This is a bit of a soft question, not sure if it's on topic, please let me know how I can improve it if it doesn't meet the criteria for the site.
GPT models are unsupervised in nature and are (from my understanding) given a prompt and then they either answer the question or continue the sentence/paragraph. They also seem to be the most advanced models for producing natural language, capable of giving outputs with correct syntax and (to my eye at least) indistinguishable from something written by a human (sometimes at least!).
However if I have a problem where I have an input (could be anything, but lets call it an image or video) and a description of the image or video as the output I could in theory train a model with convolutional filters to identify the object and describe the image (assuming any test data is within the bounds of the training data). However when I've seen models like this in the past the language is either quite simple or 'feels' like it's been produced by a machine.
Is there a way to either train a GPT model as a supervised learning model with inputs (of some non language type) and outputs (of sentences/paragraphs); or a similar type of machine learning model that can be used for this task?
A few notes:
I have seen the deep learning image captioning methods - these are what I mention above. I'm more looking for something that can take an input-output pair where the output is text and the input is any form.