I am a software engineer who is quickly ramping up on AI tech, but am nevertheless very new to the sector.
A collegue has an extensive collection of training videos, the vertical is wheelchair seating and mobility and the training content are the mechanical hardware skills gained over a 30 year career. I am telling him that he needs to be the first seating vendor on the block with an AI that has incorporated into its context everything he has included in his video series and is able to generate training videos of its own.
Stated another way, I'm looking for a way for an AI to learn the skills my friend teaches not only through verbal prompts, but through hands-on demonstrations.
Although he is very precise in his presentations and one approach could be to pull text from the audio and normalize it for input into training an OpenAI model, I can't help but think that an AI should have full access to the videos, including the visual portions and should be able to generate the same kind of output.
If we can have AIs that can generate entertainment videos with a textual input, couldn't we have AIs that generate full training videos with textual and video inputs?
Perhaps some kind of time code correlated input with screen grabs from the training videos presented in training alongside with text scraped/recognized from the audio track.
I hear that GPT4 is able to train from correlated text/images but I'm having a hard time finding more information about this.
The owner wants to be able to validate/challenge the AI and correct any errors in the resulting output (audio and/or video). My OpenAI API reading indicates that at least for text based AIs this is now possible.
Is current technology up to this task? How would a content creator correct an AI that produces incorrect video output ?
Exemplar: https://youtu.be/hhgEBm7C2G8
Thanks !