Has anybody tried unsupervised deep learning from youtube videos?

Question

YouTube has a huge amount of videos, many of which also containing various spoken languages. This would presumably provide something like the data that a "challenged" baby would experience - "challenged" meaning a baby without arms or legs (unfortunately many people are born that way).

Would this not allow unsupervised learning in a deep learning system that has both vision and audio capabilities? The neural network would presumably learn correlations between words and images, and could perhaps even learn rudimentary language skills, all without human supervision. I believe that the individual components to do this already exist.

Has this been tried, and if not, why?

In this day and age, you can expect someone has already done what you want to do. — FreezePhoenix, Apr 09 '18 at 22:57
The underlying issue seems to be that integrated circuits aren't nearly as powerful as the human brain, and we're not even sure how the brain can store and correlate all of that data. I usually recommend taking a quick look at computational complexity theory, to get a sense of the problems sizes for problems where all of the parameters are known and information is perfect and complete. Compare to nature, where this is not the case. It's a very challenging problem. Good question, though. Welcome to AI! — DukeZhou, Apr 10 '18 at 20:18
you should edit your question if your phrasing is not the way you intend it to be — k.c. sayz 'k.c sayz', Apr 14 '18 at 00:50
@k.c.sayz'k.csayz' Feel free to edit it, I will reverse it only in case the edited meaning disagrees with the intended one — Wolphram jonny, Apr 14 '18 at 15:51

FreezePhoenix · Answer 1 · 2018-04-13T13:55:59.010

2

Yes, it is possible, and yes, it probably has been done before. Odds are, however, the person(s) who tried were disappointed with the results and forgot to tell others.

The reasons they might be disappointed could be any of the following:

Took to long to train
Even when fully trained, (or appeared to be), it did not give the expected output
Too sensitive to variations in the input

edited Apr 13 '18 at 13:55

answered Apr 12 '18 at 14:08

FreezePhoenix

422
3
20

Thanks, I thought this would be the case, but this is not an answer, rather an opinion – Wolphram jonny Apr 13 '18 at 13:28
@HughMungus will reword it. – FreezePhoenix Apr 13 '18 at 13:55
Perhaps I should reword the question: if there is any specific reason of why this cannot be implemented, but specific to this idea, rather than what can go wrong with deep learning in general – Wolphram jonny Apr 13 '18 at 15:33
@HughMungus you giving up, not being able to feed it a video, etc. Mostly data conversion. – FreezePhoenix Apr 13 '18 at 15:47
well , if you see it that way, I guess you should answer it, I have no problem in giving you the bounty if it gets two positive votes (and nobody else gives a better answer) – Wolphram jonny Apr 13 '18 at 16:48

score 2 · Answer 2 · answered Apr 13 '18 at 04:05

2

Yes! Unsupervised machine learning has absolutely been applied to youtube videos... To recognize cats!

Here's an article about it in wired. One of the leading ML researchers was Andrew Ng.

answered Apr 13 '18 at 04:05

Charles

281
2
6

I know about this one, and it not what I have asked (it does not label images in any way)Thanks for your answer, though! – Wolphram jonny Apr 13 '18 at 13:26
Ah, sorry, I jumped the gun and answered the title question! – Charles Apr 13 '18 at 20:27

mico · Accepted Answer · 2018-04-14T17:25:29.470

Answer is quite yes, please have a look what Google did around this:

Google Cloud Video Intelligence makes videos searchable, and discoverable, by extracting metadata with an easy to use REST API. You can now search every moment of every video file in your catalog. It quickly annotates videos stored in Google Cloud Storage, and helps you identify key entities (nouns) within your video; and when they occur within the video.

https://cloud.google.com/video-intelligence/

So, Google does recognize all kinds of data from the video: it classifies the whole content of it to tags.

What about Humanoid Robot Sophia?

Cameras within Sophia's eyes combined with computer algorithms allow her to see. She can follow faces, sustain eye contact, and recognize individuals. She is able to process speech and have conversations using a natural language subsystem.

https://en.wikipedia.org/wiki/Sophia_(robot)

These intent to the direction to understand (Google) and produce (Sophia) language from sounds and images. To learn think by themselves, machines are still not ready. If you get into these two cases more you would see that these are quite mechanical and manual (requiring human pre-effort) things still.

It is said that machines are now on a phases of toddler who can ask names for things around her and name them. Take some years more, maybe the abilities are more advanced ;)

edit:

You asked about unsupervised learning. There is a video about speech of MIT researcher, who made experiments on text and images and in final notes he denoted that it would be nice to make the same with videos, actually with the same reasoning you had: to learn a language. He promised to keep that in mind with his colleagues, maybe some of them already work on that.

edit2:

Interesting research paper on the topic was on this link :

We address the problem of automatically learning the main steps to complete a certain task, such as changing a car tire, from a set of narrated instruction videos. The con- tributions of this paper are three-fold. [..] Third, we experimentally demonstrate that the proposed method can automatically discover, in an unsupervised manner, the main steps to achieve the task and locate the steps in the input videos.

Also Microsoft has much development on this area: look slide https://youtu.be/E3kFkzeaynw at 2:01 minutes in video. — mico, Apr 14 '18 at 14:11
.. and yet another. There are many aspects, that can be learned from video and speech. — mico, Apr 14 '18 at 17:07

Has anybody tried unsupervised deep learning from youtube videos?

3 Answers3