4

Most companies dealing with deep learning (automotive - Comma.ai, Mobileye, various automakers, etc.) do collect large amounts of data to learn from and then use lots of computational power to train a neural network (NN) from such big data. I guess this model is mainly used because both the big data and the training algorithms should remain secret/proprietary.

If I understand it correctly the problem with deep learning is that one needs to have:

  1. big data to learn from
  2. lots of hardware to train the neural network from this big data

I am trying to think about how crowdsourcing could be used in this scenario. Is it possible to distribute the training of the NN to the crowd? I mean not to collect the big data to a central place but instead to do the training from local data on the user's hardware (in a distributed way). The result of this would be lots of trained NNs that would in the end be merged into one in a Committee of machines (CoM) way. Would such a model be possible?

Of course, the model described above does have a significant drawback - one does not have control over the data that is used for learning (users could intentionally submit wrong/fake data that would lower the quality of the final CoM). This may be dealt with by sending random data samples to the central community server for review, however.

Example: Think of a powerful smartphone using its camera to capture a road from a vehicle's dashboard and using it for training lane detection. Every user would do the training himself/herself (possibly including any manual work like input image classification for supervised learning etc.).

I wonder if the model proposed above may be viable. Or is there a better model of how to use crowdsourcing (user community) to deal with machine learning?

nbro
  • 39,006
  • 12
  • 98
  • 176
Kozuch
  • 271
  • 1
  • 6

2 Answers2

4

First thing, you need to give more credit to more reliable users. You can establish the credit from amount of data they send, and a feature, where other users can review other's feed and classify it. From there, you will have a measure of certainty to what data is good and what is not.

You will need to implement a centralized server, unless you're trying to do some kind of a peer-to-peer trust systems, but I don't think smartphones are powerful enough to do training themselves.

You will need big machines for training NNets. Don't trust users to have them. You would end up with tons of badly trained NNets, which don't make for a good CoM.

1

There is already an approach similar to the one you describe: federated learning (FL), where local nodes (e.g. mobile, edge devices but also companies of different sizes) keep the training data locally, so each node might have different (unbalanced and non-i.i.d.) datasets, and models, which then need to be aggregated.

One possible definition of federated learning is

Federated learning is a machine learning setting where multiple entities (clients) collaborate in solving a machine learning problem, under the coordination of a central server or service provider. Each client's raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objective.

If you are interested in more details, there are already many sources on the topic that you can find on the web, but I would recommend the paper Advances and Open Problems in Federated Learning (2021, by Peter Kairouz et al.) or the Google's article Federated Learning: Collaborative Machine Learning without Centralized Training Data (2017). There are also software libraries for FL, such as TensorFlow Federated (TFF).

However, note that there are other approaches to distributed machine learning/training.

nbro
  • 39,006
  • 12
  • 98
  • 176