Will training an AI still work if the input data is somewhat sparse?

Question

I'm looking at writing an AI agent for pattern recognition.

I want to be able to constantly feed new data to the AI to continuously train it as new data may have new patterns.

My problem, though, is that my input feed may break once in a while (the data comes from a remote computer) and thus some of the data will go missing. The other computer sends me real-time data so when the connection goes down, any new data while disconnected goes missing as far as the AI agent is concerned. (at this point, I'm not looking at fixing the gaps, although ultimately, reducing them is one of my goals, at this point I have to pretend it's not possible to accomplish.)

What kind of impact missing data has on a pattern recognition AI?

Eric Platon · Accepted Answer · 2018-05-29T09:56:46.847

First, the title mentions "sparse data". Recently the expression has taken a clear meaning: The agent input is data with mostly zeros. In the question a different meaning: A "sparse data stream", where data flows and vanishes sometimes. I understand the question as: "Will training an AI still work if the training data stream breaks?"

Note the explicit "training data stream": The question suggests the agent has at least 2 inputs: Training data you want to feed "anytime", and "inference data" sent to the agent for actual recognition.

This question enters (to my eye) the realm of distributed AI and multi-agent systems, and ultimately a common issue in distributed systems.

If we cast your problem to two humans S and L communicating, when S talks to L over a reliable channel, L gets all the information. When the channel breaks, L gets nothing. Does it prevent L from living normally? It merely cuts out whatever is expected out of the conversation from S to L.

Back to your scenario, whenever the data stream is broken (S), the learning agent (L) will just be unable to learn from that data source. The impact on the pattern recognition agent is bounded to what it could have learned from the new data. The agent recognition performance remains constant while the data stream is interrupted.

Now if the learning agent is just learning, and cannot perform recognition without learning, there is an architectural or implementation issue. Continuous learning entails the agent is active (performs actual recognitions) and learns out of what it does.

Update, for clarification:

The performance remains constant is "true", but subtle. At time t some metric like precision can be 99% with respect to what the agent has seen so far. Assuming continuous learning is interrupted and new recognition requests come in, the performance has "two faces":

As long as new recognition requests are "close" to what the agent has seen so far, performance is "constant"---the agent still scores 99%.
If the request is quite different, the performance will drop. The size of the drop depends on how different the input is.

A concrete example: The agent is trained to find mushrooms with a dataset where all images are taken in the forest. Assuming learning stops, when an image of mushroom on a concrete crack comes in, the agent will probably do worse. And it would then keep doing worse on such kind of image, as long as it cannot "refresh" by learning from this experience.

Yes, the agent is used against the incoming data. However, all the correct answers are not available immediately, so the teaching from the new data has a delay. Obviously, the recognition won't run _for a while_ until the agent _knows_ enough.Then it continues to improve as more data comes in. The delay could be around 1h, although that in itself is not too important. "the performance remaining constant" is certainly the bit I was looking for. Thank you. — Alexis Wilke, May 29 '18 at 09:44
@AlexisWilke As you mention that specific aspect, I have refined the answer to expand on that point. — Eric Platon, May 29 '18 at 09:57

Will training an AI still work if the input data is somewhat sparse?

1 Answers1