Are self-driving cars using single frame or multiple frame to make decision?

Question

This might be a trivial question but I couldn't find any reliable answers on the internet.

Almost all the neural network architectures for self-driving cars that I have seen on the internet have a feedforward network, previous frames will not help in making the current decision.

I have read somewhere that Tesla uses two last frames captured to make a decision, even then 2 frames will not be that useful in this case. This might not be very helpful when predicting things ie.. lane cut-ins, as the system needs to observe the vehicle (that is going to cut in) behavior such as turn indicator, vehicle veering towards center lane over time in order to predict.

Can someone explain if this is the way production self-driving card such as Tesla work?

Or Is it something like the below?

Or are they using something like Many to one Recurrent net where inputs are CNN vectors of previous few frames and output is the control?

score 1 · Answer 1 · answered May 02 '19 at 23:30

It varies quite substantially between different self-driving paradigms(rather obviously) but for the most part the vast majority of implementations are using a variety of different reference frames in order to make predictions.

For example, Tesla's Autopilot is being fed many different camera feeds as well as radar and ultrasonic signals that are processed in a variety of temporal contexts.

While, for the most part, all of these programs are very tight-lipped, we can make a variety of assumptions based on the information available and educated assumptions.

As with many large, complex ML/AI systems, there is a large amount of compartmentalization where many different connectionist(or sometimes classic) models are combined(à la youtube recommendation system). Tesla is likely utilizing recurrent and convolutional networks where particular modules(combinations of models) are deciding on specific contexts(temporal or signal-based). These outputs are then most likely fed into an actor network which makes real time decisions.

You say "For example, Tesla's Autopilot is being fed many different camera feeds as well as radar and ultrasonic signals that are processed in a variety of temporal contexts.". It would be nice if you could provide a reference for this. — nbro, Nov 23 '20 at 00:42

score 0 · Answer 2 · edited Jun 17 '20 at 09:57

Visualization Appreciated

The diagrams are nicely thought out. As you refine your comparative design visualizations, you might use something like Inkscape to draw them for web publication, whether or not you decide to submit a paper to a publisher or license your ideas Creative Commons.

Web Research Realities

The reliability of answers from the internet are a function of the author, how much time she or he took in researching the question themselves, what search terms are used, and where the search term is entered. This question falls under the more challenging to research, cases where the must reliable answers are in the results sets of experiments run in proprietary government and corporate research centers.

If the GM or Tesla or Toyota or DoD research results were posted on line, someone would likely be fired, sued, and possibly jailed and a team of lawyers would hunt down all references to the posted information, employing international agreements and backbone level content filters to eliminate the dissemination of secrets.

A Better Research Approach

We can determine with a fairly high confidence that a decision based on a single frame is much less likely to lead to collision avoidance, beginning with a simple thought experiment.

Two kids are playing creatively and decide to make up a game that involves a ball. One of the rules of the game is that each player can only run backward, not forward or sideways. That's what makes the same fun for these kids. It's silly because everyone wants to run forward to get to the ball. The rest of the rules of the game are not particularly relevant.

A machine driver is processing an image and goes to make the decision. In this case, the term decision indicates the use of a rules based system, whether fuzzy and probabilistic or of the former pure Boolean type, but that is not particularly relevant either. This thought experiment applies to a learned response arising from the training of an architecture built primarily of artificial networks as well.

Consider now the case where no decision or learned response is invoked on the basis of the direction of the ball travel or its proximity to the street and path of the vehicle being driven because the kids are facing away from the ball. It would not be reasonable to assume that the training data included this spontaneously made-up game. The result of the combination of child creativity and single frame selection leads to a delayed collision avoidance tactic, at best.

In contrast, if two or three frames are included in the analysis, the feature of the ball moving toward the street and the children, regardless of their orientation with respect to the ball and the street, may be detected as a feature of the entire system through which the vehicle is driving.

This is one of astronomical number of examples where the training without the temporal dimension will lead to a much higher likelihood of improper trajectory projection from pixel data on the basis of any reasonable training set than had training and use of the pixel data included the temporal dimension.

Mathematical Analysis

When the results of trials with real vehicles move from the domain of corporate intellectual property and the domain of governmental national security artifacts into the public domain, we will see our empirical evidence. Until then we can rest on theory. The above thought experiment and others like it can be represented. Consider the hypothesis of

$$ P_{\mathcal{A} = \mathcal{E}(S)} > P_{\mathcal{A} = \mathcal{E}(\vec{S})} \; \text{,}$$

where $P_c$ is the probability given condition $c$, $\mathcal{A}$ is the actuality (posteriori) and $\mathcal{E}$ is the expectation function applied to instantaneous sensory information $s$ on the left hand side of the inequality versus the recent history of instantaneous sensory information $\vec{s}$ on the right hand side.

If we bring intermediate actions (decisioning after each frame is acquired) into the scope of this question, then we may pose the hypothesis in a way that involves Markov's work on causality and prediction in chains of events.

The accuracy of decisioning based on optical acquisition in the context of collision avoidance is higher without the Markov property, in that historical optically acquired data in conjunction with new optically acquired data will produce better trajectory oriented computational collision avoidance results than without historical data.

Either of these might take considerable work to prove, but they are both likely to be provable in probabilistic terms with a fairly reasonable set of mathematical constraints placed on the system up front. We know this because the vast majority of thought experiences show an advantage to a vector of frames over a single frame for the determination of actions that are most likely to avoid a collision.

Design

As is often the case, convolution kernel use in a CNN is likely to be the best design to recognize edge, contour, reflectivity, and texture features in collide-able object detection.

Assembly of trajectories (as a somewhat ethereal internal intermediate result) and subsequent determination of beeping, steering, acceleration, breaking, or signaling is likely best handled by recurrent networks of some type, the most publicly touted being b-LSTM or GRU based networks. The attention based handling and preemption discussed in many papers regarding real time system control are among the primary candidates for eventual common use among the designs. This is because changes in focus is common during human driving operations and is even detectable in birds and insects.

The simple case is when an ant detects one of these.

A large insurmountable object
A predator
A morsel of good smelling food
Water

The mode of behavior switches, probably in conjunction with the neural pathways for sensory information, when a preemptive stimulus is detected. Humans pilot and drive aircraft and motor vehicles this way too. When you pilot or drive next, bring into consciousness what you have learned unconsciously and this preemptive detection and change of attention and focus of task will become obvious.

Are self-driving cars using single frame or multiple frame to make decision?

2 Answers2