Is policy learning and online system identification the same?

Question

In some newer robotics literature, the term system identification is used in a certain meaning. The idea is not to use a fixed model, but to create the model on the fly. So it is equal to a model-free system identification. Perhaps a short remark for all, who doesn't know what the idea is. System identification means, to create a prediction model, better known as a forward numerical simulation. The model takes the input and calculates the outcome. It's not exactly the same like a physics engine, but both are operating with a model in the loop which is generating the output in realtime.

But what is policy learning? Somewhere, I've read that policy learning is equal to online system identification. Is that correct? And if yes, then it doesn't make much sense, because reinforcement learning has the goal to learn a policy. A policy is something which controls the robot. But if the aim is to do system identification, than the policy is equal to the prediction model. Perhaps somebody can lower the confusion about the different terms ...

Example Q-learning is a good example for reinforcement learning. The idea is to construct a q-table and this table controls the robot movements. But, if online-system-identification is equal to policy learning and this is equal to q-learning, then the q-table doesn't contains the servo signals for the robot, but it provides only the prediction of the system. That means, the q-table is equal to a box2d physics engine which can say, what x/y coordinates the robot will have. This kind of interpretation doesn't make much sense. Or does it make sense and the definition of a policy is quite different?

score 5 · Accepted Answer · answered Mar 05 '19 at 20:03

From the book Reinforcement Learning, An Introduction (R. Sutton, A. Barto):

The term system identification is used in adaptive control for what we call model-learning (e.g., Goodwin and Sin, 1984; Ljung and S ̈oderstrom, 1983; Young, 1984).

Model-learning refers to the act of learning the model (environment). Reinforcement Learning can be divided into two types:

Model-based - first we build a model of an environment and then do the control.
Model-free - we do not try to model the behaviour of the environment.

Policy learning is the act of learning optimal policy. You can do it in two ways:

On-policy learning - learn about the policy $π$ by sampling from the same policy.
Off-policy learning - learn about the policy $π$ from the experience sampled from some other policy (e.g. watching different agent playing a game).

score 2 · Answer 2 · edited Mar 11 '19 at 19:23

System Identification and policy learning are two completely different aspects of a system.

System Identification is basically finding out the transfer functions, the hardware parameters, the relationships and nature of behavior for different components that determine the results, when acted upon by a control signal. Generally, it is the hardware manufacturers who have all configuration details in their datasheets and they are either used as direct system parameters or used to derive other. Online system identification is the process of determining the set of parameters not with already available measurements but by using data coming through in real time.

Policy learning is the process of correlating the actions to results and discerning what actions are good or bad. Policy learning is about determining the control strategy that shall produce desired results given all the circumstances.

SI is like curve fitting and determine the equation of the curve (already knowing the polynomial degree because you need to know the structure of the parameters you are trying to estimate) on already available data while policy learning is using a closed loop system to repeatedly update your control signals till you find one that satisfies your performance and operational desires.

In robotics context, a robot manipulator is supposed to have the Mass(H), Coriolis(C) and Gravity(G) matrix that define the dynamics of the system, basically relating the physics of the robot to the applied torque on the joints and the tip as shown in the equation below. Online parameter identification would mean using the torque, the known structure of the HCG matrices (H is a n x n matrix, n being the DOF and so on) dynamic equation and the then determine the numerical values. Similar online parameter identification is also done for friction components like the Static and Coulomb friction and the coefficients of Viscous friction. Least squares method is often used for the same.

$$\mathbf{H(q)\ddot{q}}(t) + \mathbf{C(q, \dot{q})\dot{q}}(t) + \mathbf{B\dot{q} + g(q) = \tau}$$

Policy learning in terms fo RL is basically learning the set of actions that will produce desired good behavior. Q-learning is model free learning, so there is no predicted behavior to be obtained based on the inputs. Here, the inputs and simulated, results are obtained, they are given a degree of belief (positive & negative and high & low rewards) depending upon what part and how much of the desired result are they producing. Over time, the policy learned is finally the sequence of actions that should be run to get to desired result. The Q-table does not have anything to do with the system identification which is a modeling step, it is rather a control step. So, for the arm, the learned policy would be what joints in what sequence should be actuated to what angles to complete a pick and place task.

Is policy learning and online system identification the same?

2 Answers2