Can recovering a reward function using IRL lead to better policies compared to reward shaping?

Question

I am working on a research project about the different reward functions being used in the RL domain. I have read up on Inverse Reinforcement Learning (IRL) and Reward Shaping (RS). I would like to clarify some doubts that I have with the 2 concepts.

In the case of IRL, the goal is to find a reward function based on the policy that experts take. I have read that recovering the reward function that experts were trying to optimize, and then finding an optimal policy from those expert demonstrations has a possibility of resulting in a better policy (e.g. apprenticeship learning). Why does it lead to a better policy?

You're asking two distinct questions here. I suggest that you split this post into two. — nbro, Jul 01 '20 at 15:13
RL does not "specify immediate rewards". It often calculates a value (state value or action value), which then can be used to select an action. The value is the *expected return*, which is calculated from rewards. Although this may seem like nit-picking, it is important to get the terminology solid in a question, otherwise you will get answers that simply correct your terminology when you want other information. If you undertand the difference between return and reward (and I think you do from other questions), please fix this in the question — Neil Slater, Jul 01 '20 at 16:52
Note that IRL, in comparison, really does calculate an immediate reward function. So the difference between IRL doing that, and the value functions of RL is probably part of your question. — Neil Slater, Jul 01 '20 at 17:05
Hi @NeilSlater thank you for the comment, I will edit the question accordingly. — calveeen, Jul 02 '20 at 08:55
Actually I think there is no definite answer to this question, based on what I have gathered i will try to answer my own question. — calveeen, Jul 02 '20 at 09:00

score 2 · Answer 1 · answered Jul 02 '20 at 09:28

Inverse Reinforcement Learning (IRL) is a technique that attempts to recover the reward function that the expert is implicitly maximising based on expert demonstrations. When solving reinforcement learning problems, the agent maximises a reward function specified by the designer, and in the process of reward maximisation, accomplishes some task that it had set out to do. However, reward functions for certain tasks are sometimes difficult to specify by hand. For example the task of driving takes into consideration many different factors such as the distance of the car in front of him, the road conditions and whether or not the person needs to get to his destination quickly. A reward function can be hand specified based on these features. However, when there exists trade offs between these different features, it is difficult to know how to specify the different desiderata of these tradeoffs.

Instead of specifying the trade offs manually, it would be easier to recover a reward function from expert demonstrations using IRL. Such a reward function can lead to better generalisations to unseen states as long as the features of driving do not change.

In the case where reward shaping fails to learn a task (such as driving), it would be better to have someone demonstrate a task and learn a reward function from these demonstrations. Solving the MDP with the learnt reward function will thus yield a policy that should resemble the demonstrated behaviour. The reward function learnt should also generalise to unseen states and the agent acting in unseen states should be able to perform actions that an expert would take when he is placed in the same conditions, assuming that the unseen states come from the same distribution as the training states.

While Reward Shaping might be able to perform the same task as well, IRL might be able to do better, based on some performance metric that will differ from problem to problem.

Can recovering a reward function using IRL lead to better policies compared to reward shaping?

1 Answers1