Why are policy gradients popular in RL when there exists a dual LP formulation in terms of occupation measures that can be solved easily?

Question

Why are policy gradient methods popular in reinforcement learning when there exists a dual LP formulation in terms of occupation measures that can be solved easily?

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Oct 13 '22 at 11:33
Can you please clarify what you mean by "dual LP formulation in terms of occupation measures". I know what linear programming is, but what do you mean by "occupation measures" and "dual LP formulation in terms of occupation measures"? — nbro, Dec 31 '22 at 12:34
Occupation measure refers to an MDP's discounted state-action visitation frequency under a given policy. A detailed explanation of how this can be used for obtaining policies and the corresponding dual formulation, you can refer to Section 6.9 Linear Programming in the book Markov Decision Processes(1994) by Martin L. Puterman. — blackbird_h71, Jan 03 '23 at 05:17

score 0 · Answer 1 · answered Oct 16 '22 at 21:59

Policy gradient methods are popular in reinforcement learning because they are fast and easy to implement. Additionally, policy gradient methods often work well for simple problems. for example, if the Hellinger distance between two measures is small.

Dual LP methods may be more accurate if the problem is not too simple. However, they can be more difficult to implement and may require more computational resources. they are preferable if you have more information about the underlying distribution of the data.

Why are policy gradients popular in RL when there exists a dual LP formulation in terms of occupation measures that can be solved easily?

1 Answers1