Questions tagged [dynamic-programming]

For questions related to the dynamic programming paradigm in the context of AI (and, in particular, reinforcement learning).

25 questions
10
votes
3 answers

What algorithms are considered reinforcement learning algorithms?

What are the areas/algorithms that belong to reinforcement learning? TD(0), Q-Learning and SARSA are all temporal-difference algorithms, which belong to the reinforcement learning area, but is there more to it? Are the dynamic programming…
5
votes
1 answer

Why does TD Learning require Markovian domains?

One of my friends and I were discussing the differences between Dynamic Programming, Monte-Carlo, and Temporal Difference (TD) Learning as policy evaluation methods - and we agreed on the fact that Dynamic Programming requires the Markov assumption…
4
votes
1 answer

Why is update rule of the value function different in policy evaluation and policy iteration?

In the textbook "Reinforcement Learning: An Introduction", by Richard Sutton and Andrew Barto, the pseudo code for Policy Evaluation is given as follows: The update equation for $V(s)$ comes from the Bellman equation for $v_{\pi}(s)$ which is…
3
votes
1 answer

Are policy and value iteration used only in grid world like scenarios?

I am trying to self learn reinforcement learning. At the moment I am focusing on policy and value iteration, and I am finding several problems and doubts. One of the main doubts is given by the fact that I can't find many diversified examples on how…
3
votes
1 answer

What are the common techniques one could use to deal with collisions in a transposition table?

Consider an iterative deepening search using a transposition table. Whenever the transposition table is full, what are common strategies applied to replace entries in the table? I'm aware of two strategies: Replace the oldest entry whenever a new…
ihavenoidea
  • 255
  • 2
  • 11
2
votes
1 answer

Do we need the transition probability function when calculating the importance sampling ratio?

I am reading the book titled "Reinforcement Learning: An Introduction" (by Sutton and Barto). I am at chapter 5, which is about Monte Carlo methods, but now I am quite confused. There is one thing I don't particularly understand. Why do we need the…
2
votes
2 answers

Should we feed a greater fraction of terminal states to the value network so that their values are learned first?

The basis of Q-learning is recursive (similar to dynamic programming), where only the absolute value of the terminal state is known. Shouldn't it make sense to feed the model a greater proportion of terminal states initially, to ensure that the…
2
votes
0 answers

What trait of a planning problem makes reinforcement learning a well suited solution?

Planning problems have been the first problems studied at the dawn of AI (Shakey the robot). Graph search (e.g. A*) and planning (e.g. GraphPlan) algorithms can be very efficient at generating a plan. As for problem formulation, for planning…
2
votes
1 answer

Is it possible to retrieve the optimal policy from the state value function?

One can easily retrieve the optimal policy from the action value function but how about obtaining it from the state value function?
Mika
  • 331
  • 1
  • 8
1
vote
1 answer

Interpretation of the Dynamic Time Warping (DTW) graph

How can I interpret ate the DTW graph. I understood the algorithm behind DTW, but I struggle to interpret ate the graph. When I compute the DTW for a signal that is the same signal but shifted in time, my result will be a diagonal, straight line…
Skobo Do
  • 51
  • 6
1
vote
0 answers

Structured policies in dynamic programming: solving a toy example

I am trying to solve a dynamic programming toy example. Here is the prompt: imagine you arrive in a new city for $N$ days and every night need to pick a restaurant to get dinner at. The qualities of the restaurants are iid according to distribution…
1
vote
0 answers

Is there a way use DQN to find the optimal combination of actions (control inputs) and environment parameters?

I am using DQN to find the optimal sequence of control inputs to a dynamic system. The setup is as follows: At the beginning of each episode, the system is initialized to the SAME initial condition s0. Each episode spans from 0 seconds to tf…
1
vote
1 answer

How are these two versions of the Bellman optimality equation related?

I saw two versions of the optimality equation for $V_{*}(s)$ and $Q_{*}(s,a)$. The first one is: $$ V_{*}(s)=\max _{a} \sum_{s^{\prime}} P_{s s^{\prime}}^{a}\left(r(s, a)+\gamma V_{*}\left(s^{\prime}\right)\right) $$ and $$ Q_{*}(s,…
1
vote
1 answer

Is there a notion of exploration-exploitation tradeoff in dynamic programming (or model-based RL)?

Is there a notion of exploration-exploitation tradeoff in dynamic programming (or model-based RL)?
1
vote
1 answer

How do we get from conditional expectation on both state and action to only state in the proof of the Policy Improvement Theorem?

I'm going through Sutton and Barto's book Reinforcement Learning: An Introduction and I'm trying to understand the proof of the Policy Improvement Theorem, presented at page 78 of the physical book. The theorem goes as follows: Let $\pi$ and $\pi'$…
1
2