How to train a policy model incrementally to solve a problem similar to the vehicle routing problem?

Question

I have a problem similar to the vehicle routing problem (VRP) that I want to solve with reinforcement learning. In this problem, the agent starts from the point $(x_0, y_0)$, then it needs to travel through $N$ other points, $(x_1, y_1), \dots, (x_n, y_n)$. The goal is to minimize the distance traveled.

Right now, I am modeling a state as a point $(x, y)$. There are 8 possible actions: go east, go north-east, go north, go north-west, go-west, go south-west, go south, go south-east. Each action goes by a pace of 100 metres.

After reaching near a destination point, that destination point is removed from the list of destination points.

The reward is the reciprocal of total distance until all destination points reached (there's a short optimisation to arrange the remaining points for a better reward).

I'm using a DNN model to keep the policy of a reinforcement learning agent, so this DNN maps a certain state to suitable action.

However, after every action of the agent with a good reward, the training data are added with 1 more sample, it's kinda incremental learning.

Should the policy model be trained again and again with every new sample added in? This does take too much time.

Any better RL approach to the problem above?

Hello. Which RL algorithm are you using or planning to use to find the policy? — nbro, Jan 21 '21 at 02:48
i may brute force, or i may do some greedy to find good rewards — Dee, Jan 21 '21 at 02:50
Why don't you want to use some existing RL algorithm, such as Q-learning? Can you please describe more in detail your problem that you're trying to solve with RL (i.e. state space, action space and reward function)? Is this a toy problem that you came up with? — nbro, Jan 21 '21 at 02:51
Thanks! Your post is way clearer now :) btw, you can use [mathjax/latex on this site](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference). As far as I recall, one good approach to the travelling salesman problem (something similar to what you're trying to do) are **ant colony optimization** algorithms. I don't know if RL is the best approach for this problem, but it can certainly be applied to this problem too. — nbro, Jan 21 '21 at 03:08
@nbro i know RL isn't good for TSP, just trying out to see whether it's better than the supervised model — Dee, Jan 21 '21 at 03:10
Ok, then I think you should clarify that you want to use RL (and _not_ another approach), otherwise, people may think that you're also looking for other types of solutions given your question "Any better tactics for the problem above? ". Btw, ACO algorithms are not necessarily supervised. They are meta-heuristics, in the same way that genetic algorithms are also meta-heuristics. — nbro, Jan 21 '21 at 03:11
do you have to use the action space that you have proposed? I have seen TSP solved using RL but the action space is usually just the cities (here it would be your points), i.e. which city to visit next. — David, Jan 21 '21 at 15:27
the model is DNN, and there r thousands of destinations, and won't be able to classify to destination indices (city indices) — Dee, Jan 22 '21 at 02:05

Dee · Accepted Answer · 2021-01-28T02:48:14.553

I found out a concept called 'Experience Replay', which trains a single step every time a new data sample is added instead of training to max epochs.

That is, instead of this training loop:

for i in range(max_paces):
    find action for max reward;
    add to trajectory to make inp;
    train(batch_size=len(inp), epochs=max_epochs)

Do training this way (for a single-episode ML problem, no incremental):

for i in range(max_epochs):
    reset environment;

    for j in range(max_paces):
        find action for max reward;
        add to trajectory to make inp;
        train(batch_size=len(inp), epochs=1)

Do training this way (for a multi-episode ML problem, incremental data):

for i in range(max_epochs):
    reset environment;
    inc_trajectory = get_random_past_data();

    for j in range(max_paces):
        find action for max reward;
        add to inc_trajectory to make inp;
        train(batch_size=len(inp), epochs=1)

For multiple-episode problems especially those problems with unlimited episodes, the training loop needs to forget (ie. exclude from training) old episodes which are very distant in the past, or select a random number of old episodes (consider them a batch, random batch) to be in every round of experience replay. Anyway, without eliminating some old data, the amount of training data are too much since it's unlimited number of episodes.

How to train a policy model incrementally to solve a problem similar to the vehicle routing problem?

1 Answers1