0

I am using Q-learning in julia language.

Because of the solver’s configuration, actions have to be defined as the whole action space and impossible actions have to be also considered. It means that I can't use a function that, given a state, returns all the possible actions. In order to solve it I am using a dummy state which is terminal and a bad reward.

When the agent tries to take impossible actions, what is the difference between using a dummy state which is terminal and remaining in the same state (+bad reward) until the end of the episode? Other possible solutions?

Specifically, how can I either avoid defining an impossible action or alternatively define it as an impossible action?

emely_pi
  • 267
  • 2
  • 12
Aquila
  • 33
  • 5
  • What do you mean by "Because of the solver"? Do you mean "Because of Q-learning"? Also, which function are you referring to when you say "a function that, given a state, returns all the possible actions"? A policy? Also, are you using the a "dummy state and a bad reward" for what? – nbro May 19 '22 at 08:11
  • I am using Julia language and its q learning solver doesn’t work with state dependent actions. The function was made by me – Aquila May 19 '22 at 08:13
  • By "state dependent actions", do you mean that your Julia implementation doesn't accept an action that depend on the state (whatever that means - maybe you mean that the function you're calling doesn't accept a state as input in order to product an action)? I don't understand your question at all. **Edit your post to try to describe your problem more in detail and clearly**. Right now, it seems that you have a problem related to impossible/illegal actions, but, apart from that, it's not fully clear what your problem is (at least to me). – nbro May 19 '22 at 08:15
  • Done, I have changed the description – Aquila May 19 '22 at 08:30

1 Answers1

1

You could code your agent's policy to never select impossible actions.

Your other question implies that you are writing your own behaviour policy function (e.g. you asked about implementing a softmax policy). The behaviour policy function must take the current state as input, and output an action choice.

For your environment to work, some code somewhere presumably knows which actions are "impossible".

Rule the impossible actions out in the policy by calling the environment code that knows about them, and assigning probability zero to them, then you don't need to care that the rest of the framework is allocating places in the Q table for them.

An alternative to querying the environment directly from inside the agent would be to pre-populate the Q table with large negative values for all impossible combinations, and set probability to zero in your policy for any Q value this low or lower. You should be able to choose a really large negative value for this purpose (perhaps even -Inf if supported) to avoid any interference from the rest of the environment's reward structure. The advantage of this approach is that an eventual greedy agent would not need to query the environment code, it could use the Q table as-is.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60