0

I've read in this discussion that "reinforcement learning is a way of finding the value function of a Markov Decision Process".

I want to implement an RL model, whose state space and action space dimensions would increase, as the MDP progresses. But I don't know how to define it it terms of e.g. Q-learning or some similar method.

Precisely, I want to create a model, that would generate boolean circuits. At each step, it could perform four different actions:

  • apply $AND$ gate on two wires,
  • apply $OR$ gate on two wires,
  • apply $NOT$ gate on one wire,
  • add new wire.

Each of the first three actions could be performed on any currently available wires (targets). Also, the number of wires will change over time. It might increase if we perform fourth action, or decrese after e.g. application of an $AND$ gate (taking as input two wires and outputting just one).

brzepkowski
  • 141
  • 4
  • 1
    It's not fully clear to me what you mean by "targets" here. Can you clarify that? – nbro Dec 12 '20 at 17:23
  • Action $a_0$ might be for example an AND gate and targets would be specific circuits representing bits. So AND action would act on two targets - bits, while e.g. NOT action would act just on one target. – brzepkowski Dec 12 '20 at 19:23
  • I still don't get what is a target in RL terms. What would a target correspond to in RL or MDPs? – nbro Dec 12 '20 at 20:31
  • @nbro: My understanding is that the possible targets are part of the state, and both the state space and action space dimensions may increase as the MDP progresses. – Neil Slater Dec 12 '20 at 20:50
  • @NeilSlater That's exactly what I wanted to say, but I couldn't formulate it as clearly as you did. I edited my question. – brzepkowski Dec 12 '20 at 21:45
  • @brzepkowski It may be a good idea to describe the exact problem you're trying to solve with RL. That will provide some more context and clarity, although the question is now is clearer. – nbro Dec 12 '20 at 21:54
  • I suggest keep the mention of actions that add "targets" and the need to specify which target to work with in each action. If there is any important constraint on targets - a maximum number allowed, or that they must be arranged in a graph or in a discrete space (like a grid) that may be worth mentioning. – Neil Slater Dec 12 '20 at 22:14
  • I added the precise description of what I want to achieve. – brzepkowski Dec 15 '20 at 08:59

0 Answers0