I've read in this discussion that "reinforcement learning is a way of finding the value function of a Markov Decision Process".
I want to implement an RL model, whose state space and action space dimensions would increase, as the MDP progresses. But I don't know how to define it it terms of e.g. Q-learning or some similar method.
Precisely, I want to create a model, that would generate boolean circuits. At each step, it could perform four different actions:
- apply $AND$ gate on two wires,
- apply $OR$ gate on two wires,
- apply $NOT$ gate on one wire,
- add new wire.
Each of the first three actions could be performed on any currently available wires (targets). Also, the number of wires will change over time. It might increase if we perform fourth action, or decrese after e.g. application of an $AND$ gate (taking as input two wires and outputting just one).