4

I am experimenting with OpenAI Gym and reinforcement learning. As far as I understood, the environment is waiting for the agent to make a decision, so it's a sequential operation like this:

decision = agent.decide(state)
state, reward, done = environment.act(decision)
agent.train(state, reward)

Doing it in this sequential way, the Markov property is fulfilled: the new state is a result of the old state and the action. However, a lot of games will not wait for the player to make a decision. The game will continue to run and perhaps, the action comes too late.

Has it been observed or is it even possible that a neuronal network adjusts its weights so that the PC computes the result faster and thus makes the "better" decision? E.g. one AI beats another because it is faster.

Before posting an answer like "there are always the same amount of calculations, so it's impossible", please be aware that there is caching (1st level cache versus RAM), branch prediction and maybe other stuff.

nbro
  • 39,006
  • 12
  • 98
  • 176
Thomas Weller
  • 221
  • 3
  • 9
  • It's important to note that neural networks are usually trained with gradient descent or other learning algorithms (which are usually iterative/numerical algorithms that use data to find some target function, which will hopefully be represented by the neural network). So, I am not sure if it's a good idea to talk about "self-optimization" (in the title). – nbro Aug 23 '21 at 00:33

1 Answers1

3

Yes and No.

Yes, it is possible to achieve that result. But instead of a self-optimizing Neural Network, I'd recommend another approach:

1. Don't let training time interfere.

If you are trying to train the agent DURING the environment runtime, that's probably the problem. Training time is usually much bigger than evaluation time. And deployed models usually don't train, so it won't be a problem in production.

You can do 2 things about that:

1.1 "Pause" the game during training.

It might look like "cheating", but in your agent's point of view, they are not actually playing during train time. And once again, it's just simulating how it would behave in production.

But if you can't pause it:

1.2 Disable training during runtime.

Store all states and decisions. Wait until the game is over and then you train the whole batch.

Professional chess players don't try to learn during a blitz challenge. But they do study their own games later to learn from their mistakes.

2. Optimizing hyper-parameters for speed.

You could tweak some hyper-parameters (like the size of your NN) looking for a faster model. Keep in mind it would still be an algorithm that always runs in a fixed time, but you might find some way to make it always

2.1 Using Machine Learning for this Optimization

There are some meta-learning techniques and other methods like NEAT, that can automate your search for a simple effective topology. NEAT already rewards the simplest architectures, penalizing complexity (which is usually attached to speed), but you could also force it to consider running time specifically.

3. Another Network for Another Task

You could a make another short network for deciding if the next move should be accurate or fast. Based on this result, it will choose between precision or speed. This choice could be by flagging a parameter (like branch prediction) or even running a whole new algorithm:

NeedForSpeed = TimeEstimation(state)
#Sorry, I couldn't resist the pun!

if NeedForSpeed > 0.8
  decision = agent.instantReactionDecisionTree(state)
elif NeedForSpeed > 0.5
  decision = agent.decideStandard(state, branchPrediction=True )
elif NeedForSpeed > 0.2
  decision = agent.decideStandard(state)
else:
  decision = agent.DeepNN(state)

Bonus: Use Other ML algorithms

Some algorithms have explicit parameters that directly affect the tradeoff for time x precision.

For instance, MCST (Monte Carlos Search Tree) can keep running forever until it explores all possibilities, but it usually finishes before, giving you the best solution found so far.

So, one possibility would be trying some other method instead of Neural Networks.

Andre Goulart
  • 854
  • 2
  • 25