I have built a custom multi-agent environment with PettingZoo, where a turn-based game with two agents, A and B, is setup.
I want to examine situations where malicious behavior may arise, given the game rules, and I am looking into training approaches.
To do that, I have implemented a deterministic policy as a baseline / control.
Fixing agent A to that baseline policy, I want to subsequently train agent B and observe the resulting behaviors.
After B arrives at a desirable behavioral pattern, I want to train agent A to see how it responds to B's actions.
Having the above setting in mind:
Is the above training approach, which keeps one agent fixed and trains the other, correct?
Should I follow a MARL approach for training instead, or is the above approach that encapsulates one agent as part of the environment sound?
In general, what are requirements / desiderata to look for that hint that a MARL approach is the correct way and/or a separate training scheme is erroneous?