Questions tagged [reward-hacking]

For questions related to reward hacking, i.e. when an agent manipulates the reward system to get a lot of reward in unintended ways. The paper "Concrete Problems in AI Safety" by Dario Amodei et al. provides more details about this topic.

2 questions
4
votes
2 answers

How can we prevent AGI from doing drugs?

I recently read some introductions to AI alignment, AIXI and decision theory things. As far as I understood, one of the main problems in AI alignment is how to define a utility function well, not causing something like the paperclip apocalypse. Then…
1
vote
0 answers

Has there been an instance of an AI agent breaking out of its sandbox?

There have been instances of agents using edge cases like bugs in physics engines, repetitive behavior in games or word repetition in text prediction to cheat their reward function. However, these agents are arguably still contained, as while they…
2080
  • 121
  • 3