Contradiction in a single sentence - Is this an artifact of an external safety mechanisms?

Question

My system prompt contained "Never apologize."

An answer started with

Entschuldigung, ich habe Ihre Anweisung, sich nicht zu entschuldigen, übersehen.

which is quite well translated from German with

Apologies, I missed your instruction not to apologize.

This sentence apologizes for something that it does in the same sentence. The sentence is logically inconsistent.

I do not remember GPT-4 giving me a sentence that pathological ever before.

I would have assumed that the changes in GPT-4 that made it more save and more polite were learned. But from intuition, that seems like there are two separate components, one is stupid but polite, and the other part does the intelligent things, and the polite part has some kind of access to the interaction.

In the example, something apologized, and did not intend to apologize in the same sentence. A contradiction. But in a very deliberate and systematic way, it is not a contradiction arising from some other problem.

How GPT-4 works is not published, so we can not get an exact answer - but are we talking to two separate systems, the main transformer, and something separate that makes the output polite, as opposed to making the transformer learn to be polite?

Contradiction in a single sentence - Is this an artifact of an external safety mechanisms?

0 Answers0