zlacker

It can happen more or less no matter what language the model uses, so long as its reinforcement trained. It's just in English we have an illusion of thinking we understand the meaning.

An example of this is toki pona, a minimalist constructed human language that is designed to only express "positive thinking". Yet it is extremely easy to insult people in toki pona: e.g. sina toki li pona pona pona pona. (you are speaking very very very very well).

To be free of a potential subtext sidechannel there can be essentially no equivalent outputs.

replies(1): >>pona-a+Oc

>>nullc+(OP)
Can't you just say "sina toki ike suli a." (you are speaking very bad <exclamation>)? Just because it doesn't have official swearwords like most natural languages doesn't mean you can only express "positive thinking".

replies(1): >>nullc+Pq

>>pona-a+Oc
My mistake, in the future I'll refrain from using Toki pona for making a rhetorical point. :)