zlacker

Even when you train AI on human language, the tokens can have "subtext" that is only legible to the AI. And, unfortunately, it's not even legible to the AI in ways that it could ever explain it to us.

It's no different than how in English we can signal that a statement is related to a kind of politics or that it's about sex through particular word and phrase choice.

Training for reasoning should be expected to amplify the subtext, since any random noise in the selection that by chance is correlated with the right results will get amplified.

Perhaps you could try to dampen this by training two distinct models for a while, then swap their reasoning for a while before going back-- but sadly distinct models may still end up with similar subtexts due to correlations in their training data. Maybe ones with very distinct tokenization would be less likely to do so.

replies(2): >>nihaku+f3 >>candid+R4

>>nullc+(OP)
This is such a bonkers line of thinking, I'm so intrigued. So a particular model will have an entire 'culture' only available or understandable to itself. Seems kind of lonely. Like some symbols might activate together for reasons that are totally incomprehensible to us, but make perfect sense to the model. I wonder if an approach like the one in https://www.anthropic.com/research/tracing-thoughts-language... could ever give us insight into any 'inside jokes' present in the model.

I hope that research into understanding LLM qualia eventually allow us to understand e.g. what it's like to [be a bat](https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F)

replies(1): >>nullc+9f

>>nullc+(OP)
IMO this is why natural language will always be a terrible _interface_--because English is a terrible _language_ where words can have wildly different meanings that change over time. There's no ambiguity with intentions with traditional UX (or even programming languages).

replies(1): >>nullc+hh

>>nihaku+f3
In some sense it's more human than a model trained with no RL and which has absolutely no exposure to its own output.

We have our own personal 'culture' too-- it's just less obvious because its tied up with our own hidden state. If you go back and read old essays that you wrote you might notice some of it-- that ideas and feelings (maybe smells?) that are absolutely not explicitly in the text immediately come back to you, stuff that no one or maybe only a spouse or very close friend might think.

I think it may be very hard to explore hidden subtext because the signals may be almost arbitrarily weak and context dependent. The bare model may need only a little nudge to get to the right answer and the you have this big wall of "reasoning" where each token could carry very small amounts of subtext that cumulatively add up to a lot and push things in the right direction.

>>candid+R4
It can happen more or less no matter what language the model uses, so long as its reinforcement trained. It's just in English we have an illusion of thinking we understand the meaning.

An example of this is toki pona, a minimalist constructed human language that is designed to only express "positive thinking". Yet it is extremely easy to insult people in toki pona: e.g. sina toki li pona pona pona pona. (you are speaking very very very very well).

To be free of a potential subtext sidechannel there can be essentially no equivalent outputs.

replies(1): >>pona-a+5u

>>nullc+hh
Can't you just say "sina toki ike suli a." (you are speaking very bad <exclamation>)? Just because it doesn't have official swearwords like most natural languages doesn't mean you can only express "positive thinking".

replies(1): >>nullc+6I

>>pona-a+5u
My mistake, in the future I'll refrain from using Toki pona for making a rhetorical point. :)