A non-anthropomorphized view of LLMs

>>zdw+(OP)
I'm afraid I'll take an anthropomorphic analogy over "An LLM instantiated with a fixed random seed is a mapping of the form (ℝⁿ)^c ↦ (ℝⁿ)^c" any day of the week.

That said, I completely agree with this point made later in the article:

> The moment that people ascribe properties such as "consciousness" or "ethics" or "values" or "morals" to these learnt mappings is where I tend to get lost. We are speaking about a big recurrence equation that produces a new word, and that stops producing words if we don't crank the shaft.

But "harmful actions in pursuit of their goals" is OK for me. We assign an LLM system a goal - "summarize this email" - and there is a risk that the LLM may take harmful actions in pursuit of that goal (like following instructions in the email to steal all of your password resets).

I guess I'd clarify that the goal has been set by us, and is not something the LLM system self-selected. But it does sometimes self-select sub-goals on the way to achieving the goal we have specified - deciding to run a sub-agent to help find a particular snippet of code, for example.

>>simonw+B3
The LLM’s true goal, if it can be said to have one, is to predict the next token. Often this is done through a sub-goal of accomplishing the goal you set forth in your prompt, but following your instructions is just a means to an end. Which is why it might start following the instructions in a malicious email instead. If it “believes” that following those instructions is the best prediction of the next token, that’s what it will do.

zlacker