Chess-GPT's Internal World Model

>>homarp+(OP)
Is a linear probe part of observability/interpretability?

>>triyam+r51
Yes, a pretty fundamental technique and one of the earliest. It lets you determine which layers contain what information among other things.

>>canjob+M51
The downside is that it's a supervised technique, so you need to already know what you're looking for. It would be nice to have an unsupervised tool that could list out all the things the network has learned.

>>Legend+G71
Anthropic has published some cool stuff in that direction: https://transformer-circuits.pub/2023/monosemantic-features

>>Joshua+xe1
Whoa, this is super cool! I can imagine if we had something like this for ChatGPT, we could use it to do some serious prompt engineering. Imagine seeing what specific neurons you were activating with your prompt, and being able to identify which word in your prompt was triggering an undesired behavior. Super cool stuff, excited to see if it scales

zlacker