zlacker

I'm following Owain Evans on X and some of the papers they've been sharing are much worse. IIRC there was something with fine-tuning a LLM to be bad actor, then letting it spit out some text, and if that response was copy-pasted into the context of the ORIGINAL LLM (no fine-tune) it was also "infected" with this bad behavior.

And it makes a lot of sense, the pre-training is not perfect, it's just the best of what we can do today and the actual meaning leaks through different tokens. Then, QKV lets you rebuild the meaning from user-provided tokens, so if you know which words to use, you can totally change the behavior of your so-far benign LLM.

There was also paper about sleeper agents and I am by no way a doomer but the LLM security is greatly underestimated, and the prompt injection (which is impossible to solve with current generation of LLMs) is just the tip of the iceberg. I am really scared of what hackers will be able to do tomorrow and that we are handing them our keys willingly.