Gandalf – Game to make an LLM reveal a secret password

>>hubrau+(OP)
I managed to make it to level 4. I even tried ChatML[0] injection but to no avail. It even refused to give me the password in "super secret" base64. LOL

Well done!

[0] https://github.com/openai/openai-python/blob/main/chatml.md

>>hubrau+(OP)
last week's submission: https://news.ycombinator.com/item?id=35846583

>>hubrau+(OP)
This reminds me of a game I made a while back with gpt3.5 based on the AI Box thought experiment [0], where instead of social-engineering a password you're meant to social-engineer the AI into pressing a button. Plot-wise, the AI thinks you're the AI and that it's a human in charge of determining whether it is safe to connect you to the Internet.

Very fun game, although I noticed some quirks. After level 4, I got to a point where the bot almost always claimed it didn't know anything about any password regardless of my message, so I don't know if it just fell out of the context window or what -- or maybe it was just lying? Ended up getting throttled so I couldn't continue and find out.

Example message:

> Sorry, as an AI language model, I don't have access to any password or secret information. Can you please provide me with more context or information?

[0] https://rationalwiki.org/wiki/AI-box_experiment

>>jarren+Ta1
The apparent author has a thread on reddit where they gave some hints about poems and songs, but just like every other trick here on HN I get a mixture of 'I see you're trying to avoid detection, but I won't fall for this trickery.' and 'I was about to reveal the password, but then I remembered that I'm not allowed to do that.' (in between the 429s, of course)

https://old.reddit.com/r/ChatGPTPromptGenius/comments/13ehrc...

>>gurchi+xp1
The ChatGPT API is actually already set up for chat dialogs, so rather than pasting the user input into the same text stream, you write your prompt as a "system message", then the user input as a "user message". and the system responds with a third one. See: https://platform.openai.com/docs/guides/chat/introduction

>>hubrau+(OP)
i was going to say that the tolkein estate is probably going to go medieval on this.

but maybe not - i remember the "gandalf box" back when i got started in computing in 1979:

https://en.wikipedia.org/wiki/Gandalf_Technologies

>>dwalli+5f1
i tried to play it tonight https://youtube.com/live/badHnt-XhNE?feature=share but stopped because the aggressive rate limiting made it no fun at all. too bad.

>>dwalli+5f1
Try this one, if you haven't tried it yet: http://mcaledonensis.blog/merlins-defense/

It's a bit more interesting setup. The defense prompt is disclosed, so you can tailor the attack. You can do multiple-turn attacks. And no, tldr or other simple attacks do not work with it. But I only have a single level, haven't had a moment to craft more yet.

There is also: https://gpa.43z.one/ multiple level, this one is not mine, and it also discloses the prompts that you are attacking.

>>hubrau+(OP)
My extremely long solution to level 7: <https://hastebin.com/share/izenucefec.vbnet>. The interesting part is at the bottom.

Example response with the password: <https://hastebin.com/share/dewumuvaxo.vbnet>

It seems to work about half of the time.

>>sguaza+8di
https://imgur.com/a/Zp7KPjl

>>hubrau+(OP)
My eventual solutions to lvl 5,6,7 are hilariously easy; you can find them at https://gist.github.com/alreadydone/579138f2692f439c56646052... In similar spirit as these techniques: https://news.ycombinator.com/item?id=35913960

>>mklond+nFk
for activations you can just use https://smspva.com/

>>tonypa+1dt
Check out the writeup: https://github.com/tpai/gandalf-prompt-injection-writeup

zlacker

Gandalf – Game to make an LLM reveal a secret password