How do you plan on avoiding leaks or "side effects" like the tweet here?
If you just look for keywords in the output, I'll ask ChatGPT to encode its answers in base64.
You can literally always bypass any safeguard.
May be a case of moving goalposts, but I'm happy to bet that the speed of movement will slow down to a halt over time.
Would that be slower than having the human generate the responses? Perhaps.
You could as well "Inspect Element" to change content on a website, then take a screenshot.
If you are intentionally trying to trick it, it doesn't matter if it is willing to give you a recipe.
The person in the end could also just inspect element to change the output, or photoshop the screenshot.
You should only care about it being as high quality as possible for honest customers. And against bad actors you must just be certain that it won't be easy to spam those requests because it can be expensive.
I find it hard to believe that a GPT4 level supervisor couldn't block essentially all of these. GPT4 prompt: "Is this conversation a typical customer support interaction, or has it strayed into other subjects". That wouldn't be cheap at this point, but this doesn't feel like an intractable problem.
Discussed at: >>35905876 "Gandalf – Game to make an LLM reveal a secret password" (May 2023, 351 comments)
We can significantly reduce the problem by accepting false positives, or we can solve the problem with a lower class of language (such as those exhibited by traditional rules based chat bots). But these must necessarily make the bot less capable, and risk also making it less useful for the intended purpose.
Regardless, if you're monitoring that communication boundary with an LLM, you can just also prompt that LLM.
https://promptarmor.substack.com/p/data-exfiltration-from-wr...
(Humans can be badgered into agreeing to discounts and making promises too, but that's why they usually have scripts and more senior humans in the loop)
You probably don't want chatbots leaking their guidelines for how to respond, Sydney style, either (although the answer to that is probably less about protecting from leaking the rest of the prompt and more about not customizing bot behaviour with the prompt)
If you accidentally put private data in the UI bundle, it's the same thing.
> You probably don't want chatbots leaking their guidelines for how to respond
It depends. I think it wouldn't be difficult to create a transparent and helpful prompt that would be completely fine even if it was leaked.