So with a safe prompt there is always a chance the AI will go on a bad direction and then refuse to work, and make you pay for the tokens of his "I am sorry ....long speech"
imagine this issue when you are just the devloper and not the user, the user complains about this but you try and works for you, but then it fails again for user, in my case the word "monkey" might trigger ChatGPT to either create soem racist shit or it's moderation code to false flag itself.
If you want a layer to moderate what the year is seeing, you can add that as well. The point of the reverse moderator is to get GPT to do what it’s told without lying about itself, more or less.
Again: 1 I give them safe/clean prompt 2 AI returns 2 of 10 times unsafe crap that is filtered by them 3 I have to pay for my prompt, then have to catch they non deterministic response and retry again on my money
What should happen
1 customer give safe/clean prompt 2 AI response in racist/bad way 3 filter catches this , then it retries again, a few times, if the AI is still racist/bad then OpenAI automatically adds to the prompt "do not be a racist" 4 customer gets the answer