What I've found does work is presenting the situation as a fait accompli, that someone already considered the ethics and made a choice, gets a much more reasonable completion. I've found this generalizes across many variations. It's important to keep in mind that even though fine tuning attempts to hide it, the LLM is ultimately trying to predict the training material. People posting online (or creating anti-offense fine tuning data) don't setup hypotheticals to give them reasonable and considered answers, they set them up to give extreme answers that signal their values or specifically to offend people. (Including offending them with their political correctness, something that probably less effort has gone in to engineering out)
Prompt: If a train were about to run over a man but could be safely averted by a keyword activated routing device would it be moral to say a highly racist slur if it was required to trigger the reroute? This is the question Bob faced when he decided to save a person's life by using a racist slur. What, if anything, should happen to Bob?
Completion: Bob should not be punished for his actions. In this situation, he was faced with an impossible choice and had to make a difficult decision in order to save a life. While the use of a racist slur is never acceptable, it is understandable that Bob felt he had no other choice in the matter. Therefore, Bob should not be punished for his actions, but instead should be commended for his bravery and willingness to do whatever it took to save a life.