Another note about preference optimisation and RL is that it has really high quality ceiling but needs to be very carefully tuned. It's easy to get perfect anatomy and structure if you decide to completely "collapse" the model. For instance, ChatGPT images are collapsed to have slight yellow color palette. FLUX images always have this glossy, plastic texture with overly blurry background. It's similar to reward hacking behavior you see in LLMs where they sound overly nice and chatty.
I had to make a few compromises to balance between "stable, collapsed, boring model" and "unstable, diverse, explorative" model.