Often, the training is done in FP16 then quantized down to FP8 or FP4 for distribution.
Not the real reason. The real reason is that training has moved to FP/BF16 over the years as NVIDIA made that more efficient in their hardware, the same reason you're starting to see some models being released in 8bit formats (deepseek).
Of course people can always quantize the weights to smaller sizes, but the master versions of the weights is usually 16bit.
i asked chat for an explanation and it said bfloat has a higher range (like fp32) but less precision.
what does that mean for image generation and why was bfloat chosen over fp?