zlacker

Thanks for the pointers!

one more thing, that guide says:

> You can choose UD-Q4_K_XL or other quantized versions.

I see eight different 4-bit quants (I assume that is the size I want?).. how to pick which one to use?

    IQ4_XS
    Q4_K_S
    Q4_1
    IQ4_NL
    MXFP4_MOE
    Q4_0
    Q4_K_M
    Q4_K_XL

replies(1): >>MrDrMc+oh

>>zokier+(OP)
The I-prefix stands for Imatrix smoothing in the quantization. It trades a little more accuracy for speed than other quant styles. The _0 and _1 quants are older, simpler quants that are very accurate but kinda slow. The K quants, in my limited understanding, primarily quantize at the specified bit depth, but will bump certain important areas higher, and less used parts lower. It generally performs better while providing similar accuracy to the _1 quants. MXFP4 is specific to Nvidia, so I can't use it on my AMD hardware. It's supposed to be very efficient. The UD part includes more of Unsloth's speed optimizations.

Also, depending on how much regular system RAM you have, you can offload mixture-of-expert models like this, keeping only the most important layers on your GPU. This may let you use larger, more accurate quants. That is functionality that is supported by llama.cpp and other frameworks and is worth looking into how to do.