zlacker

[parent] [thread] 5 comments
1. genpfa+(OP)[view] [source] 2026-02-03 18:17:38
Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64
llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768
replies(3): >>halcyo+W6 >>daniel+tj1 >>lnenad+o63
2. halcyo+W6[view] [source] 2026-02-03 18:42:51
>>genpfa+(OP)
What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!
replies(1): >>coder5+P9
◧◩
3. coder5+P9[view] [source] [discussion] 2026-02-03 18:52:42
>>halcyo+W6
MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.

Not as good as running the entire thing on the GPU, of course.

4. daniel+tj1[view] [source] 2026-02-04 00:58:25
>>genpfa+(OP)
Super cool! Also with `--fit on` you don't need `--ctx-size 32768` technically anymore - llama-server will auto determine the max context size!
replies(1): >>genpfa+SB1
◧◩
5. genpfa+SB1[view] [source] [discussion] 2026-02-04 03:18:26
>>daniel+tj1
Nifty, thanks for the heads-up!
6. lnenad+o63[view] [source] 2026-02-04 15:10:30
>>genpfa+(OP)
Thanks to you I decided to give it a go as well (didn't think I'd be able to run it on 7900xtx) and I must say it's awesome for a local model. More than capable for more straightforward stuff. It uses full VRAM and about 60GBs of RAM, but runs at about 10tok/s and is *very* usable.
[go to top]