Nano-vLLM: How a vLLM-style inference engine works

>>yz-yu+(OP)
The whole thing feels AI written, generated from the codebase.*

*this is incorrect per the author’s response, my apologies.

For instance, it goes into (nano)vLLM internals and doesn’t mention PagedAttention once (one of the core ideas that vLLM is based on)[1].

Also mentions that Part 2 will cover dense vs MoE’s, which is weird because nanovllm hardcodes a dense Qwen3 into the source.

Here are better (imo) explainers about how vLLM works:

- https://hamzaelshafie.bearblog.dev/paged-attention-from-firs...

- https://www.aleksagordic.com/blog/vllm

- https://huggingface.co/blog/continuous_batching

Aleksa’s blog is a bit in the weeds for my taste but it’s really worth working through.

A lot of the magic of vLLM happens in the PagedAttention kernels, which are really succinctly implanted in nanovllm. And the codebase is great and readable by itself!

—

1. https://arxiv.org/abs/2309.06180

>>yz-yu+(OP)
Since HN only allows one link per submission, dropping Part 2 here.

https://www.neutree.ai/blog/nano-vllm-part-2

>>Juvina+gC
Number one indicator? A single punctuation mark that's trivial to make on most keyboards (option-dash on macOS). And generally people who write software are extra fixated on punctuation for obvious reasons: missing semi-colons break your build, etc. Maybe in some other niche message board people will use dash and em dash interchangeably, but here?

Also, if the a single character is how you're red-flagging LLM output, do you know how easy it is to avoid? I didn't use it here at all, but how do you know I didn't run this through some slop-machine to tighten my prose? It's really low-effort take to say "just avoid em dashes so we know you're not an AI".

https://www.mcsweeneys.net/articles/the-em-dash-responds-to-...

>>yz-yu+(OP)
Shameless plug for my structured LLM outputs handbook which is written in a similar spirit: https://nanonets.com/cookbooks/structured-llm-outputs/

>>lambda+iJ
It looks like you were right about that.

>>46858409

But: this was never a problem and now we have to distinguish between LLM generated, human generated, LLM polished and human generated. I'd much prefer it if people just wrote their own text, warts and all.

zlacker

Nano-vLLM: How a vLLM-style inference engine works