zlacker

We're finding that when running Llama-2-7B with vLLM (https://github.com/vllm-project/vllm) on an A40 GPU we're getting consistently lower time-to-first-token and lower average token generation time than GPT-3.5, even when processing multiple requests in parallel. A40s are pretty easy to get your hands on these days (much easer than A100s anyway).

The 50x cheaper (that's 2% of the cost, not 50% of the cost) number does assume 100% GPU utilization, which may or may not be realistic for your use case. If you're doing batch processing as part of a data pipeline, which is not an unusual use case, you can run your GPU at 100% utilization and turn it off when the batch finishes.

If you've got a highly variable workload then you're right, you'll have much lower utilization numbers. But if you work with an aggregator that can quickly hot swap LoRA fine-tunes (as a disclaimer, my company OpenPipe works in this space) you can get back a lot of that lost efficiency since we can increase/decrease GPU capacity only when our aggregate usage changes, which smooths things out.