From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.
I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.
I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.
I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.
I suspect the API providers will offer this model for nice and cheap, too.
I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...