There's no reason for a coding model to contain all of ao3 and wikipedia =)
Besides, programming is far from just knowing how to autocomplete syntax, you need a model that's proficient in the fields that the automation is placed in, otherwise they'll be no help in actually automating it.
I had not considered that, seems like a great solution for local models that may be more resource-constrained.
If we knew how to create a SOTA coding model by just putting coding stuff in there, that is how we would build SOTA coding models.
Importantly, this isn’t just throwing more data at the problem in an unstructured way, afaik companies are getting as many got histories as they can and doing something along the lines of, get an llm to checkpoint pull requests, features etc and convert those into plausible input prompts, then run deep rl with something which passes the acceptance criteria / tests as the reward signal.
Even for my usual toy coding problems it would get simple things wrong and require some poking to get to it.
A few times it got stuck in thinking loops and I had to cancel prompts.
This was using the recommended settings from the unsloth repository. It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.
Of course you get degraded performance with this.
3.7 was not all that great. 4 was decent for specific things, especially self contained stuff like tests, but couldn't do a good job with more complex work. 4.5 is now excellent at many things.
If it's around the perf of 3.7, that's interesting but not amazing. If it's around 4, that's useful.
I'm more bullish about small and medium sized models + efficient tool calling than I'm about LLMs too large to be run at home without $20k of hardware.
The model doesn't need to have the full knowledge of everything built into it when it has the toolset to fetch, cache and read any information available.
Chinese labs are acting as a disruption against Altman etcs attempt to create big tech monopolies, and that's why some of us cheer for them.
Those are the quant thresholds where people with mid-high end hardware can run this locally at reasonable speed, though.
In my experience Q2 is flakey, but Q4 isn't dramatically worse.
Thousands have been saying this, you aren't paying attention.
Then why did you write this?
> It's always possible that there are some bugs in early implementations that need to be fixed later, but so far I don't see any reason to believe this is actually a Sonnet 4.5 level model.