I'm starting to suspect this is the issue. Neither of these languages are in the top 5 languages so there is probably less to train on. It'd be interesting to see if this improves over time or if the gap between the languages become even more intense as it becomes favorable to use a language simply because LLMs are so much better at it.
There are a lot of interesting discussions to be had here:
- if the efficiency gains are real and llms don't improve in lesser used languages, one should expect that we might observe that companies that chose to use obscure languages and tech stacks die out as they become a lot less competitive against stacks that are more compatible with llms
- if the efficiency gains are real this might disincentivize new language adoption and creation unless the folks training models somehow address this
- languages like python with higher output acceptance rates are probably going to become even more compatible with llms at a faster rate if we extrapolate that positive reinforcement is probably more valuable than negative reinforcement for llms
I do wonder if the gap will keep widening though. If newer tools/tech don’t have enough training data, LLMs may struggle much more with them early on. Although it's possible that RAG and other optimization techniques will evolve fast enough to narrow the gap and prevent diminishing returns on LLM driven productivity.
As someone who works primarily within the Laravel stack, in PHP, the LLM's are wildly effective. That's not to say there aren't warts - but my productivity has skyrocketed.
But it's become clear that when you venture into the weeds of things that aren't very mainstream you're going to get wildly more hallucinations and solutions that are puzzling.
Another observation is that I believe that when you start getting outside of your expertise you're likely going to have a correlating amount of 'waste' time spent where the LLM is spitting out solutions that an expert in the domain would immediately recognize as problematic but the non-expert will see and likely reason that it seems reasonable/or, worse, not even look at the solution and just try to use it.
100% of the time that I've tried to get Claude/Gemini/ChatGPT to "one shot" a whole feature or refactor it's been a waste of time and tokens. But when I've spent even a minimal amount of energy to focus it in on the task, curate the context and then approach? Tremendously effective most times. But this also requires me to do enough mental work that I probably have an idea of how it should work out which primes my capability to parse the proposed solutions/code and pick up the pieces. Another good flow is to just prompt the LLM (in this case, Claude Code, or something with MCP/filesystem access) with the feature/refactor/request asking it to draw up the initial plan of implementation to feed to itself. Then iterate on that as needed before starting up a new session/context with that plan and hitting it one item at a time, while keeping a running {TASK_NAME}_WORKBOOK.md (that you task the llm to keep up to date with the relevant details) and starting a new session/context for each task/item on the plan, using the workbook to get the new sessions up to speed.
Also, this is just a hunch, but I'm generally a nocturnal creature and tend to be working in the evening into early mornings. Once 8am PST rolls around I really feel like Claude (in particular) just turns into mush. Responses get slower but it seems it loses context where it otherwise wouldn't start getting off topic/having to re-read files it should already have in context. (Note; I'm pretty diligent about refreshing/working with the context and something happens in the 'work' hours to make it terrible)
I'd imagine we're going to end up with language specific llms (though I have no idea, just seems logical to me) that a 'main' model pushes tasks/tool usage to. We don't need our "coding" LLM's to also be proficient on oceanic tidal patterns and 1800's boxing history. Those are all parameters that could have been better spent on the code.
Python is good because of the sheer volume of training data, but the lack of a strong type system means you can't have a cycle of codegen -> typecheck -> codegen be automated, and you have to get the LLM to produce tests and run those, which is mostly fine but not as efficient.