Gemini 2.5 Pro Preview

>>meetpa+(OP)
My frustration with using these models for programming in the past has largely been around their tendency to hallucinate APIs that simply don't exist. The Gemini 2.5 models, both pro and flash, seem significantly less susceptible to this than any other model I've tried.

There are still significant limitations, no amount of prompting will get current models to approach abstraction and architecture the way a person does. But I'm finding that these Gemini models are finally able to replace searches and stackoverflow for a lot of my day-to-day programming.

>>segpha+J4
> no amount of prompting will get current models to approach abstraction and architecture the way a person does

I find this sentiment increasingly worrisome. It's entirely clear that every last human will be beaten on code design in the upcoming years (I am not going to argue if it's 1 or 5 years away, who cares?)

I wished people would just stop holding on to what amounts to nothing, and think and talk more about what can be done in a new world. We need good ideas and I think this could be a place to advance them.

>>jstumm+jH
I code with multiple LLMs every day and build products that use LLM tech under the hood. I dont think we're anywhere near LLMs being good at code design. Existing models make _tons_ of basic mistakes and require supervision even for relatively simple coding tasks in popular languages, and its worse for languages and frameworks that are less represented in public sources of training data. I am _frequently_ having to tell Claude/ChatGPT to clean up basic architectural and design defects. Theres no way I would trust this unsupervised.

Can you point to _any_ evidence to support that human software development abilities will be eclipsed by LLMs other than trying to predict which part of the S-curve we're on?

>>ssalaz+gg1
I can't point to any evidence. Also I can't think of what direct evidence I could present that would be convincing, short of an actual demonstration? I would like to try to justify my intuition though:

Seems like the key question is: should we expect AI programming performance to scale well as more compute and specialised training is thrown at it? I don't see why not, it seems an almost ideal problem domain?

* Short and direct feedback loops

* Relatively easy to "ground" the LLM by running code

* Self-play / RL should be possible (it seems likely that you could also optimise for aesthetics of solutions based on common human preferences)

* Obvious economic value (based on the multi-billion dollar valuations of vscode forks)

All these things point to programming being "solved" much sooner than say, chemistry.

>>xyzzy1+Cw1
This is my view. We've seen this before in other problems where there's an on-hand automatic verifier. The nature of the problem mirrors previously solved problems.

The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective. I don't believe they can. Until they can, I'm going to assume that LLMs will soon be better than any living human at writing good code.

>>energy+WH1
> The LLM skeptics need to point out what differs with code compared to Chess, DoTA, etc from a RL perspective.

An obviously correct automatable objective function? Programming can be generally described as converting a human-defined specification (often very, very rough and loose) into a bunch of precise text files.

Sure, you can use proxies like compilation success / failure and unit tests for RL. But key gaps remain. I'm unaware of any objective function that can grade "do these tests match the intent behind this user request".

Contrast with the automatically verifiable "is a player in checkmate on this board?"

>>AnIris+0K1
This is in fact not how a chess engine works. It has an evaluation function that assigns a numerical value (score) based on a number of factors (material advantage, king "safety", pawn structure etc).

These heuristics are certainly "good enough" that Stockfish is able to beat the strongest humans, but it's rarely possible for a chess engine to determine if a position results in mate.

I guess the question is whether we can write a good enough objective function that would encapsulate all the relevant attributes of "good code".

>>Hautho+qh2
Maybe I am misunderstanding what you are saying, but eg stockfish, given time and threads, seems very good at finding forced checkmates within 20 or more moves.

zlacker