But I wonder when we'll be happy? Do we expect colleagues friends and family to be 100% laser-accurate 100% of the time? I'd wager we don't. Should we expect that from an artificial intelligence too?
Usually I’m using a minimum of 200k tokens to start with gemini 2.5.
- (1e(1e10) + 1) - 1e(1e10)
- sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2))
So when I punched in 1/3 it was exactly 1/3.
Mainly I meant to push back against the reflexive comparison to a friend or family member or colleague. AI is a multi-purpose tool that is used for many different kinds of tasks. Some of these tasks are analogues to human tasks, where we should anticipate human error. Others are not, and yet we often ask an LLM to do them anyway.
You could say that when I use my spanner/wrench to tighten a nut it works 100% of the time, but as soon as I try to use a screwdriver it's terrible and full of problems and it can't even reliably so something as trivially easy as tighten a nut, even though a screwdriver works the same way by using torque to tighten a fastener.
Well that's because one tool is designed for one thing, and one is designed for another.
And it is also not just about the %. It is also about the type of error. Will we reach a point we change our perception and say these are expected non-human error?
Or could we have a specific LLM that only checks for these types of error?
And tools in the game, even more so (there's no excuse for the engineered).
"AI"s are designed to be reliable; "AGI"s are designed to be intelligent; "LLM"s seem to be designed to make some qualities emerge.
> one tool is designed for one thing, and one is designed for another
The design of LLMs seems to be "let us see where the promise leads us". That is not really "design", i.e. "from need to solution".
Then why are we using them to write code, which should produce reliable outputs for a given input...much like a calculator.
Obviously we want the code to produce correct results for whatever input we give, and as it stands now, I can't trust LLM output without reviewing first. Still a helpful tool, but ultimately my desire would be to have them be as accurate as a calculator so they can be trusted enough to not need the review step.
Using an LLM and being OK with untrustworthy results, it'd be like clicking the terminal icon on my dock and sometimes it opens terminal, sometimes it might open a browser, or just silently fail because there's no reproducible output for any given input to an LLM. To me that's a problem, output should be reproducible, especially if it's writing code.
Your interaction with LLMs is categorically closer to interactions with people than with a calculator. Your inputs into it are language.
Of course the two are different. A calculator is a computer, an LLM is not. Comparing the two is making the same category error which would confuse Mr. Babbage, but in reverse.
(“On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question.”)
That generally seems right to me, given how much we hold in our heads when you’re discussing something with a coworker.
If we have an intern junior dev on our team do we expect them to be 100% totally correct all the time? Why do we have a culture of peer code reviews at all if we assume that every one who commits code is 100% foolproof and correct 100% of the time?
Truth is we don't trust all the humans that write code to be perfect. As the old-as-the-hills saying goes "we all make mistakes". So replace "LLM" in your comment above with "junior dev" and everything you said still applies wether it is LLMs or inexperienced colleagues. With code, there is very rarely a single "correct" answer to how to implement something (unlike the calculator tautology you suggest) anyway, so an LLM or an intern (or even an experienced colleague) absolutely nailing their PRs with zero review comments etc seems unusual to me.
So we go back to the original - and I admit quite philosophical - point: when will we be happy? We take on juniors because they do the low-level and boring work and we need to keep an eye on their output until they learn and grow and improve ... but we cannot do the same for a LLM?
What we have today was literally science fiction not so long ago (e.g. "Her" movie from 2013 is now a reality pretty much). Step back for a moment - the fact we are even having this discussion that "yeah it writes code but it needs to be checked" is just mind-blowing that it even writes code that is mostly-correct at all. Give things another couple of years and its going to be even better.