It makes me wonder whether everyone else is kidding themselves, or if I'm just holding it wrong.
Have been wondering this ever since 1 week after the initial ChatGPT release.
My cynical take is that most people don't do real work (i.e. one that is objectively evaluated against reality), so are not able to see the difference between a gimmick and the real thing. Most people are in the business of making impressions, and LLMs are pretty good at that.
It's good that we have markets that will eventually sort it out.
Ask it something where the Google SERP is full of trash and you might have a more sane result from the LLM.
It is also excellent for writing one-off code experiments and plots, saving some time from having to write them from scratch.
I’m sorry but you are just using it wrong.
I also use them for various coding tasks and they, together with agent frameworks, regularly do refactoring or small feature implementations in 1-2 minutes that would've taken me 10-20 minutes. They've probably increased my developer productivity by 2-3x overall, and by a lot more when I'm working with technology stacks that I'm not so familiar with or haven't worked with for a while. And I've been an engineer for almost 30 years.
So yea, I think you're just using them wrong.
The code it also generates is...questionable, and I'm a pretty middling dev.
* Super-powered thesaurus
A traditional thesaurus can only take a word and provide alternative words; with an LLM, you can take a whole phrase or sentence and say: "give me more ways to express the same idea".
I have done this occasionally when writing, and the results were great. No, I do not blindly cut-and-paste LLM output, and would never do so. But when I am struggling to phrase something just right, often the LLM will come up with a sentence which is close, and which I can tweak to get it exactly the way I want.
* Explaining a step in a mathematical proof.
When reading mathematical research papers or textbooks, I often find myself stuck at some point in a proof, not able to see how one step follows from the previous ones. Asking an LLM to explain can be a great way to get unstuck.
When doing so, you absolutely cannot take whatever the LLM says as 'gospel'. They can and will get confused and say illogical things. But if you call the LLM out on its nonsense, it can often correct itself and come up with a better explanation. Even if it doesn't get all the way to the right answer, as long as it gets close enough to give me the flash of inspiration I needed, that's enough for me.
* Super-powered programming language reference manual
I have written computer software in more than 20 programming languages, and can't remember all the standard library functions in each language, what the order of parameters are, and so on.
There are definitely times when going to a manpage or reference manual is better. But there are also times when asking an LLM is better.
But then again, it's not as if markets always rewarded real work either.
I'm an AI sceptic (and generally disregard most AI announcements). I don't think it's going to replace SWE at all.
I've been chunking the same questions both to Gemini and GPT and I'd say about until ~8 months ago they were both as bad as each other and basically useless.
However, recently Gemini has gotten noticeable better and has never hallucinated.
I don't let it write any code for me. Instead I treat Gemini as a 10+ YoE on {{subject}}.
Working as platform engineer, my subjects are broad so it's very useful to have a rubber duck ready to go on almost any topic.
I don't use copilot or any other AI. So I can't compare it to those.
Same for philosophy questions, "explain this piece of news through the lens of X philosopher's Y concept".
Some people seem to use them as a database of common programming patterns, but that's something I already have, both hundreds of scaffolds in many programming languages I've made myself and hundreds of FOSS and non-FOSS git repos I've collected out of interest or necessity. Often I also just go look at some public remote repo if I'm reading up on some topic in preparation for an implementation or experiment, mainly because when I ask an LLM the code usually has defects and incoherences and when I look at something that is already in production somewhere it's working and sits in a context I can learn from as well.
But hey, I rarely even use IDE autocomplete for browsing library methods and the like, in part because I've either read the relevant library code or picked a library with good documentation since that tells a lot more about intended use patterns and pitfalls.
“Computer” used to be a job, and human error rates are on the order of 1-2% no matter what level of training or experience they had. Work had to be done in triplicate and cross-checked if it mattered.
Digital computers are down to error rates roughly 10e-15 to 10e-22 and are hence treated as nearly infallible. We regularly write code routines where a trillion steps have to be executed flawlessly in sequence for things not to explode!
AIs can now output maybe 1K to 2K tokens in a sequence before they make a mistake. That’s 99.9% to 99.95%! Better than human already.
Don’t believe me?
Write me a 500 line program with pen and paper (not pencil!) and have it work the first time!
I’ve seen Gemini Pro 2.5 do this in a useful way.
As the error rates drop, the length of usefully correct sequences will get to 10K, then 100K, and maybe… who knows?
There was just a press release today about Gemini Diffusion that can alter already-generated tokens to correct mistakes.
Error rates will drop.
Useful output length will go up.
Its quality seems to vary wildly between various subjects, but annoyingly it presents itself with uniform confidence.
It's also pretty useful for brainstorming : talking to AI helps you refine your thoughts. It probably won't give you any innovative idea, only a survey of mainstream ones, but it's a pretty good start for thinking about a problem.
I was up extremely late last night writing a project-status email. I could tell my paragraphs were not tight. I told Cursor: rewrite this 15% smaller. I didn't use the output verbatim, but it gave me several perfect rewrite ideas and the result was a crisp email.
I have it summarize my sloppy notes after interviewing someone, into full sentences. I double-check it for completeness and correctness, of course. But it saves me an hour of sweating the language.
I used it to get a better explanation to a polynomial problem with my child.
I use it to generate Google Spreadsheet formulas that I would never want to spend time figuring out on my own ("give me a formula that extracts the leading number from each cell, and treats blank cells as zero").
Part of the magic is finding a new use case that shaves another hour here and there.
It's much more helpful on popular topics where summarization itself is already high quality and sufficient.
It is the former.
When LLMs blew up a few years ago I was pretty excited about the novelty of the software, and that excitement was driven by what they might do rather than what they did do.
Now, years and many iterations later, the most vocal proponents of this stuff still pitch what it might do with a volume loud enough to drown out almost any discussion of what it does. What little discussion of what it does for individuals usually boils down to some variation of “it gives me answers to the questions for which I do not care about the answers”, but, —how ridiculous, wasteful, and contrary to the basic ideas of knowledge and reasoning that statement is aside— even that is usually given with a wink and a nod to suggest that maybe one day it will give answers to questions that matter.
No, no, you have to realise, most pianists don't make real music!
The issue seems to be more in the intelligence department. You can't really leave them in an agent-like loop with compiler/shell output and expect them to meaningfully progress on their tasks past some small number of steps.
Improving their initial error-free token length is solving the wrong problem. I would take less initial accuracy than a human but equally capable of iterating on their solution over time.
Programmers who "iterate" buggy shit for 10 rounds until they get it right are a post-Google push-update phenomenon.
Google Cloud. (2024). "Broadcast Transformation with Google Cloud." https://cloud.google.com/solutions/media-entertainment/broad...
Microsoft Azure. (2024). "Azure for Media and Entertainment." https://azure.microsoft.com/en-us/solutions/media-entertainm...
IBC365. (2023). "The Future of Broadcast Engineering: Skills and Training." https://www.ibc.org/tech-advances/the-future-of-broadcast-en...
Broadcast Bridge. (2023). "Cloud Skills for Broadcast Engineers." https://www.thebroadcastbridge.com/content/entry/18744/cloud...
SVG Europe. (2023). "OTT and Cloud: The New Normal for Broadcast." https://www.svgeurope.org/blog/headlines/ott-and-cloud-the-n...
None of these exist, neither at the provided URLs or elsewhere.
But what if you _don't_ have that kind of problem? Yes LLMs can be useful to solve the above. But for many problems you ask for a solution and what you get is a suggested solution which takes a long to verify. Meaning: unless you are somewhat sure it will solve the problem you don't want to do it. You need some estimate of confidence. LLMs are useless for this. As a developer I find my problems are very rarely in the first category and more often in the second.
Yes it's "using them wrong". It's doing what they struggle with. But it's also what I struggle with. It's hard to stop yourself when you have a difficult problem and you are weighing googling it for an hour or chatgpt-ing it for an hour. But I often regret going the ChatGPT route after several hours.
So maybe another LLM would have fared better, but still, so far it's mostly wasted time. It works quite well to summarise texts and creating filler images, but overall I still find them not reliable enough to care out of these two limited use cases.
If you aren't already, I suggest making sure to not forget, every 3-5 prompts, to throw in: "no waffling", "no flattery", "no obsequious garbage", etc. You can make it as salty as you like. If the AI says "Have fun!", or "Let's get coding!", you know you need to get the whip out haha.
Also, "3 sentences max on ...", "1 sentence explaining ...", "1 paragraph max on ...".
Another improvement for me was, you want to do procedure x in situation y, so you go "I'm in situation y, I'm considering procedure x, but I know I've missed something. Tell me what I could have missed". Or "list specific scenarios in which procedure x will lead to catastrophe".
Accepting the tool as a fundamentally dumb synthesiser and summariser is the first step to it getting a lot more useful, I think.
All that said, I use it pretty rarely. The revolution in learning we need is with John Holt and similar thinkers from that period, and is waiting to happen, and won't be provided by the next big tech thing, I fear.
But I've noticed recently it's started slipping back into more "That's a great idea" and "You've got this" cheerleading, so I'm going to have to tell it to knock that out again. It will definitely lean into confirmation bias if that's what you're looking for, and you don't explicitly tell it not to worry about how you'll feel about the answer.
I find it useful for bouncing ideas off of, while keeping in mind that I'm really bouncing them off myself and sort of a hive mind made up of what's been said in certain mainstream sectors of the Internet. I'm less creative than average, so I get more ideas that way than I'd get from just journaling, so that's worth something.
> Suppose I'm standing on Earth and suddenly gravity stopped affecting me. What would be my trajectory? Specifically what would be my distance from Earth over time?
https://chatgpt.com/c/682edff8-c540-8010-acaa-8d9b5c26733d
It gives the "small distance approximation" in the examples, even if I ask for the solution after two hours, which at 879km is already quite off the correct ~820km.
An approximation that is better in the order of seconds to hours is pretty simple:
s(t) = sqrt((R^2 + (Vt)^2)) - R
And it's even plotted in the chart, but again - numbers are off.[0] Their results were giving wildly incorrect numbers at less than 100 seconds already, which was what originally prompted me to respond - they didn't even match the formula.
I keep reading all of this glazing like in the rest of the thread and it's really frustrating because you get this fatigue with all the bs coming out of them that makes you not want to use them at all. The more you try to get it to fix the output the more it uses unrelated tokens.
In just the last 24 hours I've seen multiple models:
- Put C++ code structures in Python - Synthesize non existent functions, libraries, features of programming languages - Whole features of video file formats and associated ffmpeg flags that aren't applicable to imagery.
I also think you're not going to get any good answers to this question and a lot of pro AI people are going to be left unsatisfied because when you get into this spot every single thing that it does is wrong in some new way that cannot be easily categorized.
It is literally the limit of the representation of information in a digital way.