zlacker

It is refreshing to see I am not the only person who cannot get LLMs to say anything valuable. I have tried several times, but the cycle "You're right to question this. I actually didn't do anything you asked for. Here is some more garbage!" gets really old really fast.

It makes me wonder whether everyone else is kidding themselves, or if I'm just holding it wrong.

replies(18): >>admiss+61 >>th0ma5+d1 >>otabde+k1 >>tsurba+w1 >>lovepa+x2 >>alexdo+B2 >>1a527d+I2 >>dist-e+U2 >>mnky98+V2 >>terhec+33 >>visarg+83 >>cess11+D3 >>jiggaw+H3 >>badmin+94 >>khazho+Z4 >>jrflow+f5 >>guappa+36 >>sausag+W7

>>danlit+(OP)
> It makes me wonder whether everyone else is kidding themselves, or if I'm just holding it wrong.

Have been wondering this ever since 1 week after the initial ChatGPT release.

My cynical take is that most people don't do real work (i.e. one that is objectively evaluated against reality), so are not able to see the difference between a gimmick and the real thing. Most people are in the business of making impressions, and LLMs are pretty good at that.

It's good that we have markets that will eventually sort it out.

replies(2): >>Hoasi+D2 >>melago+F2

>>danlit+(OP)
Some opinions out there that people are using them as slot machines, are using them in place of project templates and aren't really working on things of substance, or they are in the business of producing artifacts that look like work but the fitness is determined elsewhere. If they are producing working code beyond autocomplete style capabilities they either are disinterested in the long term supportability, ignoring all of the handholding they have to do to get things to work, or don't understand that experienced programmers ultimately get to a point where they are not writing that much code to begin with which isn't something these tools can help much with other than rubber duck style bouncing off of ideas... but even then on the bleeding edge of capabilities you get exponentially more and more bits from unrelated patterns that make the internal weights work the closer you get away from well worn development paths.

>>danlit+(OP)
It works better if you treat it like a compressed database of Google queries. (Which is kind of actually is.)

Ask it something where the Google SERP is full of trash and you might have a more sane result from the LLM.

>>danlit+(OP)
I do machine learning research and it is very useful for working out equations and checking for ”does this concept already have an established name” etc.

It is also excellent for writing one-off code experiments and plots, saving some time from having to write them from scratch.

I’m sorry but you are just using it wrong.

replies(1): >>Incipi+z2

>>danlit+(OP)
I use LLMs to check solutions for graduate level math and physics problem I'm working on. Can I 100% trust their final output? Of course not, but I know enough about the domain to tell whether they discovered mistakes in my solutions or not. And they do a pretty good job and have found mistakes in my reasoning many times.

I also use them for various coding tasks and they, together with agent frameworks, regularly do refactoring or small feature implementations in 1-2 minutes that would've taken me 10-20 minutes. They've probably increased my developer productivity by 2-3x overall, and by a lot more when I'm working with technology stacks that I'm not so familiar with or haven't worked with for a while. And I've been an engineer for almost 30 years.

So yea, I think you're just using them wrong.

replies(2): >>bsaul+b4 >>alkona+ua

>>tsurba+w1
I personally find your use cases to be the same as mine for AI, along with "fancy autocomplete" in larger files...however it's a fairly disappointingly limited use compared to the "it's nearly AGI" vision companies are selling.

The code it also generates is...questionable, and I'm a pretty middling dev.

>>danlit+(OP)
There are absolutely times when one can get LLMs to "say something valuable". I am still learning how to put them to good use, but here are some areas where I have found clear wins:

* Super-powered thesaurus

A traditional thesaurus can only take a word and provide alternative words; with an LLM, you can take a whole phrase or sentence and say: "give me more ways to express the same idea".

I have done this occasionally when writing, and the results were great. No, I do not blindly cut-and-paste LLM output, and would never do so. But when I am struggling to phrase something just right, often the LLM will come up with a sentence which is close, and which I can tweak to get it exactly the way I want.

* Explaining a step in a mathematical proof.

When reading mathematical research papers or textbooks, I often find myself stuck at some point in a proof, not able to see how one step follows from the previous ones. Asking an LLM to explain can be a great way to get unstuck.

When doing so, you absolutely cannot take whatever the LLM says as 'gospel'. They can and will get confused and say illogical things. But if you call the LLM out on its nonsense, it can often correct itself and come up with a better explanation. Even if it doesn't get all the way to the right answer, as long as it gets close enough to give me the flash of inspiration I needed, that's enough for me.

* Super-powered programming language reference manual

I have written computer software in more than 20 programming languages, and can't remember all the standard library functions in each language, what the order of parameters are, and so on.

There are definitely times when going to a manpage or reference manual is better. But there are also times when asking an LLM is better.

>>admiss+61
> It's good that we have markets that will eventually sort it out.

But then again, it's not as if markets always rewarded real work either.

replies(2): >>admiss+13 >>bsaul+Q4

>>admiss+61
There were many people just input text from paper to computer. Did they do real jobs in the past?

>>danlit+(OP)
I think it's starting to change.

I'm an AI sceptic (and generally disregard most AI announcements). I don't think it's going to replace SWE at all.

I've been chunking the same questions both to Gemini and GPT and I'd say about until ~8 months ago they were both as bad as each other and basically useless.

However, recently Gemini has gotten noticeable better and has never hallucinated.

I don't let it write any code for me. Instead I treat Gemini as a 10+ YoE on {{subject}}.

Working as platform engineer, my subjects are broad so it's very useful to have a rubber duck ready to go on almost any topic.

I don't use copilot or any other AI. So I can't compare it to those.

replies(1): >>-__---+Dk

>>danlit+(OP)
As a non-physicist I found it's explanations to physics questions I've asked amazing, better then watching videos on them, since you can iterate exactly on the point you don't understand.

Same for philosophy questions, "explain this piece of news through the lens of X philosopher's Y concept".

>>danlit+(OP)
This why I like how perplexity forces citations. I use it more like I’m googling then I care about what the LLM writes. The LLM simply acts as a sometimes unreasonable interface to the search engine. So really, I’m more focused on if whatever embeddings the LLM is trained on found some correlations between different documents, etc that were not obvious to a different kind of search engine.

replies(2): >>pishpa+85 >>wazoox+la

>>Hoasi+D2
It's noisy, biased, and slow, but it generally does eventually reward what works better, in most cases.

>>danlit+(OP)
Can you give some examples where it didn't work for you? I'm curious because I derive a lot of value from it and my guess is that we're trying very different things with it.

replies(4): >>exe34+j5 >>wazoox+Ma >>Tade0+4K >>th0ma5+Sw1

>>danlit+(OP)
I too have sat at pianos and banged the keys but no valuable music came out. It must be because the piano is a bad instrument. /s

replies(2): >>exe34+f6 >>johnis+ye

>>danlit+(OP)
I don't think you are. I've found some use for cheap and somewhat unreliable machine translation of formal documents, but it doesn't work for idiomatic or rude texts, for example, LLM:s commonly try to avoid 'saying' something offensive about the house of Saud or the White House so I need to push them around to do the thing I want. Sometimes I also 'one-shot' HTML scaffolds because I suck at Tailwind and I only rarely save templates of single file things, they just end up in an unstructured pile.

Some people seem to use them as a database of common programming patterns, but that's something I already have, both hundreds of scaffolds in many programming languages I've made myself and hundreds of FOSS and non-FOSS git repos I've collected out of interest or necessity. Often I also just go look at some public remote repo if I'm reading up on some topic in preparation for an implementation or experiment, mainly because when I ask an LLM the code usually has defects and incoherences and when I look at something that is already in production somewhere it's working and sits in a context I can learn from as well.

But hey, I rarely even use IDE autocomplete for browsing library methods and the like, in part because I've either read the relevant library code or picked a library with good documentation since that tells a lot more about intended use patterns and pitfalls.

>>danlit+(OP)
Something I noticed a long time ago is that going from 90% correct to 95% correct is not a 5% difference, it’s a 2x difference. As you approach 100%, the last few 0.01% error rates going away make a qualitative difference.

“Computer” used to be a job, and human error rates are on the order of 1-2% no matter what level of training or experience they had. Work had to be done in triplicate and cross-checked if it mattered.

Digital computers are down to error rates roughly 10e-15 to 10e-22 and are hence treated as nearly infallible. We regularly write code routines where a trillion steps have to be executed flawlessly in sequence for things not to explode!

AIs can now output maybe 1K to 2K tokens in a sequence before they make a mistake. That’s 99.9% to 99.95%! Better than human already.

Don’t believe me?

Write me a 500 line program with pen and paper (not pencil!) and have it work the first time!

I’ve seen Gemini Pro 2.5 do this in a useful way.

As the error rates drop, the length of usefully correct sequences will get to 10K, then 100K, and maybe… who knows?

There was just a press release today about Gemini Diffusion that can alter already-generated tokens to correct mistakes.

Error rates will drop.

Useful output length will go up.

replies(2): >>hatefu+J7 >>pishpa+58

>>danlit+(OP)
I mostly use it as a replacement of a search engine and exploration, mostly for subjects that I'm learning from scratch, and I don't have a good grasp of the official documentation and good keywords yet. It competes with searching for guides in traditional search engines, but that's easy to beat with today's SEO infested web.

Its quality seems to vary wildly between various subjects, but annoyingly it presents itself with uniform confidence.

replies(1): >>-__---+Ql

>>lovepa+x2
i could have written all of this myself. I use it exactly for the same purposes ( except i don't do undergrad physics, just maths) and with the same outcome.

It's also pretty useful for brainstorming : talking to AI helps you refine your thoughts. It probably won't give you any innovative idea, only a survey of mainstream ones, but it's a pretty good start for thinking about a problem.

>>Hoasi+D2
Ultimately markets do ask the only relevant question : are people going to pay for LLMs' output.

>>danlit+(OP)
Others aren't kidding themselves. You haven't found the use yet.

I was up extremely late last night writing a project-status email. I could tell my paragraphs were not tight. I told Cursor: rewrite this 15% smaller. I didn't use the output verbatim, but it gave me several perfect rewrite ideas and the result was a crisp email.

I have it summarize my sloppy notes after interviewing someone, into full sentences. I double-check it for completeness and correctness, of course. But it saves me an hour of sweating the language.

I used it to get a better explanation to a polynomial problem with my child.

I use it to generate Google Spreadsheet formulas that I would never want to spend time figuring out on my own ("give me a formula that extracts the leading number from each cell, and treats blank cells as zero").

Part of the magic is finding a new use case that shaves another hour here and there.

>>mnky98+V2
You're over-representing the usefulness here. On topics where traditional search reaches a dead end, you will find the AI citations to be the same ones you might have found, except that upon checking, they were clearly misread or misrepresented. Dangerous and a waste of time.

It's much more helpful on popular topics where summarization itself is already high quality and sufficient.

replies(1): >>mnky98+j6

>>danlit+(OP)
> It makes me wonder whether everyone else is kidding themselves, or if I'm just holding it wrong.

It is the former.

When LLMs blew up a few years ago I was pretty excited about the novelty of the software, and that excitement was driven by what they might do rather than what they did do.

Now, years and many iterations later, the most vocal proponents of this stuff still pitch what it might do with a volume loud enough to drown out almost any discussion of what it does. What little discussion of what it does for individuals usually boils down to some variation of “it gives me answers to the questions for which I do not care about the answers”, but, —how ridiculous, wasteful, and contrary to the basic ideas of knowledge and reasoning that statement is aside— even that is usually given with a wink and a nod to suggest that maybe one day it will give answers to questions that matter.

>>terhec+33
From my experience so far, most "AI skeptics" seem to be trying to catch the LLM in an error of reasoning or asking it to turn a vague description into a polished product in one shot. To make the latter worse, they often try to add context after the first wrong answer, which tends to make the LLM continue to be wrong - stop thinking about the pink elephant. No, I said don't think about the pink elephant! Why do you keep mentioning the pink elephant? I said I don't want a pink elephant in the text!

>>danlit+(OP)
I use them to troll. Like when I want to obviously make an annoying coworker angry I tell chatgpt to write an overly long and very AI sounding reply saying what I need to say.

>>visarg+83
> most people don't do real work (i.e. one that is objectively evaluated against reality), so are not able to see the difference between a gimmick and the real thing

No, no, you have to realise, most pianists don't make real music!

>>pishpa+85
I dunno. I think of it like a recommendation engine on Netflix. I don’t like everything Netflix tells me to watch. Same with perplexity. I don’t agree with everything it suggests me. People need to stop expecting the computer to think for them and instead see it as a tool to amplify their own thinking.

>>jiggaw+H3
I don't think the length you're talking about is that much of an issue. As you say, depending on how you measure it, LLMs are better at remaining accurate over a long span of text.

The issue seems to be more in the intelligence department. You can't really leave them in an agent-like loop with compiler/shell output and expect them to meaningfully progress on their tasks past some small number of steps.

Improving their initial error-free token length is solving the wrong problem. I would take less initial accuracy than a human but equally capable of iterating on their solution over time.

>>danlit+(OP)
I've had the same feeling for awhile. I tried to articulate it last night actually, I don't know to how much success: https://pid1.dev/posts/ai-skeptic/

>>jiggaw+H3
You are having low expectations here. People used to enter machine code on switches and punched paper tape, so yes they made sure it worked the first time. Later, people had code reviews by marking up printouts of code, and software got sent out in boxes that couldn't be changed until the next year.

Programmers who "iterate" buggy shit for 10 rounds until they get it right are a post-Google push-update phenomenon.

replies(1): >>jiggaw+f9

>>pishpa+58
Been there, done that. I made mistakes and had to try again or correct the input (when that was an option).

>>mnky98+V2
Perplexity often quotes references that simply don't exist. Recent examples provided by perplexity :

Google Cloud. (2024). "Broadcast Transformation with Google Cloud." https://cloud.google.com/solutions/media-entertainment/broad...

Microsoft Azure. (2024). "Azure for Media and Entertainment." https://azure.microsoft.com/en-us/solutions/media-entertainm...

IBC365. (2023). "The Future of Broadcast Engineering: Skills and Training." https://www.ibc.org/tech-advances/the-future-of-broadcast-en...

Broadcast Bridge. (2023). "Cloud Skills for Broadcast Engineers." https://www.thebroadcastbridge.com/content/entry/18744/cloud...

SVG Europe. (2023). "OTT and Cloud: The New Normal for Broadcast." https://www.svgeurope.org/blog/headlines/ott-and-cloud-the-n...

None of these exist, neither at the provided URLs or elsewhere.

replies(1): >>mnky98+Ot

>>lovepa+x2
I think this is the key. If you have a problem where it's slow to produce a plausible answer but quick to check if it's correct (writing a shell script, solving an equation, making up a verse for a song) then you have a good tool. It's the Prime-factorization category of problems. Recognizing when you have one and going to an LLM when you do, is key.

But what if you _don't_ have that kind of problem? Yes LLMs can be useful to solve the above. But for many problems you ask for a solution and what you get is a suggested solution which takes a long to verify. Meaning: unless you are somewhat sure it will solve the problem you don't want to do it. You need some estimate of confidence. LLMs are useless for this. As a developer I find my problems are very rarely in the first category and more often in the second.

Yes it's "using them wrong". It's doing what they struggle with. But it's also what I struggle with. It's hard to stop yourself when you have a difficult problem and you are weighing googling it for an hour or chatgpt-ing it for an hour. But I often regret going the ChatGPT route after several hours.

>>terhec+33
Not OP, but yesterday I was working on NFS server tuning on Linux, a typically quite difficult thing to find relevant info about through search engines. I asked Claude 3.5 to suggest some kernel settings or compile-time tweaks, and it provided me with entirely made up answers about kernel variables that don't exist, and makefile options that don't exist.

So maybe another LLM would have fared better, but still, so far it's mostly wasted time. It works quite well to summarise texts and creating filler images, but overall I still find them not reliable enough to care out of these two limited use cases.

replies(1): >>Yiin+Mc

>>wazoox+Ma
I mean you answered yourself why it didn't work, if there is no useful data in its training corpus, it would be a miracle if it could correctly guess unknown information.

replies(2): >>rndmio+tf >>wazoox+pI

>>visarg+83
Thanks, I am going to steal this one. :D

>>Yiin+Mc
How are you supposed to know in advance if it is going to be able to usefully answer your question or will just make up something?

>>1a527d+I2
YoE means "Years of Experience", for anyone interested. I had to look it up, and perhaps I can save a different me some time.

>>badmin+94
I hate the confident obsequious waffling. The cultural origins of the tool is evident.

If you aren't already, I suggest making sure to not forget, every 3-5 prompts, to throw in: "no waffling", "no flattery", "no obsequious garbage", etc. You can make it as salty as you like. If the AI says "Have fun!", or "Let's get coding!", you know you need to get the whip out haha.

Also, "3 sentences max on ...", "1 sentence explaining ...", "1 paragraph max on ...".

Another improvement for me was, you want to do procedure x in situation y, so you go "I'm in situation y, I'm considering procedure x, but I know I've missed something. Tell me what I could have missed". Or "list specific scenarios in which procedure x will lead to catastrophe".

Accepting the tool as a fundamentally dumb synthesiser and summariser is the first step to it getting a lot more useful, I think.

All that said, I use it pretty rarely. The revolution in learning we need is with John Holt and similar thinkers from that period, and is waiting to happen, and won't be provided by the next big tech thing, I fear.

replies(1): >>aaronb+JG

>>wazoox+la
Yes. That’s why I click on the links and read them and think “this doesn’t exist”. Then I go about my day just like I would have sorting through links served to me by a Google query that ended up being nonsensical or irrelevant.

>>-__---+Ql
At one point I asked Grok, "I've heard that AIs are programmed to please the user, which could lead to prioritizing what the user wants to hear over telling the truth. Are you doing that?" It said it wasn't, and gave examples of where in its answers to me it had given objective answers (as it saw them) and then followed them up with encouragement. Fair enough. So I told it to always prioritize giving me an objective viewpoint, and after that, it started breaking answers up with an "objective facts" section and then an "opinion" sort of section.

But I've noticed recently it's started slipping back into more "That's a great idea" and "You've got this" cheerleading, so I'm going to have to tell it to knock that out again. It will definitely lean into confirmation bias if that's what you're looking for, and you don't explicitly tell it not to worry about how you'll feel about the answer.

I find it useful for bouncing ideas off of, while keeping in mind that I'm really bouncing them off myself and sort of a hive mind made up of what's been said in certain mainstream sectors of the Internet. I'm less creative than average, so I get more ideas that way than I'd get from just journaling, so that's worth something.

>>Yiin+Mc
The data is certainly available, both in Linux kernel source and LKML history. The answers looked perfect at first glance; anyone without prior knowledge of kernel compilation and patching would have probably be impressed by the technical details in the answer. That's the typical LLM failure mode : it provides an answer when search engines fail you (because they provide you only the most basic, generic NFS-related forum posts while I was looking for strong technical information in a high-performance environment), but this answer isn't much better (even after pointing out the error), but would fool most people....

>>terhec+33
Not OP, but here's one instance over which I already had an internet fistfight with a person swearing by LLMs[0], meaning it should serve as a decent example:

> Suppose I'm standing on Earth and suddenly gravity stopped affecting me. What would be my trajectory? Specifically what would be my distance from Earth over time?

https://chatgpt.com/c/682edff8-c540-8010-acaa-8d9b5c26733d

It gives the "small distance approximation" in the examples, even if I ask for the solution after two hours, which at 879km is already quite off the correct ~820km.

An approximation that is better in the order of seconds to hours is pretty simple:

  s(t) = sqrt((R^2 + (Vt)^2)) - R

And it's even plotted in the chart, but again - numbers are off.

[0] Their results were giving wildly incorrect numbers at less than 100 seconds already, which was what originally prompted me to respond - they didn't even match the formula.

>>terhec+33
Every single thing that I am interested in my computer doing has not been done before by anyone that I know of nor do I think anyone besides myself would care.

I keep reading all of this glazing like in the rest of the thread and it's really frustrating because you get this fatigue with all the bs coming out of them that makes you not want to use them at all. The more you try to get it to fix the output the more it uses unrelated tokens.

In just the last 24 hours I've seen multiple models:

- Put C++ code structures in Python - Synthesize non existent functions, libraries, features of programming languages - Whole features of video file formats and associated ffmpeg flags that aren't applicable to imagery.

I also think you're not going to get any good answers to this question and a lot of pro AI people are going to be left unsatisfied because when you get into this spot every single thing that it does is wrong in some new way that cannot be easily categorized.

It is literally the limit of the representation of information in a digital way.