zlacker

There is a certain amount of irony that people try really hard to say that hallucinations are not a big problem anymore and then a company that would benefit from that narrative gets directly hurt by it.

Which of course they are going to try to brush it all away. Better than admitting that this problem very much still exists and isn’t going away anytime soon.

replies(6): >>Modern+I2 >>crypto+Xk >>anonzz+xy >>lyngui+oZ >>_jonas+PM5 >>learni+14m

>>nerdjo+(OP)
It's a huge problem. I just can't get past it and I get burned by it every time I try one of these products. Cursor in particular was one of the worst; the very first time I allowed it to look at my codebase, it hallucinated a missing brace (my code parsed fine), "helpfully" inserted it, and then proceeded to break everything. How am I supposed to trust and work with such a tool? To me, it seems like the equivalent of lobbing a live hand grenade into your codebase.

Don't get me wrong, I use AI every day, but it's mostly as a localized code complete or to help me debug tricky issues. Meaning I've written and understand the code myself, and the AI is there to augment my abilities. AI works great if it's used as a deductive tool.

Where it runs into issues is when it's used inductively, to create things that aren't there. When it does this, I feel the hallucinations can be off the charts -- inventing APIs, function names, entire libraries, and even entire programming languages on occasion. The AI is more than happy to deliver any kind of information you want, no matter how wrong it is.

AI is not a tool, it's a tiny Kafkaesque bureaucracy inside of your codebase. Does it work today? Yes! Why does it work? Who can say! Will it work tomorrow? Fingers crossed!

replies(5): >>mediam+44 >>Mounta+25 >>yodsan+ia >>theone+ek >>skissa+co

>>Modern+I2
I'd add that the deductive abilities translate to well-defined spec. I've found it does well when I know what APIs I want it to use, and what general algorithmic approaches I want (which are still sometimes brainstormed separately with an AI, but not within the codebase). I provide it a numbered outline of the desired requirements and approach to take, and it usually does a good job.

It does poorly without heavy instruction, though, especially with anything more than toy projects.

Still a valuable tool, but far from the dreamy autonomous geniuses that they often get described as.

>>Modern+I2
Versioning in source control for even personal projects just got far more important.

replies(2): >>Adrian+Gb >>o11c+io

>>Modern+I2
You're not supposed to trust the tool, you're supposed to review and rework the code before submitting for external review.

I use AI for rather complex tasks. It's impressive. It can make a bunch of non-trivial changes to several files, and have the code compile without warnings. But I need to iterate a few times so that the code looks like what I want.

That being said, I also lose time pretty regularly. There's a learning curve, and the tool would be much more useful if it was faster. It takes a few minutes to make changes, and there may be several iterations.

replies(7): >>ryandr+Le >>schmic+gi >>gtirlo+hl >>bigstr+qL >>mrheos+lO >>e3bc54+LA1 >>jorvi+Wo6

>>Mounta+25
It's wild how people write without version control... Maybe I'm missing something.

replies(1): >>chrisw+Ul

>>yodsan+ia
> You're not supposed to trust the tool, you're supposed to review and rework the code before submitting for external review.

It sounds like the guys in this article should not have trusted AI to go fully open loop on their customer support system. That should be well understood by all "customers" of AI. You can't trust it to do anything correctly without human feedback/review and human quality control.

>>yodsan+ia
> You're not supposed to trust the tool

This is just an incredible statement. I can't think of another development tool we'd say this about. I'm not saying you're wrong, or that it's wrong to have tools we can't just, just... wow... what a sea change.

replies(5): >>theone+Sk >>ryandr+oo >>shipp0+Wu >>tevon+Nv >>Modern+By

>>Modern+I2
> it hallucinated a missing brace (my code parsed fine), "helpfully" inserted it, and then proceeded to break everything.

Your tone is rather hyperbolic here, making it sound like an extra brace resulted in a disaster. It didn't. It was easy to detect and easy to fix. Not a big deal.

replies(2): >>Modern+kA >>nothra+NT1

>>schmic+gi
> I can't think of another development tool we'd say this about.

Because no other dev tool actually generates unique code like AI does. So you treat it like the other components of your team that generates code, the other developers. Do you trust other developers to write good code without mistakes without getting it reviewed by others. Of course not.

replies(4): >>anonym+zl >>chrisw+Il >>forget+en >>seabir+yn

>>nerdjo+(OP)
I think that’s why Apple is very slow at rolling out AI if it ever actually will. Downside is way too big than the upside.

replies(7): >>sillyf+hp >>zdragn+ip >>m3kw9+pr >>saintf+Cs >>poink+Nu >>jmaker+M21 >>gambit+ti1

>>yodsan+ia
1) Once you get it to output something you like, do you check all the lines it changed? Is there a threshold after which you just... hope?

2) No matter what the learning curve, you're using a statistical tool that outputs in probabilities. If that's fine for your workflow/company, go for it. It's just not what a lot of developers are okay with.

Of course it's a spectrum with the AI deniers in one corner and the vibe coders in the other. I personally won't be relying 100% on a tool and letting my own critical thinking atrophy, which seems to be happening, considering recent studies posted here.

replies(4): >>senord+cs >>pjerem+0O >>nkoren+We1 >>yodsan+KX1

>>theone+Sk
I trust my colleagues to write code that compiles, at the very least

replies(1): >>Modern+Oy

>>theone+Sk
But of course everyone absolutely NEEDS to use AI for codereviews! How else could the huge volume of AI-generated code be managed?

>>Adrian+Gb
yeah, "git init" (if you haven't botherered to create a template repo) is not exactly cumbersome.

>>theone+Sk
"Do you trust other developers to write good code without mistakes without getting it reviewed by others."

Literally yes. Test coverage and QA to catch bugs sure but needing everything manually reviewed by someone else sounds like working in a sweatshop full of intern-level code bootcamp graduates, or if you prefer an absolute dumpster fire of incompetence.

replies(2): >>ryandr+Xo >>theone+Cp

>>theone+Sk
Yes, actually, I do! I trust my teammates with tens of thousands of hours of experience in programming, embedded hardware, our problem spaces, etc. to write from a fully formed worldview, and for their code to work as intended (as far as anybody can tell before it enters preliminary testing by users) by the time the rest of the team reviews it. Most code review is uneventful. Have some pride in your work and you'll be amazed at what's possible.

replies(1): >>theone+SD

>>Modern+I2
> the very first time I allowed it to look at my codebase, it hallucinated a missing brace (my code parsed fine), "helpfully" inserted it, and then proceeded to break everything.

This is not an inherent flaw of LLMs, rather it is a flaw of a particular implementation-if you use guided sampling, so during sampling you only consider tokens allowed by the programming language grammar at that position, it becomes impossible for the LLM to generate ungrammatical output

> When it does this, I feel the hallucinations can be off the charts -- inventing APIs, function names, entire libraries,

They can use guided sampling for this too - if you know the set of function names which exist in the codebase and its dependencies, you can reject tokens that correspond to non-existent function names during sampling

Another approach, instead of or as well as guided sampling, is to use an agent with function calling - so the LLM can try compiling the modified code itself, and then attempt to recover from any errors which occur.

>>Mounta+25
Thankfully modern source control doesn't reuse user-supplied filenames for its internals. In the dark ages, I destroyed more than one checkout using commands of the form:

  find -name '*somepattern*' -exec clobbering command ...

>>schmic+gi
Imagine if your compiler just randomly and non-deterministically compiled valid code to incorrect binaries, and the tool's developer couldn't really tell you why it happens, how often it was expected to happen, how severe the problem was expected to be, and told you to just not trust your compiler to create correct machine code.

Imagine if your calculator app randomly and non-deterministically performed arithmetic incorrectly, and you similarly couldn't get correctness expectations from the developer.

Imagine if any of your communication tools randomly and non-deterministically translated your messages into gibberish...

I think we'd all throw away such tools, but we are expected to accept it if it's an "AI tool?"

replies(4): >>andrei+is >>ToValu+nt >>arvins+XK >>learni+Nmm

>>forget+en
I would accept mistakes and inconsistency from a human, especially one not very experienced or skilled. But I expect perfection and consistency from a machine. When I command my computer to do something, I expect it to do it correctly, the same way every time, to convert a particular input to an exact particular output, every time. I don't expect it to guess, or randomly insert garbage, or behave non-deterministically. Those things are called defects(bugs) and I'd want them to be fixed.

replies(3): >>senord+ms >>forget+tt >>tevon+aw

>>crypto+Xk
They already rolled out an "AI" product. Got humiliated pretty bad, and rolled it back. [0]

[0] https://www.bbc.com/news/articles/cq5ggew08eyo

replies(2): >>furyof+Is >>Ventur+vw

>>crypto+Xk
Investors seem to be starved for novelty right now. Web 2.0 is a given, web 3.0 is old, crypto has lost the shine, all that's left to jump on at the moment is AI.

Apple fumbled a bit with Siri, and I'm guessing they're not too keen to keep chasing everyone else, since outside of limited applications it turns out half baked at best.

Sadly, unless something shinier comes along soon, we're going to have to accept that everything everywhere else is just going to be awful. Hallucinations in your doctor's notes, legal rulings, in your coffee and laundry and everything else that hasn't yet been IoT-ified.

replies(3): >>Ventur+Gw >>timr+cz >>pcthro+W71

>>forget+en
Ok, here I thought requiring PR review and approval before merging was standard industry best practice. I guess all the places I've worked have been doing it wrong?

replies(1): >>forget+ou

>>crypto+Xk
When you got 1-2billion users a day doing maybe 10 billion prompts a day, it’s risky

>>gtirlo+hl
1) Yes, I review every line it changed.

2) I find the tool analogy helpful but it has limits. Yes, it’s a stochastic tool, but in that sense it’s more like another mind, not a tool. And this mind is neither junior nor senior, but rather a savant.

>>ryandr+oo
Imagine that you yourself never use these tools directly but your employees do. And the sellers of said tools swear that the tools are amazing and correct and will save you millions.

They keep telling you that any employee who highlights problems with the tools are just trying to save their job.

Your investors tell you that the toolmakers are already saving money for your competitors.

Now, do you want that second house and white lotus vacation or not?

Making good tools is difficult. Bending perception (“is reality”) is easier and enterprise sales, just like good propaganda, work. The gold rush will leave a lot of bodies behind but the shovelmakers will make a killing.

replies(1): >>Modern+Gy

>>ryandr+Xo
Then you are going to hate the future.

replies(1): >>ryandr+yV1

>>crypto+Xk
You say slowly, but in my opinion Apple made an out of character misstep by releasing a terrible UX to everyone. Apple intelligence is a running joke now.

Yes they didn't push it as hard as, say, copilot. I still think they got in way too deep way too fast.

replies(5): >>stogot+cv >>manmal+FF >>throwa+9M >>devmor+zX >>miki12+i11

>>sillyf+hp
They also have text thread and email summaries. I still think it counts as a slow rollout.

>>ryandr+oo
If the only calculators that existed failed at 5% of the calculations, or if the only communication tools miscommunicated 5% of the time, we would still use both all the time. They would be far less than 95% as useful as perfect versions, but drastically better then not having the tools at all.

replies(1): >>gitrem+Pv

>>ryandr+Xo
Exactly this.

>>theone+Cp
There's a lot of shit that has become "best practice" over the last 15 years, and a lot more that was "best practice" but fell out of favor because reasons. All of it exists on a continuum of what is actually reasonable given the circumstances. Reviewing pull requests is one of those things that is reasonable af in theory, produces mediocre results in practice, and is frequently nothing more than bureaucratic overhead. Consider a case where an individual adds a new feature to an existing codebase. Given they are almost certainly the only one who has spent significant time researching the particulars of the feature set in question, and are the only individual with any experience at all with the new code, having another developer review it means you've got inexperienced, low-info eyes examining something they do not fully understand, and will have to take some amount of time to come up to speed on. Sure they'll catch obvious errors, but so would a decent test suite.

Am I arguing in favor of egalitarian commit food fights with no adults in the room? Absolutely not. But demanding literally every change go through a formal review process before getting committed, like any other coding dogma, has a tendency to generate at least as much bullshit as it catches, just a different flavor.

replies(2): >>rixed+9E >>Tainno+p31

>>crypto+Xk
Yet Apple has reenabled Apple Intelligence multiple times on my devices after OS updates despite me very deliberately and angrily disabling it multiple times

>>schmic+gi
In Mechanical Engineering, this is 100% a thing with fluid dynamics simulation. You need to know if the output is BS based on a number of factors that I don't understand.

>>saintf+Cs
Fast!? They were two years slow and still fell face flat, and then rolled back the software

replies(1): >>Michae+jz

>>schmic+gi
Stackoverflow is like this, you read an answer but are not fully sure if its right or if it fits your needs.

Of course there is a review system for a reason, but we frequently use "untrusted" tools in development.

That one guy in a github issue that said "this worked for me"

>>ToValu+nt
Absolutely not. We'd just do the calculations by hand, which is better than running the 95%-correct calculator and then doing the calculations by hand anyway to verify its output.

replies(1): >>ToValu+YI

>>ryandr+Xo
This seems like a particularly limited view of what a machine is. Specifically expecting it to behave deterministically.

replies(2): >>Modern+8z >>forget+pN1

>>sillyf+hp
They had an opportunity to actually adapt, to embrace getting rapid feedback/iterating: But they are not equipped for it culturally. Major lost opportunity as it could have been a driver of internal change.

I'm certain they'll get it right soon enough though. People were writing off Google in terms of AI until this year.. and oh how attitudes have changed.

replies(2): >>tonyha+OG >>skinke+ET

>>zdragn+ip
"all that's left to jump on at the moment is AI" -> No, it's the effective applications of AI. It's unprecedented.

I was in the VC space for a while previously, most pitch decks claimed to be using AI: But doing even the briefest of DD - it was generally BS. Now it's real.

With respect to everything being awful: One might say that's always been the case. However, now there's a chance (and requirement) to build in place safeguards/checks/evals and massively improve both speed and quality of services through AI.

Don't judge for the problems: Look at the exponential curve, think about how to solve the problems. Otherwise, you will get left behind.

replies(2): >>zdragn+7I >>hexo+bX

>>nerdjo+(OP)
Did anyone say that? They are an issue everywhere, including for code. But with code at least I can have tooling to automatically check and feed back that it hallucinated libraries, functions etc, but with just normal research / problems there is no such thing and you will spend a lot of time verifying everything.

replies(5): >>manmal+xF >>threes+PH >>felipe+qI >>rini17+f11 >>jmaker+d21

>>schmic+gi
Imagine! Imagine if 0.05% of the time gcc just injected random code into your binaries. Imagine, you swing a hammer and 1% of the time it just phases into the wall. Tools are supposed to be reliable.

replies(1): >>arvins+aL

>>andrei+is
I feel like there's a lot of motivated reasoning going on, yeah.

>>anonym+zl
Oh at the very least I trust them to not take code that compiles and immediately assess that it's broken.

>>tevon+aw
Still, the whole Unix philosophy of building tools starts with a foundation of building something small that can do one thing well. If that is your foundation, you can take advantage of composability and create larger tools that are more capable. The foundation of all computing today is built on this principle of design.

Building on AI seems more like building on a foundation of sand, or building in a swamp. You can probably put something together, but it's going to continually sink into the bog. Better to build on a solid foundation, so you don't have to continually stop the thing from sinking, so you can build taller.

>>zdragn+ip
> we're going to have to accept that everything everywhere else is just going to be awful. Hallucinations in your doctor's notes, legal rulings, in your coffee and laundry and everything else that hasn't yet been IoT-ified.

I installed a logitech mouse driver (sigh) the other day, and in addition to being obtrusive and horrible to use, it jams an LLM into the UI, for some reason.

AI has reached crapware status in record time.

>>stogot+cv
“Two years slow” relative to what?

Henry Ford was 23 years “slow” relative to Karl Benz.

replies(1): >>the_do+bE

>>theone+ek
It's not a big deal in the sense that it's easily reversed, but it is a big deal in that it means the tool is unpredictably unhelpful. Of the properties that good tools in my workflow possess, "unpredictably unhelpful" does not make the top 100.

When a tool starts confidently inserting random wrong code into my 100% correct code, there's not much more I need to see to know it's not a tool for me. That's less like a tool and more like a vandal. That's not something I need in my toolbox, and I'm certainly not going to replace my other tools with it.

>>seabir+yn
so your saying that yes you do "trust other developers to write good code without mistakes without getting it reviewed by others."

And then you say "by the time the rest of the team reviews it. Most code review is uneventful."

So you trust your team to develop without the need for code review but yet, your team does code review.

So what is the purpose of these code reviews? Is it the case that you actually don't think they are necessary, but perhaps management insists on them? You actually answer this question yourself:

> Most code review is uneventful.

Keyword here is "most" as opposed to "all" So based your team's applied practices and your own words, code review is for the purpose of catching mistakes and other needed corrections.

But it seems to me if you trust your team not to make mistakes, code review is superfluous.

As an aside, it seems your team culture doesn't make room for juniors because if your team had juniors I think it would be even more foolish to trust them not to make mistakes. Maybe a junior free culture works for your company, but that's not the case for every company.

My main point is code review is not superfluous no matter the skill level; junior, senior, or AI simply because everyone and every AI makes mistakes. So I don't trust those three classes of code emitters to not ever make mistakes or bad choices (i.e. be perfect) and therefore I think code review is useful.

Have some honesty and humility and you'll amazed at what's possible.

replies(1): >>seabir+mJ

>>forget+ou
And there is worst: in the cases when the reviewer has actually some knowledge of the problem at hand, she might say "oh you did all this to add that feature? But it's actually already there. You just had to include that file and call function xyz". Or "oh but two months ago that very same topic was discussed and it was decided that it would make more sense to wait for module xyz to be refactored in order to make it easier ", etc.

>>Michae+jz
Apple innovation glazing makes me ill

replies(1): >>Michae+UF1

>>anonzz+xy
You get some superficial checking by the compiler and test cases, but hallucinations that pass both are still an issue.

replies(1): >>anonzz+ZF

>>saintf+Cs
Remember „You are a bad user, I am a good bing“? Apple is just slower in fixing and improving things.

>>manmal+xF
Absolutely, but at least you have some lines of defence while with real world info you have nothing. And the most offending stuff like importing a package that doesn't exist or using a function that doesn't exist does get caught and can be auto fixed.

replies(1): >>skyfal+ZN

>>Ventur+vw
"to embrace getting rapid feedback/iterating"

that's the problem noo?? big company is sucks at that, you cant do that in certain company because sometimes its just not possible

>>anonzz+xy
I use Scala which has arguably the best compiler/type system with Cursor.

There is no world in which a compiler or tooling will save you from the absolute mayhem it can do. I’ve had it routinely try to re-implement third party libraries, modify code unrelated to what it was asked, quietly override functions etc.

It’s like a developer who is on LSD.

replies(4): >>Terr_+j01 >>Tainno+s11 >>jmaker+t21 >>larodi+ha1

>>Ventur+Gw
The problem isn't AI; it's just a tool. The problem is the people using it incorrectly because they don't understand it beyond the hype and surface details they hear about it.

Every week for the last few months, I get a recruiter for a healthcare startup note taking app with AI. It's just a rehash of all the existing products out there, but "with AI". It's the last place I want an overworked non-technical user relying on the computer to do the right thing, yet I've had at least four companies reach out with exactly that product. A few have been similar. All of them have been "with AI".

It's great that it is getting better, but at the end of the day, there's only so much it can be relied upon for, and I can't wait for something else to take away the spotlight.

replies(1): >>Ventur+xg1

>>anonzz+xy
Yes, most people who have an incentive in pushing AI say that hallucinations aren't a problem, since humans aren't correct all the time.

But in reality hallucinations either make people using AI lose a lot of their time trying to stuck the LLMs from dead ends or render those tools unusable.

replies(2): >>jimbok+YK1 >>Gormo+M02

>>gitrem+Pv
Suppose you work in a field where getting calculations right is critical. Your engineers make mistakes less than .01% of the time, but they do a lot of calculations and each mistake could cost $millions or lives. Double- and triple-checking help a lot, but they're costly. Here's a machine that verifies 95% of calculations, but you'd still have to do 5% of the work. Shall I throw it away?

Unreliable tools have a good deal of utility. That's an example of them helping reduce the problem space, but they also can be useful in situations where having a 95% confidence guess now matters more that a 99.99% confidence one in ten minutes- firing mortars in active combat, say.

There's situations where validation is easier than computation; canonically this is factoring, but even division is much simpler than multiplication. It could very easily save you time to multiply all of the calculator's output by the dividend while performing both a multiplication and a division for the 5% that are wrong.

edit: I submit this comment and click to go the front page and right at the top is Unsure Calculator (no relevance). Sorry, I had to mention this

replies(4): >>diputs+cM >>mrheos+GO >>Tainno+721 >>jimbok+TU1

>>theone+SD
I never said that code review was useless, I said "yes, I do" to your question as to whether or not I "trust other developers to write good code without mistakes without getting it reviewed by others". Of course I can trust them to do the right thing even when nobody's looking, and review it anyway in the off-chance they overlooked something. I can't trust AI to do that.

The purpose of the review is to find and fix occasional small details before it goes to physical testing. It does not involve constant babysitting of the developer. It's a little silly to bring up honesty when you spent that entire comment dancing around the reality that AI makes an inordinately large number of mistakes. I will pick the domain expert who refuses to touch AI over a generic programmer with access to it ten times out of ten.

The entire team as it is now (me included) were juniors. It's a traditional engineering environment in a location where people don't aggressively move between jobs at the drop of a hat. You don't need to constantly train younger developers when you can retain people.

replies(1): >>theone+HL

>>ryandr+oo
If you think of AI like a compiler, yes we should throw away such tools because we expect correctness and deterministic outcomes

If you think of AI like a programmer, no we shouldn't throw away such tools because we accept them as imperfect and we still need to review.

replies(1): >>bigstr+FL

>>Modern+By
There are no existing AI tools that guarantee correct code 100% of the time.

If there is such a tool, programmers will be on path of immediate reskilling or lose their jobs very quickly.

>>yodsan+ia
> You're not supposed to trust the tool, you're supposed to review and rework the code before submitting for external review.

Then it's not a useful tool, and I will decline to waste time on it.

>>arvins+XK
> If you think of AI like a programmer, no we shouldn't throw away such tools because we accept them as imperfect and we still need to review.

This is a common argument but I don't think it holds up. A human learns. If one of my teammates or I make a mistake, when we realize it we learn not to make that mistake in the future. These AI tools don't do that. You could use a model for a year, and it'll be just as unreliable as it is today. The fact that they can't learn makes them a nonstarter compared to humans.

>>seabir+mJ
You spend your comment dancing around the fact that everyone makes mistakes and yet you claim you trust your team not to make mistakes.

> I "trust other developers to write good code without mistakes without getting it reviewed by others". Of course I can trust them to do the right thing even when nobody's looking, and review it anyway in the off-chance they overlooked something.

You're saying yes, I trust other developers to not make mistakes, but I'll check anyways in case they do. If you really trusted them not to make mistakes, you wouldn't need to check. They (eventually) will. How can I assert that? Because everyone makes mistakes.

It's absurd to expect anyone to not make mistakes. Engineers build whole processes to account for the fact that people, even very smart people make mistakes.

And it's not even just about mistakes. Often times, other developers have more context, insight or are just plain better and can offer suggestions to improve the code during review. So that's about teamwork and working together to make the code better.

I fully admit AI makes mistakes, sometimes a lot of them. So it needs code review . And on the other hand, sometimes AI can really be good at enhancing productivity especially in areas of repetitive drudgery so the developer can focus on higher level tasks that require more creativity and wisdom like architectural decisions.

> I will pick the domain expert who refuses to touch AI over a generic programmer with access to it ten times out of ten.

I would too, but I won't trust them not to make mistakes or occasional bad decisions because again, everybody does.

> You don't need to constantly train younger developers when you can retain people.

But you do need to train them initially. Or do you just trust them to write good code without mistakes too?

>>saintf+Cs

    > Apple made an out of character misstep by releasing a terrible UX to everyone

What about Apple Maps? That roll-out was awful.

replies(1): >>miki12+n11

>>ToValu+YI
> Here's a machine that verifies 95% of calculations, but you'd still have to do 5% of the work.

The problem is that you don't know which 5% are wrong. The AI is confidently wrong all the time. So the only way to be sure is to double check everything, and at some point its easier to just do it the right way.

Sure, some things don't need to be perfect. But how much do you really want to risk? This company thought a little bit of potential misinformation was acceptable, and so it caused a completely self inflicted PR scandal, pissed off their customer base, and lost them a lot of confidence and revenue. Was that 5% error worth it?

Stories like this are going to keep coming the more we rely on AI to do things humans should be doing.

Someday you'll be affected by the fallout of some system failing because you happen to wind up in the 5% failure gap that some manager thought was acceptable (if that manager even ran a calculation and didn't just blindly trust whatever some other AI system told them) I just hope it's something as trivial as an IDE and not something in your car, your bank, or your hospital. But certainly LLMs will be irresponsibly shoved into all three within the next few years, if it's not there already.

replies(1): >>ToValu+gG1

>>anonzz+ZF
Such errors can be caught and auto-fixed for now, because LLMs haven't yet rotted the code that catches and auto-fixes errors. If slop makes it into your compiler etc., I wouldn't count on that being true in the future.

>>gtirlo+hl
> 1) Once you get it to output something you like, do you check all the lines it changed? Is there a threshold after which you just... hope?

Not op but yes. It sometimes takes a lot of time but I read everything. It still faster than nothing. Also, I ask very precise changes to the AI so it doesn’t generate huge diffs anyway.

Also for new code, TDD works wonders with AI : let it write the unit tests (you still have to be mindful of what you want to implement) and ask it to implement the code that run the tests. Since you talk the probabilistic output, the tool is incredibly good at iterating over things (running and checking tests) and also, unit tests are, in themselves, a pretty perfect prompt.

replies(2): >>iforgo+9U >>riffra+C71

>>yodsan+ia
If i dont trust my tool, i would never use it, or use something else better

>>ToValu+YI
> you'd still have to do 5% of the work

No, you still have to do 100% of the work.

replies(1): >>ToValu+7I1

>>Ventur+vw
> People were writing off Google in terms of AI until this year.. and oh how attitudes have changed.

Just give Google a year or two.

Google has a pretty amazing history of both messing up products generally and especially "ai like" things, including search.

(Yes I used to defend Google until a few years ago.)

>>pjerem+0O
> It sometimes takes a lot of time but I read everything. It still faster than nothing.

Opposite experience for me. It reliably fails at more involved tasks so that I don't even try anymore. Smaller tasks that are around a hundred lines maybe take me longer to review that I can just do it myself, even though it's mundane and boring.

The only time I found it useful is if I'm unfamiliar with a language or framework, where I'd have to spend a lot of time looking up how to do stuff, understand class structures etc. Then I just ask the AI and have to slowly step through everything anyways, but at least there's all the classes and methods that are relevant to my goal and I get to learn along the way.

>>Ventur+Gw
> it was generally BS. Now it's real.

Yes. Finally! Now it's real BS. I wouldn't touch it with 8 meter pole.

>>saintf+Cs
This is not the first time that Apple has released a terrible UX that very few users liked, and it certainly wont be the last.

I don’t necessarily agree with the post you’re responding to, but what I will give Apple credit for is making their AI offering unobtrusive.

I tried it, found it unwanted and promptly shut it off. I have not had to think about it again.

Contrast that with Microsoft Windows, or Google - both shoehorning their AI offering into as many facets of their products as possible, not only forcing their use, but in most cases actively degrading the functionality of the product in favor of this required AI functionality.

replies(2): >>johnis+9Z >>trilby+v81

>>devmor+zX
On Instagram, the search is now "powered by Meta AI". Cannot do anything about it.

replies(1): >>johnis+wpd

>>nerdjo+(OP)
https://www.anthropic.com/research/tracing-thoughts-language...

The section about hallucinations is deeply relevant.

Namely, Claude sometimes provides a plausible but incorrect chain-of-thought reasoning when its “true” computational path isn’t available. The model genuinely believes it’s giving a correct reasoning chain, but the interpretability microscope reveals it is constructing symbolic arguments backward from a conclusion.

https://en.wikipedia.org/wiki/On_Bullshit

This empirically confirms the “theory of bullshit” as a category distinct from lying. It suggests that “truth” emerges secondarily to symbolic coherence and plausibility.

This means knowledge itself is fundamentally symbolic-social, not merely correspondence to external fact.

Knowledge emerges from symbolic coherence, linguistic agreement, and social plausibility rather than purely from logical coherence or factual correctness.

replies(7): >>jmaker+Q11 >>emn13+t61 >>CodesI+L61 >>skrebb+x71 >>nickle+pA1 >>jimbok+hJ1 >>ScottB+Ho8

>>threes+PH
Yeah, everyone wanted a thinking machine, but the best we can do right now is a dreaming machine... And dreams don't have to make sense.

>>anonzz+xy
Except when the hallucinated library exists and it's malicious. This is actually happening. Without AI, by using plain google you are less likely to fall for that (so far).

>>saintf+Cs
Apple made a huge mistake by keeping their commitment to "local first" in the age of AI.

The models and devices just aren't quite there yet.

Once Google gets its shit together and starts deploying (cloud--based) AI features to Android devices en masse, Apple is going to have a really big problem on their hands.

Most users say that they want privacy, but if privacy comes in the way of features or UX, they choose the latter. Successful privacy-respecting companies (Apple, Signal) usually understand this, it's why they're successful, but I think Apple definitely chose the wrong tradeoff here.

>>throwa+9M
Apple had their hand forced by Google on that one afaik.

Yes they knew Apple maps was bad and not up to standard yet, but they didn't really have any other choice.

replies(1): >>emn13+471

>>threes+PH
I don't know Scala. I asked cursor to create a tutorial for me to learn Scala. It created two files for me, Basic.scala and Advanced.scala. The second one didn't compile and no matter how often I tried to paste the error logs into the chat, it couldn't fix the actual error and just made up something different.

>>lyngui+oZ
I haven’t used Cursor yet. Some colleagues have and seemed happy. I’ve had GitHub Copilot on for what feels like a couple years, a few days ago VS Code was extended to provide an agentic workflow, MCP, bring-your-own-key, it interprets instructions in a codebase. But the UX and the outputs are bad in over 3/4 of cases. It’s a nuisance to me. It injects bad code even though it has the full context. Is Cursor genuinely any better?

To me it feels like people that benefit from or at least enjoy that sort of assistance and I solve vastly different problems and code very differently.

I’ve done exhausting code reviews on juniors’ and middles’ PRs but what I’ve been feeling lately is that I’m reviewing changes introduced by a very naive poster. It doesn’t even type-check. Regardless of whether it’s Claude 3.7, o1, o3-mini, or a few models from Hugging Face.

I don’t understand how people find that useful. Yesterday I literally wasted half an hour for a test suite setup a colleague of mine introduced to the codebase that wasn’t good, and I tried delegating that fix to several of the Copilot models. All of them missed the point, some even introduced security vulnerabilities in the process invalidating JWT validation, I tried “vide coding” it till it works, until I gave up in frustration and just used an ordinary search engine, which led me to the docs, in which I immediately found the right knob. I reverted all that crap and did the simple and correct thing. So my conclusion was simple: vibe coding and LLMs made the codebase unnecessarily more complicated and wasted my time. How on earth do people code whole apps with that?

replies(1): >>trilby+p81

>>ToValu+YI
> Unreliable tools have a good deal of utility.

This is generally true when you can quantify the unreliability. E.g. random prime number tests with a specific error rate can be combined so that the error rates multiply and become negligible.

I'm not aware that we can quantify the uncertainty coming out of LLM tools reliably.

>>anonzz+xy
Until the model injects a subtle change to your logic that does type-check and then goes haywire in production. Just takes a colleague of yours under pressure and another one to review the PR, and then you’re on call and they out sick or on vacation.

>>threes+PH
Granted the Scala language is much more complex than Go. To produce something useful it must be capable of an equivalent of parsing the AST.

>>crypto+Xk
Even the iOS and macOS typing correction engine has been getting worse for me over the past few OS updates. I’m now typing this on iOS, and it’s really annoying how it injects completely unrelated words, replaces minor typos with completely irrelevant words. Same in Safari on macOS. The previous release felt better than now, but still worse than a couple years ago.

replies(1): >>TylerE+m41

>>forget+ou
Code review is actually one of the few practices for which research does exist[0] which points in the direction of it being generally good at reducing defects.

Additionally, in the example you share, where only one person knows the context of the change, code review is an excellent tool for knowledge sharing.

[0]: https://dl.acm.org/doi/10.1145/2597073.2597076, for example

replies(1): >>forget+KO1

>>jmaker+M21
It’s not just you. iOS auto correct has gotten damn near malicious. E seen it insert entire words out of nowhere

replies(2): >>trilby+W81 >>throww+JO1

>>lyngui+oZ
While some of what you say is an interesting thought experiment, I think the second half of this argument has, as you'd put it, a low symbolic coherence and low plausibility.

Recognizing the relevance of coherence and plausibility does not need to imply that other aspects are any less relevant. Redefining truth merely because coherence is important and sometimes misinterpreted is not at all reasonable.

Logically, a falsehood can validly be derived from assumptions when those assumptions are false. That simple reasoning step alone is sufficient to explain how a coherent-looking reasoning chain can result in incorrect conclusions. Also, there are other ways a coherent-looking reasoning chain can fail. What you're saying is just not a convincing argument that we need to redefine what truth is.

replies(2): >>dcow+Nc2 >>learni+p5m

>>lyngui+oZ
> The model genuinely believes it’s giving a correct reasoning chain, but the interpretability microscope reveals it is constructing symbolic arguments backward from a conclusion.

Sounds very human. It's quite common that we make a decision based on intuition, and the reasons we give are just post-hoc justification (for ourselves and others).

replies(5): >>Ransom+N91 >>jimbok+IJ1 >>throww+oN1 >>jerf+tO1 >>learni+46m

>>miki12+n11
Of course they had a choice: they could have stuck with google maps for longer, and they probably also could have invested more in data and UI beforehand. They could have launched a submarine non-apple-branded product to test the waters. They could likely have done other things we haven't thought of here, in this thread.

Quite plausibly they just didn't realize how rocky the start would be, or perhaps they valued that immediate strategic autonomy more in the short-term that we think, and willingly chose to take the hit to their reputation rather than wait.

Regardless, they had choices.

replies(1): >>throwa+alj

>>lyngui+oZ
Offtopic but I'm still sad that "On Bullshit" didn't go for that highest form of book titles, the single noun like "Capital", "Sapiens", etc

replies(1): >>mvieir+EU1

>>pjerem+0O
How do you have it write tests before the code? It seems writing a prompt for the LLM to generate the tests would take the same time as writing the tests themselves.

Unless you're thinking of repetitive code I can't imagine the process (I'm not arguing, I'm just curious of what you're flow looks like).

>>zdragn+ip
> Hallucinations in your doctor's notes, legal rulings, in your coffee

"OK Replicator, make me one espresso with creamer"

"Making one espresso with LSD"

replies(1): >>Jagerb+Px1

>>jmaker+Q11
I think it works until it doesn't. The nature of technical debt of this kind means you can sort of coast on things until the complexity of the system reaches such a level that it's effectively painted into a corner, and nothing but a massive teardown will do as a fix.

>>devmor+zX
Yep. Replacing google assistant with Gemini before feature parity was even close is such a fuck-you to users.

>>TylerE+m41
Spellcheck is an absolutely perfect example of what happens with technology long-term. Once the hype cycle is over for a certain tech, it gets left to languish, slowly degrading until it's completely useless. We should be far more outraged at how poor basic things like this still are in 2025. They are embarrassingly bad.

replies(1): >>nottor+Zh1

>>CodesI+L61
> Sounds very human

well yes, of course it does, that article goes out of its way to anthropomorphize LLMs, while providing very little substance

>>threes+PH
Developer on LSD is likely to hallucinate less in terms of how weird the LLM hallucinations are sometimes. Besides I know people, not myself, who fare very well on LSD and particularly when micro dosing Adderal style

replies(1): >>mining+oj1

>>gtirlo+hl
I've been doing AI-assisted coding for several months now, and have found a good balance that works for me. I'm working in Typescript and React, neither of which I know particularly well (although I know ES6 very well). In most cases, AI is excellent at tasks which involve writing quasi-custom boilerplate (eg. tests which require a lot of mocking), and at answering questions of how I should do _X_ in TS/React. For the latter, those are undoubtedly questions I could eventually find the answers on Stack Overflow and deduce how to apply those answers to my specific context -- but it's orders of magnitude faster to get the AI to do that for me.

Where the AI fails is in doing anything which requires having a model of the world. I'm writing a simulator which involves agents moving through an environment. A small change in agent behaviour may take many steps of the simulator to produce consequential effects, and thinking through how that happens -- or the reverse: reasoning about the possible upstream causes of some emergent macroscopic behaviour -- requires a mental model of the simulation process, and AI absolutely does _not_ have that. It doesn't know that it doesn't have that, and will therefore hallucinate wildly as it grasps at an answer. Sometimes those hallucinations will even hit the mark. But on the whole, if a mental model is required to arrive at the answer, AI wastes more time than it saves.

replies(1): >>jimbok+nT1

>>zdragn+7I
Well put and you're correct: There IS a lot of hype/BS still sadly - as companies seek to jump on the hype train without effectively adapting. My karma took a serious hit for my last post - but yesterday I met with someone whose life has been profoundly impacted by AI:

- An extremely dedicated and high achieving professional, at the very top of her game with deep industry/sectoral knowledge: Successful and with outstanding connections. - Mother of a young child. - Tradition/requirement for success within the sector was/is working extremely long hours: 80-hour weeks are common.

She's implemented AI to automate many of her previous laborious tasks and literally cut down her required hours by 90%. She's now able to spend more time with her family, but also - able to now focus on growing/scaling in ways previously impossible.

Knowing how to use it, what to rely upon, what to verify and building in effective processes is the key. But today AI is at its worst and it already exceeds human performance in many areas.. it's only going in one direction.

Hopefully the spotlight becomes humanity being able to focus on what makes us human and our values, not mundane/routine tasks and allows us to better focus on higher-value/relationships.

replies(2): >>e3bc54+aA1 >>zdragn+q72

>>trilby+W81
> it gets left to languish, slowly degrading until it's completely useless

What do you mean? Code shouldn't degrade if it's not changed. But the iOS spell checker is actively getting worse, meaning someone is updating it.

replies(1): >>TylerE+G07

>>crypto+Xk
>>if it ever actually will.

If they don't then I'd hope they get absolutely crucified by trade comissions everywhere, currently there are bilboards in my city advertising Apple AI even though it doesn't even exist yet - if it's never brought to the market then it's a serious case of misleading advertising.

>>larodi+ha1
Mushrooms too! I find they get me into a flow state much better than acid (when microdosing).

>>pcthro+W71
I'd like to get on the pre-order list for this product.

>>Ventur+xg1
90% is a big number. Assuming being an expert allows her to make better use of AI than most, that is still an astonishing number without knowing anything about the field in question. That makes me think that 80 hour work weeks are mostly unproductive. Again, assumption being an average person to be able to use AI less effectively, let's ballpark at half as effectively, I still get 40 hours per week of mostly non-sense work. How did we end up here as a society?!

>>lyngui+oZ
Yes

https://link.springer.com/article/10.1007/s10676-024-09775-5

> # ChatGPT is bullshit

> Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters, and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.

>>yodsan+ia
> You're not supposed to trust the tool, you're supposed to review and rework the code before submitting for external review.

The vibe coding guy said to forget the code exists and give in to vibes, letting the AI 'take care' of things. Review and rework sounds more like 'work' and less like 'vibe'.

/s

>>the_do+bE
What does this mean?

>>diputs+cM
>The problem is that you don't know which 5% are wrong

This is not a problem in my unreliable calculator use-cases; are you disputing that or dropping the analogy?

Because I'd love to drop the analogy. You mention IDEs- I routinely use IntelliJ's tab completion, despite it being wrong >>5% of the time. I have to manually verify every suggestion. Sometimes I use it and then edit the final term of a nested object access. Sometimes I use the completion by mistake, clean up with backspace instead of undo, and wind up submitting a PR that adds an unused dependency. I consider it indispensable to my flow anyway. Maybe others turn this off?

You mention hospitals. Hospitals run loads of expensive tests every day with a greater than 5% false positive and false negative rate. Sometimes these results mean a benign patient undergoes invasive further testing. Sometimes a patient with cancer gets told they're fine and sent home. Hospitals continue to run these tests, presumably because having a 20x increase in specificity is helpful to doctors, even if it's unreliable. Or maybe they're just trying to get more money out of us?

Since we're talking LLMs again, it's worth noting that 95% is an underestimate of my hit rate. 4o writes code that works more reliably than my coworker does, and it writes more readable code 100% of the time. My coworker is net positive for the team. His 2% mistake rate is not enough to counter the advantage of having someone there to do the work.

An LLM with a 100% hit rate would be phenomenal. It would save my company my entire salary. A 99% one is way worse; they still have to pay me to use it. But I find a use for the 99% LLM more-or-less every day.

replies(1): >>gitrem+PP1

>>mrheos+GO
You simply do not. You do the math yourself to calculate 2(n) for n in [1, 2, 3, 4] and get [2, 5, 6, 8]. You plug it into your (75% accurate) unreliable calculator and get [3, 4, 6, 8]. You now know that you only need to recheck the first two (50%) of the entries.

replies(1): >>throww+IP1

>>lyngui+oZ
> Knowledge emerges from symbolic coherence, linguistic agreement, and social plausibility rather than purely from logical coherence or factual correctness.

This just seems like a redefinition of the word "knowledge" different from how it's commonly used. When most people say "knowledge" they mean beliefs that are also factually correct.

replies(2): >>dcow+2R1 >>indigo+V42

>>CodesI+L61
Isn't the point of computers to have machines that improve on default human weaknesses, not just reproduce them at scale?

replies(2): >>canada+AN1 >>floydn+5r2

>>felipe+qI
> Yes, most people who have an incentive in pushing AI say that hallucinations aren't a problem, since humans aren't correct all the time.

We have legal and social mechanisms in place for the way humans are incorrect. LLMs are incorrect in new ways that our legal and social systems are less prepared to handle.

If a support human lies about a change to policy, the human is fired and management communicates about the rogue actor, the unchanged policy, and how the issue has been handled.

How do you address an AI doing the same thing without removing the AI from your support system?

>>CodesI+L61
The other very human thing to do is invent disciplines of thought so that we don't just constantly spew bullshit all the time. For example you could have a discipline about "pursuit of facts" which means that before you say something you mentally check yourself and make sure it's actually factually correct. This is how large portions of the populace avoid walking around spewing made up facts and bullshit. In our rush to anthropomorphize ML systems we often forget that there are a lot of disciplines that humans are painstakingly taught from birth and those disciplines often give rise to behaviors that the ML-based system is incapable of like saying "I don't know the answer to that" or "I think that might be an unanswerable question."

>>tevon+aw
Would you welcome your car behaving in a nondeterministic fashion?

>>jimbok+IJ1
They've largely been complementary strengths, with less overlap. But human language is state-of-the-art, after hundreds of thousands of years of "development". It seems like reproducing SOTA (i.e. the current ongoing effort) is a good milestone for a computer algorithm as it gains language overlap with us.

>>CodesI+L61
In a way, the main problem with LLMs isn't that they are wrong sometimes. We humans are used to that. We encounter people who are professionally wrong all the time. Politicians, con-men, scammers, even people who are just honestly wrong. We have evaluation metrics for those things. Those metrics are flawed because there are humans on the other end intelligently gaming those too, but generally speaking we're all at least trying.

LLMs don't fit those signals properly. They always sound like an intelligent person who knows what they are talking about, even when spewing absolute garbage. Even very intelligent people, even very intelligent people in the field of AI research are routinely bamboozled by the sheer swaggering confidence these models convey in their own results.

My personal opinion is that any AI researcher who was shocked by the paper lynguist mentioned ought to be ashamed of themselves and their credulity. That was all obvious to me; I couldn't have told you the exact mechanism the arithmetic was being performed (though what is was doing was well in the realm of what I would have expected from a linguistic AI trying to do math), but the fact that its chain of reasoning bore no particular resemblance to how it drew its conclusions was always obvious. A neural net has no introspection on itself. It doesn't have any idea "why" it is doing what it is doing. It can't. There's no mechanism for that to even exist. We humans are not directly introspecting our own neural nets, we're building models of our own behavior and then consulting the models, and anyone with any practice doing that should be well aware of how those models can still completely fail to predict reality!

Does that mean the chain of reasoning is "false"? How do we account for it improving performance on certain tasks then? No. It means that it is occurring at a higher level and a different level. It is quite like humans imputing reasons to their gut impulses. With training, combining gut impulses with careful reasoning is actually a very, very potent way to solve problems. The reasoning system needs training or it flies around like an unconstrained fire hose uncontrollably spraying everything around, but brought under control it is the most powerful system we know. But the models should always have been read as providing a rationalization rather than an explanation of something they couldn't possibly have been explaining. I'm also not convinced the models have that "training" either, nor is it obvious to me how to give it to them.

(You can't just prompt it into a human, it's going to be more complicated than just telling a model to "be carefully rational". Intensive and careful RHLF is a bare minimum, but finding humans who can get it right will itself be a challenge, and it's possible that what we're looking for simply doesn't exist in the bias-set of the LLM technology, which is my base case at this point.)

>>TylerE+m41
Yeah it finishes my sentences and goes back and replaces entire words with other words that are not even in the same category of noun, then replaces pronouns and conjunctions to completely make up a new sentence for me in something I've already typed. I'm not stupid and I meant what I typed. If I didn't mean what I typed I would have typed something else. Which I didn't.

>>Tainno+p31
Oh I have no doubt it's an excellent tool for knowledge sharing. So are mailing lists (nobody reads email) and internal wikis (evergreen fist fight to get someone, anyone, to update). Despite best intentions knowledge sharing regimes are little more than well-intentioned pestering with irrelevant information that is absolutely purged from headspace during any number of daily/weekly/quarterly context switches. As I said, mediocre results.

>>ToValu+7I1
I resent becoming QA/QC for the machine instead of doing the same or better thinking myself.

replies(1): >>ToValu+NV1

>>ToValu+gG1
> This is not a problem in my unreliable calculator use-cases; are you disputing that or dropping the analogy?

If you use an unreliable calculator to sum a list of numbers, you then need to use a reliable method to sum the numbers to validate that the unreliable calculator's sum is correct or incorrect.

replies(1): >>ToValu+wU1

>>jimbok+hJ1
I don’t think it’s so clear cut… Even the most adamant “facts are immutable” person can agree that we’ve had trouble “fact checking” social media objectively. Fluoride is healthy, meta analysis of the facts reveals fluoride may be unhealthy. The truth of the matter is by and large what’s socially cohesive for doctors’ and dentists’ narrative, that “fluoride is fine any argument to the contrary—even the published meta-analysis—is politically motivated nonsense”.

replies(1): >>jimbok+SX1

>>nkoren+We1
> AI is excellent at tasks which involve writing quasi-custom boilerplate (eg. tests which require a lot of mocking)

I wonder if anyone has compared how well the AI auto-generating approach works compared to meta programming approaches (like Lisp macros) meant to address the same kind of issues with repetitive code.

replies(1): >>kazina+ZV2

>>theone+ek
https://dwheeler.com/essays/apple-goto-fail.html

>>gitrem+PP1
Yes, so in my first example in the GP, this happens first. Humans do the work. The calculator double checks and gives me a list of all errors plus 5% of the non-errors, and I only need to double check that list.

In my third example, the calculator does the hard work of dividing, and humans can validate by the simpler task of multiplication, only having to do extra work 5% of the time.

(In my second, the unreliablity is a trade-off against speed, and we need the speed more.)

In all cases, we benefit from the unreliable tool despite not knowing when it is unreliable.

replies(1): >>Modern+5g2

>>skrebb+x71
Starting with "On" is cooler in philosophical tradition, though, starting in classical and medieval times, e.g. On Interpretation, On the Heavens, etc by Aristotle, De Veritate, De Malo, etc. by Aquinas. Capital is actually "Das Kapital", too

replies(2): >>skrebb+cV3 >>pas+f34

>>ToValu+YI
> Here's a machine that verifies 95% of calculations

Which 95% did it get right?

>>senord+ms
Way ahead of you. I already hate the present, at least the current sad state of the software industry.

>>throww+IP1
This is fair. I expect you would resent the tool even more if it was perfect and you couldn't even land a job in QA anymore. If that's the case, your resentment doesn't reflect on the usefulness of LLMs.

>>gtirlo+hl
> Is there a threshold after which you just... hope?

Generally, all the code I write is reviewed by humans, so commits need to be small and easily reviewable. I can't submit something I don't understand myself or I may piss off my colleagues, or it may never get reviewed.

Now if it was a personal project or something with low value, I would probably be more lenient but I think if you use a statically typed language, the type system + unit tests can capture a lot of issues so it may be ok to have local blocks that you don't look in details.

replies(1): >>Modern+Kc2

>>dcow+2R1
You are just saying identifying "knowledge" vs "opinion" is difficult to achieve.

replies(1): >>dcow+N32

>>felipe+qI
> Yes, most people who have an incentive in pushing AI say that hallucinations aren't a problem, since humans aren't correct all the time.

Humans often make factual errors, but there's a difference between having a process to validate claims against external reality, and occasionally getting it wrong, and having no such process, with all output being the product of internal statistical inference.

The LLM is engaging in the same process in all cases. We're only calling it a "hallucination" when its output isn't consistent with our external expectations, but if we regard "hallucination" as referring to any situation where the output for a wholly endogenous process is mistaken for externally validated information, then LLMs are only ever hallucinating, and are just designed in such a way that what they hallucinate has a greater than chance likelihood of representing some external reality.

>>jimbok+SX1
No, I’m saying I’ve seen reasonbly minded experts in a field disagree over things-generally-considered-facts. I’ve seen social impetus and context shape the understanding of where to draw the line between fact and opinion. I do not believe there is an objective answer. I fundamentally believe Anthropic’s explanation is rooted in real phenomena and not just a self serving statement to explain AI hallucination in a positive quasi-intellectual light.

>>jimbok+hJ1
As a digression, the definition of knowledge as justified true belief runs into the Gettier problems:

    > Smith [...] has a justified belief that "Jones owns a Ford". Smith 
    > therefore (justifiably) concludes [...] that "Jones owns a Ford, or Brown 
    > is in Barcelona", even though Smith has no information whatsoever about 
    > the location of Brown. In fact, Jones does not own a Ford, but by sheer 
    > coincidence, Brown really is in Barcelona. Again, Smith had a belief that
    > was true and justified, but not knowledge.

Or from 8th century Indian philosopher Dharmottara:

   > Imagine that we are seeking water on a hot day. We suddenly see water, or so we 
   > think. In fact, we are not seeing water but a mirage, but when we reach the 
   > spot, we are lucky and find water right there under a rock. Can we say that we 
   > had genuine knowledge of water? The answer seems to be negative, for we were 
   > just lucky.

More to the point, the definition of knowledge as linguistic agreement is convincingly supported by much of what has historically been common knowledge, such as the meddling of deities in human affairs, or that the people of Springfield are eating the cats.

>>Ventur+xg1
> Knowing how to use it, what to rely upon, what to verify and building in effective processes is the key. But today AI is at its worst and it already exceeds human performance in many areas.. it's only going in one direction.

I suppose this is the difference between an optimist and a pessimist. No matter how much better the tool gets, I don't see people getting better, and so I don't see the addition of LLM chatbots as ever improving things on the whole.

Yes, expert users get expert results. There's a reason why I use a chainsaw to buck logs instead of a hand saw, and it's also much the same reason that my wife won't touch it.

>>yodsan+KX1
Yeah for me, I use AI with Rust and a suite of 1000 tests in my codebase. I also use CoPilot VS code plugin mostly, which as far as I can tell heavily weights toward local code around it and often it just writing code based on my other code. I've found AI to be a good macro debugger too, as macro debugging tools are severely lacking in most ecosystems.

But when I see people using these AI tools to write JavaScript of Python code wholesale from scratch, that's a huge question mark for me. Because how?? How are you sure that this thing works? How are you sure when you update it won't break? Indeed the answer seems to be "We don't know why it works, we can't tell you under which conditions it will break, we can't give you any performance guarantees because we didn't test or design for those, we can't give you any security guarantees because we don't know what security is and why that's important."

People forgot we're out here trying to do software engineering, not software generation. Eternal September is upon us.

>>emn13+t61
For this to be true everyone must be logically on the same page. They must share the same axioms. Everyone must be operating off the same data and must not make mistakes or have bias evaluating it. Otherwise inevitably sometimes people will arrive at conflicting truths.

In reality it’s messy and not possible with 100% certainty to discern falsehoods and truthoods. Our scientific method does a pretty good job. But it’s not perfect.

You can’t retcon reality and say “well retrospectively we know what happened and one side was just wrong”. That’s called history. It’s not useful or practical working definition of truth when trying to evaluate your possible actions (individually, communally, socially, etc) and make a decision in the moment.

I don’t think it’s accurate to say that we want to redefine truth. I think more accurately truth has inconvenient limitations and it’s arguably really nice most of the time to ignore them.

>>ToValu+wU1
I'd like to second the point made to you in this thread that went without reply: >>43702895

It's true that we use tools with uncertainty all the time, in many domains. But crucially that uncertainty is carefully modeled and accounted for.

For example, robots use sensors to make sense of the world around them. These sensors are not 100% accurate, and therefore if the robots rely on these sensors to be correct, they will fail.

So roboticists characterize and calibrate sensors. They attempt to understand how and why they fail, and under what conditions. Then they attempt to cover blind spots by using orthogonal sensing methods. Then they fuse these desperate data into a single belief of the robot's state, which include an estimate of its posterior uncertainty. Accounting for this uncertainty in this way is what keeps planes in the sky, boats afloat, and driverless cars on course.

With LLMs It seems like we are happy to just throw out all this uncertainty modeling and to leave it up to chance. To draw an analogy to robotics, what we should be doing is taking the output from many LLMs, characterizing how wrong they are, and fusing them into a final result, which is provided to the user with a level of confidence attached. Now that is something I can use in an engineering pipeline. That is something that can be used as a foundation to something bigger.

replies(1): >>ToValu+bA2

>>jimbok+IJ1
Why would computers have just one “point”? They have been used for endless purposes and those uses will expand forever

>>Modern+5g2
>went without reply

Yeah, I was getting a little self-conscious about replying to everyone and repeating myself a lot. It felt like too much noise.

But my first objection here is to repeat myself- none of my examples are sensitive to this problem. I don't need to understand what conditions cause the calculator/IDE/medical test/LLM to fail in order to benefit from a 95% success rate.

If I write a piece of code, I try to understand what it does and how it impacts the rest of the app with high confidence. I'm still going to run the unit test suite even if it has low coverage, and even if I have no idea what the tests actually measure. My confidence in my changes will go up if the tests pass.

This is one use of LLMs for me. I can refactor a piece of code and then send ChatGPT the before and after and ask "Do these do the same thing". I'm already highly confident that they do, but a yes from the AI means I can be more confident. If I get a no, I can read its explanation and agree or disagree. I'm sure it can get this wrong (though it hasn't after n~=100), but that's no reason to abandon this near-instantaneous, mostly accurate double-check. Nor would I give up on unit testing because somebody wrote a test of implementation details that failed after a trivial refactor.

I agree totally that having a good model of LLM uncertainty would make them orders of magnitude better (as would, obviously, removing the uncertainty altogether). And I wouldn't put them in a pipeline or behind a support desk. But I can and do use them for great benefit every day, and I have no idea why I should prefer to throw away the useful thing I have because it's imperfect.

replies(1): >>Modern+286

>>jimbok+nT1
The generation of volumes of boiler plate takes effort; nobody likes to do it.

The problem is, that phase is not the full life cycle of the boiler plate.

You have to live with it afterward.

>>mvieir+EU1
Yeah so I meant the Piketty book, not Marx. But I googled it and turns out it's actually named "Capital in the Twenty-First Century", which disappoints me even more than "On Bullshit"

replies(1): >>pas+654

>>mvieir+EU1
It's very hipster, Das Kapital. (with the dot/period, check the cover https://en.wikipedia.org/wiki/Das_Kapital#/media/File:Zentra... )

But in English it would be just "Capital", right? (The uncountable nouns are rarely used with articles, it's "happiness" not "the happiness". See also https://old.reddit.com/r/writing/comments/12hf5wd/comment/jf... )

>>skrebb+cV3
And, for the full picture it's probably important to consider that the main claim of the book is based on very unreliable data/methodology. (Though note that it does not necessarily make the claim false! See [1])

https://marginalrevolution.com/marginalrevolution/2017/10/pi...

And then later similar claims about inequality were similarly made using bad methodology (data).

https://marginalrevolution.com/marginalrevolution/2023/12/th...

[1] "Indeed, in some cases, Sutch argues that it has risen more than Piketty claims. Sutch is rather a journeyman of economic history upset not about Piketty’s conclusions but about the methods Piketty used to reach those conclusions."

replies(1): >>skrebb+XR5

>>nerdjo+(OP)
I see this fallacy often too.

My company provides hallucination detection software: https://cleanlab.ai/tlm/

But we somehow end up in sales meetings where the person who requested the meeting claims their AI does not hallucinate ...

>>pas+654
You misunderstand. I never read it. I simply liked the title, at least before I understood "Capital" that wasn't actually the title.

>>ToValu+bA2
> none of my examples are sensitive to this problem.

That's not true. You absolutely have to understand those conditions because when you try to use those things outside of their operating ranges, they fail at a higher than the nominal rate.

> I'm still going to run the unit test suite even if it has low coverage, and even if I have no idea what the tests actually measure. My confidence in my changes will go up if the tests pass.

Right, your confidence goes up because you know that if the test passes, that means the test passed. But if the test suite can probabilistically pass even though some or all of the tests actually fail, then you will have to fall back to the notions of systematic risk management in my last post.

> I can refactor a piece of code and then send ChatGPT the before and after and ask "Do these do the same thing". I'm already highly confident that they do, but a yes from the AI means I can be more confident. If I get a no, I can read its explanation and agree or disagree. I'm sure it can get this wrong (though it hasn't after n~=100)

This n is very very small for you to be confident the behavior is as consistent as you expect. In fact, it gets this wrong all the time. I use AI in a class environment so I see n=100 on a single day. When you get to n~1k+ you see all of these problems where it says things are one way but really thing are another.

> mostly accurate double-check

And that's the problem right there. You can say "mostly accurate" but you really have no basis to assert this, past your own experience. And even if it's true, we still need to understand how wrong it can be, because mostly accurate with a wild variance is still highly problematic.

> But I can and do use them for great benefit every day, and I have no idea why I should prefer to throw away the useful thing I have because it's imperfect.

Sure, they can be beneficial. And yes, we shouldn't throw them out. But that wasn't my original point, I wasn't suggesting that. What I had said was that they cannot be relied on, and you seem to agree with me in that.

>>yodsan+ia
> But I need to iterate a few times so that the code looks like what I want.

The LLM too. You can get a pretty big improvement by telling the LLM to "iterate 4 times on whichever code I want you to generate, but only show me the final iteration, and then continue as expected".

I personally just inject the request for 4 iterations into the system prompt.

>>nottor+Zh1
Real code has dependencies and they sometimes change, including growing undocumented behavior or new bugs.

>>lyngui+oZ
> The model genuinely believes it’s giving a correct reasoning chain

The model doesn't "genuinely believe" anything.

>>johnis+9Z
After a recent update, so does WhatsApp. There is "search with Meta AI". Probably will be rolled out or already is for all Meta products.

>>emn13+471

    > They could have launched a submarine non-apple-branded product to test the waters.

This is a great idea. Are there any past Apple (or non-Apple) examples of this product release strategy?

>>nerdjo+(OP)
People hallucinate all the time out of pressure or habit. We don't need AI for that. It's hard to tell most people from AI. Most people would fail Turing tests as subjects.

>>emn13+t61
Validity is not soundness. Wonder why people are just beginning to realize what logicians have been studying for more than a century. This goes to show that most programming was never based on logic but vibes. People have been vibe coding with themselves before AI became prominent.

>>CodesI+L61
Exactly, most of us behave in almost the same as AI does. We finally have a mirror to reflect upon.

>>ryandr+oo
Edsgar Dijkstra!