zlacker

> You're not supposed to trust the tool

This is just an incredible statement. I can't think of another development tool we'd say this about. I'm not saying you're wrong, or that it's wrong to have tools we can't just, just... wow... what a sea change.

replies(5): >>theone+C2 >>ryandr+86 >>shipp0+Gc >>tevon+xd >>Modern+lg

>>schmic+(OP)
> I can't think of another development tool we'd say this about.

Because no other dev tool actually generates unique code like AI does. So you treat it like the other components of your team that generates code, the other developers. Do you trust other developers to write good code without mistakes without getting it reviewed by others. Of course not.

replies(4): >>anonym+j3 >>chrisw+s3 >>forget+Y4 >>seabir+i5

>>theone+C2
I trust my colleagues to write code that compiles, at the very least

replies(1): >>Modern+yg

>>theone+C2
But of course everyone absolutely NEEDS to use AI for codereviews! How else could the huge volume of AI-generated code be managed?

>>theone+C2
"Do you trust other developers to write good code without mistakes without getting it reviewed by others."

Literally yes. Test coverage and QA to catch bugs sure but needing everything manually reviewed by someone else sounds like working in a sweatshop full of intern-level code bootcamp graduates, or if you prefer an absolute dumpster fire of incompetence.

replies(2): >>ryandr+H6 >>theone+m7

>>theone+C2
Yes, actually, I do! I trust my teammates with tens of thousands of hours of experience in programming, embedded hardware, our problem spaces, etc. to write from a fully formed worldview, and for their code to work as intended (as far as anybody can tell before it enters preliminary testing by users) by the time the rest of the team reviews it. Most code review is uneventful. Have some pride in your work and you'll be amazed at what's possible.

replies(1): >>theone+Cl

>>schmic+(OP)
Imagine if your compiler just randomly and non-deterministically compiled valid code to incorrect binaries, and the tool's developer couldn't really tell you why it happens, how often it was expected to happen, how severe the problem was expected to be, and told you to just not trust your compiler to create correct machine code.

Imagine if your calculator app randomly and non-deterministically performed arithmetic incorrectly, and you similarly couldn't get correctness expectations from the developer.

Imagine if any of your communication tools randomly and non-deterministically translated your messages into gibberish...

I think we'd all throw away such tools, but we are expected to accept it if it's an "AI tool?"

replies(4): >>andrei+2a >>ToValu+7b >>arvins+Hs >>learni+x4m

>>forget+Y4
I would accept mistakes and inconsistency from a human, especially one not very experienced or skilled. But I expect perfection and consistency from a machine. When I command my computer to do something, I expect it to do it correctly, the same way every time, to convert a particular input to an exact particular output, every time. I don't expect it to guess, or randomly insert garbage, or behave non-deterministically. Those things are called defects(bugs) and I'd want them to be fixed.

replies(3): >>senord+6a >>forget+db >>tevon+Ud

>>forget+Y4
Ok, here I thought requiring PR review and approval before merging was standard industry best practice. I guess all the places I've worked have been doing it wrong?

replies(1): >>forget+8c

>>ryandr+86
Imagine that you yourself never use these tools directly but your employees do. And the sellers of said tools swear that the tools are amazing and correct and will save you millions.

They keep telling you that any employee who highlights problems with the tools are just trying to save their job.

Your investors tell you that the toolmakers are already saving money for your competitors.

Now, do you want that second house and white lotus vacation or not?

Making good tools is difficult. Bending perception (“is reality”) is easier and enterprise sales, just like good propaganda, work. The gold rush will leave a lot of bodies behind but the shovelmakers will make a killing.

replies(1): >>Modern+qg

>>ryandr+H6
Then you are going to hate the future.

replies(1): >>ryandr+iD1

>>ryandr+86
If the only calculators that existed failed at 5% of the calculations, or if the only communication tools miscommunicated 5% of the time, we would still use both all the time. They would be far less than 95% as useful as perfect versions, but drastically better then not having the tools at all.

replies(1): >>gitrem+zd

>>ryandr+H6
Exactly this.

>>theone+m7
There's a lot of shit that has become "best practice" over the last 15 years, and a lot more that was "best practice" but fell out of favor because reasons. All of it exists on a continuum of what is actually reasonable given the circumstances. Reviewing pull requests is one of those things that is reasonable af in theory, produces mediocre results in practice, and is frequently nothing more than bureaucratic overhead. Consider a case where an individual adds a new feature to an existing codebase. Given they are almost certainly the only one who has spent significant time researching the particulars of the feature set in question, and are the only individual with any experience at all with the new code, having another developer review it means you've got inexperienced, low-info eyes examining something they do not fully understand, and will have to take some amount of time to come up to speed on. Sure they'll catch obvious errors, but so would a decent test suite.

Am I arguing in favor of egalitarian commit food fights with no adults in the room? Absolutely not. But demanding literally every change go through a formal review process before getting committed, like any other coding dogma, has a tendency to generate at least as much bullshit as it catches, just a different flavor.

replies(2): >>rixed+Tl >>Tainno+9L

>>schmic+(OP)
In Mechanical Engineering, this is 100% a thing with fluid dynamics simulation. You need to know if the output is BS based on a number of factors that I don't understand.

>>schmic+(OP)
Stackoverflow is like this, you read an answer but are not fully sure if its right or if it fits your needs.

Of course there is a review system for a reason, but we frequently use "untrusted" tools in development.

That one guy in a github issue that said "this worked for me"

>>ToValu+7b
Absolutely not. We'd just do the calculations by hand, which is better than running the 95%-correct calculator and then doing the calculations by hand anyway to verify its output.

replies(1): >>ToValu+Iq

>>ryandr+H6
This seems like a particularly limited view of what a machine is. Specifically expecting it to behave deterministically.

replies(2): >>Modern+Sg >>forget+9v1

>>schmic+(OP)
Imagine! Imagine if 0.05% of the time gcc just injected random code into your binaries. Imagine, you swing a hammer and 1% of the time it just phases into the wall. Tools are supposed to be reliable.

replies(1): >>arvins+Us

>>andrei+2a
I feel like there's a lot of motivated reasoning going on, yeah.

>>anonym+j3
Oh at the very least I trust them to not take code that compiles and immediately assess that it's broken.

>>tevon+Ud
Still, the whole Unix philosophy of building tools starts with a foundation of building something small that can do one thing well. If that is your foundation, you can take advantage of composability and create larger tools that are more capable. The foundation of all computing today is built on this principle of design.

Building on AI seems more like building on a foundation of sand, or building in a swamp. You can probably put something together, but it's going to continually sink into the bog. Better to build on a solid foundation, so you don't have to continually stop the thing from sinking, so you can build taller.

>>seabir+i5
so your saying that yes you do "trust other developers to write good code without mistakes without getting it reviewed by others."

And then you say "by the time the rest of the team reviews it. Most code review is uneventful."

So you trust your team to develop without the need for code review but yet, your team does code review.

So what is the purpose of these code reviews? Is it the case that you actually don't think they are necessary, but perhaps management insists on them? You actually answer this question yourself:

> Most code review is uneventful.

Keyword here is "most" as opposed to "all" So based your team's applied practices and your own words, code review is for the purpose of catching mistakes and other needed corrections.

But it seems to me if you trust your team not to make mistakes, code review is superfluous.

As an aside, it seems your team culture doesn't make room for juniors because if your team had juniors I think it would be even more foolish to trust them not to make mistakes. Maybe a junior free culture works for your company, but that's not the case for every company.

My main point is code review is not superfluous no matter the skill level; junior, senior, or AI simply because everyone and every AI makes mistakes. So I don't trust those three classes of code emitters to not ever make mistakes or bad choices (i.e. be perfect) and therefore I think code review is useful.

Have some honesty and humility and you'll amazed at what's possible.

replies(1): >>seabir+6r

>>forget+8c
And there is worst: in the cases when the reviewer has actually some knowledge of the problem at hand, she might say "oh you did all this to add that feature? But it's actually already there. You just had to include that file and call function xyz". Or "oh but two months ago that very same topic was discussed and it was decided that it would make more sense to wait for module xyz to be refactored in order to make it easier ", etc.

>>gitrem+zd
Suppose you work in a field where getting calculations right is critical. Your engineers make mistakes less than .01% of the time, but they do a lot of calculations and each mistake could cost $millions or lives. Double- and triple-checking help a lot, but they're costly. Here's a machine that verifies 95% of calculations, but you'd still have to do 5% of the work. Shall I throw it away?

Unreliable tools have a good deal of utility. That's an example of them helping reduce the problem space, but they also can be useful in situations where having a 95% confidence guess now matters more that a 99.99% confidence one in ten minutes- firing mortars in active combat, say.

There's situations where validation is easier than computation; canonically this is factoring, but even division is much simpler than multiplication. It could very easily save you time to multiply all of the calculator's output by the dividend while performing both a multiplication and a division for the 5% that are wrong.

edit: I submit this comment and click to go the front page and right at the top is Unsure Calculator (no relevance). Sorry, I had to mention this

replies(4): >>diputs+Wt >>mrheos+qw >>Tainno+RJ >>jimbok+DC1

>>theone+Cl
I never said that code review was useless, I said "yes, I do" to your question as to whether or not I "trust other developers to write good code without mistakes without getting it reviewed by others". Of course I can trust them to do the right thing even when nobody's looking, and review it anyway in the off-chance they overlooked something. I can't trust AI to do that.

The purpose of the review is to find and fix occasional small details before it goes to physical testing. It does not involve constant babysitting of the developer. It's a little silly to bring up honesty when you spent that entire comment dancing around the reality that AI makes an inordinately large number of mistakes. I will pick the domain expert who refuses to touch AI over a generic programmer with access to it ten times out of ten.

The entire team as it is now (me included) were juniors. It's a traditional engineering environment in a location where people don't aggressively move between jobs at the drop of a hat. You don't need to constantly train younger developers when you can retain people.

replies(1): >>theone+rt

>>ryandr+86
If you think of AI like a compiler, yes we should throw away such tools because we expect correctness and deterministic outcomes

If you think of AI like a programmer, no we shouldn't throw away such tools because we accept them as imperfect and we still need to review.

replies(1): >>bigstr+pt

>>Modern+lg
There are no existing AI tools that guarantee correct code 100% of the time.

If there is such a tool, programmers will be on path of immediate reskilling or lose their jobs very quickly.

>>arvins+Hs
> If you think of AI like a programmer, no we shouldn't throw away such tools because we accept them as imperfect and we still need to review.

This is a common argument but I don't think it holds up. A human learns. If one of my teammates or I make a mistake, when we realize it we learn not to make that mistake in the future. These AI tools don't do that. You could use a model for a year, and it'll be just as unreliable as it is today. The fact that they can't learn makes them a nonstarter compared to humans.

>>seabir+6r
You spend your comment dancing around the fact that everyone makes mistakes and yet you claim you trust your team not to make mistakes.

> I "trust other developers to write good code without mistakes without getting it reviewed by others". Of course I can trust them to do the right thing even when nobody's looking, and review it anyway in the off-chance they overlooked something.

You're saying yes, I trust other developers to not make mistakes, but I'll check anyways in case they do. If you really trusted them not to make mistakes, you wouldn't need to check. They (eventually) will. How can I assert that? Because everyone makes mistakes.

It's absurd to expect anyone to not make mistakes. Engineers build whole processes to account for the fact that people, even very smart people make mistakes.

And it's not even just about mistakes. Often times, other developers have more context, insight or are just plain better and can offer suggestions to improve the code during review. So that's about teamwork and working together to make the code better.

I fully admit AI makes mistakes, sometimes a lot of them. So it needs code review . And on the other hand, sometimes AI can really be good at enhancing productivity especially in areas of repetitive drudgery so the developer can focus on higher level tasks that require more creativity and wisdom like architectural decisions.

> I will pick the domain expert who refuses to touch AI over a generic programmer with access to it ten times out of ten.

I would too, but I won't trust them not to make mistakes or occasional bad decisions because again, everybody does.

> You don't need to constantly train younger developers when you can retain people.

But you do need to train them initially. Or do you just trust them to write good code without mistakes too?

>>ToValu+Iq
> Here's a machine that verifies 95% of calculations, but you'd still have to do 5% of the work.

The problem is that you don't know which 5% are wrong. The AI is confidently wrong all the time. So the only way to be sure is to double check everything, and at some point its easier to just do it the right way.

Sure, some things don't need to be perfect. But how much do you really want to risk? This company thought a little bit of potential misinformation was acceptable, and so it caused a completely self inflicted PR scandal, pissed off their customer base, and lost them a lot of confidence and revenue. Was that 5% error worth it?

Stories like this are going to keep coming the more we rely on AI to do things humans should be doing.

Someday you'll be affected by the fallout of some system failing because you happen to wind up in the 5% failure gap that some manager thought was acceptable (if that manager even ran a calculation and didn't just blindly trust whatever some other AI system told them) I just hope it's something as trivial as an IDE and not something in your car, your bank, or your hospital. But certainly LLMs will be irresponsibly shoved into all three within the next few years, if it's not there already.

replies(1): >>ToValu+0o1

>>ToValu+Iq
> you'd still have to do 5% of the work

No, you still have to do 100% of the work.

replies(1): >>ToValu+Rp1

>>ToValu+Iq
> Unreliable tools have a good deal of utility.

This is generally true when you can quantify the unreliability. E.g. random prime number tests with a specific error rate can be combined so that the error rates multiply and become negligible.

I'm not aware that we can quantify the uncertainty coming out of LLM tools reliably.

>>forget+8c
Code review is actually one of the few practices for which research does exist[0] which points in the direction of it being generally good at reducing defects.

Additionally, in the example you share, where only one person knows the context of the change, code review is an excellent tool for knowledge sharing.

[0]: https://dl.acm.org/doi/10.1145/2597073.2597076, for example

replies(1): >>forget+uw1

>>diputs+Wt
>The problem is that you don't know which 5% are wrong

This is not a problem in my unreliable calculator use-cases; are you disputing that or dropping the analogy?

Because I'd love to drop the analogy. You mention IDEs- I routinely use IntelliJ's tab completion, despite it being wrong >>5% of the time. I have to manually verify every suggestion. Sometimes I use it and then edit the final term of a nested object access. Sometimes I use the completion by mistake, clean up with backspace instead of undo, and wind up submitting a PR that adds an unused dependency. I consider it indispensable to my flow anyway. Maybe others turn this off?

You mention hospitals. Hospitals run loads of expensive tests every day with a greater than 5% false positive and false negative rate. Sometimes these results mean a benign patient undergoes invasive further testing. Sometimes a patient with cancer gets told they're fine and sent home. Hospitals continue to run these tests, presumably because having a 20x increase in specificity is helpful to doctors, even if it's unreliable. Or maybe they're just trying to get more money out of us?

Since we're talking LLMs again, it's worth noting that 95% is an underestimate of my hit rate. 4o writes code that works more reliably than my coworker does, and it writes more readable code 100% of the time. My coworker is net positive for the team. His 2% mistake rate is not enough to counter the advantage of having someone there to do the work.

An LLM with a 100% hit rate would be phenomenal. It would save my company my entire salary. A 99% one is way worse; they still have to pay me to use it. But I find a use for the 99% LLM more-or-less every day.

replies(1): >>gitrem+zx1

>>mrheos+qw
You simply do not. You do the math yourself to calculate 2(n) for n in [1, 2, 3, 4] and get [2, 5, 6, 8]. You plug it into your (75% accurate) unreliable calculator and get [3, 4, 6, 8]. You now know that you only need to recheck the first two (50%) of the entries.

replies(1): >>throww+sx1

>>tevon+Ud
Would you welcome your car behaving in a nondeterministic fashion?

>>Tainno+9L
Oh I have no doubt it's an excellent tool for knowledge sharing. So are mailing lists (nobody reads email) and internal wikis (evergreen fist fight to get someone, anyone, to update). Despite best intentions knowledge sharing regimes are little more than well-intentioned pestering with irrelevant information that is absolutely purged from headspace during any number of daily/weekly/quarterly context switches. As I said, mediocre results.

>>ToValu+Rp1
I resent becoming QA/QC for the machine instead of doing the same or better thinking myself.

replies(1): >>ToValu+xD1

>>ToValu+0o1
> This is not a problem in my unreliable calculator use-cases; are you disputing that or dropping the analogy?

If you use an unreliable calculator to sum a list of numbers, you then need to use a reliable method to sum the numbers to validate that the unreliable calculator's sum is correct or incorrect.

replies(1): >>ToValu+gC1

>>gitrem+zx1
Yes, so in my first example in the GP, this happens first. Humans do the work. The calculator double checks and gives me a list of all errors plus 5% of the non-errors, and I only need to double check that list.

In my third example, the calculator does the hard work of dividing, and humans can validate by the simpler task of multiplication, only having to do extra work 5% of the time.

(In my second, the unreliablity is a trade-off against speed, and we need the speed more.)

In all cases, we benefit from the unreliable tool despite not knowing when it is unreliable.

replies(1): >>Modern+PX1

>>ToValu+Iq
> Here's a machine that verifies 95% of calculations

Which 95% did it get right?

>>senord+6a
Way ahead of you. I already hate the present, at least the current sad state of the software industry.

>>throww+sx1
This is fair. I expect you would resent the tool even more if it was perfect and you couldn't even land a job in QA anymore. If that's the case, your resentment doesn't reflect on the usefulness of LLMs.

>>ToValu+gC1
I'd like to second the point made to you in this thread that went without reply: >>43702895

It's true that we use tools with uncertainty all the time, in many domains. But crucially that uncertainty is carefully modeled and accounted for.

For example, robots use sensors to make sense of the world around them. These sensors are not 100% accurate, and therefore if the robots rely on these sensors to be correct, they will fail.

So roboticists characterize and calibrate sensors. They attempt to understand how and why they fail, and under what conditions. Then they attempt to cover blind spots by using orthogonal sensing methods. Then they fuse these desperate data into a single belief of the robot's state, which include an estimate of its posterior uncertainty. Accounting for this uncertainty in this way is what keeps planes in the sky, boats afloat, and driverless cars on course.

With LLMs It seems like we are happy to just throw out all this uncertainty modeling and to leave it up to chance. To draw an analogy to robotics, what we should be doing is taking the output from many LLMs, characterizing how wrong they are, and fusing them into a final result, which is provided to the user with a level of confidence attached. Now that is something I can use in an engineering pipeline. That is something that can be used as a foundation to something bigger.

replies(1): >>ToValu+Vh2

>>Modern+PX1
>went without reply

Yeah, I was getting a little self-conscious about replying to everyone and repeating myself a lot. It felt like too much noise.

But my first objection here is to repeat myself- none of my examples are sensitive to this problem. I don't need to understand what conditions cause the calculator/IDE/medical test/LLM to fail in order to benefit from a 95% success rate.

If I write a piece of code, I try to understand what it does and how it impacts the rest of the app with high confidence. I'm still going to run the unit test suite even if it has low coverage, and even if I have no idea what the tests actually measure. My confidence in my changes will go up if the tests pass.

This is one use of LLMs for me. I can refactor a piece of code and then send ChatGPT the before and after and ask "Do these do the same thing". I'm already highly confident that they do, but a yes from the AI means I can be more confident. If I get a no, I can read its explanation and agree or disagree. I'm sure it can get this wrong (though it hasn't after n~=100), but that's no reason to abandon this near-instantaneous, mostly accurate double-check. Nor would I give up on unit testing because somebody wrote a test of implementation details that failed after a trivial refactor.

I agree totally that having a good model of LLM uncertainty would make them orders of magnitude better (as would, obviously, removing the uncertainty altogether). And I wouldn't put them in a pipeline or behind a support desk. But I can and do use them for great benefit every day, and I have no idea why I should prefer to throw away the useful thing I have because it's imperfect.

replies(1): >>Modern+MP5

>>ToValu+Vh2
> none of my examples are sensitive to this problem.

That's not true. You absolutely have to understand those conditions because when you try to use those things outside of their operating ranges, they fail at a higher than the nominal rate.

> I'm still going to run the unit test suite even if it has low coverage, and even if I have no idea what the tests actually measure. My confidence in my changes will go up if the tests pass.

Right, your confidence goes up because you know that if the test passes, that means the test passed. But if the test suite can probabilistically pass even though some or all of the tests actually fail, then you will have to fall back to the notions of systematic risk management in my last post.

> I can refactor a piece of code and then send ChatGPT the before and after and ask "Do these do the same thing". I'm already highly confident that they do, but a yes from the AI means I can be more confident. If I get a no, I can read its explanation and agree or disagree. I'm sure it can get this wrong (though it hasn't after n~=100)

This n is very very small for you to be confident the behavior is as consistent as you expect. In fact, it gets this wrong all the time. I use AI in a class environment so I see n=100 on a single day. When you get to n~1k+ you see all of these problems where it says things are one way but really thing are another.

> mostly accurate double-check

And that's the problem right there. You can say "mostly accurate" but you really have no basis to assert this, past your own experience. And even if it's true, we still need to understand how wrong it can be, because mostly accurate with a wild variance is still highly problematic.

> But I can and do use them for great benefit every day, and I have no idea why I should prefer to throw away the useful thing I have because it's imperfect.

Sure, they can be beneficial. And yes, we shouldn't throw them out. But that wasn't my original point, I wasn't suggesting that. What I had said was that they cannot be relied on, and you seem to agree with me in that.

>>ryandr+86
Edsgar Dijkstra!