zlacker

War story: the hardest bug I ever debugged

submitted by jakevo+(OP) on 2025-03-24 14:37:46 | 462 points 188 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
13. nneonn+au7[view] [source] 2025-03-27 04:30:16
>>jakevo+(OP)
FWIW: this type of bug in Chrome is exploitable to create out-of-bounds array accesses in JIT-compiled JavaScript code.

The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.

However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index - which can be abused to rewrite the array’s length and enable further shenanigans.

Further reading about a Chrome CVE pretty much exactly in this mold: https://shxdow.me/cve-2020-9802/

◧◩
15. Terr_+bv7[view] [source] [discussion] 2025-03-27 04:44:37
>>BobbyT+Hp7
This repro was a few times per day, but try fixing a Linux kernel panic when you don't even have C/C++ on your resume, and everyone who originally set stuff up has left...

>>37859771

Point being that the difficulty of a fix can come from many possible places.

◧◩◪
27. nneonn+MD7[view] [source] [discussion] 2025-03-27 06:48:13
>>saghm+7z7
Normally, there would be a bounds check to ensure that the index was actually non-negative; negative indices get treated as property accesses instead of array accesses (unlike e.g. Python where they would wrap around).

However, if the JIT compiler has "proven" that the index is never non-negative (because it came from Math.abs), it may omit such checks. In that case, the resulting access to e.g. arr[-1] may directly access the memory that sits one position before the array elements - which could, for example, be part of the array metadata, such as the length of the array.

You can read the comments on the sample CVE's proof-of-concept to see what the JS engine "thinks" is happening, vs. what actually happens when the code is executed: https://github.com/shxdow/exploits/blob/master/CVE-2020-9802.... This exploit is a bit more complicated than my description, but uses a similar core idea.

45. chroma+YL7[view] [source] 2025-03-27 08:40:36
>>jakevo+(OP)
Great writeup :)

It's like this https://geek-and-poke.com/geekandpoke/2017/8/13/just-happene... but actually true, which is really bad for mental health :D

◧◩◪◨
48. latexr+aQ7[view] [source] [discussion] 2025-03-27 09:33:05
>>dharma+jK7
https://www.youtube.com/watch?v=fE2KDzZaxvE
53. latexr+LR7[view] [source] 2025-03-27 09:53:12
>>jakevo+(OP)
> I do it a few more times. It’s not always the 20th iteration, but it usually happens sometime between the 10th and 40th iteration. Sometimes it never happend. Okay, the bug is nondeterministic.

That’s an incorrect assumption. Just because your test case isn’t triggering the bug reliably, it does not mean the bug is nondeterministic.

That is like saying the “OpenOffice can’t print on Tuesdays” is non deterministic because you can’t reproduce it everyday. It is deterministic, you just need to find the right set of circumstances.

https://beza1e1.tuxen.de/lore/print_on_tuesday.html

From the writing it appears the author found one way to reproduce the bug sometimes and then relied on it for every test. Another approach would have been to tweak their test case until they found a situation which reproduced the bug more or less often, trying to find the threshold that causes it and continuing to deduce from there.

60. latexr+AT7[view] [source] 2025-03-27 10:12:04
>>jakevo+(OP)
It’s amusing how so many of the comments here are like “You think two days is hard? Well, I debugged a problem which was passed down to me by my father, and his father before him”. It reminds me of the Four Yorkshiremen sketch.

https://youtube.com/watch?v=sGTDhaV0bcw

The author’s “error”, of course, was calling it “the hardest bug I ever debugged”. It drives clicks, but comparisons too.

86. jason_+Ka8[view] [source] 2025-03-27 12:47:20
>>jakevo+(OP)
Reminds me of the classic bug story where users couldn’t send emails more than 500 miles.

https://web.mit.edu/jemorris/humor/500-miles

◧◩◪◨
116. amne+Cp8[view] [source] [discussion] 2025-03-27 14:20:40
>>latexr+Ik8
With the appropiate butterfly wing flap everything is deterministic.

https://xkcd.com/378/

◧◩◪
142. somat+yT8[view] [source] [discussion] 2025-03-27 17:27:37
>>seeing+cH8
One of my favorite man pages is scan_ffs https://man.openbsd.org/scan_ffs

    The basic operation of this program is as follows:

    1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out.

    2. ...
◧◩◪
149. parlia+hX8[view] [source] [discussion] 2025-03-27 17:48:51
>>seeing+cH8
No, QR codes are auto-orienting[1]. If you're getting a different reading at different orientations, there is a bug in your scanner.

[1] https://en.wikipedia.org/wiki/QR_code#Design

◧◩
152. margin+xZ8[view] [source] [discussion] 2025-03-27 18:02:06
>>aetimm+cq8
Days-taken-to-fix is kind of a weird measure for how difficult a bug is. It's clearly a factor of a large number of things that's not the bug itself, including experience and whether you have to go it alone or if you can talk to the right people.

The bug ticks most of the boxes for a tricky bug:

* Non-deterministic

* Enormous haystack

* Unexpected "1+1=3"-type error with a cause outside of the code itself

Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.

I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.

[1] https://www.marginalia.nu/log/a_104_dep_bug/

◧◩◪◨
171. egyptu+od9[view] [source] [discussion] 2025-03-27 19:16:49
>>parlia+hX8
It does seem to be possible to design QR codes that scan differently depending on the orientation, though they look a little visibly malformed.

https://hackaday.com/2025/01/23/this-qr-code-leads-to-two-we...

◧◩
175. decima+Xi9[view] [source] [discussion] 2025-03-27 19:49:07
>>jason_+Ka8
Crashes only on Wednesdays:

https://gyrovague.com/2015/07/29/crashes-only-on-wednesdays/

[go to top]