I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:
At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.
Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.
One thing I notice is that Google has no way whatsoever to actually just ask users "hey, are you having problems?", a definite downside of their approach to software development where there is absolutely no communication between users and developers.
I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"
I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.
The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.
The basic operation of this program is as follows:
1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out.
2. ...I'm not sure this is really luck.
The fix is to just not use Math.abs. If they didn't work at Google they still would've done the same debugging and used the same fix. Working at Google probably harmed them as once they discovered Math.abs didn't work correctly they could've just immediately used `> 0` instead of asking the chrome team about it.
There's nothing lucky about slowly adding printf statements until you understand what the computer is actually doing; that's just good work.
Same, I assumed they were designed to always work. I suspect it was whatever app or library you were using that wasn't designed to handle them correctly.
The bug ticks most of the boxes for a tricky bug:
* Non-deterministic
* Enormous haystack
* Unexpected "1+1=3"-type error with a cause outside of the code itself
Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.
I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.
For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.
That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.
https://hackaday.com/2025/01/23/this-qr-code-leads-to-two-we...