zlacker

[return to "War story: the hardest bug I ever debugged"]
1. BobbyT+Hp7[view] [source] 2025-03-27 03:27:20
>>jakevo+(OP)
Interesting writeup, but 2 days to debug “the hardest bug ever”, while accurate, seems a bit overdone.

Though abs() returning negative numbers is hilarious.. “You had one job…”

To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.

I’m not just talking about concurrency issues either…

The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

2 days is cute though.

◧◩
2. userbi+ps7[view] [source] 2025-03-27 04:02:11
>>BobbyT+Hp7
The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.

The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.

One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.

◧◩◪
3. gmueck+pv7[view] [source] 2025-03-27 04:48:05
>>userbi+ps7
That reminded me of a former colleague at the desk next to me randomly exclaiming one day that he had just fixed a bug he had created 20 years ago.

The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.

◧◩◪◨
4. treyd+Ow7[view] [source] 2025-03-27 05:12:32
>>gmueck+pv7
This is a surprisingly common mistake with temperature readings. Especially when the system has a thermal safety power off that triggers if it's above some temperature, but then interprets -1 deg C as actually 255 deg C.
[go to top]