Vendor provided an outlook plugin (ew) that linked storage directly in outlook (double ew) and contained a built in pdf viewer (disgusting) for law firms to manage their cases.
One user, regardless of PC, user account or any other isolation factor, would reliably crash the program and outlook with it.
She could work for 40 minutes on another users logged in account on another PC and reproduce the issue.
Turns out it was a memory allocation issue. When you open a file saved in the addons storage, via the built in pdf viewer, it would allocate memory for it. However, when you close the pdf file, it would not deallocate that memory. After debugging her usage for some time, I noted that there was a memory deallocation, but it was performed at intervals.
If there were 20 or so pdf allocations and then she switched customer case file before a deallocation, regardless of available memory, the memory allocation system in the addon would shit the bed and crash.
This one user, an absolute powerhouse of a woman I must say, could type 300 wpm and would rapidly read -> close -> assign -> allocate -> write notes faster than anyone I have ever seen before. We legitimately got her to rate limit herself to 2 files per 10 minutes as an initial workaround while waiting for a patch from the vendor.
I had to write one hell of a bug report to the vendor before they would even look at it. Naturally they could not reproduce the error through their normal tests and tried closing the bug on me several times. The first update they rolled out upped it to something like 40 pdfs viewed every 15 minutes. But she still managed to touch the new ceiling on occasion (I imagine billing each of those customers 7 minutes a pop or whatever law firms do) and ultimately they had to rewrite the entire memory system.
Though abs() returning negative numbers is hilarious.. “You had one job…”
To me, the hardest bugs are nearly irreproducible “Heisenbugs” that vanish when instrumentation is added.
I’m not just talking about concurrency issues either…
The kind of bug where a reproduction attempt takes a week, not parallelizable due to HW constraints, and logging instrumentation makes it go away or fail differently.
2 days is cute though.
> It didn’t correspond to a Google Docs release. The stack trace added very little information. There wasn’t an associated spike in user complaints, so we weren’t even sure it was really happening — but if it was happening it would be really bad. It was Chrome-only starting at a specific release.
That sounds like a Chrome bug. Or, at least, a bug triggered by a change in Chrome. Bisecting your code when their change reveals a crash is folly, regardless of whose bug it is.
I still don't understand how we've arrived at this state of affairs
It took me maybe three days to track down, from first clues to final resolution, on a 486/50 luggable with the orange on black monochrome built-in screen.
The hardest one I've debugged took a few months to reproduce, and would only show up on hardware that only one person on the team had.
One of the interesting things about working on a very mature product is that bugs tend to be very rare, but those rare ones which do appear are also extremely difficult to debug. The 2-hour, 2-day, and 2-week bugs have long been debugged out already.
Regarding "exhausting 2-day brute-force grind": is/was this just how you like to get things done, or was there external pressure of the "don't work on anything else" sort? I've never worked at a large company, and lots of descriptions of the way things get done are pretty foreign to me :). I am also used to being able to say "this isn't getting figured out today; probably going to be best if I work on something else for a bit, and sleep on it, too".
The JIT compiler contains passes that will eliminate unnecessary bounds checks. For example, if you write “var x = Math.abs(y); if(x >= 0) arr[x] = 0xdeadbeef;”, the JIT compiler will probably delete the if statement and the internal nonnegative array index check inside the [] operator, as it can assume that x is nonnegative.
However, if Math.abs is then “optimized” such that it can produce a negative number, then the lack of bounds checks means that the code will immediately access a negative array index - which can be abused to rewrite the array’s length and enable further shenanigans.
Further reading about a Chrome CVE pretty much exactly in this mold: https://shxdow.me/cve-2020-9802/
Our team also had a very grindy culture, so "I'm going to put in extra hours focusing exclusively on our top crash" was a pretty normalized behavior. After I left that team (and Google), most of my future teams have been more forgiving on pace for non-outages.
Point being that the difficulty of a fix can come from many possible places.
The bug was actually quite funny in a way: it was in the code displaying the internal temperature of the electronics box of some industrial equipment. The string conversion was treating the temperature variable as an unsigned int when it was in fact signed. It took a brave field technician in Finland in winter, inspecting a unit in an unheated space to even discover this particular bug because the units' internal temperatures were usually about 20C above ambient.
I followed all of this up until here. JavaScript lets you modify the length of an array by assigning to indexes that are negative? I'm familiar with the paradigm of negative indexing being used to access things from the end of the array (like -1 being the last element), but I don't understand what operation someone could do that would somehow modify the length of the array rather than modifying a specific element in-place. Does JIT-compiled JavaScript not follow the usual JavaScript semantics that would normally happen when using a negative index, or are you describing something that would be used in combination with some other compiler bug (which honestly sounds a lot more severe even in the absence of an usual Math.abs implementation).
I.e. don't think fancy language shenanigans that do negative indexing. But negative offset from the beginning of the array memory access.
When there's some inlining, there will be no function call into some index operator function
This is my no doubt dumb understanding of what you can do, based on some funky stuff I did one time to mess with people's heads
do the following const arr = []; arr[-1] = "hi"; console.log(arr) this gives you "-1": "hi"
length: 0
which I figured is because really an array is just a special type of object. (my interpretation, probably wrong)
now we can see that the JavaScript Array length is 0, but since the value is findable in there I would expect there is some length representation in the lower level language that JavaScript is implemented in, in the browser, and I would then think that there could even be exploits available by somehow taking advantage of the difference between this lower level representation of length and the JS array length. (again all this is silly stuff I thought and have never investigated, and is probably laughably wrong in some ways)
I remember seeing some additions to array a few years back that made it so you could protect against the possibility of negative indexes storing data in arrays - but that memory may be faulty as I have not had any reason to worry about it.
Currently working a bug where we saw file system corruption after 3 weeks of automated testing, 10s of thousands of restarts. We might never see the problem again, even? Only happened once yet.
Heres what a lawyer does:
1. They bill for time writing emails and on phone calls 2. They bill for time reviewing emails. 3. They bill for printing (and faxing if they are diehards) 4. They also bill for the time they are face to face with a human.
They also need to gather all the data, much of which flows in and out via email (or fax if they hate you) related to the case in a single space.
The sad state is that 80% of this can be achieved in outlook without much effort. Setting up an external application to capture all this shit is quite difficult, and generally requires mail to be run through it in some capacity. The question is, why reinvent the email client. (Sadly they reinvented the pdf reader) I have seen some lawfirms literally saving out every email as html, and uploading it with billing stats to a third party app. Its easier for me to support but the user experience can be awful.
The user already exists in Outlook, they already understand outlook. A few buttons in the ribbon (Mostly File this under X open case, time me, and bill this customer) make more sense from a user perspective.
From a support perspective its an absolute nightmare. Microsoft absolutely wont take a support case about an addon with shit memory management. And the addon provider will usually blame Microsoft.
However, if the JIT compiler has "proven" that the index is never non-negative (because it came from Math.abs), it may omit such checks. In that case, the resulting access to e.g. arr[-1] may directly access the memory that sits one position before the array elements - which could, for example, be part of the array metadata, such as the length of the array.
You can read the comments on the sample CVE's proof-of-concept to see what the JS engine "thinks" is happening, vs. what actually happens when the code is executed: https://github.com/shxdow/exploits/blob/master/CVE-2020-9802.... This exploit is a bit more complicated than my description, but uses a similar core idea.
The fix ended up being one character -> change the priority of an ebpf tc filter from 0 to 1.
Math.abs(Integer.MIN_VALUE) in Java very seriously returns -2147483648, as there is no int for 2147483648.
Where mere mortals can complain about Google product?
Then a high-ranked non-employee 'product expert' will be along presently to tell you that's not really a problem and to stop bothering the almighty google with such trivialities, your views are not important and they have millions of users, really why should they listen to you?
At least, that's been my experience.
You can do more and more in it and it's so fun, until it suddenly isn't anymore and dies.
> Then we called in our Tech Lead / Manager, who had a reputation of being a human JavaScript compiler. We explained how we got here, that Math.abs() is returning negative values, and whether she could find anything that we were doing wrong. After persuading her that we weren’t somehow horribly mistaken, she sat down and looked at the code. Her CPU spun up to 100%, and she was muttering in Russian about parse trees or something while staring at the code and typing into the debug console. Finally she leaned back and declared that Math.abs() was definitely returning negative values for negative inputs.
struct js_array {
uint64_t length;
js_value *values[];
}
Because after bound checks have been taken care of, loading an element of a JS array probably compiles to a simple assembly-level load like mov. If you bypass the bounds checks, that mov can read or write any mapped address.Love the story! There is so much complexity in the world around as that seemingly obviously wrong things happen through the most unlikely chains of dependency.
Oh my god.
Honestly, of all the stupid ideas, having your engine switch to a completely untested mode when under heavy load, a mode that no one ever checks and it might take years to discover bugs in, is absolutely one of most insane things I can think of. That's at best really lazy, and at worst displays a corporate culture that prizes superficial performance over reliability and quality. Thankfully no one's deploying V8 in, like, avionics. I hope.
At least this is one of those bugs you can walk away from and say, it really truly was a low-level issue. And it takes serious time and energy to prove that.
It throws an OverflowException: ("Negating the minimum value of a twos complement number is invalid.")
At one point I found a bug where if you hit a sequence of buttons on the remote at a very specific time--I want to say it was "next track" twice right as a new track started--the whole device would crash and reboot. This was a show stopper; people would hit the roof if their $500 stereo crashed from hitting "next". Similar to the article, the engineering lead on the product cleared his schedule to reproduce, find, and fix the issue. He did explain what was going on at the time, but the specifics are lost to me.
Overall the work was incredibly boring. I heard the same few tracks so many times I literally started to hear them in my dreams. So it was cool to find a novel, highest severity bug by coloring outside the lines of the testcases. I felt great for finding the problem! I think the lead lost 20% of his hair in the course of fixing it, lol.
I haven't had QA as a job title in a long time but that job did teach me some important lessons about how to test outside the happy path, and how to write a reproducible and helpful bug report for the dev team. Shoutout to all the extremely underpaid and unappreciated QA folks out there. It sucks that the discipline doesn't get more respect.
It's like this https://geek-and-poke.com/geekandpoke/2017/8/13/just-happene... but actually true, which is really bad for mental health :D
To be clear, there are good reasons for this different mode. The fuck-up is not testing it properly.
These kinds of modes can be tested properly in various ways, e.g. by having an override switch that forces the chosen mode to be used all the time instead of using the default heuristics for switching between modes. And then you run your test suite in that configuration in addition to the default configuration.
The challenge is that you have now at least doubled the time it takes to run all your tests. And with this kind of project (like a compiler), there are usually multiple switches of this kind, so you very quickly get into combinatorial explosion where even a company like Google falls far short of the resources it would require to run all the tests. (Consider how many -f flags GCC has... there aren't enough physical resources to run any test suite against all combinations.)
The solution I'd love to see is stochastic testing. Instead of (or, more realistically, in addition to) a single fixed test suite that runs on every check-in and/or daily, you have an ongoing testing process that continuously tests your main branch against randomly sampled (test, config) pairs from the space of { test suite } x { configuration space }. Ideally combine it with an automatic bisector which, whenever a failure is found, goes back to an older version to see if the failure is a recent regression and identifies the regression point if so.
The ultimate cause was in the network initialisation using a network library that was a tissue-paper-thin wrapper around Linux sockets. When downloading a new software version to the device, it would halt the PLC but this didn’t cleanly shut down open sockets, which would stay open, preventing a network service from starting until the unit was restarted. So I did the obvious thing and wrote the socket handle to a file. On startup I’d check the file and if it existed, shut that socket handle. This worked great during development.
Of course this file was still there after a power cycle. 99% of the time nothing would happen, but very occasionally, closing this random socket handle on startup would segfault the soft PLC runtime. So dumb, but so hard to actually catch in the wild.
As per the compute shader post from a few days ago, currently I'm "debugging" some pretty advanced code that's being ported to a shader, and the only way to do it is by creating an array of e.g. ints and inserting values into it in both the original and the shader code to see where they diverge. Its not the most difficult but its quite time consuming.
My favourite are bugs, that not only don't appear in the debugger - but also don't reproduce anymore on normal settings after I took a closer look in the debugger (Only to come back later at a random time). Feels like chasing ghosts.
An intern gets a devboard with a new mcu to play with. A new generation, but mostly backwards compatible or something like that. Intern gets the board up and running with embedded equivalent of "hello world". They port basic product code - ${thing} does not work. After enough hair are pulled, I give them some guidance - ${thing} does not work. Okay, I instruct intern to take mcu vendor libraries/examples and get ${thing} running in isolation. Intern fails.
Okay, we are missing something huge that should be obvious. We start pair programming and strip the code down layer by layer. Eventually we are at a stage where we are accessing hand-coded memory addresses directly. ${thing} does not work. Okay, set up a peripheral and read state register back. Assertion fails. Okay, set up peripheral, nop some time for values to settle, read state register back. Assertion fails. Check generated assembly - nopsled is there.
We look at manual, the bit switching peripheral into the state we care about is not set. However we poke the mcu, whatever we write to control register, the bit is just not set and the peripheral never switches into the mode we need. We get a new devboard (or resolder mcu on the old one, don't remember) and it works first try.
"New device - must be new behavior" thinking with lack of easy access to the new hardware led us down a rabbit hole. Yes, nothing too fancy. However, I shudder thinking what if reading the state register gave back the value written?
That’s an incorrect assumption. Just because your test case isn’t triggering the bug reliably, it does not mean the bug is nondeterministic.
That is like saying the “OpenOffice can’t print on Tuesdays” is non deterministic because you can’t reproduce it everyday. It is deterministic, you just need to find the right set of circumstances.
https://beza1e1.tuxen.de/lore/print_on_tuesday.html
From the writing it appears the author found one way to reproduce the bug sometimes and then relied on it for every test. Another approach would have been to tweak their test case until they found a situation which reproduced the bug more or less often, trying to find the threshold that causes it and continuing to deduce from there.
Prior to this year, they could only handle 0-127 degrees for the water temperature. Which used to be sensible, but there were some issues with pressurised water starting to be delivered to houses resulting in negative temperatures being reported, like -125C, which immediately has the water switch off to prevent icing problems.
The software side also switched from COBOL to Ada. So that's kewl.
Weird things can happen anywhere but I was wondering why this issue wasn't caught by test cases before it escaped to production? I would think that a compiler team would have low-level tests for such common functions.
https://youtube.com/watch?v=sGTDhaV0bcw
The author’s “error”, of course, was calling it “the hardest bug I ever debugged”. It drives clicks, but comparisons too.
A lesson to learn seems obvious to me: the V8 team did not communicate upfront sufficiently on the "oops our Math.abs() may return negative numbers, we fixed that in version X, be warned".
Which the V8 should be able to do in a "advisory for Google developers that work on high-performance client-side view rendering stuff" sort of weekly newsletter.
Turned out that Node.js didn't gracefully close TCP connections. It just silently dropped the connection and sent a RST packet if the other side tried to reuse it. Fun times.
This had worked perfectly for many years but windows was upgraded underneath it, and some smartass had used clever tricks for a hover menu that didn’t work in a future (safer) version of the OS. A rarely triggered hover menu.
Thank you, authors of advanced windows debugging and advanced .net debugging.
I couldn't always get people to talk this way, but people who did usually worked out well
However we still saw these crash reports from one device (conveniently the partner of the CEO, so we got full debug reports). However the system logs were suspicious, lots of clock jumps especially when coming out of sleep. At the end of the day we concluded it was bad hardware (an M1 Max) and the OS was trusting it too much, returning out-of-order values for a supposedly monotonic clock. We updated the code to use saturating arithmetic to mitigate the problem.
A new customer comes in and we deploy a new VMware vSphere private cloud platform for them (first using this type of hardware). Nothing special or too fancy, but fist ones 10G production networking.
After a few weeks, integration team complains that a random VM stopped being able to communicate with another VM, but only one other specific VM. Moving the "broken" VM to a different ESXi fixed things, so we suspected a bad cable/connection/port/switch. Various tests turned up nothing, so we just waited for something to happen again.
A few days later, same thing. Some more debugging, packet capture, nothing. Rebooting the ESXi fixed the issue, so it was not the cables/switch, probably. Support ticket was opened at VMware for them to throw all sorts of useless "advice" (update drivers, firwmare, OS, etc etc).
This kept happening more and more, at some point there were multiple daily occurrences of this - again, just specific VMs to other specific VMs, but could always SSH, and communicate with other things, for which we had to reboot the hypervisor to fix it. VMware are completely and utterly useless, even with all the logs, timelines, etc.
A few weeks in, customer is getting pissed. We say that we've tried all sorts of debugging of everything (packet capture on the ESX, switch stuff, in the guest OSes, etc etc), and there's no rhyme nor reason - all sorts of VMs, of different virtual hardware versions, on different guest OSes, different virtual NIC types, different ESXes, and we're trying stuff with the vendor, it probably being a software bug.
One morning I decided to just go and read all of the logs on one of the ESX, trying to see if I can spot something weird (early on we tried greping for errors, warns yielded just VMware vomit and nothing of use). There's too much of them, and I don't see anything. In desperation, I Googled various combinations of "vmware" "nic type" "network issues", and boom, I stumble upon Intel forums with months of people complaining that the Intel X710 NIC's drivers are broken, throw a "Malicious Driver Detected" message (not error) in the logs, and just shut down traffic on that specific port. And what do you know, that's the NICs we're using, and we have those messages. The piece of shit of a driver had been known to not work for months (there was either that, or it crashing the whole machine), but was proudly sitting on VMware's compatibility list. When I told VMware's support about it, they said they were aware internally, but refused to remove it from the compatibility list. But if we upgraded to the beta release of the next major vSphere, there's a newer driver that supposedly fixes everything. We did that and everything was then finally fixed, but there were machines with similar issues where the driver wasn't updated for years after that.
This is the event that taught me that enterprise vendors don't know that much even about their own software, VMware's support is useless, hardware compatibility lists are also useless. So you actually need to know what you're doing and can't rely on support saving you.
It's not even malice/laziness, it's their entire interpretation of the problem/requirements drives their implementation which then drives their testing. It's like asking restaurants to self-certify they are up to food safety codes.
Part of it was difficulty of pinpointing the actual issue - fullness of drive vs throughput of writes.
A lot of it was unfortunately organizational politics such that the system spanned two teams with different reporting lines that didn't cooperate well / had poor testing practices.
The hardest bugs in my experience are those where your only source of vital information is a third party who is straight-up lying to you.
Spec: allow the internal BI tool to send scheduled reports to the user
Implementation: the server required the desktop front end of said user to have been opened that day for the scheduled reports to work, even though the server side was sending the mails
Why this was hilariously bad - the only reason to have this feature is for when the user is out of office / away from desk for an extended period, precisely when they may not have opened their desktop UI for the day.
One of my favorite examples of how an engineer can get the entire premise of the problem wrong.
In the end he had taken so long and was so intransigent that desktop support team found it easier to schedule the desktop UIs to auto-open in windows scheduler every day such that the whole Rube Goldberg scheduled reports would work.
I work with LLVM and huge % of my work is fixing bugs that are already fixed in upstream
With the lady, if she'd dialed it back a bit on her pace of work "because people are watching", that could have been a crazy one to debug... "only happens when no one is watching (and I'm not beastly-WPM closing cases)"
Back in 2005, when I had only paid-by-cash internet cafe access to computer, one of the shopkeeper offered me free time on computer IF I typed and ran a 15 page of class 12 computer project printed on A4 sheets, onto the compiler. TurboC++. I gladly accepted the offer and typed things.
When I finished typing, taking out all the compile error, the program didn't work as expected. Few hours latter, I find out that 1 or 2 pages of printed source codes were not in original order. :-O . So had to swap code from one function to another to finally get it working. That was one hell of a lesson!
Shopkeeper must have sold that project to many students, and I got some Free internet access.
Think network appliance in the middle that don't log or not at the level you need (and sometimes they can't log what you need).
Those usually mean that no reproduction is possible, except in production or very close to it, with tools you don't always control.
Annoying ones are those of "This http request is sometimes slow", and chasing each boxes in the middle shows a new box that is supposed to be transparent but isn't, or some rare timing issues due to boxes interacting in a funny way.
edit: reminded me of the old joke
A programmer gets sent to the store by his wife. His wife says, “Get a gallon of milk, and if they have eggs, get a dozen.”
The programmer returns home with 12 gallons of milk and says, “They had eggs.”
I was telling someone the story a couple years ago and they said the opcodes linked to the symbols could get corrupted or something like that.
This is how humans work, and this is why I am reading the comments.
But then a production optimized build apparently contains different code? This sounds to me like a system flaw
> We rerun the repro. We look at the logged value. Math.abs() is returning negative values for negative inputs. We reload and run it again. Math.abs() is returning negative values for negative inputs. We reload and run it again. Math.abs() is returning negative values for negative inputs.
Regardless, that is beside the point. I was not arguing either way if this was a deterministic bug or not, I was pointing out that the author’s conclusion does not follow from the premise. Even if the bug had turned out to be nondeterministic, they had not done the necessary steps to confidently make that assertion. There is a chasm of difference between “this bug is nondeterministic” and “I haven’t yet determined the conditions that reproduce this bug”.
Emacs' #'message implementation has a debounce logic, that if you repeatedly debug-print the same string, it gets deduplicated. (If you call (message "foo") 50 times fast, the string printed is "foo [50 times]"). So: if you debug-print inspect a variable that infrequently changes (as was the case), no GUI thrashing occurs. The bug manifested when there were *two* debug-print statements active, which circumvented the debouncer, since the thing being printed was toggling between two different strings. Commenting out one debug-print statement, or the other, would hide the bug.
The language compiler was most likely written by someone who had never read a book about compilation, it was basically just like if you had written a compiler using macros. I don't think it had anything like an optimisation pass. This combined with it being a higher level language meant that debugging with a debugger was just infeasible. Even if you had figured out the issue, you wouldn't know what exactly caused it from the code side as most lines of code would get turned into pages of assembly. Not only that, I believe the format for the debug symbols was custom so line number information was something you would only get if you used the terrible debugger which shipped with the language. Windows is also a terrible development environment due to the incredible lack of any good documentation for almost anything at the WinAPI level.
The applications I was working on were multi-threaded Windows applications. Concurrency issues were everywhere. Troubleshooting them sometimes took months. In many cases the fixes made absolutely no sense.
The IDE (which you were basically forced to use) was incessantly buggy. You could reliably crash it in many contexts by simply clicking too fast. After 5 years of working with that tooling, I had gained an intuition for where I needed to slow down my clicks to prevent a crash.
The IDE also operated on these binary blobs which encapsulated the entire project. I never put in the time to investigate the format of these blobs but, unsurprisingly, given the quality of the IDE, it was possible to put these opaque binary blobs in erroneous states. You could either just revert to a previous version of the blob and copy paste all your work (no way of easily accessing the raw text in the IDE because of this idiotically designed templating feature which was used throughout). If your project was in a wierd state, you would get mystery compiler errors with a 32bit integer printed as hex as an error identifier.
Searching the documentation or the internet for these numbers would either produce no results or would produce forum or comp.lang.clarion results for dozens of unrelated issues.
The language itself was an insane variation of pascal and/or COBOL. It had some nice database related features (as it was effectively CRUD domain specific) but that was about it. You look on GitHub these days to see people discussing the soundness and ergonomics issues of the never type in rust for many months before even considering partially stabilising it. Meanwhile in clarion, you get a half-arsedly written document page which serves as the language specification and out of it you get a half baked feature which doesn't work half the time. The documentation would often have duplicate pages for some features which would provide you with non-overlapping, sometimes conflicting or just outright wrong information.
When dealing with WINAPI you would need to deal with pointer types, and sometimes you would need to do pointer type conversions. The language wouldn't let you just do something like `void *p = &foo;` (this is C, actually very sane compared to Clarion). You had to do the language equivalent of `void *p = 1 ? &foo : NULL;` which magically lost enough type information for the language to let you do it. There was no documented alternative to this (there was casting, it just didn't work in this case), this wasn't even itself documented and was just a result of frustration and trial and error.
Not only this, the people I was working with had all entered this terrible proprietary language (oh wait, did I mention, you had to pay for a license for this shit) at a time where you were writing pure winapi code in C or C++. So for them, the fact that it had a forms editor was so amazing that they literally never considered for the next 25 years looking at alternative options. So when I complained about the complete insanity of using this completely ridiculous language I would get told that the alternatives were worse.
Do you want to experience living hell when debugging? Find a company writing Clarion, apparently it's still popular in the US government.
But, love the story and I collect tales like this all the time so thanks for sharing
I mean, it's well known that there's very little engineering in most software "engineers", but you're describing a person I've never seen.
I work on a server software of online backups for customers. We do daily thousands of mount/umount of a particular filesystem. Once every month or so, we get an issue where a file timestamp fails to save, the error happens at the filesystem level.
Hard to reproduce! It's a filesystem bug! So it's full theorical, reading code and seeing how it would happen.
Found out after a while, the conditions were fun. I don't remember exactly, but it was like, you need to follow these steps : 1/ Create a folder 2/ Create in it 99 files (no more no less) 3/ Create a new folder 4/ Copy the first of the 99 files in the new folder
The issue was linked to some data structure caching, and cache eviction.
Had fun finding it out!
After 3 days of literally trying everything, I don't know why, I thought of rewriting the file character by character by hand and it worked. What was happening?
Eventually opened the two files side by side in a hex editor and here it is: several exotic unicode characters for "empty" space.
I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:
At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.
Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.
a = torch.tensor(-2*31, dtype=torch.int32) assert a == a.abs()
> A programmer gets sent to the store by his wife. His wife says, “Get a gallon of milk. If they have eggs, get a dozen.”
That's the perfect optimization: extremely fast, and mostly right -- probably more often than 50% if there are more positive numbers than negative ones.
No, that means you're dealing with an early alpha, rigged demo, or some sort of vibe coding nonsense.
But there's a lot hidden in "same inputs", because that includes everything that's an input to your program from the operating system. Which includes things like "time" (bane of reproduction), memory layout, execution scheduling order of multithreaded code, value of uninitialized memory, and so on.
> Another approach would have been to tweak their test case until they found a situation which reproduced the bug more or less often, trying to find the threshold that causes it and continuing to deduce from there.
Yes - when dealing with unknowns in a huge problem space it can be very effective to play hotter-colder and climb up the hill.
On an embedded system, we had this bug that we couldn't find. It was around for a month or two. Random crashes that we couldn't reproduce, couldn't even debug. We started calling it "the phantom".
Finally Ed said, "I think the phantom showed up after we made that change to the ethernet driver." We reverted it, and the bug disappeared.
We never found the bug in the source code. But Ed debugged it using the calendar.
In a large complicated application where a change to the environment revealed a crash, finding out what changed in the environment and thinking about how that affects the application makes a lot more sense than going back through application changes to see if you can find it that way.
Once you figure out what the problem is, sure you can probably fix it in the application or the environment, and fixing the application is often easier if the environment is Chrome. But chrome changed and my app is broken means look at the changes in Chrome and work from there.
Turns out IE8 doesn't define console until the devtools are open. That caused me to pull a few hairs out.
One thing I notice is that Google has no way whatsoever to actually just ask users "hey, are you having problems?", a definite downside of their approach to software development where there is absolutely no communication between users and developers.
This is really the same issue with the promo culture we see at Big Tech companies: you end up promoting the people who are good at crafting promo packets i.e. telling stories about their work. There is certainly a good overlap between that and the people who do genuinely good work, but it's not a perfect overlap.
Personally I don't really mind it because I consider myself good at story telling. But as an interviewer I would never do that to a candidate because not everyone can tell good stories.
I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"
I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.
The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.
Pulled my hair out for a year, no progress/insights. Updated the driver for a device, haven't seen it since.
I hope the reverse-calendar debugging works for me!
> When doing the refactoring, they needed to provide new implementations for every opcode. Someone accidentally turned Math.abs() into the identity function for the super-optimized level. But nobody noticed because it almost never ran — and was right half of the time when it did.
If it never was tested, plain and simple as that, then it couldn't matter that it 'almost never ran' or 'was right half the time'.
So the root problem here is that their test-suite neither exercised all optimized levels appropriately, nor flagged the omission as a fatal problem breaking 100% branch coverage (which for a simple primitive like abs you'd definitely want). This meant that they could break lots of other things too without noticing. OP doesn't discuss if the JS team dealt with it appropriately; one hopes they did.
60 seconds to reproduce? Slow!? Laughs in enterprise software
Bought a brand new battery, but the problem persisted. Started looking at all the various parts in the car, that were connected to the electrical system. Took them out, troubleshooting the parts to my best ability, even ended up buying a new alternator AND solenoid just out of sheer desperation.
3 months went by, countless hours in the garage, and I thought to myself...could it be...could it be the new battery I bought? Bought yet another battery, and everything worked. Just like that.
Turns out the battery I had in my car originally had degraded, and couldn't store enough charge. And the second (brand new) I bought turned out to also be defect, having the very same fault.
Those faulty batteries would charge up to measure the correct voltage, but didn't get the correct charge capacity - and thus the car couldn't draw enough current to start the engine.
And don't get me started on the weird wacky world of electronics...but the car debugging was by far the longest I've spent, at one point I had almost every component out of the car, going over the wiring.
I won't name the product because it's not its fault, but we had an HA cluster of 3 instances of it set up. Users reported that the first login of the day would fail, but only for the first person to come into the office. You hit the login button, it takes 30 seconds to give you an invalid login, and then you try logging in again and it works fine for the rest of the day.
Turns out IT had a "passive" firewall (traffic inspection and blocking, but no NAT) in place between the nodes. The nodes established long-running TCP connections between them for synchronization. The firewall internally kept a table of known established connections and eventually drops them out if they're idle. The product had turned on TCP keepalive, but the Linux default keepalive interval is longer than the firewall's timeout. When the firewall dropped the connection from the table it didn't spit out RST packets to anyone, it just silently stopped letting traffic flow.
When the first user of the day tried to log in, all three HA nodes believed their TCP connections were still alive and happy (since they had no reason not to think that) and had to wait for the connection to timeout before tearing those down and re-establishing them. That was a fun one to figure out...
The basic operation of this program is as follows:
1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out.
2. ...I'm not sure this is really luck.
The fix is to just not use Math.abs. If they didn't work at Google they still would've done the same debugging and used the same fix. Working at Google probably harmed them as once they discovered Math.abs didn't work correctly they could've just immediately used `> 0` instead of asking the chrome team about it.
There's nothing lucky about slowly adding printf statements until you understand what the computer is actually doing; that's just good work.
You just needed to find another one like him, and bam, +4×.
(It is actually conceivable that two bad engineers could mostly cancel each other out, if they can occupy each other enough, but it’s not the most likely outcome.)
I had been promoted to technical writer and I needed a better test system that didn’t have customer data for screenshots. Something I needed was unique data because the archive used single instance storage, so I put together a bash script to create and send emails generated from random lines of public domain books I got from Gutenberg.
This worked great for me and at one point I had it fire off 1 million emails just for fun. I let my test email server and archive server chew on them over the weekend. It worked great but I had nearly maxed out my storage. No problem, use the deletion function. And it didn’t work.
It’s Didn’t Work. I had reproduced the bug in-house on a system we had full control over. Engineering and QA both took copies of my environments and started working on the bug.
I also learned the lore of the deletion feature. The founding developer didn’t think anyone wanted a deletion feature because it made no sense to him. But after pressure from the CEO, Board of Directors and customers he banged out some code over a weekend and shipped it. It was no 10 years later and he was long gone, and it was finally beginning to bite us.
After devs banged no the code for a while they found there was a design flaw, it failed if the number of items to delete was more than 500. QA had tested the feature, repeatedly, but their test data set just happened to be just smaller than 500 items so the bug never triggered. I only exceeded that because Austin Powers is funny.
Now that we could reproduce it, and knew there was a design flaw. The code for deletion needed to be replaced. It needed taking over two years to replace the code, because project management never thought it was all that important compared to new features, even though customers were complaining about it.
<input name="tag" id="tag"
this failed in IE with very strange results. Took a long time to realize we had hit a browser bug and change it to: <input name="tagx" id="tagx"
which worked fine.Same, I assumed they were designed to always work. I suspect it was whatever app or library you were using that wasn't designed to handle them correctly.
The bug ticks most of the boxes for a tricky bug:
* Non-deterministic
* Enormous haystack
* Unexpected "1+1=3"-type error with a cause outside of the code itself
Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.
I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.
Until comparatively recently, it was absurdly easy to crash machines via their graphics drivers, even by accident. And I bet a lot of them were security concerns, not just DoS vectors. WebGL has been marvellous at encouraging the makers to finally fix their drivers properly, because browsers declared that kind of thing unacceptable (you shouldn’t be able to bring the computer down from an unprivileged web page¹), and developed long blacklists of cards and drivers, and brought the methodical approach browsers had finally settled on to the graphics space.
Things aren’t perfect, but they are much better than ten years ago.
—⁂—
¹ Ah, fond memories of easy IE6 crashes, some of which would even BSOD Windows 98. My favourite was, if my memory serves me correctly, <script>document.createElement("table").appendChild(document.createElement("div"))</script>. This stuff was not robust.
The more abundant the undefined (mis)behavior, the more you're going to be tearing your hair out.
Almost the kind of frustration where you're supposed to have a logic-based system, and it rears it ugly head and defies logic anyway :\
For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.
That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.
Not a hard thing to debug once the issue is noticed, and completely preventable (write specs in plain text).
I like the hard-earned lessons that are often taken away from such sessions.
While nowhere on the scale of this story, I helped a fellow student while I was at the University where his program was outputting highly bogus numbers from punched card deck input. I ultimately suggested that he print out the numbers that were being read by the program and presto the field alignments were off. This has now become my first step in debugging.
During a co-op stint during my EE degree program was at a pulp bleach plant in Longview Washington. They were implementing instrumentation of various metrics in the bleach tower. The engineers told of a story about one of their instruments to measure flow or temperature or acidity. The instrument was failing but the manufacturer couldn't find any flaw, shipped it back. The cycle repeated several times until one of the engineers accompanied the instrument to the repair lab. The technicians were standing the instrument on its side, not flat as it was in the instrument rack back at the plant. Lying it flat exposed the error.
Another bug sticks in my mind from reading Coders At Work by Peter Seibel. Guy Steele is telling about a bug Bill Gosper reported in the bignum library. One thing caught is eye was a conditional step he didn't quite understand. Since it was based on the division algorithms from Knuth: "And what caught my eye in Knuth was a comment that this step happens rarely—with a probability of roughly only one in two to the size of the word." The error was in a rarely-executed piece of code. The lesson here helped him find similar bugs.
While three of us were building a compiler at Sycor, we kept a large lab notebook in which we wrote brief release notes, and a one-line note about each bug we found and fixed.
My most recent bug was a new emacs snippet was causing errors in eval_buf. Made no sense, so ultimately decided to clear out the .emacs.d directory and start over. There were files that were over 20 years old--I just copied the directory when I built a new machine.
It's just like those "what did you do when you had conflict with another employee" questions. I either worked it out with them like an adult or got our management involved and they worked it out for them. It's not some hero narrative I considered much past the time it happened.
As a software engineer, I've always been very proud of my thoroughness and attention to detail in testing my code. However, good QA people always leave me wondering "how did they even think to do that?" when reviewing bug reports.
QA is both a skillset AND a mindset.
A favourite of mine was a bug (specifically, a stack corruption) that I only managed to see under instrumentation. After a lot of debugging turns out that the bug was in the instrumentation software itself, which generated invalid assembly under certain conditions (calling one of its own functions with 5 parameters even though it takes only 4). Resolved by upgrading to their latest version.
https://hackaday.com/2025/01/23/this-qr-code-leads-to-two-we...
Me: "Your product is broken for all customers in this situation, probably has been so for years, here is the exact problem and how to fix it, can I talk with someone who can do the work?"
Customer Support: "Have you tried turning your machine off and turning it back on again?"
https://gyrovague.com/2015/07/29/crashes-only-on-wednesdays/
Turned out there was an undocumented MDM feature that would reboot the device if a package with a specific name wasn't running.
Upon decompilation it wasn't supposed to be active (they had screwed up and shipped a debug build of the MDM) and it was supposed to be 60 seconds according to the variable name, but they had mixed up milliseconds and seconds
I've had that experience. Turned out some boards in the wild didn't have the bodge wire that connected the shift register output to the gate that changed the behavior.
(And thanks for the war story!)
const arr = []; arr[false] = "hi";
which console.log(arr); - in FF at least - gives
Array []
false: "hi"
length: 0
which means
console.log(arr[Boolean(arr.length)]); returns
hi
which is funny, I just feel there must be an exploit somewhere among this area of things, but maybe not because it would be well covered.
on edit: for example since the index could be achieved - for some reason - from numeric operation that output NaN, you would then have NaN: "hi", or since the arr[-1] gives you "-1": "hi" but arr[0 -1] returns that "hi" there are obviously type conversions going on in the indexing...which just always struck me as a place you don't expect the type conversions to be going on the way you do with a == b;
Maybe I am just easily freaked out by things as I get older.