This morning I was reading about the analysis of an incident in which a London tube train drove away with open doors. Nobody was harmed, or even in immediate danger, the train had relatively few passengers and in fact they only finally alerted the driver at the next station, classic British politeness (they made videos, took photographs, but they didn't use the emergency call button until the train got to a station)
Anyway, the underlying cause involves systems which were flooded with critical "I'm failing" messages and would just periodically reboot and then press on. The train had been critically faulty for minutes, maybe even days before the incident, but rather than fail, and go out of service, systems kept trying to press on. The safety systems wouldn't have allowed this failed train to drive with its doors open - but the safety critical mistake to disable safety systems and drive the train anyway wouldn't have happened if the initial failure had caused the train to immediately go out of passenger service instead of limping on for who knows how long.
And I don’t think he’s making a system level claim, that the whole train system should be designed to limp on through failures. He’s claiming that the kernel needs to be able to limp on so that the systems that use it can have the best chance of e.g. sending automated bug reports. (Or you can turn off the limping behavior if you want; maybe trains should do that. But maybe a train’s control system randomly rebooting might be more catastrophic than leaving its doors open? I don’t know.)
From a couple messages up-thread in the OP:
> … having behavior changes depending on context is a total disaster. And that's invariably why people want this disgusting thing.
They want to do broken things like "I want to allocate memory, and I don't want to care where I am, so I want the memory allocator to just do the whole GFP_ATOMIC for me".
And that is FUNDAMENTALLY BROKEN.
If you want to allocate memory, and you don't want to care about what context you are in, or whether you are holding spinlocks etc, then you damn well shouldn't be doing kernel programming. Not in C, and not in Rust.
It really is that simple. Contexts like this ("I am in a critical region, I must not do memory allocation or use sleeping locks") is fundamental to kernel programming. It has nothing to do with the language, and everything to do with the problem space.
So don't go down this "let's have the allocator just know if you're in an atomic context automatically" path. It's wrong. It's complete garbage. It may generate kernel code that superficially "works", but one that is fundamentally broken, and will fail and becaome unreliable under memory pressure
Which is safe. It's inconvenient, but it's safe. Failures of this sort do happen, electrical fires are probably the most extreme example. They're annoying, but nobody is at risk if you stop. Since the tube is in civilisation (even at the extreme ends of the London Underground which are outside London, like Chesham, this is hardly wilderness, you can probably see a house from where your train stopped if there aren't trees in the way) we can just walk away.
https://commons.wikimedia.org/wiki/File:Chesham_Tube_Station...
> Linus was saying no, you carry on despite the error until you get to the next station
Depending on the error the consequences of attempting to "carry on" may be fatal and it's appropriate that the decision to attempt this rests with a human, and isn't just the normal function of a machine determined to get there regardless.
For example, say there's a bug in the Linux kernel that would produce a "panic" at midnight Dec 31st 2022... do we accept a billion devices shutting down? In the best case rebooting and resuming a whatever user space program was running?
Despite the bad taste, I think the obvious answer is as Linus says: the Kernel should keep going despite errors.
I'd say B is nearly always the better choice, because halting is a known state it's almost always possible to recover from, and going into unknown state may cause you to get hacked or to damage your peripherals. But if we were operating, say, a Mars rover, and shutting down meant we would never be able to boot again, then it'd be better take kernel A and attempt to recover from whatever state we find ourselves in. That's pretty exotic, however.
In the case of an unanticipated error in a software component, we always need input from an external source to correct ourselves. When you're the kernel, that generally means either a human being or a hypervisor has to correct you; better to do so from a halted state than an entirely unknown one. Trying to muddle through despite is super dangerous, and makes your software component into lava in the case of a fault.
That you view it as exotic is partly a lack of imagination on your part; with a little more effort it's possible to identify similar use cases that are much closer to home than Mars.
But that doesn't really matter. What matters is that the Linux kernel needs to support both options, because it's just one component in a larger system and that context outside the kernel is what determines which option is correct for that system.
If you feel there are some that would add to this conversation, feel free to share them.
There's a video of passengers doing this for real in this 2016 news article: