zlacker

I don't think I buy Linus' high level claim. It is not necessarily better to press on with the wrong answer, in some cases failure actually is an option and might be much better than oops we did it wrong.

This morning I was reading about the analysis of an incident in which a London tube train drove away with open doors. Nobody was harmed, or even in immediate danger, the train had relatively few passengers and in fact they only finally alerted the driver at the next station, classic British politeness (they made videos, took photographs, but they didn't use the emergency call button until the train got to a station)

Anyway, the underlying cause involves systems which were flooded with critical "I'm failing" messages and would just periodically reboot and then press on. The train had been critically faulty for minutes, maybe even days before the incident, but rather than fail, and go out of service, systems kept trying to press on. The safety systems wouldn't have allowed this failed train to drive with its doors open - but the safety critical mistake to disable safety systems and drive the train anyway wouldn't have happened if the initial failure had caused the train to immediately go out of passenger service instead of limping on for who knows how long.

replies(3): >>thepti+T4 >>pwinns+37 >>flumpc+ob

>>tialar+(OP)
I feel like the OP is really hurting from quoting Linus out of context. This is many messages deep in a thread about automatically detecting atomic contexts in the allocator.

And I don’t think he’s making a system level claim, that the whole train system should be designed to limp on through failures. He’s claiming that the kernel needs to be able to limp on so that the systems that use it can have the best chance of e.g. sending automated bug reports. (Or you can turn off the limping behavior if you want; maybe trains should do that. But maybe a train’s control system randomly rebooting might be more catastrophic than leaving its doors open? I don’t know.)

From a couple messages up-thread in the OP:

> … having behavior changes depending on context is a total disaster. And that's invariably why people want this disgusting thing.

They want to do broken things like "I want to allocate memory, and I don't want to care where I am, so I want the memory allocator to just do the whole GFP_ATOMIC for me".

And that is FUNDAMENTALLY BROKEN.

If you want to allocate memory, and you don't want to care about what context you are in, or whether you are holding spinlocks etc, then you damn well shouldn't be doing kernel programming. Not in C, and not in Rust.

It really is that simple. Contexts like this ("I am in a critical region, I must not do memory allocation or use sleeping locks") is fundamental to kernel programming. It has nothing to do with the language, and everything to do with the problem space.

So don't go down this "let's have the allocator just know if you're in an atomic context automatically" path. It's wrong. It's complete garbage. It may generate kernel code that superficially "works", but one that is fundamentally broken, and will fail and becaome unreliable under memory pressure

>>tialar+(OP)
Linus' statement are applicable to the kernel only, and if we're using tube analogies, he was talking more about situations where the train is underway and something fails. The Rust way would be to panic, train stops in between stations and must be rebooted to continue. Linus was saying no, you carry on despite the error until you get to the next station. Much as the passengers in your story did.

replies(1): >>tialar+6b

>>pwinns+37
> The Rust way would be to panic, train stops in between stations and must be rebooted to continue.

Which is safe. It's inconvenient, but it's safe. Failures of this sort do happen, electrical fires are probably the most extreme example. They're annoying, but nobody is at risk if you stop. Since the tube is in civilisation (even at the extreme ends of the London Underground which are outside London, like Chesham, this is hardly wilderness, you can probably see a house from where your train stopped if there aren't trees in the way) we can just walk away.

https://commons.wikimedia.org/wiki/File:Chesham_Tube_Station...

> Linus was saying no, you carry on despite the error until you get to the next station

Depending on the error the consequences of attempting to "carry on" may be fatal and it's appropriate that the decision to attempt this rests with a human, and isn't just the normal function of a machine determined to get there regardless.

replies(1): >>gmueck+6l

>>tialar+(OP)
I think it's obvious that Linus is correct here.

For example, say there's a bug in the Linux kernel that would produce a "panic" at midnight Dec 31st 2022... do we accept a billion devices shutting down? In the best case rebooting and resuming a whatever user space program was running?

Despite the bad taste, I think the obvious answer is as Linus says: the Kernel should keep going despite errors.

replies(1): >>maxbon+ox

>>tialar+6b
Stopping a train in the tube between stations is not safe. You can't get off the train safely between stations. Most help can't reach a train stuck in a tube.

replies(1): >>tialar+YC

>>flumpc+ob
A better analogy would be: Let's say if we have kernel A that contains a bug; we don't know when it will trigger or what it will do. We have another kernel, B, which has the same bug, but while we don't know when it will trigger, we know it will cause the device to halt. Which is the better kernel?

I'd say B is nearly always the better choice, because halting is a known state it's almost always possible to recover from, and going into unknown state may cause you to get hacked or to damage your peripherals. But if we were operating, say, a Mars rover, and shutting down meant we would never be able to boot again, then it'd be better take kernel A and attempt to recover from whatever state we find ourselves in. That's pretty exotic, however.

In the case of an unanticipated error in a software component, we always need input from an external source to correct ourselves. When you're the kernel, that generally means either a human being or a hypervisor has to correct you; better to do so from a halted state than an entirely unknown one. Trying to muddle through despite is super dangerous, and makes your software component into lava in the case of a fault.

replies(1): >>wtalli+2z

>>maxbon+ox
> But if we were operating, say, a Mars rover, and shutting down meant we would never be able to boot again, then it'd be better take kernel A and attempt to recover from whatever state we find ourselves in. That's pretty exotic, however.

That you view it as exotic is partly a lack of imagination on your part; with a little more effort it's possible to identify similar use cases that are much closer to home than Mars.

But that doesn't really matter. What matters is that the Linux kernel needs to support both options, because it's just one component in a larger system and that context outside the kernel is what determines which option is correct for that system.

replies(1): >>maxbon+jA

>>wtalli+2z
> [W]ith a little more effort it's possible to identify similar use cases that are much closer to home than Mars.

If you feel there are some that would add to this conversation, feel free to share them.

replies(1): >>krater+KQ

>>gmueck+6l
Trains can be, and sometimes are, evacuated in a tunnel. The front (and rear, these trains are symmetrical) can be opened, converting into steps for able-bodied passengers to walk down to the tunnel floor.

There's a video of passengers doing this for real in this 2016 news article:

https://www.bbc.co.uk/news/uk-england-london-36716256

replies(1): >>gmueck+ON

>>tialar+YC
Note the electrified third rail in the photos. It's not safe to walk there before that rail is disconnected.

>>maxbon+jA
Your phone dies when you need to call 911. Your selfdriving car dies when you driving 120km/h on the highway. Only 2 that needed no effort to find.