zlacker

[parent] [thread] 1 comments
1. notaco+(OP)[view] [source] 2022-10-02 20:55:11
In a system already designed to handle the sudden and possibly permanent loss of a single machine to hardware failure, those are nice to have at best. "Panic" doesn't have to mean not executing a single other instruction. Logging e.g. over the network is one of the things a system might do as part of its death throes, and definitely was for the last few such systems I worked on. What's important is that it not touch storage any more, or issue instructions to other machines to do so, or return any more possibly-corrupted data to other systems. For example, what if the faulty machine itself is performing block reconstruction when it realizes the world has turned upside down? Or if it returns a corrupted shard to another machine that's doing such reconstruction? In both of those scenarios the whole block could be corrupted even though that machine's local storage is no longer involved. I've seen both happen.

Since the mechanisms for ensuring the orderly stoppage of all such activity system-wide are themselves complicated and possibly error-prone, and more importantly not present in a commodity OS such as Linux, the safe option is "opt in" rather than "opt out". In other words, don't try to say you must stop X and Y and Z ad infinitum. Instead say you may only do A and B and nothing else. That can easily be accomplished with a panic, where certain parts such as dmesg are specifically enabled between the panic() call and the final halt instruction. Making that window bigger, e.g. to return errors to clients who don't really need them, only creates further potential for destructive activity to occur, and IMO is best avoided.

Note that this is a fundamental difference between a user (compute-centric) view of software and a systems/infra view. It's actually the point Linus was trying to get across, even if he picked a horrible example. What's arguably better in one domain might be professional malfeasance in the other. Given the many ways Linux is used, saying that "stopping is not an option" is silly, and "continuing is not an option" would be equally so. My point is not that what's true for my domain must be true for others, but that both really are and must remain options.

P.S. No, stopping userspace is not stopping everything, and not what I was talking about. Or what you were talking about until the narrowing became convenient. Your reply is a non sequitur. Also, I can see from other comments that you already agree with points I have made from the start - e.g. that both must remain options, that the choice depends on the system as a whole. Why badger so much, then? Why equivocate on the importance (or even meaningful difference) between kernel vs. userspace? Heightening conflict for its own sake isn't what this site is supposed to be about.

replies(1): >>wtalli+t2
2. wtalli+t2[view] [source] 2022-10-02 21:12:30
>>notaco+(OP)
> "Panic" doesn't have to mean not executing a single other instruction.

We're talking specifically about the current meaning of a Linux kernel panic. That means an immediate halt to all of userspace.

[go to top]