zlacker

The policy of ‘oopsing’ and limping on is, in my opinion, literally one of Linux’s worst features. It has bitten me in various cases:

- Remember when Linux had that caused the kernel to partially crash and eat 100% CPU due to some bug in the leap second application code? That caused a >1MW spike in power usage at Hetzner at the time. That must have been >1GW globally. Many people didn’t notice it immediately, so it must have taken weeks before everyone rebooted.

- I’ve personally run into issues where not crashing caused Linux to go on and eat my file system.

On any Linux server I maintain, I always toggle those sysctls that cause the kernel to panic on oops, and reboot on panic.

replies(7): >>xyzzy_+h2 >>pca006+B3 >>mike_h+c4 >>Miguel+k5 >>UncleM+68 >>amluto+Te >>notaco+3y

>>EdScho+(OP)
Right, the fact that you can toggle those sysctls is sort of the point. There are plenty of environments where running at all is better than not, hardware watchdogs can solve for unresponsiveness.

TFA is about making it possible for the kernel to decide what to do, rather than exploding on the spot, which is terrible.

>>EdScho+(OP)
Probably not very linux, but consider the case where a device is controlling a motor speed by setting its current, computed by getting its current speed and do a simple PID control. If your device crashed (due to failure of something else, for example some LED display) while the current output is pretty large, you may cause some serious damage to whatever that motor is connected to. The thing the system should be able to do is notify such error and gracefully shutdown, for example sending commands to shutdown everything else.

Although I think this can be better done by some special panic handler that performs a sjlj and notify other systems about the failure, without continuing running with the wrong output...

replies(1): >>Gare+Aa

>>EdScho+(OP)
So instead of a power spike, we'd have had a major internet outage across the world, across the entire industry and beyond, probably, if everyone had panicked on oops. The blame really lies with people not monitoring their systems.

As you said, you have the option to reboot on panic, but Linus is absolutely not wrong that this size does not fit all.

What about a medical procedure that WILL kill the patient if interrupted? What about life support in space? Hitting an assert in those kinds of systems is a very bad place to be, but an automatic halt is worse than at least giving the people involved a CHANCE to try and get to a state where it's safe to take the system offline and restart it.

replies(6): >>Tomte+w6 >>ok_dad+d7 >>notaco+ny >>acje+6K1 >>stouse+kL1 >>kaba0+ou2

>>EdScho+(OP)
What are those sysctls? It was worth my time to read Hacker News this morning.

replies(1): >>stjohn+Fq

>>mike_h+c4
You won't see Linux in those fields. I'm aware of a project by OSADL to qualify Linux for SIL 2, but your examples are way beyond that.

replies(1): >>retard+rR9

>>mike_h+c4
> What about a medical procedure that WILL kill the patient if interrupted? What about life support in space? Hitting an assert in those kinds of systems is a very bad place to be, but an automatic halt is worse than at least giving the people involved a CHANCE to try and get to a state where it's safe to take the system offline and restart it.

Kinda a strawman there. That's got to account for, what, 0.0001% of all use of computers, and probably they would never ever use Linux for these applications (I know medical devices DO NOT use Linux).

replies(3): >>gmueck+C8 >>goodpo+2h >>Kim_Br+xv2

>>EdScho+(OP)
It is also a fabulous way of introducing vulnerabilities.

>>ok_dad+d7
Do you know absolutely every medical device in existence and do you know how broad the definition of a medical device is (including e.g. the monitor attached to the PC used for displaying X-ray images)?

replies(2): >>jmilli+v9 >>ok_dad+ec

>>gmueck+C8

  > including e.g. the monitor attached to the PC used for displaying
  > X-ray images

Somewhat off-topic, but I used to work in a dental office. The monitors used for displaying X-rays were just normal monitors, bought from Amazon or Newegg or whatever big-box store had a clearance sale. Even the X-ray sensors themselves were (IIRC) not regulated devices, you could buy one right now on AliExpress if you wanted to.

replies(1): >>gmueck+cm

>>pca006+B3
Such a device should have a simple protective circuit that doesn't allow this. This is common in any expensive or critical industrial system.

replies(1): >>yellow+LJ1

>>gmueck+C8
I worked in medical device quality control and so, yes, I know all about the FDA requirements for medical devices and ISO 13485. I can say, with certainty, that base Linux would not be allowed to run in a medical device in the USA. It's software of unknown provenance (SOUP) and would absolutely NOT be used as-is.

replies(6): >>smolde+0i >>gmueck+Tn >>voakba+Ip >>sarlal+2W >>cplusp+0v1 >>Suzura+GE2

>>EdScho+(OP)
As a kernel developer, I mostly disagree. Panicking hard is nice unless you are the user whose system rebooted without explanation or the developer trying to handle the bug report saying “my system rebooted and I have nothing more to say”.

Getting logs out is critical.

replies(1): >>EdScho+bg

>>amluto+Te
One does not rule out the other. You could simply write crash info to some NVRAM or something, and then do a reboot. Then you can recover it during the next boot.

But there is no need to let userspace processes continue to run, which is exactly what Linux does.

replies(1): >>wtalli+8m

>>ok_dad+d7
> I know medical devices DO NOT use Linux

Absolutely false.

>>ok_dad+ec
Makes me wonder what they run their NAS software with. Or their internal web-hosting, or their networking devices, or any of the other devices they have littered about. I'd swear on the Bible that I've seen a dentist or two running KDE 3 before...

replies(1): >>ok_dad+mt

>>EdScho+bg
> You could simply write crash info to some NVRAM or something, and then do a reboot. Then you can recover it during the next boot.

That works for some systems: those for which "some NVRAM or something" evaluates to a real device usable for that purpose. Not all Linux systems provide such a device.

> But there is no need to let userspace processes continue to run, which is exactly what Linux does.

Userspace processes usually contain state that the user would also like to be persisted before rebooting. If my WiFi driver crashes, there's nothing helpful or safer about immediately bringing down the whole system when it's possible to keep running with everything but networking still functioning.

replies(2): >>EdScho+Do >>kaba0+Kx2

>>jmilli+v9
That's not the case in the EU. I've worked for an equipment manufacturer for dental clinics. While the monitors were allowed to be off the shelf, the operator (dental clinic) is required to make sure that they work properly and display the image correctly - obey certain brightness and color resolution/calibration standards. Our display software had to refuse to work on an uncalibrated monitor.

replies(1): >>alias_+yP

>>ok_dad+ec
Then you should know that the use of SOUP is not so clear cut. It depends on the class of device and more specifically, on the part of the device that the software is used on. I know medical devices running SOUP operating systems like Linux. They went to some length to show that the parts running Linux and the critical functions of the device were sufficiently independent. This isolation is specifically allowed by the standards you quote.

It's even worse on things like car dashboards: some warning lights on dashboards need to be ASIL-D conformant, which is quite strict. However, developing the whole dashboard software stack to that standard is too expensive. So the common solution these days is to have a safe, ASIL-D compliant compositor and a small renderer for the warning lights section of the display while the rendering for all the flashy graphics runs in an isolated VM on standard software with lower safety requirements. It's all done on the same CPU and GPU.

replies(1): >>ok_dad+it

>>wtalli+8m
> If my WiFi driver crashes, there's nothing helpful or safer about immediately bringing down the whole system when it's possible to keep running with everything but networking still functioning.

There have been various examples of WiFi driver bugs leading to security issues. Didn’t some Broadcom WiFi driver once have a bug in how it processed non-ASCII SSID names, allowing you to trigger remote code execution?

replies(1): >>wtalli+5s

>>ok_dad+ec
That’s an odd thing to claim. I have worked on certified medical devices that run custom Linux distribution.

Mind you, that experience also severely soured me on the quality of medical software systems, due to poor quality of the software that ran in that distribution. Linux itself was a golden god in comparison to the crap that was layered on top of it.

replies(1): >>ok_dad+Dt

>>Miguel+k5
Maybe these? https://www.cyberciti.biz/tips/reboot-or-halt-linux-system-i...

>>EdScho+Do
We're not talking about bugs in general, we're talking about bugs whose manifestation is caught by error checking already in the code. For device drivers, those situations can often be handled safely by simply disabling the device in question while leaving the rest of the OS running. I doubt the Broadcom bug you're thinking of triggered a WARN_ON() in the code path allowing for a remote code execution. (Also, the highest-profile Broadcom WiFi remote code execution bug I'm aware of was a bug in the WiFi chip's closed-source firmware, which doesn't even run on the same processor as the host Linux OS.)

>>gmueck+Tn
> They went to some length to show that the parts running Linux and the critical functions of the device were sufficiently independent.

Let's not be too pedantic. You, as an experienced medical device engineer, probably knew what I meant was that they would never use Linux in the critical parts of a medical device as the OP had originally argued. Any device would definitely do all of it's functionality without the part with Linux on it.

The OP was still a major strawman, regardless of my arguments, because the Linux kernel will never be in the critical path of a medical device without a TON of work to harden it from errors and such. Just the fact that Linus' stance is as said would mean that it's not an appropriate kernel for a medical device, because they should always fail with an error and stop under unknown conditions rather than just doing some random crap.

>>smolde+0i
Those aren't medical devices.

>>voakba+Ip
I'd like to hear more about that, but I assume it's much like the other poster here that described a Linux system that is a peripheral device attached to the actual medical device that does the medical shit.

replies(2): >>gmueck+MG >>voakba+pH2

>>EdScho+(OP)
Yeah, this part has never really been true.

> In the kernel, "panic and stop" is not an option

That's simply not true. It's an option I've seen exercised many times, even in default configurations. Furthermore, for some domains - e.g. storage - it's the only sane option. Continuing when the world is clearly crazy risks losing or corrupting data, and that's far worse than a crash. No, it's not weird to think all types of computation are ephemeral or less important than preserving the integrity of data. Especially in a distributed context, where this machine might be one of thousands which can cover for a transient loss of one component but letting it continue to run puts everything at risk, rebooting is clearly the better option. A system that can't survive such a reboot is broken. See also: Erlang OTP, Recovery Oriented Computing @ Berkeley.

Linus is right overall, but that particular argument is a very bad one. There are systems where "panic and stop" is not an option and there are systems where it's the only option.

replies(1): >>wtalli+gz

>>mike_h+c4
> What about a medical procedure that WILL kill the patient if interrupted? What about life support in space?

The proper answer to those is redundancy, not continuing in an unknown and quite likely harmful state.

replies(2): >>yencab+rM >>John23+bK1

>>notaco+3y
> Furthermore, for some domains - e.g. storage - it's the only sane option.

Can you elaborate on this? Because failing storage is a common occurrence that usually does not warrant immediately crashing the whole OS, unless it's the root filesystem that becomes inaccessible.

replies(1): >>notaco+rF

>>wtalli+gz
Depends on what you mean by "failing storage" but IMX it does warrant an immediate stop (with or without reboot depending on circumstances). Yes, for some kinds of media errors it's reasonable to continue, or at least not panic. Another option in some cases is to go read-only. OTOH, if either media or memory corruption is detected, it would almost certainly be unsafe to continue because that might lead to writing the wrong data or writing it to the wrong place. The general rule in storage is that inaccessible data is preferable to lost, corrupted, or improperly overwritten data.

Especially in a distributed storage system using erasure codes etc., losing one machine means absolutely nothing even if it's permanent. On the last storage project I worked on, we routinely ran with 1-5% of machines down, whether it was due to failures or various kinds of maintenance actions, and all it meant was a loss of some capacity/performance. It's what the system was designed for. Leaving a faulty machine running, OTOH, could have led to a Byzantine failure mode corrupting all shards for a block and thus losing its contents forever.

BTW, in that sort of context - where most bytes in the world are held BTW - the root filesystem is more expendable than any other. It's just part of the access system, much like firmware, and re-imaging or even hardware replacement doesn't affect the real persistence layer. It's user data that must be king, and those media whose contents must be treated with the utmost care.

replies(1): >>wtalli+mK

>>ok_dad+Dt
It is not a peripheral device if it runs the UI with all the main controls, is it?

replies(1): >>ok_dad+yD1

>>notaco+rF
I understand why a failing drive or apparently corrupt filesystem would be reason to freeze a filesystem. But that's nowhere close to kernel panic territory.

Even in a distributed, fault-tolerant multi-node system, it seems like it would be useful for the kernel to keep running long enough for userspace to notify other systems of the failure (eg. return errors to clients with pending requests so they don't have to wait for a timeout to try retrieving data from a different node) or at least send logs to where ever you're aggregating them.

replies(1): >>notaco+0O

>>notaco+ny
The leap second bug would have crashed all nodes of a redundant system, at the same time...

replies(1): >>notaco+vP

>>wtalli+mK
In a system already designed to handle the sudden and possibly permanent loss of a single machine to hardware failure, those are nice to have at best. "Panic" doesn't have to mean not executing a single other instruction. Logging e.g. over the network is one of the things a system might do as part of its death throes, and definitely was for the last few such systems I worked on. What's important is that it not touch storage any more, or issue instructions to other machines to do so, or return any more possibly-corrupted data to other systems. For example, what if the faulty machine itself is performing block reconstruction when it realizes the world has turned upside down? Or if it returns a corrupted shard to another machine that's doing such reconstruction? In both of those scenarios the whole block could be corrupted even though that machine's local storage is no longer involved. I've seen both happen.

Since the mechanisms for ensuring the orderly stoppage of all such activity system-wide are themselves complicated and possibly error-prone, and more importantly not present in a commodity OS such as Linux, the safe option is "opt in" rather than "opt out". In other words, don't try to say you must stop X and Y and Z ad infinitum. Instead say you may only do A and B and nothing else. That can easily be accomplished with a panic, where certain parts such as dmesg are specifically enabled between the panic() call and the final halt instruction. Making that window bigger, e.g. to return errors to clients who don't really need them, only creates further potential for destructive activity to occur, and IMO is best avoided.

Note that this is a fundamental difference between a user (compute-centric) view of software and a systems/infra view. It's actually the point Linus was trying to get across, even if he picked a horrible example. What's arguably better in one domain might be professional malfeasance in the other. Given the many ways Linux is used, saying that "stopping is not an option" is silly, and "continuing is not an option" would be equally so. My point is not that what's true for my domain must be true for others, but that both really are and must remain options.

P.S. No, stopping userspace is not stopping everything, and not what I was talking about. Or what you were talking about until the narrowing became convenient. Your reply is a non sequitur. Also, I can see from other comments that you already agree with points I have made from the start - e.g. that both must remain options, that the choice depends on the system as a whole. Why badger so much, then? Why equivocate on the importance (or even meaningful difference) between kernel vs. userspace? Heightening conflict for its own sake isn't what this site is supposed to be about.

replies(1): >>wtalli+tQ

>>yencab+rM
Perhaps. On the other hand, letting a medical device continue moving an actuator or dispensing a medication when it's known to be in a bad "never happen" state could also be fatal. Ditto for the "life support in space" example. Ditto for anything reliant on position, where the system suddenly realizes it has no idea whether its position is correct. Imagine that e.g. on a warship. Limiting responses to external inputs (including time adjustments) can ameliorate such problems. So can software diversity. Many safety-critical systems require one or both, and other measures as well. Picking one black-swan event while ignoring literally every day scenarios doesn't seem very helpful. That's especially true when the thing you're advocating is what actually happened and led to its own bad outcomes.

replies(1): >>yencab+jq1

>>gmueck+cm
Interesting, how does your software detect an uncalibrated monitor? Did it come with a calibration device which had to be used to scan the display output to check?

I don't suppose monitors report calibration data back to display adapters do they?

replies(2): >>Gauntl+SQ >>gmueck+sS

>>notaco+0O
> "Panic" doesn't have to mean not executing a single other instruction.

We're talking specifically about the current meaning of a Linux kernel panic. That means an immediate halt to all of userspace.

>>alias_+yP
My guess is they had some heuristic based on EDIDs, which are incredibly easy to spoof.

https://smile.amazon.com/EVanlak-Passthrough-Generrtion-Elim...

replies(1): >>gmueck+GT

>>alias_+yP
I didn't work on that specific software team and it has been a long time since I worked there. But the software came with its custom calibration routine and I believe that the calibration result was stored with model and serial number information from the monitor EDID.

replies(1): >>alias_+EQ1

>>Gauntl+SQ
Yes, but why would you go to these lengths? The purpose of the whole mechanism is to prevent accidental misdiagnosis based on an incorrectly interpreted X-ray image. This isn't DRM, just a safeguard against incorrect use of equipment.

replies(1): >>Gauntl+3E1

>>ok_dad+ec
Ok, that's good for a U.S. centric view. Do you know that every medical device manufactured in China, for use in China meets the same requirements? Same for India, Russia, etc. The U.S. isn't the world and I'd be surprised if Linux weren't in use in some critical systems around the world that would be shocking for U.S. experts on those types of systems.

>>notaco+vP
Picking medical devices and warships is also quite the cherry picking. Most Linuxes aren't like that. Critical embedded systems tend to have a hard realtime component, and if Linux is on the system it sits under e.g. seL4, or on a different CPU.

At the end of the day, what Linux does is what Linus wants out of it. He's stated, often, that halting the CPU at the exact moment something goes wrong is not the goal. If your goal is to do that, you might not be able to use Linux. If your goal is to put Rust in the Linux kernel, you might have to let go of your goal.

replies(1): >>notaco+IA2

>>ok_dad+ec
Surely we can “harden” Linux for this application?

>>gmueck+MG
No, do you have a concrete example of this strawman, though?

Edit: I should also add (probably earlier too) that all my examples are specific to the USA FDA process. I'm sure some other place might not have the same rules.

replies(1): >>gmueck+cQ1

>>gmueck+GT
People are cheap and corrupt. The speed bump this presents is real, but minor, in the face of a couple medical shops looking to save $100/pop on a dozen monitors.

I hope it's rare, but I think a persistent nag window ("Your display isn't calibrated and may not be accurate") is probably a better answer than refusing to work altogether, because it will be clear about the source of the problem and less likely to get nailed down.

replies(2): >>gmueck+SQ1 >>kaba0+Iw2

>>Gare+Aa
"Should" unfortunately ain't the same as "does". The Torvaldsian (for lack of a better word) attitude seems to be to assume that someone is indeed dumb enough to design a system wherein all safety measures are software-defined, and in such a situation the software in question probably shouldn't catastrophically fail on every last failed assertion.

>>mike_h+c4
I remember working on some telecom equipment in the 90ies. It had a x86/Unix feature rich distributed management system. In other words complicated and expected to fail. The solution was a “watch dog” circuit that the main CPU had to poll every 100ms or so. Three misses and the CPU would get hard rebooted by the watch dog.

This reminds me of two things. Good system design needs a hardware-software codesign. Oxide computers has identified this, but it was probably much more common before the 90ies than after. The second thing is that all things can fail so a strategy that only hardens the one component is fundamentally limited, even flawed. If the component must not fail you need redundancy and supervision. Joe Armstrong would be my source of quote if I needed to find one.

Both rust and Linux has some potential for improvement here, but the best answers may lie in their relation to the greater system, rather than within it self. I’m thinking of WASM and hardware codesign respectively.

>>notaco+ny
Right. It's clear that many people have not heard of, or considered, Therac-25[1].

[1] https://en.wikipedia.org/wiki/Therac-25

replies(1): >>eesmit+CW1

>>mike_h+c4
> What about a medical procedure that WILL kill the patient if interrupted?

Allow me to introduce you to Therac-25: https://en.wikipedia.org/wiki/Therac-25

>>ok_dad+yD1
I can't see how you can make out a strawman in what I said. There are medical devices where the UI is running on a processor separate from the controller in charge of the core device functions. The two are talking to each other and there is no secondary way of interacting with the controller. This lessens the requirements that are put on the part running the UI, but does not eliminate them.

I'm mostly familiar with EU rules, but as far as I know the FDA regulations follow the same idea of tiered requirements based on potential harm done.

replies(1): >>ok_dad+P42

>>gmueck+sS
Thanks, sounds like I need to do some reading about EDIDs; I knew _of_ them but no real understanding is what they are and what they do.

>>Gauntl+3E1
You have to draw a line somewhere, I guess. As far as I remember, protections against accidental misuse and foreseeable abuse of a device are required in medical equipment. But malicious circumvention of protections or any kind of active tampering are a whole other category in my opinion.

>>John23+bK1
Therac-25 removed redundancy. Quoting the Wikipedia article: "Previous models had hardware interlocks to prevent such faults, but the Therac-25 had removed them, depending instead on software checks for safety."

replies(1): >>John23+X0c

>>gmueck+cQ1
The UI is one of the most important parts of a machine, look at the Therac-25! The FDA regulations require a lot of effort goes into the human factors, too, and the UI definitely had to be as reliable as the rest of the device and be as well engineered as the rest.

https://www.fda.gov/medical-devices/human-factors-and-medica...

Honestly, the FDA regulations go too far vs the EU regs. The company I worked for was based in the EU and the products there were so advanced compared to our versions. Ours were all based on an original design from Europe that was approved and then basically didn’t charge for 30 years. The European device was fucking cool and had so many features, it was also capable of being carried around rather than rolled. The manufacturing was almost all automated, too, but in the USA it was not at all automated, it was humans assembling parts then recording it in a computer terminal.

>>mike_h+c4
That’s why monitoring, fail-safe power offs and redundant systems are important. E.g. even at the complete failure of a CAT scanner’s higher level control (which let’s say would run on an embedded linux kernel), the system would safely stop the radiation and power off, without any instructions from an OS. Here, an inconsistent state from the OS is actually more dangerous than stopping in the middle (e.g. the OS stucks and the same, high energy radiation is continuously being released)

>>ok_dad+d7
Industrial control systems (sadly, imho) don't use Linux as often as they could/should, but such systems do have the ability to injure their operators or cause large amounts of damage of course. [1]

The first priority is safety, absolutely and without question. And then the immediate second priority is the fact that time is money. For every minute that the system is not operating, x amount of product is not being produced.

Generally, having the software fully halt on error is both dangerous and time-consuming.

Instead you want to switch to an ERROR and/or EMERGENCY_STOP state, where things like lasers or plasma torches get turned off, motors are stopped, brakes are applied, doors get locked/unlocked (as appropriate/safe), etc. And then you want to report that to the user, and give them tools to diagnose and correct the source of the error and to restart the machine/line [safely!] as quickly as possible.

In short, error handling and recovery is its own entire thing, and tends to be something that gets tested for separately during commissioning.

[1] PLC's do have the ability to <not stop> and execute code in a real time manner, but I haven't encountered a lot of PLC programmers who actually exploit these abilities effectively. Basically for more complex situations you're quickly going to be better off with more general purpose tools [2], at most handing off critical tasks to PLCs, micro-controllers, or motor controllers etc.

[2] except for that stupid propensity to give-up-and-halt at exactly that moment where it'll cause the most damage.

>>Gauntl+3E1
Medical devices are insanely expensive (a CT scanner may reach a million dollars), you won’t risk $100 on such a small thing as a screen.

>>wtalli+8m
Isn’t that exactly the reason behind microkernel’s supposed superiority?

replies(1): >>wtalli+4Q3

>>yencab+jq1
> Picking medical devices ... is also quite the cherry picking

It wasn't my example. It was mike_hock's, and I was responding in the context they had set.

> Most Linuxes aren't like that.

Your ally picked the medical-device and space-life-support examples. If you think they're invalid because such systems don't use Linux, why did you forego bringing it up with them and then change course when replying to me? As I said: not helpful.

The point is not specific to Linux, and more Linux systems than you seem to be aware of do adopt the "crash before doing more damage" approach because they have some redundancy. If you're truly interested, I had another whole thread in this discussion explaining one class of such cases in what I feel was a reasonably informative and respectful way while another bad-faith interlocutor threw out little more than one-liners.

>>ok_dad+ec
I am an American citizen and a former dialysis patient, now kidney transplant recipient. I have watched in-center dialysis machines reboot during treatment, show the old "Energy Star" BIOS logo, and then boot Linux...

Felt kinda bad until I thought about how well a "Linux literally killed me" headline would do on HN, but then I realized I wouldn't be able to post the article if I actually died. Such is life. Or death? One or the other.

>>ok_dad+Dt
These were not peripherals. We are talking devices that would be front line in an emergency room. Terrifying.

>>kaba0+Kx2
It's the primary claimed benefit to microkernels. But since Linux as it exists today already handles this case, it isn't a very strong argument in favor of microkernels.

>>Tomte+w6
SUSE was used on Mars

https://www.pcmag.com/news/linux-is-now-on-mars-thanks-to-na...

>>eesmit+CW1
Right, that is the point I was making.