zlacker

[return to "“Rust is safe” is not some kind of absolute guarantee of code safety"]
1. jmilli+Fb[view] [source] 2022-10-02 15:34:06
>>rvz+(OP)
As usual HN comments react to the headline, without reading the content.

A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.

He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.

Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:

  struct Array<T> { ... }
  impl<T> Array<T> {
    fn len(&self) -> usize;

    // if idx >= len, panic
    fn get_or_panic(&self, idx: usize) -> T;

    // if idx >= len, return None
    fn get_or_none(&self, idx: usize) -> Option<T>;

    // if idx >= len, print a stack trace and return
    // who knows what
    unsafe fn get_or_undefined(&self, idx: usize) -> T;
  }
The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.
◧◩
2. EdScho+Vf[view] [source] 2022-10-02 15:58:40
>>jmilli+Fb
The policy of ‘oopsing’ and limping on is, in my opinion, literally one of Linux’s worst features. It has bitten me in various cases:

- Remember when Linux had that caused the kernel to partially crash and eat 100% CPU due to some bug in the leap second application code? That caused a >1MW spike in power usage at Hetzner at the time. That must have been >1GW globally. Many people didn’t notice it immediately, so it must have taken weeks before everyone rebooted.

- I’ve personally run into issues where not crashing caused Linux to go on and eat my file system.

On any Linux server I maintain, I always toggle those sysctls that cause the kernel to panic on oops, and reboot on panic.

◧◩◪
3. notaco+YN[view] [source] 2022-10-02 19:10:49
>>EdScho+Vf
Yeah, this part has never really been true.

> In the kernel, "panic and stop" is not an option

That's simply not true. It's an option I've seen exercised many times, even in default configurations. Furthermore, for some domains - e.g. storage - it's the only sane option. Continuing when the world is clearly crazy risks losing or corrupting data, and that's far worse than a crash. No, it's not weird to think all types of computation are ephemeral or less important than preserving the integrity of data. Especially in a distributed context, where this machine might be one of thousands which can cover for a transient loss of one component but letting it continue to run puts everything at risk, rebooting is clearly the better option. A system that can't survive such a reboot is broken. See also: Erlang OTP, Recovery Oriented Computing @ Berkeley.

Linus is right overall, but that particular argument is a very bad one. There are systems where "panic and stop" is not an option and there are systems where it's the only option.

◧◩◪◨
4. wtalli+bP[view] [source] 2022-10-02 19:19:25
>>notaco+YN
> Furthermore, for some domains - e.g. storage - it's the only sane option.

Can you elaborate on this? Because failing storage is a common occurrence that usually does not warrant immediately crashing the whole OS, unless it's the root filesystem that becomes inaccessible.

◧◩◪◨⬒
5. notaco+mV[view] [source] 2022-10-02 20:02:00
>>wtalli+bP
Depends on what you mean by "failing storage" but IMX it does warrant an immediate stop (with or without reboot depending on circumstances). Yes, for some kinds of media errors it's reasonable to continue, or at least not panic. Another option in some cases is to go read-only. OTOH, if either media or memory corruption is detected, it would almost certainly be unsafe to continue because that might lead to writing the wrong data or writing it to the wrong place. The general rule in storage is that inaccessible data is preferable to lost, corrupted, or improperly overwritten data.

Especially in a distributed storage system using erasure codes etc., losing one machine means absolutely nothing even if it's permanent. On the last storage project I worked on, we routinely ran with 1-5% of machines down, whether it was due to failures or various kinds of maintenance actions, and all it meant was a loss of some capacity/performance. It's what the system was designed for. Leaving a faulty machine running, OTOH, could have led to a Byzantine failure mode corrupting all shards for a block and thus losing its contents forever.

BTW, in that sort of context - where most bytes in the world are held BTW - the root filesystem is more expendable than any other. It's just part of the access system, much like firmware, and re-imaging or even hardware replacement doesn't affect the real persistence layer. It's user data that must be king, and those media whose contents must be treated with the utmost care.

◧◩◪◨⬒⬓
6. wtalli+h01[view] [source] 2022-10-02 20:32:32
>>notaco+mV
I understand why a failing drive or apparently corrupt filesystem would be reason to freeze a filesystem. But that's nowhere close to kernel panic territory.

Even in a distributed, fault-tolerant multi-node system, it seems like it would be useful for the kernel to keep running long enough for userspace to notify other systems of the failure (eg. return errors to clients with pending requests so they don't have to wait for a timeout to try retrieving data from a different node) or at least send logs to where ever you're aggregating them.

[go to top]