“Rust is safe” is not some kind of absolute guarantee of code safety

>>rvz+(OP)
As usual HN comments react to the headline, without reading the content.

A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.

He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.

Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:

  struct Array<T> { ... }
  impl<T> Array<T> {
    fn len(&self) -> usize;

    // if idx >= len, panic
    fn get_or_panic(&self, idx: usize) -> T;

    // if idx >= len, return None
    fn get_or_none(&self, idx: usize) -> Option<T>;

    // if idx >= len, print a stack trace and return
    // who knows what
    unsafe fn get_or_undefined(&self, idx: usize) -> T;
  }

The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.

>>jmilli+Fb
The policy of ‘oopsing’ and limping on is, in my opinion, literally one of Linux’s worst features. It has bitten me in various cases:

- Remember when Linux had that caused the kernel to partially crash and eat 100% CPU due to some bug in the leap second application code? That caused a >1MW spike in power usage at Hetzner at the time. That must have been >1GW globally. Many people didn’t notice it immediately, so it must have taken weeks before everyone rebooted.

- I’ve personally run into issues where not crashing caused Linux to go on and eat my file system.

On any Linux server I maintain, I always toggle those sysctls that cause the kernel to panic on oops, and reboot on panic.

>>EdScho+Vf
As a kernel developer, I mostly disagree. Panicking hard is nice unless you are the user whose system rebooted without explanation or the developer trying to handle the bug report saying “my system rebooted and I have nothing more to say”.

Getting logs out is critical.

>>amluto+Ou
One does not rule out the other. You could simply write crash info to some NVRAM or something, and then do a reboot. Then you can recover it during the next boot.

But there is no need to let userspace processes continue to run, which is exactly what Linux does.

>>EdScho+6w
> You could simply write crash info to some NVRAM or something, and then do a reboot. Then you can recover it during the next boot.

That works for some systems: those for which "some NVRAM or something" evaluates to a real device usable for that purpose. Not all Linux systems provide such a device.

> But there is no need to let userspace processes continue to run, which is exactly what Linux does.

Userspace processes usually contain state that the user would also like to be persisted before rebooting. If my WiFi driver crashes, there's nothing helpful or safer about immediately bringing down the whole system when it's possible to keep running with everything but networking still functioning.

>>wtalli+3C
Isn’t that exactly the reason behind microkernel’s supposed superiority?

zlacker