“Rust is safe” is not some kind of absolute guarantee of code safety

>>rvz+(OP)
As usual HN comments react to the headline, without reading the content.

A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.

He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.

Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:

  struct Array<T> { ... }
  impl<T> Array<T> {
    fn len(&self) -> usize;

    // if idx >= len, panic
    fn get_or_panic(&self, idx: usize) -> T;

    // if idx >= len, return None
    fn get_or_none(&self, idx: usize) -> Option<T>;

    // if idx >= len, print a stack trace and return
    // who knows what
    unsafe fn get_or_undefined(&self, idx: usize) -> T;
  }

The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.

>>jmilli+Fb
The way to handle this is split up kernel work into fail-able tasks [1]. When a safety check (like array OOB) occurs, it unwinds the stack up to the start of the task, and the task fails.

Linus sounds so ignorant in this comment. As if no one else thought of writing safety-critical systems in a language that had dynamic errors, and that dynamic errors are going to bring the whole system down or turn it into a brick. No way!

Errors don't have to be full-blown exceptions with all that rigamarole, but silently continuing with corruption is utter madness and in 2022 Linus should feel embarrassed for advocating such a backwards view.

[1] This works for Erlang. Not everything needs to be a shared-nothing actor, to be sure, but failing a whole task is about the right granularity to allow reasoning about the system. E.g. a few dozen to a few hundred types of tasks or processes seems about right.

>>titzer+cf
I think Linus's response would be that those failable tasks are called "processes", and the low-level supervisor that starts + monitors them is the kernel. If you have code that might fail and restart, it belongs in userspace.

If you want to run an Erlang-style distributed system in the kernel then that's an interesting research project, but it isn't where Linux is today. You'd be better off starting with SeL4 or Fuchsia.

>>jmilli+Uf
40 years of microkernels, of which I know Linus is aware of, beg to differ. Maybe Linus's extreme opposition to microkernels, ostensibly because they have historically a little lower performance--I dunno--but my comment should not be read as "yes, you must have a microkernel". There are cheaper fault isolation mechanisms than full-blown separate processes. Just having basic stack unwinding and failing a task would be a start.

>>titzer+wg
I think it might be a bit more complicated than that, considering you can have static data and unwinding the stack will not reset those states. I guess you still need some sort of task level abstraction and reset all the data for that task when unwinding from it. Btw, do we need stack unwinding or can we just do a sjlj?

>>pca006+Yh
Note I am not going to advocate for try ... catch .. finally, because I think that language mechanism is abused out the wazoo for handling all kinds of things, like IO errors, but this is exactly what try ... finally would be for.

Regardless, I think just killing the task instantly, even with partial updates to memory, would be totally fine. It'd be cheap, as automatically undoing the updates (effectively a transaction rollback) is still too expensive. Software transactional memory just comes with too much overhead.

I vote "kill and unwind" and then dealing with partial updates has to be left to a higher level.

zlacker