zlacker

[return to "“Rust is safe” is not some kind of absolute guarantee of code safety"]
1. jmilli+Fb[view] [source] 2022-10-02 15:34:06
>>rvz+(OP)
As usual HN comments react to the headline, without reading the content.

A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.

He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.

Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:

  struct Array<T> { ... }
  impl<T> Array<T> {
    fn len(&self) -> usize;

    // if idx >= len, panic
    fn get_or_panic(&self, idx: usize) -> T;

    // if idx >= len, return None
    fn get_or_none(&self, idx: usize) -> Option<T>;

    // if idx >= len, print a stack trace and return
    // who knows what
    unsafe fn get_or_undefined(&self, idx: usize) -> T;
  }
The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.
◧◩
2. titzer+cf[view] [source] 2022-10-02 15:53:46
>>jmilli+Fb
The way to handle this is split up kernel work into fail-able tasks [1]. When a safety check (like array OOB) occurs, it unwinds the stack up to the start of the task, and the task fails.

Linus sounds so ignorant in this comment. As if no one else thought of writing safety-critical systems in a language that had dynamic errors, and that dynamic errors are going to bring the whole system down or turn it into a brick. No way!

Errors don't have to be full-blown exceptions with all that rigamarole, but silently continuing with corruption is utter madness and in 2022 Linus should feel embarrassed for advocating such a backwards view.

[1] This works for Erlang. Not everything needs to be a shared-nothing actor, to be sure, but failing a whole task is about the right granularity to allow reasoning about the system. E.g. a few dozen to a few hundred types of tasks or processes seems about right.

◧◩◪
3. mike_h+Un[view] [source] 2022-10-02 16:39:00
>>titzer+cf
It doesn't continue silently, it warns. More accurately, it does what you tell it to, which can also be a hard stop if you want to.

It's up to you to choose the right failure strategy and monitor your system if you don't want to panic, and take appropriate measures and not just ignore the warning.

It's not Linus who sounds ignorant here, it's the people applying user-space "best practices" to the kernel. If the kernel panics, the system is dead and you've lost the opportunity to diagnose the problem, which may be non-deterministic and hard to trigger on purpose.

◧◩◪◨
4. jeffre+AB[view] [source] 2022-10-02 17:52:35
>>mike_h+Un
I agree with your statements, but I wonder: who is warned typically? An end user via a log he neither reads nor understands? The chance that this will lead to the right measure is low, isn't it.
◧◩◪◨⬒
5. Gibbon+j91[view] [source] 2022-10-02 21:32:32
>>jeffre+AB
The couple of times I had to go digging into the kernel what the thing looks like to me is a very large bare metal piece of firmware. As someone who writes firmware that very last thing you ever want is it to hang or reset without reporting any diagnostics. Because you have no idea where the offending code is. I'll belabor the point for people that think a large program is a few thousand lands. With the kernel it's millions of lings of code mostly written by other people.

Small rant. ARM cortex processors overwrites the stack pointer on reset. That's very very very dumb because after the watchdog trips you have no idea what the code was doing. Which means you can't report what the code was doing when that happened.

[go to top]