zlacker

As usual HN comments react to the headline, without reading the content.

A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.

He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.

Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:

  struct Array<T> { ... }
  impl<T> Array<T> {
    fn len(&self) -> usize;

    // if idx >= len, panic
    fn get_or_panic(&self, idx: usize) -> T;

    // if idx >= len, return None
    fn get_or_none(&self, idx: usize) -> Option<T>;

    // if idx >= len, print a stack trace and return
    // who knows what
    unsafe fn get_or_undefined(&self, idx: usize) -> T;
  }

The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.

replies(12): >>ChrisS+11 >>layer8+l1 >>lucasy+a2 >>titzer+x3 >>echelo+14 >>EdScho+g4 >>whatsh+H6 >>hedgeh+L8 >>3a2d29+ee >>oconno+bf >>zozbot+Mf >>scoutt+o42

>>jmilli+(OP)
That's all very well but the Rust-in-Linux advocates are advocating for #2 and fully agree with Linus on #1. So attacking #1 is attacking straw.

>>jmilli+(OP)
Please correct me if I’m wrong, but Rust also has no built-in mechanism to statically determine “this code won’t ever panic”, and thus with regards to Linux kernel requirements isn’t safer in that aspect than C. To the contrary, Rust is arguably less safe in that aspect than C, due to the general Rust practice of panicking upon unexpected conditions.

replies(6): >>pornel+U2 >>jrochk+p3 >>gerane+D3 >>jmilli+N3 >>dcsomm+l4 >>stjohn+2A

>>jmilli+(OP)
Then Linus is wrong because the unsafe keyword has nothing to do with no-panic guarantees? Unsafe correlates with memory safety / UB, so using it in a different way in the kernel would be flat out wrong.

The language determines the definition of its constructs, not the software being written with it.

Edit: It's worth mentioning that while I think he is wrong, I think it's symptomatic of there not being a keyword/designation in Rust to express what Linus is trying to say. I would completely oppose misusing the unsafe keyword since it has negative downstream effects on all future dependency crates, where it's not clear what characteristics "unsafe" refers to which causes a split. So maybe they need to just discuss a different way to label these for now and agree to improve it later.

>>layer8+l1
Lack of a non-hacky no-panic guarantee is a pain. That would be like a no-segfault guarantee in C.

But Rust's situation is still safer, because Rust can typically prevent more errors from ever becoming a run-time issue, e.g. you may not even need to use array indexing at all if you use iterators. You have a guarantee that references are never NULL, so you don't risk nullptr crash, etc.

Rust panics are safer, because they reliably happen instead of an actually unsafe operation. Mitigations in C are usually best-effort and you may be lucky/unlucky to silently corrupt memory.

Panics are a problem for uptime, but not for safety (in the sense they're not exploitable for more than DoS).

In the long term crashing loud and clear may be better for reliability. You shake out all the bugs instead of having latent issues that corrupt your data.

replies(2): >>layer8+n8 >>hegels+t9

>>layer8+l1
> but Rust also has no built-in mechanism to statically determine “this code won’t ever panic”,

My intuition says that's the Halting Problem, so not actually possible to implement perfectly? https://en.wikipedia.org/wiki/Halting_problem

replies(5): >>jmilli+O4 >>im3w1l+25 >>skybri+i5 >>pca006+w5 >>layer8+H5

>>jmilli+(OP)
The way to handle this is split up kernel work into fail-able tasks [1]. When a safety check (like array OOB) occurs, it unwinds the stack up to the start of the task, and the task fails.

Linus sounds so ignorant in this comment. As if no one else thought of writing safety-critical systems in a language that had dynamic errors, and that dynamic errors are going to bring the whole system down or turn it into a brick. No way!

Errors don't have to be full-blown exceptions with all that rigamarole, but silently continuing with corruption is utter madness and in 2022 Linus should feel embarrassed for advocating such a backwards view.

[1] This works for Erlang. Not everything needs to be a shared-nothing actor, to be sure, but failing a whole task is about the right granularity to allow reasoning about the system. E.g. a few dozen to a few hundred types of tasks or processes seems about right.

replies(3): >>jmilli+f4 >>fritol+e9 >>mike_h+fc

>>layer8+l1
We cannot ensure that an arbitrary program halts by statically analyzing it. And it doesn’t have anything to do with the language of choice.

https://en.m.wikipedia.org/wiki/Halting_problem

replies(2): >>layer8+b4 >>roywig+G7

>>layer8+l1
Rust doesn't have an official `#[never_panic]` annotation, but there's a variety of approaches folks use. Static analysis (Clippy) can get you pretty far. My favorite trick is to link a no_std binary with no panic handler, and see if it has a linker error. No linker error = no calls to panic handler = no panics.

Note that Rust is easier to work with than C here, because although the C-like API isn't shy about panicking where C would segfault, it also inherits enough of the OCaml/Haskell/ML idiom to have non-panic APIs for pretty much any major operation. Calling `saturating_add()` instead of `+` is verbose, but it's feasible in a way that C just isn't unless you go full MISRA.

replies(2): >>ajross+ig >>pdimit+kr

>>jmilli+(OP)
Use #2 everywhere you're not doing C FFI.

Both #1 and #3 are gross and wrong.

>>gerane+D3
Proof assistants, which I expect to eventually merge with programming languages, can be used to restrict the set of programs you write to those where you can statically prove all properties you expect the program to hold. It’s not much different from what diligent programmers have always done in their head (with, of course, much more room for error).

The fact that arbitrary programs are undecidable is a red herring here.

replies(2): >>yonixw+Dk >>gerane+QJ

>>titzer+x3
I think Linus's response would be that those failable tasks are called "processes", and the low-level supervisor that starts + monitors them is the kernel. If you have code that might fail and restart, it belongs in userspace.

If you want to run an Erlang-style distributed system in the kernel then that's an interesting research project, but it isn't where Linux is today. You'd be better off starting with SeL4 or Fuchsia.

replies(1): >>titzer+R4

>>jmilli+(OP)
The policy of ‘oopsing’ and limping on is, in my opinion, literally one of Linux’s worst features. It has bitten me in various cases:

- Remember when Linux had that caused the kernel to partially crash and eat 100% CPU due to some bug in the leap second application code? That caused a >1MW spike in power usage at Hetzner at the time. That must have been >1GW globally. Many people didn’t notice it immediately, so it must have taken weeks before everyone rebooted.

- I’ve personally run into issues where not crashing caused Linux to go on and eat my file system.

On any Linux server I maintain, I always toggle those sysctls that cause the kernel to panic on oops, and reboot on panic.

replies(7): >>xyzzy_+x6 >>pca006+R7 >>mike_h+s8 >>Miguel+A9 >>UncleM+mc >>amluto+9j >>notaco+jC

>>layer8+l1
> ... the general Rust practice of panicking upon unexpected conditions

What makes you say this? From the sample I've seen, Rust programs are far more diligent about handling errors (not panicking: either returning error or handling it explicitly) than C or Go programs due to the nature of wrapped types like Option<T> and Result<T, E>. You can't escape handling the error, and panicking potential is very easy to see and lint against with clippy in the code.

replies(1): >>layer8+85

>>jrochk+p3
If you were to define a subset of Rust's standard library (core + alloc + std) that does not contain the `panic!` macro, and excluded all functionality that needed to panic, then safe Rust could be proven to never panic (because it can't).

That's different than solving the halting problem. You're not trying to prove it halts, you're just trying to prove it doesn't halt in a specific way, which is trivial to prove if you first make it impossible.

replies(1): >>gpm+7a

>>jmilli+f4
40 years of microkernels, of which I know Linus is aware of, beg to differ. Maybe Linus's extreme opposition to microkernels, ostensibly because they have historically a little lower performance--I dunno--but my comment should not be read as "yes, you must have a microkernel". There are cheaper fault isolation mechanisms than full-blown separate processes. Just having basic stack unwinding and failing a task would be a start.

replies(5): >>pca006+j6 >>jstimp+07 >>Wastin+I7 >>geertj+J9 >>lr1970+rO

>>jrochk+p3
If you are fine with saying that stuff like this code may panic (and many people are fine with just that), then it's perfectly doable

    if false {
        panic!()
    }

Basically you'd prohibit any call to panic whether they may actually end up running or not.

replies(1): >>jrochk+fB1

>>dcsomm+l4
I’m referring to the fact that ubiquitous functions like unwrap() panic if the programmer has made an error. Guarding against such panics is outside of the scope of Rust-the-language, and has to be handled through external means. There are linters for C as well.

replies(4): >>Velila+x7 >>hegels+A8 >>stjohn+OC >>tomjak+YY

>>jrochk+p3
That's different. You can't perfectly detect all infinite loops in a language that allows arbitrary loops. This also means you can't perfectly detect unreachable code.

But determining that a function (such as panic) is never called because there are no calls to it is pretty easy.

>>jrochk+p3
No, panic is not halting, you just need some static check to check that you never call some functions that can panic in your code. Essentially it is just checking if some code (panic) might be reachable, if it is not, it will never panic (but it can still do other crazy things).

Note that we can only check for maybe, because in general we don't know if some code in the middle will somehow execute forever and never reach the panic call after it.

replies(1): >>roywig+N9

>>jrochk+p3
See https://news.ycombinator.com/item?id=33057059.

>>titzer+R4
I think it might be a bit more complicated than that, considering you can have static data and unwinding the stack will not reset those states. I guess you still need some sort of task level abstraction and reset all the data for that task when unwinding from it. Btw, do we need stack unwinding or can we just do a sjlj?

replies(1): >>titzer+e7

>>EdScho+g4
Right, the fact that you can toggle those sysctls is sort of the point. There are plenty of environments where running at all is better than not, hardware watchdogs can solve for unresponsiveness.

TFA is about making it possible for the kernel to decide what to do, rather than exploding on the spot, which is terrible.

>>jmilli+(OP)
#3 could cause a security problem. I don't think you'd find it in the Linux kernel - it would let attackers read arbitrary kernel memory. That's one of safe Rust's strongest features: it will not compile a direct memory access vulnerability.

>>titzer+R4
How do you unwind if most of your kernel is written in C? (answering my own question - they are doing stack unwinding - only manually).

Where do you unwind to if memory is corrupted?

I don't think we're talking about what would be exception handling in other languages. I believe it's asserts. How do userland processes handle a failed assertion? Usually the process is terminated, but giving a debugger the possibility to examine the state first, or dumping core.

And that's similar to what they are doing in the kernel. Only in that in the kernel, it's more dangerous because there is limited process / task isolation. I think that is an argument that taking down "full-blown separate processes" might not even be enough in the kernel.

>>pca006+j6
Note I am not going to advocate for try ... catch .. finally, because I think that language mechanism is abused out the wazoo for handling all kinds of things, like IO errors, but this is exactly what try ... finally would be for.

Regardless, I think just killing the task instantly, even with partial updates to memory, would be totally fine. It'd be cheap, as automatically undoing the updates (effectively a transaction rollback) is still too expensive. Software transactional memory just comes with too much overhead.

I vote "kill and unwind" and then dealing with partial updates has to be left to a higher level.

>>layer8+85
I think I prefer Rust's way of doing things. Just last night I used the Vec! macro incorrectly putting in a comma instead of a semi-colon and despite the program compiling correctly, it immediately panicked with an OOB error. With C it would have been a lot harder to even notice a bug little alone track it down.

replies(1): >>layer8+P8

>>gerane+D3
You can prove that a machine can't ever write "1" to the tape if you just look at the state machine and see that none of the rules write a 1 to the tape. Since no rules ever write 1, no possible execution could.

Working out whether it will write 1 to the tape in general is undecidable, but in certain cases (you've just banned states that write 1) it's trivial.

If all of the state transitions are valid (a transition to a non-existing state is a halt) then the machine can't get into a state that will transition into a halt, so it can't halt. That's a small fraction of all the machines that won't halt, but it's easy to tell when you have one of this kind by looking at the state machine.

replies(2): >>yonixw+Zk >>gerane+xG

>>titzer+R4
Sorry it’s hard to take you seriously after that.

Linux isn’t a microkernel. If you want to work on a microkernel, go work on Fuchsia. It’s interesting research but utterly irrelevant to the point at hand.

Anyway, the microkernel discussion has been happening for three decades now. They haven’t historically had a little lower performance. They had garbage performance, to the point of being unsuitable in the 90s.

Plenty of kernel code can’t be written as to be unwindable. That’s the issue at hand. In a fantasy world, it might have been written as such but it’s not the world will live in which is what matters to Linus.

replies(3): >>jeffre+hp >>pjmlp+Or >>rleigh+0J

>>EdScho+g4
Probably not very linux, but consider the case where a device is controlling a motor speed by setting its current, computed by getting its current speed and do a simple PID control. If your device crashed (due to failure of something else, for example some LED display) while the current output is pretty large, you may cause some serious damage to whatever that motor is connected to. The thing the system should be able to do is notify such error and gracefully shutdown, for example sending commands to shutdown everything else.

Although I think this can be better done by some special panic handler that performs a sjlj and notify other systems about the failure, without continuing running with the wrong output...

replies(1): >>Gare+Qe

>>pornel+U2
One has to be careful about words. When Rust (or Linux) is used in (say) a vehicle or in a nuclear power plant, panicking certainly has immediate safety implications.

replies(2): >>pornel+Z8 >>avgcor+mb

>>EdScho+g4
So instead of a power spike, we'd have had a major internet outage across the world, across the entire industry and beyond, probably, if everyone had panicked on oops. The blame really lies with people not monitoring their systems.

As you said, you have the option to reboot on panic, but Linus is absolutely not wrong that this size does not fit all.

What about a medical procedure that WILL kill the patient if interrupted? What about life support in space? Hitting an assert in those kinds of systems is a very bad place to be, but an automatic halt is worse than at least giving the people involved a CHANCE to try and get to a state where it's safe to take the system offline and restart it.

replies(6): >>Tomte+Ma >>ok_dad+tb >>notaco+DC >>acje+mO1 >>stouse+AP1 >>kaba0+Ey2

>>layer8+85
Linters can catch panics, linters for C won't catch memory issues which is what rust prevents.

replies(1): >>layer8+Ca

>>jmilli+(OP)
As I read it the issue is a little bit deeper, start with the context here and read down the thread:

https://lkml.org/lkml/2022/9/19/640

Get at least down to here:

https://lkml.org/lkml/2022/9/20/1342

What Linus seems to be getting at is that there are many varying contextual restrictions on what code can do in different parts of the kernel, that Filho etc appear to be attempting to hide that complexity using language features, and that in his opinion it is not workable to fit kernel APIs into Rust's common definition of a single kind of "safe" code. All of this makes sense, in user land you don't normally have to care about things like whether different functional units are turned on or off, how interrupts are set up, etc, but in kernel you do. I'm not sure if Rust's macros and type system will allow solving the problem as Linus frames it but it seems like a worthy target & will be interesting to watch.

>>Velila+x7
Right. My personal opinion is that exceptions provide a better trade-off between catching bugs and still allowing the chance of graceful shutdown or recovery.

replies(2): >>Velila+hb >>stjohn+TC

>>layer8+n8
But it can still be safer - e.g. a panic can trigger an emergency stop instead of silently overwriting the "go full throttle" variable.

replies(1): >>mike_h+ye

>>titzer+x3
What if the task was invoked asynchronously (and maybe it keeps happening.) What does async stack unwinding entail in Rust? Is there a parent-child relationship between invoker and invokee? async scopes (https://rust-lang.github.io/wg-async/vision/roadmap/scopes.h...) ? I've not touched Rust at all.

>>pornel+U2
I think you're missing the point Linus made. Panicking is safer from a memory safety perspective, but it's not from a kernel perspective. You'll lose all the file changes that are not saved, you'll risk having disk written in a bad state which can be catastrophic, etc.

replies(3): >>pornel+Lb >>lifthr+Bc >>zozbot+ii

>>EdScho+g4
What are those sysctls? It was worth my time to read Hacker News this morning.

replies(1): >>stjohn+Vu

>>titzer+R4
> Just having basic stack unwinding and failing a task would be a start.

As the sibling comment pointed out, if you extend this idea to clean up all state, you end up with processes.

I do have some doubt on the no panic rule. But instead of emulating processes in the kernel, I’d see a firmware like subsystem whose only job it is to export core dumps from the local system, after which the kernel is free to panic.

As a general point and in my view, and I agree this is an appeal to authority, Linus has this uncanny ability to find compromises between practicality and theory that result in successful real world software. He’s not always right but he’s almost never completely wrong.

>>pca006+w5
Even if it is halting, you can sometimes statically detect if a Turing machine never halts. Just look through the state machine and see if any states will transition to a halt; if none of them do, the machine will loop forever. This is not a very large fraction of machines that loop forever, but if you're writing a machine and want to be absolutely sure it won't halt, just don't put in any states that halt.

>>jmilli+O4
> If you were to define a subset of Rust's standard library (core + alloc + std) that does not contain the `panic!` macro, and excluded all functionality that needed to panic, then safe Rust could be proven to never panic (because it can't).

Not quite, because stack overflows can cause panics independent of any actual invocation of the panic macro.

You need to either change how stack overflows are handled as well, or you need to do some static analysis of the stack size as well.

Both are possible (while keeping rust turing complete), so it's still not like the halting problem.

replies(1): >>jmilli+Uc

>>hegels+A8
Linters like Splint [0] (predating Rust) can do that for C. I’m not saying that Rust’s built-in approach isn’t better, but please be careful about what exactly you claim.

[0] http://splint.org/

replies(2): >>dcsomm+7e >>hegels+gp1

>>mike_h+s8
You won't see Linux in those fields. I'm aware of a project by OSADL to qualify Linux for SIL 2, but your examples are way beyond that.

replies(1): >>retard+HV9

>>layer8+P8
In my case it wouldn't have raised an exception though, it would have just been UB.

It's not like there's not exceptions in Rust though. The error handling is thorough to a fault when it's used. Unwrap is just a shortcut to say "I know there might be bad input, I don't want to handle it right now, just let me do it and I'll accept the panic."

replies(1): >>layer8+Ad

>>layer8+n8
And a perfect, bug-free ballistic rocket program is unsafe in the sense that it is efficient at causing damage.

Rust’s “safety” has always meant what the Rust team meant by that term. There’s no gotcha to be found here except if you can find some way that Rust violates its own definition of the S-word.

This submission is not really about safety. It’s a perfectly legitimate concern that Rust likes to panic and that panicking is inappropriate for Linux. That isn’t about safety per se.

“Safety“ is a very technical term in the PL context and you just end up endlessly bickering if you try to torture the term into certain applications. Is it safer to crash immediately or to continue the program in a corrupted state? That entirely depends on the application and the domain, so it isn’t a useful distinction to make in this context.

EDIT: The best argument one could make from this continue-can-be-safer perspective is that given two PLs, the one that lets you abstract over this decision (to panic or to continue in a corrupted state, preferably with some out of band error reporting) is safer. And maybe C is safer than Rust in that regard (I wouldn’t know).

replies(2): >>layer8+Oc >>goto11+GS1

>>mike_h+s8
> What about a medical procedure that WILL kill the patient if interrupted? What about life support in space? Hitting an assert in those kinds of systems is a very bad place to be, but an automatic halt is worse than at least giving the people involved a CHANCE to try and get to a state where it's safe to take the system offline and restart it.

Kinda a strawman there. That's got to account for, what, 0.0001% of all use of computers, and probably they would never ever use Linux for these applications (I know medical devices DO NOT use Linux).

replies(3): >>gmueck+Sc >>goodpo+il >>Kim_Br+Nz2

>>hegels+t9
I understand his point. I just disagree, and prefer a different tradeoff.

Yes, a kernel panic will cause disruption when it happens. But that will also give a precise error location, which makes reporting and fixing of the root cause easier. It could be harder to pinpoint of the code rolled forward in broken state.

It will cause loss of unsaved data when it happens, but OTOH it will prevent corrupted data from being persisted.

replies(2): >>wtalli+Jf >>stjohn+zB

>>titzer+x3
It doesn't continue silently, it warns. More accurately, it does what you tell it to, which can also be a hard stop if you want to.

It's up to you to choose the right failure strategy and monitor your system if you don't want to panic, and take appropriate measures and not just ignore the warning.

It's not Linus who sounds ignorant here, it's the people applying user-space "best practices" to the kernel. If the kernel panics, the system is dead and you've lost the opportunity to diagnose the problem, which may be non-deterministic and hard to trigger on purpose.

replies(1): >>jeffre+Vp

>>EdScho+g4
It is also a fabulous way of introducing vulnerabilities.

>>hegels+t9
Panic corresponds to a potential logic bug. If you have a logic bug, you already risk its consequences even if the panic didn't happen. As long as panic can be caught (and in Rust, it's indeed the case) it is safer than the alternative.

replies(1): >>hegels+Ar1

>>avgcor+mb
That’s exactly my point. Rust’s definition of safety is a very specific one, and one has to be careful about what it actually implies in the context where Rust is employed. “Safety” isn’t a well-defined term for PL in general. “Soundness” is.

replies(2): >>avgcor+Kk >>UncleM+Ay

>>ok_dad+tb
Do you know absolutely every medical device in existence and do you know how broad the definition of a medical device is (including e.g. the monitor attached to the PC used for displaying X-ray images)?

replies(2): >>jmilli+Ld >>ok_dad+ug

>>gpm+7a
In a Rust defined without `panic!`, a stack overflow would not be able to panic. What would probably happen is the process would just die, like C.

>>Velila+hb
By exceptions, I’m referring to languages with exceptions as a dedicated language construct with automatic stack unwinding, and preferably without UB (e.g. Java or C#). Rust doesn’t have exceptions in that sense.

replies(1): >>dureui+Uq

>>gmueck+Sc

  > including e.g. the monitor attached to the PC used for displaying
  > X-ray images

Somewhat off-topic, but I used to work in a dental office. The monitors used for displaying X-rays were just normal monitors, bought from Amazon or Newegg or whatever big-box store had a clearance sale. Even the X-ray sensors themselves were (IIRC) not regulated devices, you could buy one right now on AliExpress if you wanted to.

replies(1): >>gmueck+sq

>>layer8+Ca
Interesting that despite tools like Splint, 70% of high severity security vulns, including in well staffed projects like Chrome and Windows, are due to memory unsafety. The false negatives of security analysis tools are significant and are the very reason Rust got developed.

replies(1): >>layer8+ff

>>jmilli+(OP)
I might be missing something here. So I understand panic! will essentially crash the kernel, that makes sense to me as a problem.

But wouldn't reading outside an array bounds also possibly do that? It coudl seg fault which is essentially the same thing.

Is it that reading out of bounds on an array isn't guaranteed to crash everything while a panic always will?

replies(1): >>fdr+XU1

>>pornel+Z8
Yes, or jumping to the "emergency stop" routine can instead trigger "go full throttle" because the jump address has been corrupted.

Or in an actual vehicle, the "emergency stop" (if that means just stomping on the brakes) can flip the car and kill its passengers.

replies(1): >>stjohn+uC

>>pca006+R7
Such a device should have a simple protective circuit that doesn't allow this. This is common in any expensive or critical industrial system.

replies(1): >>yellow+1O1

>>jmilli+(OP)
I don't have any experience with this project, but I know a lot of panics in my Rust code look like this (you probably know this already, just setting up a question):

    fn foo<T>() -> Option<T> {
        // Oops, something went wrong and we don't have a T.
        None
    }

    fn bar<T>() -> T {
        if let Some(t) = foo() {
            t
        } else {
            // This could've been an `unwrap`; just being explicit here
            panic!("oh no!");
        }
    }

A panic in this case is exactly like an exception in that the function that's failing doesn't need to come up with a return value. Unwinding happens instead of returning anything. But if I was writing `bar` and I was trying to follow a policy like "never unwind, always return something", I'd be in a real pickle, because the way the underlying `foo` function is designed, there aren't any T's sitting around for me to return. Should I conjure one out of thin air / uninitialized memory? What does the kernel do in situations like this? I guess the ideal solution is making `bar` return `Option<T>` instead of `T`, but I don't imagine that's always possible?

replies(2): >>strict+Xf >>jmilli+jg

>>dcsomm+7e
No, the reason Rust was developed (with regard to that aspect) was that the necessary static analysis is enforced by the compiler if it is built into the language, whereas otherwise (if not built in) it empirically doesn’t get a lot of adoption. There’s nothing Rust’s static analysis is doing that couldn’t be done with the same semantics using an external static analyzer and linter annotations.

The ideas of Rust weren’t new when Rust was developed. The actual integration into a new programming language beyond experimental status was, and the combination with ML-style functional programming.

>>pornel+Lb
> But that will also give a precise error location, which makes reporting and fixing of the root cause easier. It could be harder to pinpoint of the code rolled forward in broken state.

I think you must have missed out on how Linux currently handles these situations. It does not silently move on past the error; it prints a stack trace and CPU state to the kernel log before moving on. So you have all of the information you'd get from a full kernel panic, plus the benefit of a system that may be able to keep running long enough to save that kernel log to disk.

>>jmilli+(OP)
The problem with #3 is not really about C vs. Rust, it's about modern optimizing compilers (including GCC). A compiler is allowed to assume that UB simply won't happen, so it makes no guarantees whatsoever about what happens if UB is ever hit. It's not "return a random value", it really is "all bets are off". There is no guarantee that you manage to "limp along" in any reasonable sense, let alone report the failure. That's what "panic and recover" mechanisms are for, and yes even the ? operator in Rust can be seen as such a mechanism.

>>oconno+bf
The harsh truth is that you need to think about every single case of failure, and decide what to do when things go south.

If you look at how POSIX does it, pretty much every single function has error codes, signaling everything from lost connections, to running out of memory, entropy or whatnot. Failures are hard to abstract away. Unless you have some real viable fallback to use, you're going to have to tell the user that something went wrong and leave it up to them to decide what the application can best do in this case.

So in your case, I would return Result<T>, and encode the errors in that. Simply expose the problem to the caller.

>>jmilli+N3
> Static analysis (Clippy) can get you pretty far.

What's funny about this is that (while it's true!) it's exactly the argument that Rustaceans tend to reject out of hand when the subject is hardening C code with analysis tools (or instrumentation gadgets like ASAN/MSAN/fuzzing, which get a lot of the same bile).

In fact when used well, my feeling is that extra-language tooling has largely eliminated the practical safety/correctness advantages of a statically-checked language like rust, or frankly even managed runtimes like .NET or Go. C code today lives in a very different world than it did even a decade ago.

replies(6): >>jmilli+xi >>lifthr+ej >>j-krie+Em >>tialar+Es >>P5fRxh+p41 >>pjmlp+504

>>oconno+bf
Couple options:

1. Have a constraint on T that lets you return some sort of placeholder. For example, if you've got an array of u8, maybe every read past the end of the array returns 0.

  fn bar<T: Default>() -> T {
    if let Some(t) = foo() {
      t
    } else {
      eprintln!("programmer error, foo() returned None!");
      Default::default()
    }
  }

2. Return a `Option<T>` from bar, as you describe.

3. Return a `Result<T, BarError>`, where `BarError` is a struct or enum describing possible error conditions.

  #[non_exhaustive]
  enum BarError {
    FooIsNone,
  }
  
  fn bar<T>() -> Result<T, BarError> {
    if let Some(t) = foo() {
      Ok(t)
    } else {
      eprintln!("programmer error, foo() returned None!");
      Err(BarError::FooIsNone)
    }
  }

>>gmueck+Sc
I worked in medical device quality control and so, yes, I know all about the FDA requirements for medical devices and ISO 13485. I can say, with certainty, that base Linux would not be allowed to run in a medical device in the USA. It's software of unknown provenance (SOUP) and would absolutely NOT be used as-is.

replies(6): >>smolde+gm >>gmueck+9s >>voakba+Yt >>sarlal+i01 >>cplusp+gz1 >>Suzura+WI2

>>hegels+t9
Panic in Rust need not equate to a literal kernel panic. It can call an oops handler, which might manage to keep the system operational depending on where the failure occurred.

replies(1): >>hegels+os1

>>ajross+ig
Analysis for memory safety is really hard. For >40 years there's been entire sub-industries focused just on analysis of C/C++ memory safety and it's nowhere near a solved problem. That's why Rust has `unsafe`, it's so the programmer has a way to encode logic that they believe is safe but aren't yet able to prove.

Analysis of panic-safety in Rust is comparatively easy. The set of standard library calls that can panic is finite, so if your tool just walks every call graph you can figure out whether panic is disproven or not.

replies(1): >>ajross+Uj1

>>EdScho+g4
As a kernel developer, I mostly disagree. Panicking hard is nice unless you are the user whose system rebooted without explanation or the developer trying to handle the bug report saying “my system rebooted and I have nothing more to say”.

Getting logs out is critical.

replies(1): >>EdScho+rk

>>ajross+ig
While C/C++ has been made much safer by sanitizers, fuzzing and lints, they are:

- Only useful when actually being used, which is never the case. (Seriously, can we make at least ASAN the default?)

- Often costly to always turn them on (e.g. MSAN).

- Often requires restructuring or redesign to get the most out of them (especially fuzzing).

Rust's memory safety guarantee does not suffer from first two points, and the third point is largely amortized into the language learning cost.

replies(1): >>pjmlp+G04

>>amluto+9j
One does not rule out the other. You could simply write crash info to some NVRAM or something, and then do a reboot. Then you can recover it during the next boot.

But there is no need to let userspace processes continue to run, which is exactly what Linux does.

replies(1): >>wtalli+oq

>>layer8+b4
"Undecidable" Is way too close to your day2day program than you think: https://en.wikipedia.org/wiki/Rice%27s_theorem

I would like to learn otherwise, but even a React JS+HTML page is undecidable... its scope is limited by chrome V8 js engine (like a vm), but within that scope I don't think you can prove anything more. otherwise we could just make static analysis to check if it will leak passwords...

replies(1): >>layer8+hn

>>layer8+Oc
Memory safety is a well-defined term.

replies(1): >>layer8+Pl

>>roywig+G7
"Print 1" is trivial according to this: https://en.wikipedia.org/wiki/Rice%27s_theorem.

But day to day programs are not trivial... as for your example, just switch it with this code: `print(gcd(user_input--,137))`... now it's quite more hard to "just ban some final states"

replies(3): >>joshua+5z >>UncleM+8z >>roywig+oz

>>ok_dad+tb
> I know medical devices DO NOT use Linux

Absolutely false.

>>avgcor+Kk
I agree, but that isn’t the term that was used here, and Rust proponents usually mean more than memory safety by “safe” (like e.g. absence of UB).

replies(2): >>avgcor+Ap >>veber-+G52

>>ok_dad+ug
Makes me wonder what they run their NAS software with. Or their internal web-hosting, or their networking devices, or any of the other devices they have littered about. I'd swear on the Bible that I've seen a dentist or two running KDE 3 before...

replies(1): >>ok_dad+Cx

>>ajross+ig
The difference being that C is barely changed for the sake of backwards compatibility while the panicking in kernel space is a recognized problem with rust that is being actively worked on and will have a solution in the future.

>>yonixw+Dk
I’m not sure you understand Rice’s theorem correctly. It means that you can’t write an algorithm that takes an arbitrary program as input and tells you whether it fulfills a given nontrivial semantic property. But you can write an algorithm that can tell you for some subset of programs. So as a developer, if you restrict yourself to releasing programs for which the algorithm has halted and given you the desired answer, you are fine.

Depending on the semantic property to check for, writing such an algorithm isn’t trivial. But the Rust compiler for example does it for memory safety, for the subset of valid Rust programs that don’t use Unsafe.

replies(1): >>yonixw+Yn

>>layer8+hn
But isn't every program we write today (Rust, C++, Python, JS, etc.) raise up to the level of an "arbitrary program"? How do you find those "some subset of programs" that will halt by said algorithm?

The only sure way I can think of, is when you force your program to go through a more narrow non-turing algorithm. Like sending data through a network after Serialization. Where we could limit the De-Serialization process to be non Turing (json, yaml?).

Same for code, that uses non-turing API, like memory allocation in a dedicated per process space. Or rust "borrow" mechanics that the compiler enforces.

But my point is, everyday program are "arbitrary program" and not a red haring. Surly from the kernel perspective, which is Linus point imo.

replies(1): >>layer8+Jq

>>Wastin+I7
Well QNX is running on a gazillion of devices, even resource restricted ones, without problems. It can be slower but it does not have to always. That is gar from being a fantasy world.

>>layer8+Pl
Going through that thread (a few posts back) it seems that “Rust is safe” (as seen in this submission title) was stated first by Torvalds. It wasn’t mentioned first by a “Rust aficianado”. So you would really have to ask Torvalds what he meant. But his mentioning of it (and this submission) obviously alludes to “safe” claims by the Rust project. Which has always been memory safety.

replies(1): >>layer8+gs

>>mike_h+fc
I agree with your statements, but I wonder: who is warned typically? An end user via a log he neither reads nor understands? The chance that this will lead to the right measure is low, isn't it.

replies(1): >>Gibbon+EX

>>EdScho+rk
> You could simply write crash info to some NVRAM or something, and then do a reboot. Then you can recover it during the next boot.

That works for some systems: those for which "some NVRAM or something" evaluates to a real device usable for that purpose. Not all Linux systems provide such a device.

> But there is no need to let userspace processes continue to run, which is exactly what Linux does.

Userspace processes usually contain state that the user would also like to be persisted before rebooting. If my WiFi driver crashes, there's nothing helpful or safer about immediately bringing down the whole system when it's possible to keep running with everything but networking still functioning.

replies(2): >>EdScho+Ts >>kaba0+0C2

>>jmilli+Ld
That's not the case in the EU. I've worked for an equipment manufacturer for dental clinics. While the monitors were allowed to be off the shelf, the operator (dental clinic) is required to make sure that they work properly and display the image correctly - obey certain brightness and color resolution/calibration standards. Our display software had to refuse to work on an uncalibrated monitor.

replies(1): >>alias_+OT

>>yonixw+Yn
For the first question, see the second paragraph I added in my previous comment.

Regarding the second question, in the general case you have to guess or think hard, and proceed by trial and error. You notice that the analyzer takes more time than you’re willing to wait, so you stop it and try to change your program in order to fix that problem.

We already have that situation today, because the Rust type system is turing-complete. Meaning, the Rust compiler may in principle need an infinite amount of time to type-check a program. Normally the types used in actual programs don’t trigger that situation (and the compiler also may first run out of memory).

By the way, even if Rust’s type system wasn’t turing-complete, the kind of type inference it uses takes exponential time, which in practice is almost the same as the possibility of non-halting cases, because you can’t afford to wait a hundred or more years for your program to finish compiling.

> But my point is, everyday program are "arbitrary program"

No, most programs we write are from a very limited subset of all possible programs. This is because we already reason in our heads about the validity and suitability of our programs.

replies(1): >>yonixw+os

>>layer8+Ad
But panics in rust are pretty much exceptions though?

The differences are they are actually meant to be used for exceptional situations ("assert violated => there's a bug in this program" or "out of memory, catastrophic runtime situation") and they are not typed (rather, the panic holds a type erased payload).

Other than that, it performs unwinding without UB, and is catchable[0]. I'm not seeing the technical difference?

[0]: https://doc.rust-lang.org/std/panic/fn.catch_unwind.html

replies(1): >>layer8+nt

>>jmilli+N3
> My favorite trick is to link a no_std binary with no panic handler, and see if it has a linker error. No linker error = no calls to panic handler = no panics.

Oh? How do you do that? Do you have a written guide handy? Very curious about this.

>>Wastin+I7
QNX and INTEGRITY customers will be to differ.

>>ok_dad+ug
Then you should know that the use of SOUP is not so clear cut. It depends on the class of device and more specifically, on the part of the device that the software is used on. I know medical devices running SOUP operating systems like Linux. They went to some length to show that the parts running Linux and the critical functions of the device were sufficiently independent. This isolation is specifically allowed by the standards you quote.

It's even worse on things like car dashboards: some warning lights on dashboards need to be ASIL-D conformant, which is quite strict. However, developing the whole dashboard software stack to that standard is too expensive. So the common solution these days is to have a safe, ASIL-D compliant compositor and a small renderer for the warning lights section of the display while the rendering for all the flashy graphics runs in an isolated VM on standard software with lower safety requirements. It's all done on the same CPU and GPU.

replies(1): >>ok_dad+yx

>>avgcor+Ap
I disagree that “safe” as used by the Rust community is always restricted to memory safety, see my parent comment.

>>layer8+Jq
> Regarding the second question, in the general case you have to guess or do trial and error.

> You notice that the analyzer takes more time than you’re willing to wait,

I see, thanks, didn't know about this feedback loop as I'm not a rust programmer. Still on my todo list to learn.

replies(1): >>layer8+Du

>>ajross+ig
Rust makes choices which drastically simplify the analysis problem.

The most obvious is mutable references. In Rust there can be either one mutable reference to an object or there may be any number of immutable references. So if we're thinking about this value here, V, and we're got an immutable reference &V so that we can examine V well... it's not changing, there are no mutable references to it by definition. The Rust language won't let us have &mut V the mutable reference at the same time &V exists and so it needn't ever account for that possibility†.

In C and C++ they break this rule all the time. It's really convenient, and it's not forbidden in C or C++ so why not. Well, now the analysis you wanted to do is incredibly difficult, so good luck with that.

† This also has drastic implications for an optimiser. Rust's optimiser can often trivially conclude that a= f(b); can't change b where a C or C++ optimiser is obliged to admit that actually it's not sure, we need to emit slower code in case b is just an alias for a.

replies(1): >>ajross+fq1

>>wtalli+oq
> If my WiFi driver crashes, there's nothing helpful or safer about immediately bringing down the whole system when it's possible to keep running with everything but networking still functioning.

There have been various examples of WiFi driver bugs leading to security issues. Didn’t some Broadcom WiFi driver once have a bug in how it processed non-ASCII SSID names, allowing you to trigger remote code execution?

replies(1): >>wtalli+lw

>>dureui+Uq
You’re probably right now that I’ve read up on it, I wasn’t previously aware of catch_unwind.

replies(1): >>dureui+gu

>>ok_dad+ug
That’s an odd thing to claim. I have worked on certified medical devices that run custom Linux distribution.

Mind you, that experience also severely soured me on the quality of medical software systems, due to poor quality of the software that ran in that distribution. Linux itself was a golden god in comparison to the crap that was layered on top of it.

replies(1): >>ok_dad+Tx

>>layer8+nt
Glad to be of service. Note that the idiomatic error handling in rust is still Result based rather than panic/catch_unwind based.

Nevertheless a long living application like, e.g., a webserver will catch panics coming from its subtasks (e.g., its request handlers) via catch_unwind

>>yonixw+os
I don’t think it actually happens in Rust in practice, or only very rarely. I was more talking about the hypothetical case for any static analysis of nontrivial program properties as in Rice’s theorem.

>>Miguel+A9
Maybe these? https://www.cyberciti.biz/tips/reboot-or-halt-linux-system-i...

>>EdScho+Ts
We're not talking about bugs in general, we're talking about bugs whose manifestation is caught by error checking already in the code. For device drivers, those situations can often be handled safely by simply disabling the device in question while leaving the rest of the OS running. I doubt the Broadcom bug you're thinking of triggered a WARN_ON() in the code path allowing for a remote code execution. (Also, the highest-profile Broadcom WiFi remote code execution bug I'm aware of was a bug in the WiFi chip's closed-source firmware, which doesn't even run on the same processor as the host Linux OS.)

>>gmueck+9s
> They went to some length to show that the parts running Linux and the critical functions of the device were sufficiently independent.

Let's not be too pedantic. You, as an experienced medical device engineer, probably knew what I meant was that they would never use Linux in the critical parts of a medical device as the OP had originally argued. Any device would definitely do all of it's functionality without the part with Linux on it.

The OP was still a major strawman, regardless of my arguments, because the Linux kernel will never be in the critical path of a medical device without a TON of work to harden it from errors and such. Just the fact that Linus' stance is as said would mean that it's not an appropriate kernel for a medical device, because they should always fail with an error and stop under unknown conditions rather than just doing some random crap.

>>smolde+gm
Those aren't medical devices.

>>voakba+Yt
I'd like to hear more about that, but I assume it's much like the other poster here that described a Linux system that is a peripheral device attached to the actual medical device that does the medical shit.

replies(2): >>gmueck+2L >>voakba+FL2

>>layer8+Oc
> “Safety” isn’t a well-defined term for PL in general. “Soundness” is.

This is false. "Safety" and "Liveness" are terms used by the PL field to describe precise properties of programs and they have been used this way for like 50 years (https://en.wikipedia.org/wiki/Safety_and_liveness_properties). A "safety" property describes a guarantee that a program will never reach some form of unwanted state. A "liveness" property describes a guarantee that a program will eventually reach some form of wanted state. These terms would be described very early in a PL course.

replies(1): >>layer8+uX

>>yonixw+Zk
Indeed, but panic is easier because in some ways checking if a program can panic is akin to checking if the program links the panic function.

And that's pretty easy to statically analyze.

>>yonixw+Zk
That's not what "trivial property" means w.r.t. Rice's Thm.

The point is that you can produce a perfectly working analysis method that is either sound or complete but not both. "Nowhere in the entire program does the call 'panic()' appear is a perfectly workable analysis - it just has false positives.

>>yonixw+Zk
Turing machines have a set of states and a transition function that governs how it moves between states. The transition function is a bunch of mappings like this:

    (input state, input symbol) --> (output state, output symbol, move left/right)

This is all static, so you can look at the transition table to see all the possible output symbols. If no transition has output symbol 1, then it never outputs 1. It doesn't matter how big the Turing machine is or what input it gets, it won't do it. This is basically trivial, but it's still a type of very simple static analysis that you can do. Similarly, if you don't have any states that halt, the machine will never halt.

This is like just not linking panic() into the program: it isn't going to be able to call it, no matter what else is in there.

>>layer8+l1
I think you're mixing up what a -kernel- rust programmer would do (but should know not to do) vs what rust "is", it's not the same. You have to enter a different mindset with the kernel, it will be a new hybrid of c context in the kernel and user-land programming.

>>pornel+Lb
On one embedded projected we had a separate debug chip could safely shutdown what might be dangerous circuits in the case of controller failure. The source code for that was much much much smaller than the controller and heavily vetted by mulitiple people. The small dedicated would initiate circuit shutdown on panic from the linux kernel on the controller. My point being is it's hard to know what happens after a panic, and logging and such is nice, but that may or may not be available, but being able to do some action as simple as sending a "panic" signal to a second dedicated processor to shut down critical systems in a controlled manner is nice. "Stop the world" can be very dangerous in some situations. There were even more independent backup failsafes on the potentially dangerous circuits as well, but redundancy is even more insurance something bad won't happen.

>>EdScho+g4
Yeah, this part has never really been true.

> In the kernel, "panic and stop" is not an option

That's simply not true. It's an option I've seen exercised many times, even in default configurations. Furthermore, for some domains - e.g. storage - it's the only sane option. Continuing when the world is clearly crazy risks losing or corrupting data, and that's far worse than a crash. No, it's not weird to think all types of computation are ephemeral or less important than preserving the integrity of data. Especially in a distributed context, where this machine might be one of thousands which can cover for a transient loss of one component but letting it continue to run puts everything at risk, rebooting is clearly the better option. A system that can't survive such a reboot is broken. See also: Erlang OTP, Recovery Oriented Computing @ Berkeley.

Linus is right overall, but that particular argument is a very bad one. There are systems where "panic and stop" is not an option and there are systems where it's the only option.

replies(1): >>wtalli+wD

>>mike_h+ye
It's about the odds here. Nothing is 100% safe. Independent systems almost always provide backup safety incase the OS/embedded system fails. Thing like overcurrent detector, brown out detectors, speed governors, etc in case code does something as a result of running corrupted (or something similarly awful)

>>mike_h+s8
> What about a medical procedure that WILL kill the patient if interrupted? What about life support in space?

The proper answer to those is redundancy, not continuing in an unknown and quite likely harmful state.

replies(2): >>yencab+HQ >>John23+rO1

>>layer8+85
That's where linters and code reviews come in, you will never 100% prevent stupid coding, that's where review either automated and/or other coders and coding standards come in.

>>layer8+P8
exceptions would be awful in the kernel. I would be highly surprised if kernels like fuscia allow c++ exceptions.

>>notaco+jC
> Furthermore, for some domains - e.g. storage - it's the only sane option.

Can you elaborate on this? Because failing storage is a common occurrence that usually does not warrant immediately crashing the whole OS, unless it's the root filesystem that becomes inaccessible.

replies(1): >>notaco+HJ

>>roywig+G7
> Rust is arguably less safe in that aspect than C, due to the general Rust practice of panicking upon unexpected conditions

For context, this is OP's sentence that I responded to in particular. Ensuring safety [1] is way less trivial than looking for a call to "panic" in the state machine. You can remove the calls to "panic" and this alone does not make your program safer than the equivalent C code. It just makes it more kernel friendly.

[1] not only memory safety

>>Wastin+I7
Others have mentioned QNX. There is also ThreadX, which is a "picokernel". Both are certified for use in safety-critical domains. There are other options as well. Segger do one, for example, and there's also SafeRTOS, and others.

"Performance" is a red herring. In a safety-critical system, what matters is the behaviour and the consistency. ThreadX provides timing guarantees which Linux can not, and all of the system threads are executed in strict priority order. It works extremely well, and the result is a system for which one can can understand the behaviour exactly, which is important for validating that it is functioning correctly. Simplicity equates to reliability. It doesn't matter if it's "slow" so long as it's consistently slow. If it meets the product requirements, then it's fine. And when you do the board design, you'll pick a part appropriate to the task at hand to meet the timing requirements.

Anyway, systems like ThreadX provide safety guarantees that Linux will never be able to. But the interface is not POSIX. And for dedicated applications that's OK. It's not a general-purpose OS, and that's OK too. There are good reasons not to use complex general-purpose kernels in safety-critical systems.

IEC 62304 and ISO 13485 are serious standards for serious applications, where faults can be life-critical. You wouldn't use Linux in this context. No matter how much we might like Linux, you wouldn't entrust your life to it, would you? Anyone who answered "yes" to that rhetorical question should not be trusted with writing safety-critical applications. Linux is too big and complex to fully understand and reason about, and as a result impossible to validate properly in good faith. You might use it in an ancillary system in a non-safety-critical context, but you wouldn't use it anywhere where safety really mattered. IEC 62304 is all about hazards and risks, and risk mitigation. You can't mitigate risks you can't fully reason about, and any given release of Linux has hundreds of silly bugs in it on top of very complex behaviours we can't fully understand either even if they are correct.

replies(1): >>Wastin+bR

>>wtalli+wD
Depends on what you mean by "failing storage" but IMX it does warrant an immediate stop (with or without reboot depending on circumstances). Yes, for some kinds of media errors it's reasonable to continue, or at least not panic. Another option in some cases is to go read-only. OTOH, if either media or memory corruption is detected, it would almost certainly be unsafe to continue because that might lead to writing the wrong data or writing it to the wrong place. The general rule in storage is that inaccessible data is preferable to lost, corrupted, or improperly overwritten data.

Especially in a distributed storage system using erasure codes etc., losing one machine means absolutely nothing even if it's permanent. On the last storage project I worked on, we routinely ran with 1-5% of machines down, whether it was due to failures or various kinds of maintenance actions, and all it meant was a loss of some capacity/performance. It's what the system was designed for. Leaving a faulty machine running, OTOH, could have led to a Byzantine failure mode corrupting all shards for a block and thus losing its contents forever.

BTW, in that sort of context - where most bytes in the world are held BTW - the root filesystem is more expendable than any other. It's just part of the access system, much like firmware, and re-imaging or even hardware replacement doesn't affect the real persistence layer. It's user data that must be king, and those media whose contents must be treated with the utmost care.

replies(1): >>wtalli+CO

>>layer8+b4
> Rust is arguably less safe in that aspect than C, due to the general Rust practice of panicking upon unexpected conditions

For clarification, I responded to this in particular because "safety" is being conflated with "panicking" (bad for kernel). I reckoned "Unexpected conditions" means "arbitrary programs", hence my response, otherwise you could just remove the call to panic.

>>ok_dad+Tx
It is not a peripheral device if it runs the UI with all the main controls, is it?

replies(1): >>ok_dad+OH1

>>titzer+R4
> 40 years of microkernels, of which I know Linus is aware of, beg to differ.

For better or worse Linux is NOT a microkernel. Therefore, the sound microkernel wisdom is not applicable to Linux in its present form. The "impedance match" of any new language added to the linux kernel is driven by what current kernel code in C is doing. This is essentially linux kernel limitation. If Rust cannot adapt to these requirements it is a mismatch for linux kernel development. For the other kernels like Fuchsia Rust is a good fit. BTW, core Fuchsia kernel itself is still in C++.

>>notaco+HJ
I understand why a failing drive or apparently corrupt filesystem would be reason to freeze a filesystem. But that's nowhere close to kernel panic territory.

Even in a distributed, fault-tolerant multi-node system, it seems like it would be useful for the kernel to keep running long enough for userspace to notify other systems of the failure (eg. return errors to clients with pending requests so they don't have to wait for a timeout to try retrieving data from a different node) or at least send logs to where ever you're aggregating them.

replies(1): >>notaco+gS

>>notaco+DC
The leap second bug would have crashed all nodes of a redundant system, at the same time...

replies(1): >>notaco+LT

>>rleigh+0J
Sorry, I’m a bit lost regarding your comment. The discussion was about code safety in Linux in the context of potentially introducing Rust. I don’t really see the link with microkernels in the context of safety oriented RTOS. I think you are reacting to my comment about microkernels performance in the 90s which I maintain.

Neither QNX nor ThreadX are intended to be general purpose kernel. I haven’t looked into it for a long time but QNX performances used to not be very good. It’s small. It can boot fast. It gives you guaranty regarding time of return. Everything you want from a RTOS in a safety critical environment. It’s not very fast however which is why it never tried to move towards the general market.

>>wtalli+CO
In a system already designed to handle the sudden and possibly permanent loss of a single machine to hardware failure, those are nice to have at best. "Panic" doesn't have to mean not executing a single other instruction. Logging e.g. over the network is one of the things a system might do as part of its death throes, and definitely was for the last few such systems I worked on. What's important is that it not touch storage any more, or issue instructions to other machines to do so, or return any more possibly-corrupted data to other systems. For example, what if the faulty machine itself is performing block reconstruction when it realizes the world has turned upside down? Or if it returns a corrupted shard to another machine that's doing such reconstruction? In both of those scenarios the whole block could be corrupted even though that machine's local storage is no longer involved. I've seen both happen.

Since the mechanisms for ensuring the orderly stoppage of all such activity system-wide are themselves complicated and possibly error-prone, and more importantly not present in a commodity OS such as Linux, the safe option is "opt in" rather than "opt out". In other words, don't try to say you must stop X and Y and Z ad infinitum. Instead say you may only do A and B and nothing else. That can easily be accomplished with a panic, where certain parts such as dmesg are specifically enabled between the panic() call and the final halt instruction. Making that window bigger, e.g. to return errors to clients who don't really need them, only creates further potential for destructive activity to occur, and IMO is best avoided.

Note that this is a fundamental difference between a user (compute-centric) view of software and a systems/infra view. It's actually the point Linus was trying to get across, even if he picked a horrible example. What's arguably better in one domain might be professional malfeasance in the other. Given the many ways Linux is used, saying that "stopping is not an option" is silly, and "continuing is not an option" would be equally so. My point is not that what's true for my domain must be true for others, but that both really are and must remain options.

P.S. No, stopping userspace is not stopping everything, and not what I was talking about. Or what you were talking about until the narrowing became convenient. Your reply is a non sequitur. Also, I can see from other comments that you already agree with points I have made from the start - e.g. that both must remain options, that the choice depends on the system as a whole. Why badger so much, then? Why equivocate on the importance (or even meaningful difference) between kernel vs. userspace? Heightening conflict for its own sake isn't what this site is supposed to be about.

replies(1): >>wtalli+JU

>>yencab+HQ
Perhaps. On the other hand, letting a medical device continue moving an actuator or dispensing a medication when it's known to be in a bad "never happen" state could also be fatal. Ditto for the "life support in space" example. Ditto for anything reliant on position, where the system suddenly realizes it has no idea whether its position is correct. Imagine that e.g. on a warship. Limiting responses to external inputs (including time adjustments) can ameliorate such problems. So can software diversity. Many safety-critical systems require one or both, and other measures as well. Picking one black-swan event while ignoring literally every day scenarios doesn't seem very helpful. That's especially true when the thing you're advocating is what actually happened and led to its own bad outcomes.

replies(1): >>yencab+zu1

>>gmueck+sq
Interesting, how does your software detect an uncalibrated monitor? Did it come with a calibration device which had to be used to scan the display output to check?

I don't suppose monitors report calibration data back to display adapters do they?

replies(2): >>Gauntl+8V >>gmueck+IW

>>notaco+gS
> "Panic" doesn't have to mean not executing a single other instruction.

We're talking specifically about the current meaning of a Linux kernel panic. That means an immediate halt to all of userspace.

>>alias_+OT
My guess is they had some heuristic based on EDIDs, which are incredibly easy to spoof.

https://smile.amazon.com/EVanlak-Passthrough-Generrtion-Elim...

replies(1): >>gmueck+WX

>>alias_+OT
I didn't work on that specific software team and it has been a long time since I worked there. But the software came with its custom calibration routine and I believe that the calibration result was stored with model and serial number information from the monitor EDID.

replies(1): >>alias_+UU1

>>UncleM+Ay
What I mean is that there is no universal definition of which properties are safety properties. In principle, you can define any property you can formally reason about as a safety property. Therefore, whenever you talk about safety, you first have to define which properties you mean by that.

In the context of Rust, there are a number of safety properties that Rust guarantees (modulo unsafe, FFI UB, etc.), but that set of safety properties is specific to Rust and not universal. For example, Java has a different set of safety properties, e.g. its memory model gives stronger guarantees than Rust’s.

Therefore, the meaning of “language X is safe” is entirely dependent on the specific language, and can only be understood by explicitly specifying its safety properties.

replies(2): >>avgcor+iZ >>UncleM+c61

>>jeffre+Vp
The couple of times I had to go digging into the kernel what the thing looks like to me is a very large bare metal piece of firmware. As someone who writes firmware that very last thing you ever want is it to hang or reset without reporting any diagnostics. Because you have no idea where the offending code is. I'll belabor the point for people that think a large program is a few thousand lands. With the kernel it's millions of lings of code mostly written by other people.

Small rant. ARM cortex processors overwrites the stack pointer on reset. That's very very very dumb because after the watchdog trips you have no idea what the code was doing. Which means you can't report what the code was doing when that happened.

>>Gauntl+8V
Yes, but why would you go to these lengths? The purpose of the whole mechanism is to prevent accidental misdiagnosis based on an incorrectly interpreted X-ray image. This isn't DRM, just a safeguard against incorrect use of equipment.

replies(1): >>Gauntl+jI1

>>layer8+85
> that ubiquitous functions like unwrap() panic if the programmer has made an error.

You're not wrong but you chose a hilarious example. Unwrap's entire purpose is to turn unhandled errors into panics!

Array indexing, arithmetic (with overflow-checks enabled), and slicing are examples where it's not so obvious there be panic dragons. Library code does sometimes panic in cases of truly unrecoverable errors also.

>>layer8+uX
> Therefore, whenever you talk about safety, you first have to define which properties you mean by that.

Like “memory safety”?

replies(1): >>layer8+F31

>>ok_dad+ug
Ok, that's good for a U.S. centric view. Do you know that every medical device manufactured in China, for use in China meets the same requirements? Same for India, Russia, etc. The U.S. isn't the world and I'd be surprised if Linux weren't in use in some critical systems around the world that would be shocking for U.S. experts on those types of systems.

>>avgcor+iZ
For example. Rust has other safety properties beyond memory safety.

>>ajross+ig
I've long been of the opinion that the rust community is missing the boat in a lot of respects.

Rust could do the external tooling better than any other language out there, but they're so focused on the _language_ preventing abuse that they've largely missed the boat.

>>layer8+uX
That's true for "soundness" too. Things aren't just "sound". They are sound with respect to something. So when you use "soundness" as a comparison against "safety", you'll have to understand how somebody could interpret your post in the way that I did.

Almost all discussion about Rust is in comparison to C and C++, by far the dominant languages for developing native applications. C and C++ are famously neither type-safe nor memory-safe and it becomes a pretty easy shorthand in discussions of Rust for "safety" to refer to these properties.

>>jmilli+xi
> That's why Rust has `unsafe`, it's so the programmer has a way to encode logic that they believe is safe but aren't yet able to prove.

That's not the right way to characterize this. Rust has unsafe for code that is correct but that the compiler is unable to detect. Foreign memory access (or hardware MMIO) and cyclic data structures are the big ones, and those are well-specified, provable, verifiable regimes. They just don't fit within the borrow checker's world view.

Which is something I think a lot of Rust folks tend to gloss over: even at it's best, most maximalist interpretation, Rust can only verify "correctness" along axes it understands, and those aren't really that big a part of the problem area in practice.

replies(1): >>kaba0+zJ2

>>layer8+Ca
Splint doesn't make C memory safe. What I meant is that it doesn't prevent the same problems that Rust does. Hence, you can add a linter to rust to prevent panics. You cannot add a linter to C to make it memory safe.

>>tialar+Es
That point would be stronger if Rust were showing up as faster than C compilers on this axis, and it isn't. In point of fact aliasing rules (which is what C calls the problem you're talking about) are very well travelled ground and the benefits and risks of -fno-strict-aliasing are well understood. Static analysis tools are in fact extremely good at this particular area of the problem. And of course at the level of emerging technologies, LTO makes this largely a non-issue because the compiler is able to see the aliasing opportunities at the level of global analysis. It doesn't need the programmer or the compiler to promise it an access is unaliased, it can just check and see.

replies(1): >>kaba0+ZJ2

>>lifthr+Bc
1. As far as i'm aware there's no way to reliably catch all panics. catch_unwind does not catch all panics. 2. The whole point is that consequences of a panic are worse than the consequences of memory corruption. That's how the kernel was designed. There was an explicit design decision not to kernel panic in every situation where a logic error occurs.

replies(1): >>lifthr+tE1

>>zozbot+ii
As far as I'm aware there's no way to reliably catch all panics. catch_unwind does not catch all panics. Handlers don't stop the program from terminating abruptly.

>>notaco+LT
Picking medical devices and warships is also quite the cherry picking. Most Linuxes aren't like that. Critical embedded systems tend to have a hard realtime component, and if Linux is on the system it sits under e.g. seL4, or on a different CPU.

At the end of the day, what Linux does is what Linus wants out of it. He's stated, often, that halting the CPU at the exact moment something goes wrong is not the goal. If your goal is to do that, you might not be able to use Linux. If your goal is to put Rust in the Linux kernel, you might have to let go of your goal.

replies(1): >>notaco+YE2

>>ok_dad+ug
Surely we can “harden” Linux for this application?

>>im3w1l+25
Fair. From the responses, clearly i didn't know what I was talking about, fair enough!

But ok, uninformed me would have guessed checking for that would be pretty straightforward in statically typed Rust. Is that something people want? Why isn't there a built-in mechanism to do it?

>>hegels+Ar1
There are tons of edge cases with panics, e.g. panic can trigger a destructor that can panic itself, or unwinding may cross a language boundary which may not be well defined, but to my knowledge `catch_unwind` does catch all panics as long as unwinding reliably works. That disclaimer in the `catch_unwind` documentation only describes the `panic = abort` case.

And I thought it was clear that kernel panic is different from Rust panic, which you don't seem to distinguish. Rust panic doesn't need to cause a kernel panic because it can be caught earlier.

replies(1): >>hegels+hI1

>>gmueck+2L
No, do you have a concrete example of this strawman, though?

Edit: I should also add (probably earlier too) that all my examples are specific to the USA FDA process. I'm sure some other place might not have the same rules.

replies(1): >>gmueck+sU1

>>lifthr+tE1
Obviously rust panic is not the same as a kernel panic. What you're taking for granted is that just because rust can catch a panic that it will. A simple overflow can cause a panic. When this happens, the panic might be caught before the kernel panics, but by then the program is probably already in an undefined state. It also might not be caught at all, and cause an actual kernel panic.

replies(1): >>lifthr+gL1

>>gmueck+WX
People are cheap and corrupt. The speed bump this presents is real, but minor, in the face of a couple medical shops looking to save $100/pop on a dozen monitors.

I hope it's rare, but I think a persistent nag window ("Your display isn't calibrated and may not be accurate") is probably a better answer than refusing to work altogether, because it will be clear about the source of the problem and less likely to get nailed down.

replies(2): >>gmueck+8V1 >>kaba0+YA2

>>hegels+hI1
The program is in a defined but undesirable state, both when panic occurred in Rust and when a "simple" uncontrolled overflow happened in C (provided that the compiler is configured to not treat it as an UB, otherwise it'd be worse). And anyone doing Rust-C interop already has to be conscious about the language boundary, which happens to be a perfect place to catch Rust panics.

>>Gare+Qe
"Should" unfortunately ain't the same as "does". The Torvaldsian (for lack of a better word) attitude seems to be to assume that someone is indeed dumb enough to design a system wherein all safety measures are software-defined, and in such a situation the software in question probably shouldn't catastrophically fail on every last failed assertion.

>>mike_h+s8
I remember working on some telecom equipment in the 90ies. It had a x86/Unix feature rich distributed management system. In other words complicated and expected to fail. The solution was a “watch dog” circuit that the main CPU had to poll every 100ms or so. Three misses and the CPU would get hard rebooted by the watch dog.

This reminds me of two things. Good system design needs a hardware-software codesign. Oxide computers has identified this, but it was probably much more common before the 90ies than after. The second thing is that all things can fail so a strategy that only hardens the one component is fundamentally limited, even flawed. If the component must not fail you need redundancy and supervision. Joe Armstrong would be my source of quote if I needed to find one.

Both rust and Linux has some potential for improvement here, but the best answers may lie in their relation to the greater system, rather than within it self. I’m thinking of WASM and hardware codesign respectively.

>>notaco+DC
Right. It's clear that many people have not heard of, or considered, Therac-25[1].

[1] https://en.wikipedia.org/wiki/Therac-25

replies(1): >>eesmit+S02

>>mike_h+s8
> What about a medical procedure that WILL kill the patient if interrupted?

Allow me to introduce you to Therac-25: https://en.wikipedia.org/wiki/Therac-25

>>avgcor+mb
Linus' point is that safety means something different in kernel programming than in PL theory, and that Rust have to be safe according to kernel rules before it can be used for kernel programming.

>>ok_dad+OH1
I can't see how you can make out a strawman in what I said. There are medical devices where the UI is running on a processor separate from the controller in charge of the core device functions. The two are talking to each other and there is no secondary way of interacting with the controller. This lessens the requirements that are put on the part running the UI, but does not eliminate them.

I'm mostly familiar with EU rules, but as far as I know the FDA regulations follow the same idea of tiered requirements based on potential harm done.

replies(1): >>ok_dad+592

>>gmueck+IW
Thanks, sounds like I need to do some reading about EDIDs; I knew _of_ them but no real understanding is what they are and what they do.

>>3a2d29+ee
What segment? You don't have mapped memory to overrun in the kernel, so that array would have to be in a very special spot to cause a bus error in such a situation. Also, even in user space, overrunning an array is far from guaranteed in trying to address an unmapped page...in fact, it often doesn't, since mapping memory with gaps for each array is prohibitively expensive (though, debugging aides like Electric Fence rely on exactly this mechanism)

>>Gauntl+jI1
You have to draw a line somewhere, I guess. As far as I remember, protections against accidental misuse and foreseeable abuse of a device are required in medical equipment. But malicious circumvention of protections or any kind of active tampering are a whole other category in my opinion.

>>John23+rO1
Therac-25 removed redundancy. Quoting the Wikipedia article: "Previous models had hardware interlocks to prevent such faults, but the Therac-25 had removed them, depending instead on software checks for safety."

replies(1): >>John23+d5c

>>jmilli+(OP)
I think it goes beyond "panics". Linus is saying that Rust cannot guarantee safety under certain circumstances and that safety still depends on the order of fuctions called by the kernel module developer. Because some checks on some compilations will be disabled.

"If you want to allocate memory, and you don't want to care about what context you are in, or whether you are holding spinlocks etc, then you damn well shouldn't be doing kernel programming. Not in C, and not in Rust.

It really is that simple. Contexts like this ("I am in a critical region, I must not do memory allocation or use sleeping locks") is fundamental to kernel programming. It has nothing to do with the language, and everything to do with the problem space."

https://lkml.org/lkml/2022/9/19/840

>>layer8+Pl
absence of UB is literally memory safety.

Rust proponents mean exactly "memory safety" when they say rust is safe because that is the only safety rust guarantees.

>>gmueck+sU1
The UI is one of the most important parts of a machine, look at the Therac-25! The FDA regulations require a lot of effort goes into the human factors, too, and the UI definitely had to be as reliable as the rest of the device and be as well engineered as the rest.

https://www.fda.gov/medical-devices/human-factors-and-medica...

Honestly, the FDA regulations go too far vs the EU regs. The company I worked for was based in the EU and the products there were so advanced compared to our versions. Ours were all based on an original design from Europe that was approved and then basically didn’t charge for 30 years. The European device was fucking cool and had so many features, it was also capable of being carried around rather than rolled. The manufacturing was almost all automated, too, but in the USA it was not at all automated, it was humans assembling parts then recording it in a computer terminal.

>>mike_h+s8
That’s why monitoring, fail-safe power offs and redundant systems are important. E.g. even at the complete failure of a CAT scanner’s higher level control (which let’s say would run on an embedded linux kernel), the system would safely stop the radiation and power off, without any instructions from an OS. Here, an inconsistent state from the OS is actually more dangerous than stopping in the middle (e.g. the OS stucks and the same, high energy radiation is continuously being released)

>>ok_dad+tb
Industrial control systems (sadly, imho) don't use Linux as often as they could/should, but such systems do have the ability to injure their operators or cause large amounts of damage of course. [1]

The first priority is safety, absolutely and without question. And then the immediate second priority is the fact that time is money. For every minute that the system is not operating, x amount of product is not being produced.

Generally, having the software fully halt on error is both dangerous and time-consuming.

Instead you want to switch to an ERROR and/or EMERGENCY_STOP state, where things like lasers or plasma torches get turned off, motors are stopped, brakes are applied, doors get locked/unlocked (as appropriate/safe), etc. And then you want to report that to the user, and give them tools to diagnose and correct the source of the error and to restart the machine/line [safely!] as quickly as possible.

In short, error handling and recovery is its own entire thing, and tends to be something that gets tested for separately during commissioning.

[1] PLC's do have the ability to <not stop> and execute code in a real time manner, but I haven't encountered a lot of PLC programmers who actually exploit these abilities effectively. Basically for more complex situations you're quickly going to be better off with more general purpose tools [2], at most handing off critical tasks to PLCs, micro-controllers, or motor controllers etc.

[2] except for that stupid propensity to give-up-and-halt at exactly that moment where it'll cause the most damage.

>>Gauntl+jI1
Medical devices are insanely expensive (a CT scanner may reach a million dollars), you won’t risk $100 on such a small thing as a screen.

>>wtalli+oq
Isn’t that exactly the reason behind microkernel’s supposed superiority?

replies(1): >>wtalli+kU3

>>yencab+zu1
> Picking medical devices ... is also quite the cherry picking

It wasn't my example. It was mike_hock's, and I was responding in the context they had set.

> Most Linuxes aren't like that.

Your ally picked the medical-device and space-life-support examples. If you think they're invalid because such systems don't use Linux, why did you forego bringing it up with them and then change course when replying to me? As I said: not helpful.

The point is not specific to Linux, and more Linux systems than you seem to be aware of do adopt the "crash before doing more damage" approach because they have some redundancy. If you're truly interested, I had another whole thread in this discussion explaining one class of such cases in what I feel was a reasonably informative and respectful way while another bad-faith interlocutor threw out little more than one-liners.

>>ok_dad+ug
I am an American citizen and a former dialysis patient, now kidney transplant recipient. I have watched in-center dialysis machines reboot during treatment, show the old "Energy Star" BIOS logo, and then boot Linux...

Felt kinda bad until I thought about how well a "Linux literally killed me" headline would do on HN, but then I realized I wouldn't be able to post the article if I actually died. Such is life. Or death? One or the other.

>>ajross+Uj1
I don’t really get your comment - are you agreeing or disagreeing with parent? Because you seemingly say the same thing.

And continuing on parent’s comment, rust can only make its memory guarantees by restricting the set of programmable programs, while C and the like’s static analysis has to work on the whole set which is simply an undecidable problem. As soon as unsafe is in the picture, it becomes undecidable as well in Rust, in general.

replies(1): >>ajross+wp3

>>ajross+fq1
It’s not a problem, because compilers simply declare it as UB and don’t care about it. If you by accident did depend on that, your program will just get a nice, hard to debug bug.

>>ok_dad+Tx
These were not peripherals. We are talking devices that would be front line in an emergency room. Terrifying.

>>kaba0+zJ2
The parent comment seemed to imply that using unsafe was a failing of the developer to prove to the compiler that the code is correct. And that's not right, unless you view thing like "doubly linked list" as incorrect code. Unsafe is for correct code that the compiler is unable to verify.

>>kaba0+0C2
It's the primary claimed benefit to microkernels. But since Linux as it exists today already handles this case, it isn't a very strong argument in favor of microkernels.

>>ajross+ig
Static analysis for C exists since 1979, date of lint's availability, the problem isn't lack of tooling, rather people actually using it.

replies(1): >>jstimp+wG7

>>lifthr+ej
> Seriously, can we make at least ASAN the default?

Android helps a bit into that sense,

https://source.android.com/docs/security/test/hwasan

>>pjmlp+504
I would assume that the things that lint could do you nowadays get by simply using -Wall -Wextra with gcc for example. While I haven't checked what a lint is required to do, but there have been plenty of situations in the past where I had to change my code in order to avoid triggering false positives from tests that run during normal compilation. For instance, there are tests that find accesses to potentially uninitialized variables.

replies(1): >>pjmlp+8y9

>>jstimp+wG7
If we are talking about the 1979 version, most likely.

If we are talking about products like PC-lint, Sonar qube, Coverity, the experience is much more than that.

>>Tomte+Ma
SUSE was used on Mars

https://www.pcmag.com/news/linux-is-now-on-mars-thanks-to-na...

>>eesmit+S02
Right, that is the point I was making.