A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.
He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.
Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:
struct Array<T> { ... }
impl<T> Array<T> {
fn len(&self) -> usize;
// if idx >= len, panic
fn get_or_panic(&self, idx: usize) -> T;
// if idx >= len, return None
fn get_or_none(&self, idx: usize) -> Option<T>;
// if idx >= len, print a stack trace and return
// who knows what
unsafe fn get_or_undefined(&self, idx: usize) -> T;
}
The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.The language determines the definition of its constructs, not the software being written with it.
Edit: It's worth mentioning that while I think he is wrong, I think it's symptomatic of there not being a keyword/designation in Rust to express what Linus is trying to say. I would completely oppose misusing the unsafe keyword since it has negative downstream effects on all future dependency crates, where it's not clear what characteristics "unsafe" refers to which causes a split. So maybe they need to just discuss a different way to label these for now and agree to improve it later.
But Rust's situation is still safer, because Rust can typically prevent more errors from ever becoming a run-time issue, e.g. you may not even need to use array indexing at all if you use iterators. You have a guarantee that references are never NULL, so you don't risk nullptr crash, etc.
Rust panics are safer, because they reliably happen instead of an actually unsafe operation. Mitigations in C are usually best-effort and you may be lucky/unlucky to silently corrupt memory.
Panics are a problem for uptime, but not for safety (in the sense they're not exploitable for more than DoS).
In the long term crashing loud and clear may be better for reliability. You shake out all the bugs instead of having latent issues that corrupt your data.
My intuition says that's the Halting Problem, so not actually possible to implement perfectly? https://en.wikipedia.org/wiki/Halting_problem
Linus sounds so ignorant in this comment. As if no one else thought of writing safety-critical systems in a language that had dynamic errors, and that dynamic errors are going to bring the whole system down or turn it into a brick. No way!
Errors don't have to be full-blown exceptions with all that rigamarole, but silently continuing with corruption is utter madness and in 2022 Linus should feel embarrassed for advocating such a backwards view.
[1] This works for Erlang. Not everything needs to be a shared-nothing actor, to be sure, but failing a whole task is about the right granularity to allow reasoning about the system. E.g. a few dozen to a few hundred types of tasks or processes seems about right.
Note that Rust is easier to work with than C here, because although the C-like API isn't shy about panicking where C would segfault, it also inherits enough of the OCaml/Haskell/ML idiom to have non-panic APIs for pretty much any major operation. Calling `saturating_add()` instead of `+` is verbose, but it's feasible in a way that C just isn't unless you go full MISRA.
The fact that arbitrary programs are undecidable is a red herring here.
If you want to run an Erlang-style distributed system in the kernel then that's an interesting research project, but it isn't where Linux is today. You'd be better off starting with SeL4 or Fuchsia.
- Remember when Linux had that caused the kernel to partially crash and eat 100% CPU due to some bug in the leap second application code? That caused a >1MW spike in power usage at Hetzner at the time. That must have been >1GW globally. Many people didn’t notice it immediately, so it must have taken weeks before everyone rebooted.
- I’ve personally run into issues where not crashing caused Linux to go on and eat my file system.
On any Linux server I maintain, I always toggle those sysctls that cause the kernel to panic on oops, and reboot on panic.
What makes you say this? From the sample I've seen, Rust programs are far more diligent about handling errors (not panicking: either returning error or handling it explicitly) than C or Go programs due to the nature of wrapped types like Option<T> and Result<T, E>. You can't escape handling the error, and panicking potential is very easy to see and lint against with clippy in the code.
That's different than solving the halting problem. You're not trying to prove it halts, you're just trying to prove it doesn't halt in a specific way, which is trivial to prove if you first make it impossible.
if false {
panic!()
}
Basically you'd prohibit any call to panic whether they may actually end up running or not.But determining that a function (such as panic) is never called because there are no calls to it is pretty easy.
Note that we can only check for maybe, because in general we don't know if some code in the middle will somehow execute forever and never reach the panic call after it.
TFA is about making it possible for the kernel to decide what to do, rather than exploding on the spot, which is terrible.
Where do you unwind to if memory is corrupted?
I don't think we're talking about what would be exception handling in other languages. I believe it's asserts. How do userland processes handle a failed assertion? Usually the process is terminated, but giving a debugger the possibility to examine the state first, or dumping core.
And that's similar to what they are doing in the kernel. Only in that in the kernel, it's more dangerous because there is limited process / task isolation. I think that is an argument that taking down "full-blown separate processes" might not even be enough in the kernel.
Regardless, I think just killing the task instantly, even with partial updates to memory, would be totally fine. It'd be cheap, as automatically undoing the updates (effectively a transaction rollback) is still too expensive. Software transactional memory just comes with too much overhead.
I vote "kill and unwind" and then dealing with partial updates has to be left to a higher level.
Working out whether it will write 1 to the tape in general is undecidable, but in certain cases (you've just banned states that write 1) it's trivial.
If all of the state transitions are valid (a transition to a non-existing state is a halt) then the machine can't get into a state that will transition into a halt, so it can't halt. That's a small fraction of all the machines that won't halt, but it's easy to tell when you have one of this kind by looking at the state machine.
Linux isn’t a microkernel. If you want to work on a microkernel, go work on Fuchsia. It’s interesting research but utterly irrelevant to the point at hand.
Anyway, the microkernel discussion has been happening for three decades now. They haven’t historically had a little lower performance. They had garbage performance, to the point of being unsuitable in the 90s.
Plenty of kernel code can’t be written as to be unwindable. That’s the issue at hand. In a fantasy world, it might have been written as such but it’s not the world will live in which is what matters to Linus.
Although I think this can be better done by some special panic handler that performs a sjlj and notify other systems about the failure, without continuing running with the wrong output...
As you said, you have the option to reboot on panic, but Linus is absolutely not wrong that this size does not fit all.
What about a medical procedure that WILL kill the patient if interrupted? What about life support in space? Hitting an assert in those kinds of systems is a very bad place to be, but an automatic halt is worse than at least giving the people involved a CHANCE to try and get to a state where it's safe to take the system offline and restart it.
https://lkml.org/lkml/2022/9/19/640
Get at least down to here:
https://lkml.org/lkml/2022/9/20/1342
What Linus seems to be getting at is that there are many varying contextual restrictions on what code can do in different parts of the kernel, that Filho etc appear to be attempting to hide that complexity using language features, and that in his opinion it is not workable to fit kernel APIs into Rust's common definition of a single kind of "safe" code. All of this makes sense, in user land you don't normally have to care about things like whether different functional units are turned on or off, how interrupts are set up, etc, but in kernel you do. I'm not sure if Rust's macros and type system will allow solving the problem as Linus frames it but it seems like a worthy target & will be interesting to watch.
As the sibling comment pointed out, if you extend this idea to clean up all state, you end up with processes.
I do have some doubt on the no panic rule. But instead of emulating processes in the kernel, I’d see a firmware like subsystem whose only job it is to export core dumps from the local system, after which the kernel is free to panic.
As a general point and in my view, and I agree this is an appeal to authority, Linus has this uncanny ability to find compromises between practicality and theory that result in successful real world software. He’s not always right but he’s almost never completely wrong.
Not quite, because stack overflows can cause panics independent of any actual invocation of the panic macro.
You need to either change how stack overflows are handled as well, or you need to do some static analysis of the stack size as well.
Both are possible (while keeping rust turing complete), so it's still not like the halting problem.
It's not like there's not exceptions in Rust though. The error handling is thorough to a fault when it's used. Unwrap is just a shortcut to say "I know there might be bad input, I don't want to handle it right now, just let me do it and I'll accept the panic."
Rust’s “safety” has always meant what the Rust team meant by that term. There’s no gotcha to be found here except if you can find some way that Rust violates its own definition of the S-word.
This submission is not really about safety. It’s a perfectly legitimate concern that Rust likes to panic and that panicking is inappropriate for Linux. That isn’t about safety per se.
“Safety“ is a very technical term in the PL context and you just end up endlessly bickering if you try to torture the term into certain applications. Is it safer to crash immediately or to continue the program in a corrupted state? That entirely depends on the application and the domain, so it isn’t a useful distinction to make in this context.
EDIT: The best argument one could make from this continue-can-be-safer perspective is that given two PLs, the one that lets you abstract over this decision (to panic or to continue in a corrupted state, preferably with some out of band error reporting) is safer. And maybe C is safer than Rust in that regard (I wouldn’t know).
Kinda a strawman there. That's got to account for, what, 0.0001% of all use of computers, and probably they would never ever use Linux for these applications (I know medical devices DO NOT use Linux).
Yes, a kernel panic will cause disruption when it happens. But that will also give a precise error location, which makes reporting and fixing of the root cause easier. It could be harder to pinpoint of the code rolled forward in broken state.
It will cause loss of unsaved data when it happens, but OTOH it will prevent corrupted data from being persisted.
It's up to you to choose the right failure strategy and monitor your system if you don't want to panic, and take appropriate measures and not just ignore the warning.
It's not Linus who sounds ignorant here, it's the people applying user-space "best practices" to the kernel. If the kernel panics, the system is dead and you've lost the opportunity to diagnose the problem, which may be non-deterministic and hard to trigger on purpose.
> including e.g. the monitor attached to the PC used for displaying
> X-ray images
Somewhat off-topic, but I used to work in a dental office. The monitors used for displaying X-rays were just normal monitors, bought from Amazon or Newegg or whatever big-box store had a clearance sale. Even the X-ray sensors themselves were (IIRC) not regulated devices, you could buy one right now on AliExpress if you wanted to.But wouldn't reading outside an array bounds also possibly do that? It coudl seg fault which is essentially the same thing.
Is it that reading out of bounds on an array isn't guaranteed to crash everything while a panic always will?
Or in an actual vehicle, the "emergency stop" (if that means just stomping on the brakes) can flip the car and kill its passengers.
fn foo<T>() -> Option<T> {
// Oops, something went wrong and we don't have a T.
None
}
fn bar<T>() -> T {
if let Some(t) = foo() {
t
} else {
// This could've been an `unwrap`; just being explicit here
panic!("oh no!");
}
}
A panic in this case is exactly like an exception in that the function that's failing doesn't need to come up with a return value. Unwinding happens instead of returning anything. But if I was writing `bar` and I was trying to follow a policy like "never unwind, always return something", I'd be in a real pickle, because the way the underlying `foo` function is designed, there aren't any T's sitting around for me to return. Should I conjure one out of thin air / uninitialized memory? What does the kernel do in situations like this? I guess the ideal solution is making `bar` return `Option<T>` instead of `T`, but I don't imagine that's always possible?The ideas of Rust weren’t new when Rust was developed. The actual integration into a new programming language beyond experimental status was, and the combination with ML-style functional programming.
I think you must have missed out on how Linux currently handles these situations. It does not silently move on past the error; it prints a stack trace and CPU state to the kernel log before moving on. So you have all of the information you'd get from a full kernel panic, plus the benefit of a system that may be able to keep running long enough to save that kernel log to disk.
If you look at how POSIX does it, pretty much every single function has error codes, signaling everything from lost connections, to running out of memory, entropy or whatnot. Failures are hard to abstract away. Unless you have some real viable fallback to use, you're going to have to tell the user that something went wrong and leave it up to them to decide what the application can best do in this case.
So in your case, I would return Result<T>, and encode the errors in that. Simply expose the problem to the caller.
What's funny about this is that (while it's true!) it's exactly the argument that Rustaceans tend to reject out of hand when the subject is hardening C code with analysis tools (or instrumentation gadgets like ASAN/MSAN/fuzzing, which get a lot of the same bile).
In fact when used well, my feeling is that extra-language tooling has largely eliminated the practical safety/correctness advantages of a statically-checked language like rust, or frankly even managed runtimes like .NET or Go. C code today lives in a very different world than it did even a decade ago.
1. Have a constraint on T that lets you return some sort of placeholder. For example, if you've got an array of u8, maybe every read past the end of the array returns 0.
fn bar<T: Default>() -> T {
if let Some(t) = foo() {
t
} else {
eprintln!("programmer error, foo() returned None!");
Default::default()
}
}
2. Return a `Option<T>` from bar, as you describe.3. Return a `Result<T, BarError>`, where `BarError` is a struct or enum describing possible error conditions.
#[non_exhaustive]
enum BarError {
FooIsNone,
}
fn bar<T>() -> Result<T, BarError> {
if let Some(t) = foo() {
Ok(t)
} else {
eprintln!("programmer error, foo() returned None!");
Err(BarError::FooIsNone)
}
}Analysis of panic-safety in Rust is comparatively easy. The set of standard library calls that can panic is finite, so if your tool just walks every call graph you can figure out whether panic is disproven or not.
Getting logs out is critical.
- Only useful when actually being used, which is never the case. (Seriously, can we make at least ASAN the default?)
- Often costly to always turn them on (e.g. MSAN).
- Often requires restructuring or redesign to get the most out of them (especially fuzzing).
Rust's memory safety guarantee does not suffer from first two points, and the third point is largely amortized into the language learning cost.
But there is no need to let userspace processes continue to run, which is exactly what Linux does.
I would like to learn otherwise, but even a React JS+HTML page is undecidable... its scope is limited by chrome V8 js engine (like a vm), but within that scope I don't think you can prove anything more. otherwise we could just make static analysis to check if it will leak passwords...
But day to day programs are not trivial... as for your example, just switch it with this code: `print(gcd(user_input--,137))`... now it's quite more hard to "just ban some final states"
Depending on the semantic property to check for, writing such an algorithm isn’t trivial. But the Rust compiler for example does it for memory safety, for the subset of valid Rust programs that don’t use Unsafe.
The only sure way I can think of, is when you force your program to go through a more narrow non-turing algorithm. Like sending data through a network after Serialization. Where we could limit the De-Serialization process to be non Turing (json, yaml?).
Same for code, that uses non-turing API, like memory allocation in a dedicated per process space. Or rust "borrow" mechanics that the compiler enforces.
But my point is, everyday program are "arbitrary program" and not a red haring. Surly from the kernel perspective, which is Linus point imo.
That works for some systems: those for which "some NVRAM or something" evaluates to a real device usable for that purpose. Not all Linux systems provide such a device.
> But there is no need to let userspace processes continue to run, which is exactly what Linux does.
Userspace processes usually contain state that the user would also like to be persisted before rebooting. If my WiFi driver crashes, there's nothing helpful or safer about immediately bringing down the whole system when it's possible to keep running with everything but networking still functioning.
Regarding the second question, in the general case you have to guess or think hard, and proceed by trial and error. You notice that the analyzer takes more time than you’re willing to wait, so you stop it and try to change your program in order to fix that problem.
We already have that situation today, because the Rust type system is turing-complete. Meaning, the Rust compiler may in principle need an infinite amount of time to type-check a program. Normally the types used in actual programs don’t trigger that situation (and the compiler also may first run out of memory).
By the way, even if Rust’s type system wasn’t turing-complete, the kind of type inference it uses takes exponential time, which in practice is almost the same as the possibility of non-halting cases, because you can’t afford to wait a hundred or more years for your program to finish compiling.
> But my point is, everyday program are "arbitrary program"
No, most programs we write are from a very limited subset of all possible programs. This is because we already reason in our heads about the validity and suitability of our programs.
The differences are they are actually meant to be used for exceptional situations ("assert violated => there's a bug in this program" or "out of memory, catastrophic runtime situation") and they are not typed (rather, the panic holds a type erased payload).
Other than that, it performs unwinding without UB, and is catchable[0]. I'm not seeing the technical difference?
[0]: https://doc.rust-lang.org/std/panic/fn.catch_unwind.html
Oh? How do you do that? Do you have a written guide handy? Very curious about this.
It's even worse on things like car dashboards: some warning lights on dashboards need to be ASIL-D conformant, which is quite strict. However, developing the whole dashboard software stack to that standard is too expensive. So the common solution these days is to have a safe, ASIL-D compliant compositor and a small renderer for the warning lights section of the display while the rendering for all the flashy graphics runs in an isolated VM on standard software with lower safety requirements. It's all done on the same CPU and GPU.
> You notice that the analyzer takes more time than you’re willing to wait,
I see, thanks, didn't know about this feedback loop as I'm not a rust programmer. Still on my todo list to learn.
The most obvious is mutable references. In Rust there can be either one mutable reference to an object or there may be any number of immutable references. So if we're thinking about this value here, V, and we're got an immutable reference &V so that we can examine V well... it's not changing, there are no mutable references to it by definition. The Rust language won't let us have &mut V the mutable reference at the same time &V exists and so it needn't ever account for that possibility†.
In C and C++ they break this rule all the time. It's really convenient, and it's not forbidden in C or C++ so why not. Well, now the analysis you wanted to do is incredibly difficult, so good luck with that.
† This also has drastic implications for an optimiser. Rust's optimiser can often trivially conclude that a= f(b); can't change b where a C or C++ optimiser is obliged to admit that actually it's not sure, we need to emit slower code in case b is just an alias for a.
There have been various examples of WiFi driver bugs leading to security issues. Didn’t some Broadcom WiFi driver once have a bug in how it processed non-ASCII SSID names, allowing you to trigger remote code execution?
Mind you, that experience also severely soured me on the quality of medical software systems, due to poor quality of the software that ran in that distribution. Linux itself was a golden god in comparison to the crap that was layered on top of it.
Nevertheless a long living application like, e.g., a webserver will catch panics coming from its subtasks (e.g., its request handlers) via catch_unwind
Let's not be too pedantic. You, as an experienced medical device engineer, probably knew what I meant was that they would never use Linux in the critical parts of a medical device as the OP had originally argued. Any device would definitely do all of it's functionality without the part with Linux on it.
The OP was still a major strawman, regardless of my arguments, because the Linux kernel will never be in the critical path of a medical device without a TON of work to harden it from errors and such. Just the fact that Linus' stance is as said would mean that it's not an appropriate kernel for a medical device, because they should always fail with an error and stop under unknown conditions rather than just doing some random crap.
This is false. "Safety" and "Liveness" are terms used by the PL field to describe precise properties of programs and they have been used this way for like 50 years (https://en.wikipedia.org/wiki/Safety_and_liveness_properties). A "safety" property describes a guarantee that a program will never reach some form of unwanted state. A "liveness" property describes a guarantee that a program will eventually reach some form of wanted state. These terms would be described very early in a PL course.
And that's pretty easy to statically analyze.
The point is that you can produce a perfectly working analysis method that is either sound or complete but not both. "Nowhere in the entire program does the call 'panic()' appear is a perfectly workable analysis - it just has false positives.
(input state, input symbol) --> (output state, output symbol, move left/right)
This is all static, so you can look at the transition table to see all the possible output symbols. If no transition has output symbol 1, then it never outputs 1. It doesn't matter how big the Turing machine is or what input it gets, it won't do it. This is basically trivial, but it's still a type of very simple static analysis that you can do. Similarly, if you don't have any states that halt, the machine will never halt.This is like just not linking panic() into the program: it isn't going to be able to call it, no matter what else is in there.
> In the kernel, "panic and stop" is not an option
That's simply not true. It's an option I've seen exercised many times, even in default configurations. Furthermore, for some domains - e.g. storage - it's the only sane option. Continuing when the world is clearly crazy risks losing or corrupting data, and that's far worse than a crash. No, it's not weird to think all types of computation are ephemeral or less important than preserving the integrity of data. Especially in a distributed context, where this machine might be one of thousands which can cover for a transient loss of one component but letting it continue to run puts everything at risk, rebooting is clearly the better option. A system that can't survive such a reboot is broken. See also: Erlang OTP, Recovery Oriented Computing @ Berkeley.
Linus is right overall, but that particular argument is a very bad one. There are systems where "panic and stop" is not an option and there are systems where it's the only option.
The proper answer to those is redundancy, not continuing in an unknown and quite likely harmful state.
Can you elaborate on this? Because failing storage is a common occurrence that usually does not warrant immediately crashing the whole OS, unless it's the root filesystem that becomes inaccessible.
For context, this is OP's sentence that I responded to in particular. Ensuring safety [1] is way less trivial than looking for a call to "panic" in the state machine. You can remove the calls to "panic" and this alone does not make your program safer than the equivalent C code. It just makes it more kernel friendly.
[1] not only memory safety
"Performance" is a red herring. In a safety-critical system, what matters is the behaviour and the consistency. ThreadX provides timing guarantees which Linux can not, and all of the system threads are executed in strict priority order. It works extremely well, and the result is a system for which one can can understand the behaviour exactly, which is important for validating that it is functioning correctly. Simplicity equates to reliability. It doesn't matter if it's "slow" so long as it's consistently slow. If it meets the product requirements, then it's fine. And when you do the board design, you'll pick a part appropriate to the task at hand to meet the timing requirements.
Anyway, systems like ThreadX provide safety guarantees that Linux will never be able to. But the interface is not POSIX. And for dedicated applications that's OK. It's not a general-purpose OS, and that's OK too. There are good reasons not to use complex general-purpose kernels in safety-critical systems.
IEC 62304 and ISO 13485 are serious standards for serious applications, where faults can be life-critical. You wouldn't use Linux in this context. No matter how much we might like Linux, you wouldn't entrust your life to it, would you? Anyone who answered "yes" to that rhetorical question should not be trusted with writing safety-critical applications. Linux is too big and complex to fully understand and reason about, and as a result impossible to validate properly in good faith. You might use it in an ancillary system in a non-safety-critical context, but you wouldn't use it anywhere where safety really mattered. IEC 62304 is all about hazards and risks, and risk mitigation. You can't mitigate risks you can't fully reason about, and any given release of Linux has hundreds of silly bugs in it on top of very complex behaviours we can't fully understand either even if they are correct.
Especially in a distributed storage system using erasure codes etc., losing one machine means absolutely nothing even if it's permanent. On the last storage project I worked on, we routinely ran with 1-5% of machines down, whether it was due to failures or various kinds of maintenance actions, and all it meant was a loss of some capacity/performance. It's what the system was designed for. Leaving a faulty machine running, OTOH, could have led to a Byzantine failure mode corrupting all shards for a block and thus losing its contents forever.
BTW, in that sort of context - where most bytes in the world are held BTW - the root filesystem is more expendable than any other. It's just part of the access system, much like firmware, and re-imaging or even hardware replacement doesn't affect the real persistence layer. It's user data that must be king, and those media whose contents must be treated with the utmost care.
For clarification, I responded to this in particular because "safety" is being conflated with "panicking" (bad for kernel). I reckoned "Unexpected conditions" means "arbitrary programs", hence my response, otherwise you could just remove the call to panic.
For better or worse Linux is NOT a microkernel. Therefore, the sound microkernel wisdom is not applicable to Linux in its present form. The "impedance match" of any new language added to the linux kernel is driven by what current kernel code in C is doing. This is essentially linux kernel limitation. If Rust cannot adapt to these requirements it is a mismatch for linux kernel development. For the other kernels like Fuchsia Rust is a good fit. BTW, core Fuchsia kernel itself is still in C++.
Even in a distributed, fault-tolerant multi-node system, it seems like it would be useful for the kernel to keep running long enough for userspace to notify other systems of the failure (eg. return errors to clients with pending requests so they don't have to wait for a timeout to try retrieving data from a different node) or at least send logs to where ever you're aggregating them.
Neither QNX nor ThreadX are intended to be general purpose kernel. I haven’t looked into it for a long time but QNX performances used to not be very good. It’s small. It can boot fast. It gives you guaranty regarding time of return. Everything you want from a RTOS in a safety critical environment. It’s not very fast however which is why it never tried to move towards the general market.
Since the mechanisms for ensuring the orderly stoppage of all such activity system-wide are themselves complicated and possibly error-prone, and more importantly not present in a commodity OS such as Linux, the safe option is "opt in" rather than "opt out". In other words, don't try to say you must stop X and Y and Z ad infinitum. Instead say you may only do A and B and nothing else. That can easily be accomplished with a panic, where certain parts such as dmesg are specifically enabled between the panic() call and the final halt instruction. Making that window bigger, e.g. to return errors to clients who don't really need them, only creates further potential for destructive activity to occur, and IMO is best avoided.
Note that this is a fundamental difference between a user (compute-centric) view of software and a systems/infra view. It's actually the point Linus was trying to get across, even if he picked a horrible example. What's arguably better in one domain might be professional malfeasance in the other. Given the many ways Linux is used, saying that "stopping is not an option" is silly, and "continuing is not an option" would be equally so. My point is not that what's true for my domain must be true for others, but that both really are and must remain options.
P.S. No, stopping userspace is not stopping everything, and not what I was talking about. Or what you were talking about until the narrowing became convenient. Your reply is a non sequitur. Also, I can see from other comments that you already agree with points I have made from the start - e.g. that both must remain options, that the choice depends on the system as a whole. Why badger so much, then? Why equivocate on the importance (or even meaningful difference) between kernel vs. userspace? Heightening conflict for its own sake isn't what this site is supposed to be about.
I don't suppose monitors report calibration data back to display adapters do they?
We're talking specifically about the current meaning of a Linux kernel panic. That means an immediate halt to all of userspace.
https://smile.amazon.com/EVanlak-Passthrough-Generrtion-Elim...
In the context of Rust, there are a number of safety properties that Rust guarantees (modulo unsafe, FFI UB, etc.), but that set of safety properties is specific to Rust and not universal. For example, Java has a different set of safety properties, e.g. its memory model gives stronger guarantees than Rust’s.
Therefore, the meaning of “language X is safe” is entirely dependent on the specific language, and can only be understood by explicitly specifying its safety properties.
Small rant. ARM cortex processors overwrites the stack pointer on reset. That's very very very dumb because after the watchdog trips you have no idea what the code was doing. Which means you can't report what the code was doing when that happened.
You're not wrong but you chose a hilarious example. Unwrap's entire purpose is to turn unhandled errors into panics!
Array indexing, arithmetic (with overflow-checks enabled), and slicing are examples where it's not so obvious there be panic dragons. Library code does sometimes panic in cases of truly unrecoverable errors also.
Like “memory safety”?
Rust could do the external tooling better than any other language out there, but they're so focused on the _language_ preventing abuse that they've largely missed the boat.
Almost all discussion about Rust is in comparison to C and C++, by far the dominant languages for developing native applications. C and C++ are famously neither type-safe nor memory-safe and it becomes a pretty easy shorthand in discussions of Rust for "safety" to refer to these properties.
That's not the right way to characterize this. Rust has unsafe for code that is correct but that the compiler is unable to detect. Foreign memory access (or hardware MMIO) and cyclic data structures are the big ones, and those are well-specified, provable, verifiable regimes. They just don't fit within the borrow checker's world view.
Which is something I think a lot of Rust folks tend to gloss over: even at it's best, most maximalist interpretation, Rust can only verify "correctness" along axes it understands, and those aren't really that big a part of the problem area in practice.
At the end of the day, what Linux does is what Linus wants out of it. He's stated, often, that halting the CPU at the exact moment something goes wrong is not the goal. If your goal is to do that, you might not be able to use Linux. If your goal is to put Rust in the Linux kernel, you might have to let go of your goal.
But ok, uninformed me would have guessed checking for that would be pretty straightforward in statically typed Rust. Is that something people want? Why isn't there a built-in mechanism to do it?
And I thought it was clear that kernel panic is different from Rust panic, which you don't seem to distinguish. Rust panic doesn't need to cause a kernel panic because it can be caught earlier.
Edit: I should also add (probably earlier too) that all my examples are specific to the USA FDA process. I'm sure some other place might not have the same rules.
I hope it's rare, but I think a persistent nag window ("Your display isn't calibrated and may not be accurate") is probably a better answer than refusing to work altogether, because it will be clear about the source of the problem and less likely to get nailed down.
This reminds me of two things. Good system design needs a hardware-software codesign. Oxide computers has identified this, but it was probably much more common before the 90ies than after. The second thing is that all things can fail so a strategy that only hardens the one component is fundamentally limited, even flawed. If the component must not fail you need redundancy and supervision. Joe Armstrong would be my source of quote if I needed to find one.
Both rust and Linux has some potential for improvement here, but the best answers may lie in their relation to the greater system, rather than within it self. I’m thinking of WASM and hardware codesign respectively.
Allow me to introduce you to Therac-25: https://en.wikipedia.org/wiki/Therac-25
I'm mostly familiar with EU rules, but as far as I know the FDA regulations follow the same idea of tiered requirements based on potential harm done.
"If you want to allocate memory, and you don't want to care about what context you are in, or whether you are holding spinlocks etc, then you damn well shouldn't be doing kernel programming. Not in C, and not in Rust.
It really is that simple. Contexts like this ("I am in a critical region, I must not do memory allocation or use sleeping locks") is fundamental to kernel programming. It has nothing to do with the language, and everything to do with the problem space."
Rust proponents mean exactly "memory safety" when they say rust is safe because that is the only safety rust guarantees.
https://www.fda.gov/medical-devices/human-factors-and-medica...
Honestly, the FDA regulations go too far vs the EU regs. The company I worked for was based in the EU and the products there were so advanced compared to our versions. Ours were all based on an original design from Europe that was approved and then basically didn’t charge for 30 years. The European device was fucking cool and had so many features, it was also capable of being carried around rather than rolled. The manufacturing was almost all automated, too, but in the USA it was not at all automated, it was humans assembling parts then recording it in a computer terminal.
The first priority is safety, absolutely and without question. And then the immediate second priority is the fact that time is money. For every minute that the system is not operating, x amount of product is not being produced.
Generally, having the software fully halt on error is both dangerous and time-consuming.
Instead you want to switch to an ERROR and/or EMERGENCY_STOP state, where things like lasers or plasma torches get turned off, motors are stopped, brakes are applied, doors get locked/unlocked (as appropriate/safe), etc. And then you want to report that to the user, and give them tools to diagnose and correct the source of the error and to restart the machine/line [safely!] as quickly as possible.
In short, error handling and recovery is its own entire thing, and tends to be something that gets tested for separately during commissioning.
[1] PLC's do have the ability to <not stop> and execute code in a real time manner, but I haven't encountered a lot of PLC programmers who actually exploit these abilities effectively. Basically for more complex situations you're quickly going to be better off with more general purpose tools [2], at most handing off critical tasks to PLCs, micro-controllers, or motor controllers etc.
[2] except for that stupid propensity to give-up-and-halt at exactly that moment where it'll cause the most damage.
It wasn't my example. It was mike_hock's, and I was responding in the context they had set.
> Most Linuxes aren't like that.
Your ally picked the medical-device and space-life-support examples. If you think they're invalid because such systems don't use Linux, why did you forego bringing it up with them and then change course when replying to me? As I said: not helpful.
The point is not specific to Linux, and more Linux systems than you seem to be aware of do adopt the "crash before doing more damage" approach because they have some redundancy. If you're truly interested, I had another whole thread in this discussion explaining one class of such cases in what I feel was a reasonably informative and respectful way while another bad-faith interlocutor threw out little more than one-liners.
Felt kinda bad until I thought about how well a "Linux literally killed me" headline would do on HN, but then I realized I wouldn't be able to post the article if I actually died. Such is life. Or death? One or the other.
And continuing on parent’s comment, rust can only make its memory guarantees by restricting the set of programmable programs, while C and the like’s static analysis has to work on the whole set which is simply an undecidable problem. As soon as unsafe is in the picture, it becomes undecidable as well in Rust, in general.
If we are talking about products like PC-lint, Sonar qube, Coverity, the experience is much more than that.