A lot of modern userspace code, including Rust code in the standard library, thinks that invariant failures (AKA "programmer errors") should cause some sort of assertion failure or crash (Rust or Go `panic`, C/C++ `assert`, etc). In the kernel, claims Linus, failing loudly is worse than trying to keep going because failing would also kill the failure reporting mechanisms.
He advocates for a sort of soft-failure, where the code tells you you're entering unknown territory and then goes ahead and does whatever. Maybe it crashes later, maybe it returns the wrong answer, who knows, the only thing it won't do is halt the kernel at the point the error was detected.
Think of the following Rust API for an array, which needs to be able to handle the case of a user reading an index outside its bounds:
struct Array<T> { ... }
impl<T> Array<T> {
fn len(&self) -> usize;
// if idx >= len, panic
fn get_or_panic(&self, idx: usize) -> T;
// if idx >= len, return None
fn get_or_none(&self, idx: usize) -> Option<T>;
// if idx >= len, print a stack trace and return
// who knows what
unsafe fn get_or_undefined(&self, idx: usize) -> T;
}
The first two are safe by the Rust definition, because they can't cause memory-unsafe behavior. The second two are safe by the Linus/Linux definition, because they won't cause a kernel panic. If you have to choose between #1 and #3, Linus is putting his foot down and saying that the kernel's answer is #3.Linus sounds so ignorant in this comment. As if no one else thought of writing safety-critical systems in a language that had dynamic errors, and that dynamic errors are going to bring the whole system down or turn it into a brick. No way!
Errors don't have to be full-blown exceptions with all that rigamarole, but silently continuing with corruption is utter madness and in 2022 Linus should feel embarrassed for advocating such a backwards view.
[1] This works for Erlang. Not everything needs to be a shared-nothing actor, to be sure, but failing a whole task is about the right granularity to allow reasoning about the system. E.g. a few dozen to a few hundred types of tasks or processes seems about right.
If you want to run an Erlang-style distributed system in the kernel then that's an interesting research project, but it isn't where Linux is today. You'd be better off starting with SeL4 or Fuchsia.
Linux isn’t a microkernel. If you want to work on a microkernel, go work on Fuchsia. It’s interesting research but utterly irrelevant to the point at hand.
Anyway, the microkernel discussion has been happening for three decades now. They haven’t historically had a little lower performance. They had garbage performance, to the point of being unsuitable in the 90s.
Plenty of kernel code can’t be written as to be unwindable. That’s the issue at hand. In a fantasy world, it might have been written as such but it’s not the world will live in which is what matters to Linus.
"Performance" is a red herring. In a safety-critical system, what matters is the behaviour and the consistency. ThreadX provides timing guarantees which Linux can not, and all of the system threads are executed in strict priority order. It works extremely well, and the result is a system for which one can can understand the behaviour exactly, which is important for validating that it is functioning correctly. Simplicity equates to reliability. It doesn't matter if it's "slow" so long as it's consistently slow. If it meets the product requirements, then it's fine. And when you do the board design, you'll pick a part appropriate to the task at hand to meet the timing requirements.
Anyway, systems like ThreadX provide safety guarantees that Linux will never be able to. But the interface is not POSIX. And for dedicated applications that's OK. It's not a general-purpose OS, and that's OK too. There are good reasons not to use complex general-purpose kernels in safety-critical systems.
IEC 62304 and ISO 13485 are serious standards for serious applications, where faults can be life-critical. You wouldn't use Linux in this context. No matter how much we might like Linux, you wouldn't entrust your life to it, would you? Anyone who answered "yes" to that rhetorical question should not be trusted with writing safety-critical applications. Linux is too big and complex to fully understand and reason about, and as a result impossible to validate properly in good faith. You might use it in an ancillary system in a non-safety-critical context, but you wouldn't use it anywhere where safety really mattered. IEC 62304 is all about hazards and risks, and risk mitigation. You can't mitigate risks you can't fully reason about, and any given release of Linux has hundreds of silly bugs in it on top of very complex behaviours we can't fully understand either even if they are correct.
Neither QNX nor ThreadX are intended to be general purpose kernel. I haven’t looked into it for a long time but QNX performances used to not be very good. It’s small. It can boot fast. It gives you guaranty regarding time of return. Everything you want from a RTOS in a safety critical environment. It’s not very fast however which is why it never tried to move towards the general market.