zlacker

I think Linus's response would be that those failable tasks are called "processes", and the low-level supervisor that starts + monitors them is the kernel. If you have code that might fail and restart, it belongs in userspace.

If you want to run an Erlang-style distributed system in the kernel then that's an interesting research project, but it isn't where Linux is today. You'd be better off starting with SeL4 or Fuchsia.

replies(1): >>titzer+C

>>jmilli+(OP)
40 years of microkernels, of which I know Linus is aware of, beg to differ. Maybe Linus's extreme opposition to microkernels, ostensibly because they have historically a little lower performance--I dunno--but my comment should not be read as "yes, you must have a microkernel". There are cheaper fault isolation mechanisms than full-blown separate processes. Just having basic stack unwinding and failing a task would be a start.

replies(5): >>pca006+42 >>jstimp+L2 >>Wastin+t3 >>geertj+u5 >>lr1970+cK

>>titzer+C
I think it might be a bit more complicated than that, considering you can have static data and unwinding the stack will not reset those states. I guess you still need some sort of task level abstraction and reset all the data for that task when unwinding from it. Btw, do we need stack unwinding or can we just do a sjlj?

replies(1): >>titzer+Z2

>>titzer+C
How do you unwind if most of your kernel is written in C? (answering my own question - they are doing stack unwinding - only manually).

Where do you unwind to if memory is corrupted?

I don't think we're talking about what would be exception handling in other languages. I believe it's asserts. How do userland processes handle a failed assertion? Usually the process is terminated, but giving a debugger the possibility to examine the state first, or dumping core.

And that's similar to what they are doing in the kernel. Only in that in the kernel, it's more dangerous because there is limited process / task isolation. I think that is an argument that taking down "full-blown separate processes" might not even be enough in the kernel.

>>pca006+42
Note I am not going to advocate for try ... catch .. finally, because I think that language mechanism is abused out the wazoo for handling all kinds of things, like IO errors, but this is exactly what try ... finally would be for.

Regardless, I think just killing the task instantly, even with partial updates to memory, would be totally fine. It'd be cheap, as automatically undoing the updates (effectively a transaction rollback) is still too expensive. Software transactional memory just comes with too much overhead.

I vote "kill and unwind" and then dealing with partial updates has to be left to a higher level.

>>titzer+C
Sorry it’s hard to take you seriously after that.

Linux isn’t a microkernel. If you want to work on a microkernel, go work on Fuchsia. It’s interesting research but utterly irrelevant to the point at hand.

Anyway, the microkernel discussion has been happening for three decades now. They haven’t historically had a little lower performance. They had garbage performance, to the point of being unsuitable in the 90s.

Plenty of kernel code can’t be written as to be unwindable. That’s the issue at hand. In a fantasy world, it might have been written as such but it’s not the world will live in which is what matters to Linus.

replies(3): >>jeffre+2l >>pjmlp+zn >>rleigh+LE

>>titzer+C
> Just having basic stack unwinding and failing a task would be a start.

As the sibling comment pointed out, if you extend this idea to clean up all state, you end up with processes.

I do have some doubt on the no panic rule. But instead of emulating processes in the kernel, I’d see a firmware like subsystem whose only job it is to export core dumps from the local system, after which the kernel is free to panic.

As a general point and in my view, and I agree this is an appeal to authority, Linus has this uncanny ability to find compromises between practicality and theory that result in successful real world software. He’s not always right but he’s almost never completely wrong.

>>Wastin+t3
Well QNX is running on a gazillion of devices, even resource restricted ones, without problems. It can be slower but it does not have to always. That is gar from being a fantasy world.

>>Wastin+t3
QNX and INTEGRITY customers will be to differ.

>>Wastin+t3
Others have mentioned QNX. There is also ThreadX, which is a "picokernel". Both are certified for use in safety-critical domains. There are other options as well. Segger do one, for example, and there's also SafeRTOS, and others.

"Performance" is a red herring. In a safety-critical system, what matters is the behaviour and the consistency. ThreadX provides timing guarantees which Linux can not, and all of the system threads are executed in strict priority order. It works extremely well, and the result is a system for which one can can understand the behaviour exactly, which is important for validating that it is functioning correctly. Simplicity equates to reliability. It doesn't matter if it's "slow" so long as it's consistently slow. If it meets the product requirements, then it's fine. And when you do the board design, you'll pick a part appropriate to the task at hand to meet the timing requirements.

Anyway, systems like ThreadX provide safety guarantees that Linux will never be able to. But the interface is not POSIX. And for dedicated applications that's OK. It's not a general-purpose OS, and that's OK too. There are good reasons not to use complex general-purpose kernels in safety-critical systems.

IEC 62304 and ISO 13485 are serious standards for serious applications, where faults can be life-critical. You wouldn't use Linux in this context. No matter how much we might like Linux, you wouldn't entrust your life to it, would you? Anyone who answered "yes" to that rhetorical question should not be trusted with writing safety-critical applications. Linux is too big and complex to fully understand and reason about, and as a result impossible to validate properly in good faith. You might use it in an ancillary system in a non-safety-critical context, but you wouldn't use it anywhere where safety really mattered. IEC 62304 is all about hazards and risks, and risk mitigation. You can't mitigate risks you can't fully reason about, and any given release of Linux has hundreds of silly bugs in it on top of very complex behaviours we can't fully understand either even if they are correct.

replies(1): >>Wastin+WM

>>titzer+C
> 40 years of microkernels, of which I know Linus is aware of, beg to differ.

For better or worse Linux is NOT a microkernel. Therefore, the sound microkernel wisdom is not applicable to Linux in its present form. The "impedance match" of any new language added to the linux kernel is driven by what current kernel code in C is doing. This is essentially linux kernel limitation. If Rust cannot adapt to these requirements it is a mismatch for linux kernel development. For the other kernels like Fuchsia Rust is a good fit. BTW, core Fuchsia kernel itself is still in C++.

>>rleigh+LE
Sorry, I’m a bit lost regarding your comment. The discussion was about code safety in Linux in the context of potentially introducing Rust. I don’t really see the link with microkernels in the context of safety oriented RTOS. I think you are reacting to my comment about microkernels performance in the 90s which I maintain.

Neither QNX nor ThreadX are intended to be general purpose kernel. I haven’t looked into it for a long time but QNX performances used to not be very good. It’s small. It can boot fast. It gives you guaranty regarding time of return. Everything you want from a RTOS in a safety critical environment. It’s not very fast however which is why it never tried to move towards the general market.