zlacker

[parent] [thread] 17 comments
1. ChrisS+(OP)[view] [source] 2022-10-02 15:26:30
What is better: continuing to "limp along" in some unknown corrupted state (aka undefined behaviour) or in a well defined (albeit invalid) state?
replies(3): >>throw8+55 >>Someon+X8 >>yencab+SZ
2. throw8+55[view] [source] 2022-10-02 15:55:13
>>ChrisS+(OP)
Had the same topic often on MCUs: limp along to hopefully get the error out somehow, otherwise it won't be noticed if not with JTAG debugger attached (default in field).

So I can understand where Linus comes from.

replies(2): >>gmueck+oj >>mlindn+Cj
3. Someon+X8[view] [source] 2022-10-02 16:16:05
>>ChrisS+(OP)
This question is answered in Linus' emails fully and better than I'm going to do.

But to restate briefly, the answer varies wildly between kernel and user programs, because a user program failing hard on corrupt state is still able to report that failure/bug, whereas a kernel panic is a difficult to report problem (and breaks a bunch of automated reporting tooling).

So in answer: Read the discussion.

replies(1): >>ChrisS+Hc
◧◩
4. ChrisS+Hc[view] [source] [discussion] 2022-10-02 16:34:28
>>Someon+X8
You seem to have misunderstood me. The distinction I'm making is not between kernel panic or undefined behaviour. The distinction is between undefined behaviour and defined behaviour. That defined behaviour can be anything, even including "limping on" somehow.
◧◩
5. gmueck+oj[view] [source] [discussion] 2022-10-02 17:10:32
>>throw8+55
Yes. You could still hard reset after the error is reported if you wanted to. And if system availability matters, a hardware watchdog would handle the case where the error handling doesn't finish.
◧◩
6. mlindn+Cj[view] [source] [discussion] 2022-10-02 17:11:59
>>throw8+55
Limping along is what the salesman and the business people want as failures look bad.

Engineers should want the immediate stop, because that's safer, especially in safety critical situations.

replies(3): >>wtalli+In >>warinu+n91 >>niscoc+4R1
◧◩◪
7. wtalli+In[view] [source] [discussion] 2022-10-02 17:33:35
>>mlindn+Cj
The kernel is not the whole system. The kernel needs to offer the "limping along" option so that the other parts of the system can implement whatever graceful failure method is appropriate for that system. There's no one size fits all solution for the kernel to pick.
8. yencab+SZ[view] [source] 2022-10-02 21:37:48
>>ChrisS+(OP)
What is better for a desktop user:

1) needing to reload a wifi driver to reinitialize hardware (with a tiny probability of memory corruption) OR choosing to reboot as soon as convenient (with a tiny probability of corrupting the latest saved files)

2) to lose unsaved files for sure and not even know what caused the crash

replies(2): >>Jweb_G+721 >>notaco+JJ2
◧◩
9. Jweb_G+721[view] [source] [discussion] 2022-10-02 21:55:58
>>yencab+SZ
The latter, because the "tiny probability of memory corruption" can easily become a CVE.
replies(1): >>P5fRxh+Eb1
◧◩◪
10. warinu+n91[view] [source] [discussion] 2022-10-02 22:43:30
>>mlindn+Cj
You sound like you code websites or something.

Real engineers, like say the people who code the machines that fly in mars, don't want "oops that's unexpected, ruin the entire mission because that's safer". Same for the Linux kernel.

◧◩◪
11. P5fRxh+Eb1[view] [source] [discussion] 2022-10-02 23:02:03
>>Jweb_G+721
We have a term for this.

FUD

replies(1): >>Jweb_G+8f1
◧◩◪◨
12. Jweb_G+8f1[view] [source] [discussion] 2022-10-02 23:31:00
>>P5fRxh+Eb1
Linux has numerous CVEs, and a large percentage stem from memory corruption. That's not FUD, I'm afraid.
replies(1): >>scoutt+A12
◧◩◪
13. niscoc+4R1[view] [source] [discussion] 2022-10-03 05:47:00
>>mlindn+Cj
What are you talking about? Should planes stop flying when they encounter an error?

Safety critical systems will try to recover to a working state as much as possible. It is designed with redundancy that if one path fails, it can use path 2 or path 3 towards a safe usable state.

◧◩◪◨⬒
14. scoutt+A12[view] [source] [discussion] 2022-10-03 07:29:37
>>Jweb_G+8f1
It's FUD. And not only that. The fear of constantly being attacked by an external entity is also paranoic.
replies(1): >>Jweb_G+rcl
◧◩
15. notaco+JJ2[view] [source] [discussion] 2022-10-03 13:30:02
>>yencab+SZ
Why focus exclusively on the desktop, or over-generalize from it to other uses? What is appropriate for them is not necessarily so for the many millions of machines in server rooms and data centers. Also, you present a false dichotomy. "Lose unsaved files for sure" is not the case for many systems, and "not even know" is not necessarily the case. Logging during shutdown is a real thing, as is saving a crash dump for retrieval after reboot. Both have been standard at my last several projects and companies.

As I've said over and over, both approaches - "limp along" and "reboot before causing harm" - need to remain options, for different scenarios. Anyone who treats the one use case they're familiar with as the only one which should drive policy for everyone is doing the community a disservice.

replies(1): >>yencab+fQ2
◧◩◪
16. yencab+fQ2[view] [source] [discussion] 2022-10-03 14:02:17
>>notaco+JJ2
Yes, both need to remain options. Rust-in-kernel needs to be able to support both. That's like half of Linus's ranting there.

The other half is that kernel has a lot of rules of what is safe to be done where, and Rust has to be able to follow those rules, or not be used in those contexts. This is the GFP_ATOMIC part.

◧◩◪◨⬒⬓
17. Jweb_G+rcl[view] [source] [discussion] 2022-10-09 04:13:55
>>scoutt+A12
Unfortunately, whether you personally care about this sort of thing isn't good enough anymore. Owned Linux boxes on IoT devices are now being marshaled into massive botnets used to perform denial of service attacks, while other vulnerabilities are exploited to enable ransomware. You having negligent security on your own unpatched box because you don't personally feel like it's a good tradeoff has many negative external consequences. Fortunately, the decision isn't actually up to you (and having fewer vulnerabilities won't influence you negatively anyway, so I'm not sure why you're so angry about it).
replies(1): >>scoutt+Eoo
◧◩◪◨⬒⬓⬔
18. scoutt+Eoo[view] [source] [discussion] 2022-10-10 12:31:58
>>Jweb_G+rcl
> why you're so angry about it

Am I?

You suppose a lot of things about me from literaly a bunch of words.

"A 'tiny probability of memory corruption' can easily become a CVE" is still FUD, because is simply not true in most cases. The words "tiny" and "easily" show the bias here.

The rest of the conversation seems a symptom of Hypervigilance: Fixation on potential threats (dangerous people, animals, or situations).

Fortunately, the decision isn't up to you either.

[go to top]