Transcending Posix: The End of an Era?

>>jsnell+(OP)
> However, contemporary applications rarely run on a single machine. They increasingly use remote procedure calls (RPC), HTTP and REST APIs, distributed key-value stores, and databases,

I'm seeing an increasing trend of pushback against this norm. An early example was David Crawshaw's one-process programming notes [1]. Running the database in the same process as the application server, using SQLite, is getting more popular with the rise of Litestream [2]. Earlier this year, I found the post "One machine can go pretty far if you build things properly" [3] quite refreshing.

Most of us can ignore FAANG-scale problems and keep right on using POSIX on a handful of machines.

[1]: https://crawshaw.io/blog/one-process-programming-notes

[2]: https://litestream.io/

[3]; https://rachelbythebay.com/w/2022/01/27/scale/

>>mwcamp+0v
Cranshaw mentions "tens of hours of downtime" as a drawback. Given that downtime sometime corresponds with times of high load aka, the really important times, that's going to be a deal killer for most.

But his architecture does seem to be consistent with a "minutes of downtime" model. He's using AWS, and has his database on a separate EBS volume with a sane backup strategy. So he's not manually fixing servers, and has reasonable migration routes for most disaster scenarios.

Except for PBKAC, which is what really kills most servers. And HA servers are more vulnerable to that, since they're more complicated.

>>bryanl+ZA
> Cranshaw mentions "tens of hours of downtime" as a drawback. Given that downtime sometime corresponds with times of high load aka, the really important times, that's going to be a deal killer for most.

I posit that this kind of "deal killer" is most often a wish list item and not a true need. I think most teams without a working product think these kinds of theoretical reliability issues are "deal killers" as a form of premature optimization.

I worked at a FANG doing a product where we thought availability issues caused by sessions being "owned" by a single server design was a deal killer. I.e. that one machine could crash at any time and people would notice, we thought. We spent a lot of time designing a fancy fully distributed system where sessions could migrate seamlessly, etc. Spent the good part of a year designing and implementing it.

Then, before we finished, a PM orchestrated purchase of a startup that had a launched product with similar functionality. Its design held per-user session state on a single server and was thus much simpler. It was almost laughably simple compared to what we were attempting. The kind of design you'd write on a napkin over a burrito lunch as minimally viable, and quickly code up -- just what you'd do in a startup.

After the acquisition we had big arguments between our team and those at the startup about which core technology the FANG should go forward with. We'd point at math and theory about availability and failure rates. They'd point at happy users and a working product. It ended with a VP pointing at the startup's launched product saying "we're going with what is working now." Within months the product was working within the FANG's production infrastructure, and it has run almost unchanged architecturally for over a decade. Is the system theoretically less reliable than our fancier would-be system? Yes. Does anybody actually notice or care? No.

>>mattar+lJ
It is a deal killer for anyone who has SLAs specified in contracts. Which is pretty common in B2B

>>thayne+3V
Maybe. In that example, if the service has run for over a decade, it seems plausible that whatever contractual penalties they would have had to pay out for occasional downtimes would be far less than the initial and ongoing development time required to implement a far more complex solution, not to mention the additional hardware/cloud costs.

>>macint+411
I would consider it dishonest to promise your customers a certain uptime, knowing you likely won't meet it. And some customers, particular more lucrative ones, want to see historical uptime and/or evidence that you have a resilient architecture.

That is not at all to say that it is a deal breaker for everyone, but it certainly will be for some companies.

zlacker