zlacker

> Cranshaw mentions "tens of hours of downtime" as a drawback. Given that downtime sometime corresponds with times of high load aka, the really important times, that's going to be a deal killer for most.

I posit that this kind of "deal killer" is most often a wish list item and not a true need. I think most teams without a working product think these kinds of theoretical reliability issues are "deal killers" as a form of premature optimization.

I worked at a FANG doing a product where we thought availability issues caused by sessions being "owned" by a single server design was a deal killer. I.e. that one machine could crash at any time and people would notice, we thought. We spent a lot of time designing a fancy fully distributed system where sessions could migrate seamlessly, etc. Spent the good part of a year designing and implementing it.

Then, before we finished, a PM orchestrated purchase of a startup that had a launched product with similar functionality. Its design held per-user session state on a single server and was thus much simpler. It was almost laughably simple compared to what we were attempting. The kind of design you'd write on a napkin over a burrito lunch as minimally viable, and quickly code up -- just what you'd do in a startup.

After the acquisition we had big arguments between our team and those at the startup about which core technology the FANG should go forward with. We'd point at math and theory about availability and failure rates. They'd point at happy users and a working product. It ended with a VP pointing at the startup's launched product saying "we're going with what is working now." Within months the product was working within the FANG's production infrastructure, and it has run almost unchanged architecturally for over a decade. Is the system theoretically less reliable than our fancier would-be system? Yes. Does anybody actually notice or care? No.

replies(2): >>thayne+Ib >>hiptob+I12

>>mattar+(OP)
It is a deal killer for anyone who has SLAs specified in contracts. Which is pretty common in B2B

replies(1): >>macint+Jh

>>thayne+Ib
Maybe. In that example, if the service has run for over a decade, it seems plausible that whatever contractual penalties they would have had to pay out for occasional downtimes would be far less than the initial and ongoing development time required to implement a far more complex solution, not to mention the additional hardware/cloud costs.

replies(1): >>thayne+3G2

>>mattar+(OP)
So many examples of this across Google it's not even funny.

>>macint+Jh
I would consider it dishonest to promise your customers a certain uptime, knowing you likely won't meet it. And some customers, particular more lucrative ones, want to see historical uptime and/or evidence that you have a resilient architecture.

That is not at all to say that it is a deal breaker for everyone, but it certainly will be for some companies.