zlacker

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

replies(5): >>vouwfi+IA >>bombca+EM >>tetha+X01 >>znkr+1Y1 >>eviks+212

>>uyzstv+(OP)
> Languages with strong type systems won't save you

Neither will seatbelts if you drive into the ocean, or helmets if you drink poison. I'm not sure what your point is.

replies(1): >>djmips+WS

>>uyzstv+(OP)
They have millions of “free” subscribers; said subscribers should be the test pigs for rollouts; paying (read: big) subscribers can get the breaking changes later.

replies(2): >>bearde+eQ >>ectosp+8V

>>bombca+EM
This feels like such a valid solution and is how past $dayjobs released things: send to the free users, rollout to Paying Users once that's proven to not blow up.

replies(1): >>sznio+S91

>>vouwfi+IA
I think you strengthened their point.

replies(1): >>vouwfi+ua2

>>bombca+EM
Free tier doesn’t get WAF. We kept working.

replies(1): >>bsdpqw+SW

>>ectosp+8V
Their December 3rd blog about React states:

"These new protections are included in both the Cloudflare Free Managed Ruleset (available to all Free customers) ..... "

having some burn in time in free tier before it hits the whole network would have been good?!

>>uyzstv+(OP)
This is kinda what I'm thinking. We're absolutely not at the scale Cloudflare is at.

But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.

This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.

But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.

If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.

And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.

>>bearde+eQ
If your target is availability, that's correct.

If your target is security, then _assuming your patch is actually valid_ you're giving better security coverage for free customers than to your paying ones.

Cloudflare is both, and their tradeoffs seem to be set on maximizing security at cost of availability. And it makes sense. A fully unavailable system is perfectly secure.

>>uyzstv+(OP)
I am sure they have this. What tends to happen is that the gradual rollout system becomes too slow for some rare, low latency rollout requirements, so a config system is introduced that fulfills the requirements. For example, let’s say you have a gradual rollout for binaries (slow) and configuration (fast). Over time, the fast rollout of the configuration system will cause outages, so it’s slowed down. Then a requirement pops up for which the config system is too slow and someone identifies a global system with no gradual rollout (e.g. a database) to be used as the solution. That solution will be compliant with all the processes that have been introduced to the letter, because so far nobody has thought of using a single database row for global configuration yet. Add new processes whenever this happens and at some point everything will be too slow and taking on more risk becomes necessary to stay competitive. So processes are adjusted. Repeat forever.

>>uyzstv+(OP)
> Languages with strong type systems won't save you, good procedure will.

One of the items in the list of procedures is to use types to encode rules of your system.

>>djmips+WS
I don't think I did. The implication is that using languages with strong types (as discussed in the article) is not a solution. That's rubbish. It's at least part of the solution.