zlacker

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

replies(4): >>trashb+52 >>fidotr+h3 >>da_gri+L3 >>iLoveO+8e

>>denysv+(OP)
I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.

replies(1): >>denysv+a3

>>trashb+52
This is the kind of comment I wish he would ignore.

You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

replies(2): >>beanju+K8 >>nish__+3i

>>denysv+(OP)
> HugOps

This childish nonsense needs to end.

Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

replies(2): >>denysv+85 >>esseph+Xe

>>denysv+(OP)
[ Removed by Reddit ]

replies(1): >>denysv+z5

>>fidotr+h3
I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

replies(1): >>fidotr+h6

>>da_gri+L3
Wow. The three comments below parent really show how toxic HN has become.

replies(1): >>beanju+29

>>denysv+85
> I have never seen an Ops team being rewarded for avoiding incidents

That's why their salaries are so high.

replies(3): >>denysv+18 >>esseph+bf >>agoodu+cz

>>fidotr+h6
Depending on the tech debt, the ops team might just be in "survival mode" and not have the time to fix every single issue.

In this particular case, they seem to be doing two things: - Phasing out the old proxy (Lua based) which is replaced by FL2 (Rust based, the one that caused the previous incident) - Reacting to an actively exploited vulnerability in React by deploying WAF rules - and they're doing them in a relatively careful way (test rules) to avoid fuckups, which caused this unknown state, which triggered the issue

replies(1): >>fidotr+V8

>>denysv+a3
I hope he doesn't ignore it, the internet has been forgiving enough toward cloudflares string of failures..its getting pretty old, and creates a ton of choas. I work with life saving devices, being impacted in any way in data monitoring has a huge impact in many ways. "sorry ma'am we can't give your child t1d readings on your follow app because our provider decided to break everything in the pursuit of some react bug." has a great ring to it

replies(2): >>esseph+Gg >>Anon10+hr

>>denysv+18
They deliberately ignored an internal tool that started erroring out at the given deployment and rolled it out anyway without further investigation.

That's not deserving of sympathy.

>>denysv+z5
being angry about something doesn't make it toxic, people have a right to be upset

replies(1): >>denysv+4b

>>beanju+29
The comment, before the edit, was what I would consider toxic. No wonder it has been edited.

It's fine to be upset, and especially rightfully so after the second outage in less than 30 days, but this doesn't justify toxicity.

>>denysv+(OP)
> I truly believe they're really going to make resilience their #1 priority now

I hope that was their #1 priority from the very start given the services they sell...

Anyway, people always tend to overthink about those black-swan events. Yes, 2 happened in a quick succession, but what is the average frequency overall? Insignificant.

replies(2): >>denysv+Xi >>roguec+kN

>>fidotr+h3
Ops has never been "rewarded" at any org I've ever been at or heard about, including physical infra companies.

>>fidotr+h6
Ops salaries are high??? Where?!?!

replies(1): >>hnthro+ml

>>beanju+K8
Half your medical devices are probably opening up data leakage to China.

https://www.csoonline.com/article/3814810/backdoor-in-chines...

Most hospital and healthcare IT teams are extremely under funded, undertrained, overworked, and the software, configurations and platforms are normally not the most resilient things.

I have a friend at one in the North East right now going through a hell of a security breach for multiple months now and I'm flabbergasted no one is dead yet.

When it comes to tech, I get the impression most organizations are not very "healthy" in the durability of systems.

replies(1): >>rurban+QP1

>>denysv+a3
Maybe not on purpose but there's such a thing as negligence.

>>iLoveO+8e
I think they have to strike a balance between being extremely fast (reacting to vulnerabilities and DDOS attacks) while still being resilient. I don't think it's an easy situation

>>esseph+bf
Definitely commands better salaries than us pitty DEs.

>>beanju+K8
Cloudflare and other cloud infra providers are only providing primitives to use, in this case WAF. They have target uptimes and it's never 100%. It's up to the people actually making end user services (like your medical devices) to judge whether that is enough and if not to design your service around it.

(and also, rolling your own version of WAF is probably not the right answer if you need better uptime. It's exceedingly unlikely a medical devices company will beat CF at this game.)

>>fidotr+h6
news to me.

>>iLoveO+8e
This is Cloudflare. They've repeatedly broken DNS for years.

Looking across the errors, it points to some underlying practices: a lack of systems metaphors, modularity, testability, and an reliance on super-generic configuration instead of software with enforced semantics.

>>esseph+Gg
Most of DICOM data sharing is over unencrypted ports. Every sniffer can easily extract super-sensible data. Not only China, everywhere.