zlacker

Predicting the Future of Distributed Systems

submitted by borisj+(OP) on 2024-08-27 00:26:37 | 173 points 42 comments
[view article] [source] [links] [go to bottom]
replies(12): >>awkii+r5 >>notver+X7 >>samsta+28 >>BraveN+t8 >>jamesb+Rk >>jensne+0n >>buro9+7n >>willva+yo >>purple+sp >>nyrikk+AY >>pjdesn+pt1 >>__turb+mr3
1. awkii+r5[view] [source] 2024-08-27 01:27:55
>>borisj+(OP)
I think the author has a point with one-way doors slowing down the adoption of distributed systems. The best way to build two way doors is to push for industry adoption of a particular API. In theory the backend of these APIs matter little to me, the developer, so long as they are fast and consistent. Some examples that come to mind is that Apache Beam is a "programming model" for Data pipelines, Akka is a "programming model" for stateful distributed systems, OpenTelemetry for logging/telemetry, and Kubernetes for orchestration. Oh, and local development is a strong preference.
replies(2): >>jaunty+17 >>mikepu+Ta
◧◩
2. jaunty+17[view] [source] [discussion] 2024-08-27 01:51:21
>>awkii+r5
OTel being a capture & ingest only specification is kind of messed up. There's no attempt from what I can tell for how to query or present stored data; it's just an over-the-wire specification, & that drastically limits usable scope. It means vendors each get to make their own services & backends & tools, but it's greviously limiting the effort as a whole, makes even an open spec like OTel a one-way door.

Ideally OTel would be more than observability, imo. Traces would be event-sources, would be a thing that begets more computing. The system to observe computing should in turn also be the system to react & respond to computing, should be begetting more computing elsewhere. That's the distributed system I want to see; local agents reporting their actions, other agents seeing that & responding. OTel would fit the need well, if only we could expand ourselves beyond thinking of observability as an operator affordance & start thinking of it as the system to track computation.

replies(1): >>singro+T9
3. notver+X7[view] [source] 2024-08-27 02:04:26
>>borisj+(OP)
The right time to mention Designing Data-Intensive Applications by Martin Kleppmann. Amazing book explaining distributed systems concepts in a digestible language.
4. samsta+28[view] [source] 2024-08-27 02:06:09
>>borisj+(OP)
>>the biggest opportunity for a new programming model is extracting the majority of the code from an application and moving it into the infrastructure instead. The second biggest opportunity is for the remaining code—what people refer to as the business logic, the essence of the program—to be portable and secure.

This was such a well put comment, that truly made me grok the entire article in just this one statement.

---

Infrastructure needs to be invisible, and that is where the future of AI-enabled orchestration/abstraction will allow development to be more poetry than code - whereby we can describe complex logic paths/workflows in a language of intent - and all the components required to accomplish that desired outcome will be much more quickly, elegantly be a reality.

THe real challenge ahead is the divide between those who have the capability and power of all the AI tools available to them, and those who are subjugated by those who do.

For example, an individual can build a lot with the current state of the available tool universe... but a more sophisticated and well funded organization will have a lot more potential capability.'

What I am really interested to know, is if there is a dark Dystopian Cyberpunk AI under-world happening yet?

Whats the state of BadActor/BigCorpo/BigSpy's capability and covert actions currently?

While we are distracted by AI_ClipArt and celebrity voice squabbles, and seemingly Top AI Voices are being ignored after founding organizations for Alignment/Governance/Humane/etc and warning of catastrophe - define The State of Things?

But yeah - extracting the code and letting logic just be handled yet portable, clonable, refactorable easily is where we are already headed. Its amazing and terrifying at the same time.

I'm thankful that all my Cyberpunk Fantasy reading, thinking, imagining and then my tiny part in the overall evolution of the world of tech today, having the opportunity to be here, worked with and build to, in with -- and now seeing the birth of AI and using it daily in my actual interactions with my IRL.

Such an amazing moment in Human History to be here through this.

5. BraveN+t8[view] [source] 2024-08-27 02:11:17
>>borisj+(OP)
> One-Way-Door and Two-Way-Door Decisions

See also the "Linux kernel management style" document that's been in the kernel since forever: https://docs.kernel.org/6.1/process/management-style.html

replies(2): >>ent101+K8 >>ZaoLah+nm
◧◩
6. ent101+K8[view] [source] [discussion] 2024-08-27 02:13:39
>>BraveN+t8
> Most people are idiots, and being a manager means you’ll have to deal with it, and perhaps more importantly, that they have to deal with you.

> It turns out that while it’s easy to undo technical mistakes, it’s not as easy to undo personality disorders. You just have to live with theirs - and yours.

this was definitely written by Linus XD

◧◩◪
7. singro+T9[view] [source] [discussion] 2024-08-27 02:32:32
>>jaunty+17
Otel works as a standard since there isn't any need to innovate at that level. Despite the over complications it has, all the implementations largely have the same requirements, and it's useful to instrument everything the same way.

Querying unfortunately has lots of room for innovation, and it's really hard to nail down in a spec especially when the vendors all want to compete.

replies(1): >>__turb+wq3
◧◩
8. mikepu+Ta[view] [source] [discussion] 2024-08-27 02:45:57
>>awkii+r5
It boggles my mind that people accept architectures where the only dev story is a duplicate cloud instance of the required services.

Being able to bring the whole application up locally should be an absolute non-negotiable.

replies(2): >>cybera+We >>pjmlp+C79
◧◩◪
9. cybera+We[view] [source] [discussion] 2024-08-27 03:52:34
>>mikepu+Ta
> Being able to bring the whole application up locally should be an absolute non-negotiable.

This usually doesn't work that well for larger systems with services split between multiple teams. And it's not typically the RAM/CPU limitations that are the problem, but the amount of configuration that needs to be customized (and, in some cases, data).

Sooner or later, you just start testing with the other teams' production/staging environments rather than deal with local incompatibilities.

replies(1): >>choege+EF1
10. jamesb+Rk[view] [source] 2024-08-27 05:29:20
>>borisj+(OP)
I really enjoyed this article. The one point I have issue with is that the dominance of object storage in today's distributed systems is very much due to economics, not technology. There's basically cheering every little step S3 takes towards a POSIX-like distributed file system like HDFS - "consistent listing of files, yeah!". Last week it was preconditions for writing files. There's still huge gymnastics needed in Iceberg/Delta to work with S3 given the lack of atomic rename.
◧◩
11. ZaoLah+nm[view] [source] [discussion] 2024-08-27 05:55:21
>>BraveN+t8
I really like the avoidance (elimination) of one-way-door decisions by turning them into several small(er) two-way-door decisions. I guess the software development interpretation of it is clearly defined boundaries of responsibility, and avoiding to leak implementation details beyond those?
12. jensne+0n[view] [source] 2024-08-27 06:07:13
>>borisj+(OP)
I'd like to add that I'm seeing more and more companies unifying synchronous and asynchronous APIs. With the concept of GraphQL Federation, it's possible to "extend" Entities by defining their (primary) keys in a GraphQL Schema. If we're combining this with Async APIs, e.g. NATS or Kafka, we can enable teams to build APIs around events, while still being able to distribute the implementation of how certain fields can be resolved. The Federation Router then joins the Stream with additional data from synchronous services, a very powerful pattern I believe. I wrote a bit more on the topic here: https://wundergraph.com/blog/distributed_graphql_subscriptio...
13. buro9+7n[view] [source] 2024-08-27 06:08:20
>>borisj+(OP)
Things I have come to know about distributed systems:

The S3 API (object storage) is the accepted storage API, but you do not need AWS (but they are very good at this).

The Kafka API is the accepted stream/ buffer/ queue API, but you do not need Confluent.

SQL is the query language, but you do not need a relational database.

replies(1): >>jensne+No
14. willva+yo[view] [source] 2024-08-27 06:31:17
>>borisj+(OP)
Something I anticipate is smarter storage that can do some filtering on push down predicates. There's compute on the storage nodes that is being wasted today.

I was kinda expecting BigQuery to do this under the hood, but it seems like they don't, which is a shame. BigQuery isn't faster than, say, trino on gcs, even though Google could do some major optimisations here.

replies(2): >>okr+Qv >>levent+Eib
◧◩
15. jensne+No[view] [source] [discussion] 2024-08-27 06:33:55
>>buro9+7n
I'd argue that a lot of people are moving from Kafka to NATS. NATS and Kafka serve different purposes and for many use cases related to APIs, NATS has a lot more to offer, like like wildcard topic topologies.
replies(1): >>buro9+Ny
16. purple+sp[view] [source] 2024-08-27 06:42:27
>>borisj+(OP)
> Programming Models

If you read this section, the author gets a lot of things right, but clearly doesn't know the space that well since there have been people building things along these lines for years. And making vague commentary instead of describing the nitty-gritty doesn't evoke much confidence.

I work on one such language/tool called mgmt config, but I have had virtually no interest and/or skill in marketing it. TBQH, I'm disenchanted by the fact that it seems to get any recognition you need to have VC's and a three-year timeline, short-term goals, and a plan to be done by then or move on.

If you're serious about future infra, then it's all here:

https://github.com/purpleidea/mgmt/

Looking for coding help for some of the harder bits that people might wish to add, and for people to take it into production and find issues that we've missed.

replies(2): >>lifty+bz >>filter+z36
◧◩
17. okr+Qv[view] [source] [discussion] 2024-08-27 08:13:29
>>willva+yo
I also wonder if Athena does this with AWS. Parquet supports pushdown. But i would suspect, pushdown predicates would mean that the file storage unit has to have some logic to execute custom code, bringing back the code to the data. The promise of spark, once. It would be a huge win, definitly. Hmmm.

But opens up also a threat vector. And you have competing users running their predicates. So one has to think also about queues and pipelining and so on. But probably also solvable, just like on any multiuser system.

Interesting.

◧◩◪
18. buro9+Ny[view] [source] [discussion] 2024-08-27 09:00:05
>>jensne+No
it's the Kafka API, not Kafka itself, that I see as having become the standard.
replies(2): >>xyzzy_+nZ >>629514+Ck1
◧◩
19. lifty+bz[view] [source] [discussion] 2024-08-27 09:07:53
>>purple+sp
I remember seeing your presentation many years ago, at Fosdem. Very cool project and if I would have to manage classic OS deployments I would definitely give mgmt a try. That being said, I think the world is moving to more immutable systems similar to how Talos works (https://talos.dev).
replies(2): >>karmar+iA >>purple+po1
◧◩◪
20. karmar+iA[view] [source] [discussion] 2024-08-27 09:24:03
>>lifty+bz
I would be hesitant to claim "the world is moving to" anything, really. Deployments that would now be called "traditional", so anything that does not run in a container but in a VM, will continue to exist for quite some time.

And not only because of legacy systems that are hard to migrate to a modern platform. At my place of work there are workloads that can easily run on Kubernetes and it would be wise to do so. On the other hand there are systems that are not designed to run in a container and there is frankly no need to, because not everything needs to scale up and down or be available 100% of the time at all costs.

I think configuration management systems like mgmt (or Ansible and Puppet) are here to stay.

replies(3): >>lifty+vB >>hnthro+cZ >>purple+or1
◧◩◪◨
21. lifty+vB[view] [source] [discussion] 2024-08-27 09:39:15
>>karmar+iA
Not disagreeing with you there; technology lingers for many years. But in terms of market share and mind share, configuration management has shrank in dominance and I suspect it will continue to do so.
22. nyrikk+AY[view] [source] 2024-08-27 13:23:28
>>borisj+(OP)
On a unrelated note, does anyone know the origins of the one way vs two way door analogy?

In this post it is attributed to Jeff Bezos quotes, but it was popular in the Pacific North West before his rise.

◧◩◪◨
23. hnthro+cZ[view] [source] [discussion] 2024-08-27 13:27:47
>>karmar+iA
>Deployments that would now be called "traditional", so anything that does not run in a container but in a VM, will continue to exist for quite some time.

I think there is even a widening talent gap where you can't get people excited about doing something that maybe should have been done years ago (assuming VM -> containers makes sense for a thing). The salary needs to go higher for things that are less beneficial to the resume.

The industry at large asks most developers to stay up-to-date, so it starts looking suspicious when a company doesn't stay up-to-date too. For C# in particular, companies who have only recently migrated to .NET 5+ are now a red flag to me considering how long .NET Core has been out.

replies(2): >>karmar+Kd1 >>pjmlp+q79
◧◩◪◨
24. xyzzy_+nZ[view] [source] [discussion] 2024-08-27 13:28:42
>>buro9+Ny
Which is honestly a shame. It's an awful API.
replies(1): >>ako+sF1
◧◩◪◨⬒
25. karmar+Kd1[view] [source] [discussion] 2024-08-27 14:51:33
>>hnthro+cZ
I think we have to make a distinction between "concepts" being out of date and tools being out of date. I would not consider the concept (or architectural decision) to run a system on a fleet of VMs as outdated. However tools (e.g. compilers) absolutely go out of date once they are being deprecated and need timely migrations.

In the latter case I would consider it a red flag if some long-deprecated tool turned up in the tech stack of a company, but there might be perfectly good reasons to stick to the former, a bunch of VMs, instead of operating a Kubernetes cluster.

I ran a small Kubernetes cluster once and it turned out to be the wrong decision _at that time_. I think I would be delighted to see a job ad from a company that mentioned both (common hypervisors/VMs, containers/Kubernetes) in their tech stack. Without more information I would think that company took their time to evaluate their needs irrespective of current tech trends.

replies(1): >>purple+Dr1
◧◩◪◨
26. 629514+Ck1[view] [source] [discussion] 2024-08-27 15:26:44
>>buro9+Ny
Ce n'est pas un Kafka: Kafka is a Protocol Apache Kafka is an aging open source project. It's time to accept that Kafka's protocol is what matters. (https://materializedview.io/p/ce-nest-pas-un-kafka)
◧◩◪
27. purple+po1[view] [source] [discussion] 2024-08-27 15:48:30
>>lifty+bz
> I think the world is moving to more immutable systems

Mgmt doesn't care whether or not you want to build your system to be immutable, that's up to you! Mgmt let's you glue together the different pieces with a safe, reactive, distributed DSL.

Regarding your Talos comment, Kubernetes makes building things so complicated, so no, I don't think it will win out long term.

replies(1): >>pjmlp+v79
◧◩◪◨
28. purple+or1[view] [source] [discussion] 2024-08-27 16:05:39
>>karmar+iA
> I think configuration management systems like mgmt (or Ansible and Puppet) are here to stay.

I think so too, however "mgmt config" builds a lot of radical new primitives that Ansible and Puppet don't have. It's been negative for my "PR" to classify it as "config management" because people assume I'm building a "Puppet clone", but I really see it as that space, it's just that those legacy tools never delivered on the idea that I thought they should have correctly.

◧◩◪◨⬒⬓
29. purple+Dr1[view] [source] [discussion] 2024-08-27 16:06:47
>>karmar+Kd1
I'm hiring for a company that is building a tech stack of VM's. My username at mastodon or twitter has the details, and it's about working with https://github.com/purpleidea/mgmt/
30. pjdesn+pt1[view] [source] 2024-08-27 16:14:59
>>borisj+(OP)
Something missing here in the discussion of object storage and databases is any mention of the declining importance of the file system.

From the 70s through the 90s or 00s everything was file system-based, and it was just assumed that the best way to store data in a distributed system - even a globally-distributed one - was some sort of distributed file system. (e.g. Andrew File System, or research projects like OceanStore.

Nowadays the file system holds applications and configuration, but applications mostly store data in databases and object stores. In distributed systems this is done almost exclusively through system-specific network connections (e.g. port 3306 to MySQL, or HTTP for S3) rather than OS-level mounting of a file system.

(not counting HPC, where distributed file systems are used to preserve the developer look and feel of early non-distributed HPC systems)

◧◩◪◨⬒
31. ako+sF1[view] [source] [discussion] 2024-08-27 17:02:12
>>xyzzy_+nZ
Why?
◧◩◪◨
32. choege+EF1[view] [source] [discussion] 2024-08-27 17:03:22
>>cybera+We
> Sooner or later, you just start testing with the other teams' production/staging environments rather than deal with local incompatibilities.

That's probably about the time when your development pace goes downhill.

I think it's an interesting idea to consider: If some team interfaces with something outside of its control, they need to have a mock of it. That policy increases the development effort by at least a factor of two (you always have to create the mock alongside the thing), but it's just a linear increase.

replies(2): >>mikepu+DQ1 >>cybera+kS1
◧◩◪◨⬒
33. mikepu+DQ1[view] [source] [discussion] 2024-08-27 17:55:30
>>choege+EF1
In theory it should be the cloud providers themselves maintaining the locally-runnable stand-ins for their services, but as it stands you basically either get it as a third party effort (MinIO for S3) or in cases where the service is just a hosted version of some existing OSS product (Postgres for RDS).

Either way, once the local version exists, then the job becomes maintaining all the infrastructure that lets you bring up the pieces, populate them with reasonable state and wire them into whatever the bits are that are being actively hacked-on.

◧◩◪◨⬒
34. cybera+kS1[view] [source] [discussion] 2024-08-27 18:02:17
>>choege+EF1
> That's probably about the time when your development pace goes downhill.

Oh, absolutely. But at this point, your team is probably around several dozen people and you have a product with paying customers. This naturally slows the development speed, however you organize the development process.

> I think it's an interesting idea to consider: If some team interfaces with something outside of its control, they need to have a mock of it. That policy increases the development effort by at least a factor of two (you always have to create the mock alongside the thing), but it's just a linear increase.

The problem is, you can't really recapture the actual behavior of a service in a mock.

To give you an example, DynamoDB in AWS has a local mock in-memory DB for testing and development. It has nearly the same functionality, but stores all the data in RAM. So the simulated global secondary indexes (something like table views in classic SQL databases) are updated instantly. But on the real database it's eventually consistent, and it can take a fraction of a second to update.

So when you try to use your service in production, it can start breaking under the load.

Perhaps, we need better mocks that also simulate the behavior of the real services for delays, retries, and so on.

replies(1): >>KAKAN+UX7
◧◩◪◨
35. __turb+wq3[view] [source] [discussion] 2024-08-28 06:49:24
>>singro+T9
Otel is nice and all but I still think you are best off going 100% all in prometheus. Prometheus is so common that it has become a de-facto standard in metrics.

At BigCo we have migrated a number of internal things to Otel but I don’t think it has been worth the effort.

So many projects come with prometheus metrics, dashboards, and alerts out of the box that it becomes hard to use anything else. When I pick some random helm chart to install you can almost guarantee that is comes with prometheus integrations.

With grafana mimir you can now scale easily to a few billion metrics streams so a lot of the issues with the old model of prometheus have been fixed.

Like you said I don’t think there is much to innovate on in this area, which is a good thing.

36. __turb+mr3[view] [source] 2024-08-28 06:57:18
>>borisj+(OP)
Pushing as much down to the infra sounds like aws lambda and friends. You basically upload a zip or container and say, “just run this business code somewhere, I don’t care”. OCI bundles are basically a two day door at this point, you can build them with many tools, and run them with many other tools.

It works great for stateless things, but not so great for stateful things. I guess this plays into state being persisted in object storage or DBs, this allows the application to be stateless.

◧◩
37. filter+z36[view] [source] [discussion] 2024-08-29 02:49:27
>>purple+sp
> ... closed-loop feedback systems ...

It's good to actually see even a mention of control theory.

My degree was electronics and control theory and whilst I've only had one job that involved either electronics or control theory I often think about software in these terms: I genuinely think that as an industry we need to seriously consider the systems we build in control theoretic terms.

◧◩◪◨⬒⬓
38. KAKAN+UX7[view] [source] [discussion] 2024-08-29 18:53:06
>>cybera+kS1
This reminds me of an article I read somewhere (probably here in HN) wherein people implementing Banking Services just straight up test the API in Production after a few cycles of mock development, due to constantly having to deal with edge cases not present in the dev env.
◧◩◪◨⬒
39. pjmlp+q79[view] [source] [discussion] 2024-08-30 07:43:57
>>hnthro+cZ
Even Microsoft themselves have a bunch of products that still require .NET Framework.

SharePoint CSM, Dynamics, SQL Server CLR, Visual Studio extensions, Office AddIns.

◧◩◪◨
40. pjmlp+v79[view] [source] [discussion] 2024-08-30 07:45:08
>>purple+po1
There is a reason why most cloud now sell managed Kubernetes.
◧◩◪
41. pjmlp+C79[view] [source] [discussion] 2024-08-30 07:46:40
>>mikepu+Ta
We are back to timesharing days, in better clothing, and that is non negotiable from management point of view.
◧◩
42. levent+Eib[view] [source] [discussion] 2024-08-31 06:32:58
>>willva+yo
BigQuery Storage Read API claims to support filters and simple projections pushed down to the storage: https://cloud.google.com/bigquery/docs/reference/storage. See also this recent paper: https://research.google/pubs/biglake-bigquerys-evolution-tow...

I've also recently proposed a Table Read protocol that should be a "non-vendor-controlled" equivalent of BigQuery Storage APIs: https://engineeringideas.substack.com/p/table-transfer-proto...

[go to top]