Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.
Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.
We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.
What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.
We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.
We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.
[1] https://github.com/hatchet-dev/hatchet/blob/main/README.md#h...
> Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.
but the link to "self-hosted quickstart" links back to the same page
Hatchet looks cool nonetheless. Queues are a pain for many other use-cases too.
We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.
To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.
The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.
There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].
[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...
https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/
The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].
The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.
Let me know what you need for self-hosted docs, happy to write them up for you.
[1] https://github.com/hatchet-dev/hatchet/tree/main/internal/se... [2] https://docs.hatchet.run/self-hosting/configuration-options#...
Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: >>39643940 .
It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.
I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.
Wondering if perhaps the Postgres underpinnings here would make that possible.
EDIT: seems so! https://docs.hatchet.run/home/features/streaming
It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.
Long live Postgres queues.
The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?
Would be interested to know what features you feel it’s lacking.
> I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.
Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.
The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.
As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.
[1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-clien...
> How do you distribute inference across workers?
In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.
> Can one use just any protocol
We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.
[1] https://docs.hatchet.run/home/features/concurrency/round-rob...
We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:
1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.
2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".
Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.
Based on all of this, here's where we land on things:
1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.
2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.
3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.
Would love to hear everyone's thoughts on this.
> Do you publish pricing for your cloud offering?
Not yet, we're rolling out the cloud offering slowly to make sure we don't experience any widespread outages. As soon as we're open for self-serve on the cloud side, we'll publish our pricing model.
> For the self hosted option, are there plans to create a Kubernetes operator?
Not at the moment, our initial plan was to help folks with a KEDA autoscaling setup based on Hatchet queue metrics, which is something I've done with Sidekiq queue depth. We'll probably wait to build a k8s operator after our existing Helm chart is relatively stable.
> With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?
Yes. The question is whether that risk is worth the tradeoff of not being MIT-licensed. There are also paths to getting integrated into AWS marketplace we'll explore longer-term. I added some thoughts here: >>39646788 .
There's still a lot of work to do for optimization though, particularly to improve the polling interval if there aren't workers available to run the task. Some people might expect to set a max concurrency limit of 1 on each worker and have each subsequent workflow take 50ms to start, which isn't be the case at the moment.
[1] https://github.com/hatchet-dev/hatchet/tree/main/examples/lo...
'But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency."'
"Advances in software technology and increasing economic pressure have begun to break down many of the barriers to improved software productivity. The ${PRODUCT} is designed to remove the remaining barriers […]"
It reads like the above quote from the pitch of r1000 in 1985. https://datamuseum.dk/bits/30003882
If you're saying that the scheduling in Hatchet should be a separate library, we rely on go-cron [1] to run cron schedules.
It's not an eternity in a task queue which supports DAG-style workflows with concurrency limits and fairness strategies. The reason for this is you need to check all of the subscribed workers and assign a task in a transactional way.
The limit on the Postgres level is probably on the order of 5-10ms on a managed PG provider. Have a look at: >>39593384 .
Also, these are not my benchmarks, but have a look at [1] for Temporal timings.
[1] https://www.windmill.dev/blog/launch-week-1/fastest-workflow...
This seems like a lot of boiler plate to write functions with to me (context I created http://github.com/DAGWorks-Inc/hamilton).
1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).
2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).
3. General reliability and API access.
Inngest does that for you, but again — disclaimer, I made it and am biased.
https://renegadeotter.com/2023/11/30/job-queues-with-postrgr...
Like I mentioned here [1], we'll expand our comparison section over time. If Pueue's an alternative people are asking about, we'll definitely put it in there.
> Having the possibility to schedule stuff in a smart way is nice and all, but how do you overlook it? It's important to get a good overview of how your tasks perform.
I'm not sure what you mean by this. Perhaps you're referring to this - >>39647154 - in which case I'd say: most software is far from perfect. Our scheduling works but has limitations and is being refactored before we advertise it and build it into our other SDKs.
[1] >>39643631
I'm personally very excited about River and I think it fills an important gap in the Go ecosystem! Also now that sqlc w/ pgx seems to be getting more popular, it's very easy to integrate.
Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.
Set up a separate scaling group and away you go.
[0] https://github.com/wakatime/wakaq
[1] https://github.com/wakatime/wakaq-ts
[2] >>32730038
Plus some adjacent discussion on GitHub: https://github.com/prometheus/client_python/issues/902
Hope that helps!
You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]
How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?
[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...
Here's the most heavily upvoted in the past 12 months
Hatchet >>39643136
Inngest >>36403014
Windmill >>35920082
HN comments on Temporal.io https://github.com/temporalio https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Internally we rant about the complexity of the above projects vs using transactional job queues libs like:
river >>38349716
neoq: [https://github.com/acaloiaro/neoq](https://github.com/acaloi...
gue: [https://github.com/vgarvardt/gue](https://github.com/vgarvar...
Deep inside can't wait to see some like ThePrimeTimeagen to review it ;) https://www.youtube.com/@ThePrimeTimeagen
The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.
Like I mention in that comment, we'd like to keep our repository 100% MIT licensed. I realize this is unpopular among open source startups - and I'm sure there are good reasons for that. We've considered these reasons and still landed on the MIT license.
[1] >>39647101
[2] >>39646788
> I'm curious how you're building a money making business around an open source product.
We'd like to make money off of our cloud version. See the comment on pricing here - >>39653084 - which also links to other comments about pricing, sorry about that.
We still need to do some work on this feature though, we'll make sure to document it when it's well-supported.
You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.
Happy to answer any questions here or over email james@mergent.co