Show HN: Hatchet – Open-source distributed task queue

submitted by abelan+(OP) on 2024-03-08 17:07:35 | 578 points 189 comments
[view article] [source] [go to bottom]

Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

NOTE: showing posts with links only show all posts

>>sixhob+H5
Done [1]. We'll expand this section over time. There are also definite tradeoffs to our architecture - spoke to someone wanting the equivalent 1.5m PutRecord/s in Kinesis, which we're definitely not ready for because we're persist every event + task execution in Postgres.

[1] https://github.com/hatchet-dev/hatchet/blob/main/README.md#h...

>>abelan+(OP)
In https://docs.hatchet.run/home/quickstart/installation, it says

> Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.

but the link to "self-hosted quickstart" links back to the same page

>>tzahif+57
It uses Postgres rather than RabbitMQ: https://github.com/hatchet-dev/hatchet?tab=readme-ov-file#ho...

>>kevinl+a8
That's exactly why we built Svix[1]. Building webhooks services, even with amazing tools like FastAPI, Celery and Redis is still a big pain. So we just built a product to solve it.

Hatchet looks cool nonetheless. Queues are a pain for many other use-cases too.

1: https://www.svix.com

>>abelan+(OP)
How is this different from pg-boss[1]? Other than the distributed part it also seems to use skip locked.

[1] https://github.com/timgit/pg-boss

>>topics+G3
Each task in Hatchet is backed by a workflow [1]. Workflows are predefined steps which are persisted in PostgreSQL. If a worker dies or crashes midway through (stops heartbeating to the engine), we reassign tasks (assuming they have retries left). We also track timeouts in the database, which means if we miss a timeout, we simply retry after some amount of time. Like I mentioned in the post, we avoid some classes of faults just by relying on PostgreSQL and persisting each workflow run, so you don't need to time out with distributed locks in Redis, for example, or worry about data loss if Redis OOMs. Our `ticker` service is basically its own worker which is assigned a lease for each step run.

We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.

To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.

The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.

There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].

[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...

>>abelan+(OP)
How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?

https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/

>>tzahif+57
When I started on this codebase, we needed to implement some custom exchange logic that maps very neatly to fanout exchanges and non-durable queues in RabbitMQ and weren't built out on our PostgreSQL layer yet. This was a bootstrapping problem. Like I mentioned in the comment, we'd like to switch to pub/sub pattern that lets us distribute our engine over multiple geographies. Listen/notify could be the answer once we migrate to PG 16, though there are some concerns around connection poolers like pg_bouncer having limited support for listen/notify. There's a Github discussion on this if you're curious: https://github.com/hatchet-dev/hatchet/discussions/224.

>>nextwo+H4
I think we might have had a dead link in the README to our self-hosting guide, here it is: https://docs.hatchet.run/self-hosting.

The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].

The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.

Let me know what you need for self-hosted docs, happy to write them up for you.

[1] https://github.com/hatchet-dev/hatchet/tree/main/internal/se... [2] https://docs.hatchet.run/self-hosting/configuration-options#...

>>Kluggy+08
This should be fixed now, here's the direct link: https://docs.hatchet.run/self-hosting.

>>kcorbi+Nd
Thank you, appreciate the kind words! What boxes are you looking to check?

Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: >>39643940 .

It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.

>>abelan+(OP)
I need task queues where the client (web browser) can listen to the progress of the task through completion.

I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.

Wondering if perhaps the Postgres underpinnings here would make that possible.

EDIT: seems so! https://docs.hatchet.run/home/features/streaming

>>toddmo+ah
If you need to listen for the progress only, try server-sent events, maybe?: https://en.wikipedia.org/wiki/Server-sent_events

It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.

>>abelan+(OP)
A related lively dicussion from a few months ago: >>37636841

Long live Postgres queues.

>>jerryg+Hm
I like that idea, basically the first HTTP request ensures the worker gets spun up on a lambda, and the task gets picked up on the next poll when the worker is running. We already have the underlying push model for our streaming feature: https://docs.hatchet.run/home/features/streaming. Can configure this to post to an HTTP endpoint pretty easily.

The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?

>>abelan+(OP)
How does this compare to ZeroMQ (ZMQ) ?

https://zeromq.org/

>>kcorbi+Nd
Not sure if you saw it but Graphile Worker supports jobs written in arbitrary languages so long as your OS can execute them: https://worker.graphile.org/docs/tasks#loading-executable-fi...

Would be interested to know what features you feel it’s lacking.

>>sigmar+cD
You can execute a new workflow programmatically, for example see [1]. So people have triggered, for example, 50 child workflows from a parent step. As you've identified the difficult part there is the "collect" or "gathering" step, we've had people hack around that by waiting for all the steps from a second workflow (and falling back to the list events method to get status), but this isn't an approach I'd recommend and it's not well documented. And there's no circuit breaker.

> I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.

The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.

As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.

[1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-clien...

>>zwaps+UE
This isn't built specifically for generative AI, but generative AI apps typically have architectural issues that are solved by a good queueing system and worker pool. This is particularly true once you start integrating smaller, self-hosted LLMs or other types of models into your pipeline.

> How do you distribute inference across workers?

In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.

> Can one use just any protocol

We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.

[1] https://docs.hatchet.run/home/features/concurrency/round-rob...

>>leetro+rx
Yep, we're backed by YC in the W24 batch - this is evident on our landing page [1].

We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:

1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.

2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".

Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.

Based on all of this, here's where we land on things:

1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.

2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.

3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.

Would love to hear everyone's thoughts on this.

[1] https://hatchet.run

[2] https://github.com/porter-dev/porter

[3] https://www.elastic.co/blog/why-license-change-aws

>>kbar13+rS
Sure: >>39646719

>>fuddle+TV
Thank you!

> Do you publish pricing for your cloud offering?

Not yet, we're rolling out the cloud offering slowly to make sure we don't experience any widespread outages. As soon as we're open for self-serve on the cloud side, we'll publish our pricing model.

> For the self hosted option, are there plans to create a Kubernetes operator?

Not at the moment, our initial plan was to help folks with a KEDA autoscaling setup based on Hatchet queue metrics, which is something I've done with Sidekiq queue depth. We'll probably wait to build a k8s operator after our existing Helm chart is relatively stable.

> With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?

Yes. The question is whether that risk is worth the tradeoff of not being MIT-licensed. There are also paths to getting integrated into AWS marketplace we'll explore longer-term. I added some thoughts here: >>39646788 .

>>treesc+rL
To be clear, the 25ms isn't a guarantee. We have a load testing CLI [1] and the secondary steps on multi-step workflows are in the range of 25ms, while the first steps are in the range of 50ms, so that's what I'm referencing.

There's still a lot of work to do for optimization though, particularly to improve the polling interval if there aren't workers available to run the task. Some people might expect to set a max concurrency limit of 1 on each worker and have each subsequent workflow take 50ms to start, which isn't be the case at the moment.

[1] https://github.com/hatchet-dev/hatchet/tree/main/examples/lo...

>>doctor+SY
I fully agree with you.

'But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency."'

"Advances in software technology and increasing economic pressure have begun to break down many of the barriers to improved software productivity. The ${PRODUCT} is designed to remove the remaining barriers […]"

It reads like the above quote from the pitch of r1000 in 1985. https://datamuseum.dk/bits/30003882

>>hinkle+MJ
This reminds me of: >>28234057

If you're saying that the scheduling in Hatchet should be a separate library, we rely on go-cron [1] to run cron schedules.

[1] https://github.com/go-co-op/gocron

>>abelan+(OP)
How does this compare to River Queue (https://riverqueue.com/)? Besides the additional Python and TS client libraries.

>>fcsp+t41
To clarify - you're right, this is a long time in a message/event queue.

It's not an eternity in a task queue which supports DAG-style workflows with concurrency limits and fairness strategies. The reason for this is you need to check all of the subscribed workers and assign a task in a transactional way.

The limit on the Postgres level is probably on the order of 5-10ms on a managed PG provider. Have a look at: >>39593384 .

Also, these are not my benchmarks, but have a look at [1] for Temporal timings.

[1] https://www.windmill.dev/blog/launch-week-1/fastest-workflow...

>>bicija+eL
Do you know about the Temporal startup program? It gives enough credits to offset support fees for 2 years. https://temporal.io/startup

>>jerryg+Hm
You might want to look at https://www.inngest.com for that. Disclaimer: I'm a cofounder. We released event-driven step functions about 20 months ago.

>>abelan+(OP)
Can you explain why you chose every function to take in context? https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/...

This seems like a lot of boiler plate to write functions with to me (context I created http://github.com/DAGWorks-Inc/hamilton).

>>moribv+NV
I built https://www.inngest.com specifically because of healthcare flows. You should check it out, with the obvious disclaimer that I'm biased. Here's what you need:

1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).

2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).

3. General reliability and API access.

Inngest does that for you, but again — disclaimer, I made it and am biased.

>>leetro+KP
https://www.rabbitmq.com/docs/management

>>beveks+d31
I wrote about one simple implementation:

https://renegadeotter.com/2023/11/30/job-queues-with-postrgr...

>>Nukeso+7x1
Pueue looks cool, it's not an alternative to Hatchet though - looks like it's meant to be run in the terminal or by a user? We're very much meant to run in an application runtime.

Like I mentioned here [1], we'll expand our comparison section over time. If Pueue's an alternative people are asking about, we'll definitely put it in there.

> Having the possibility to schedule stuff in a smart way is nice and all, but how do you overlook it? It's important to get a good overview of how your tasks perform.

I'm not sure what you mean by this. Perhaps you're referring to this - >>39647154 - in which case I'd say: most software is far from perfect. Our scheduling works but has limitations and is being refactored before we advertise it and build it into our other SDKs.

[1] >>39643631

>>beerka+e61
The underlying queue is very similar. See this comment, which details how we're different from a library client: >>39644327 . We also have the concept of workflows, which last I checked doesn't exist in River.

I'm personally very excited about River and I think it fills an important gap in the Go ecosystem! Also now that sqlc w/ pgx seems to be getting more popular, it's very easy to integrate.

>>Yanael+ID1
a little late now, but I wonder if https://github.com/DataBiosphere/toil might meet your requirements

>>jerryg+Hm
https://cloud.google.com/tasks is such a good model and I really want an open source version of it (or to finally bite the bullet and write my own).

Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.

Set up a separate scaling group and away you go.

>>abelan+(OP)
Related, I also wrote my own distributed task queue in Python [0] and TypeScript [1] with a Show HN [2]. Time it took was about a week. I like your features, but it was easy to write my own so I'm curious how you're building a money making business around an open source product. Maybe the fact everyone writes their own means there's no best solution now, so you're trying to be that and do paid closed source features for revenue?

[0] https://github.com/wakatime/wakaq

[1] https://github.com/wakatime/wakaq-ts

[2] >>32730038

>>kamika+gE1
Here you go: https://stackoverflow.com/questions/75652326/celery-spawn-si...

Plus some adjacent discussion on GitHub: https://github.com/prometheus/client_python/issues/902

Hope that helps!

>>abelan+(OP)
Congrats on the launch!

You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]

How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?

[1]: https://stackoverflow.com/a/47604045/1593459

>>abelan+2F1
I recently found Nex in the context of Wasmcloud [0] and ability for it to support long-running tasks/workflows. Impression that indeed Nex needs a good time to mature still. There was also a talk [1] about using Temporal here. For Hatchet it may be interesting to check it out (note: I am not affiliated with Wasmcloud, nor currently using it).

[0] https://wasmcloud.com

[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...

>>abelan+(OP)
Exciting time for distributed, transactional task queue projects built on the top of PostgreSQL!

Here's the most heavily upvoted in the past 12 months

Hatchet >>39643136

Inngest >>36403014

Windmill >>35920082

HN comments on Temporal.io https://github.com/temporalio https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Internally we rant about the complexity of the above projects vs using transactional job queues libs like:

river >>38349716

neoq: [https://github.com/acaloiaro/neoq](https://github.com/acaloi...

gue: [https://github.com/vgarvardt/gue](https://github.com/vgarvar...

Deep inside can't wait to see some like ThePrimeTimeagen to review it ;) https://www.youtube.com/@ThePrimeTimeagen

>>abelan+(OP)
Ola, fellow YC founders. Surely you have seen Windmill since you refer to it in the comments below. It looks like Hatchet, being a lot more recent, has currently a subset of what Windmill offers, albeit with a focus solely on the task queue and without the self-hosted enterprise focus. So it looks more like a competitor to Inngest than of Windmill. We released workflows as code last week which was the primary differentiator with other workflow engines and us so far: https://www.windmill.dev/docs/core_concepts/workflows_as_cod...

The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.

>>abelan+(OP)
Have you considered https://github.com/tembo-io/pgmq for the queue bit?

>>cebert+nc2
We'd like to make money off Hatchet Cloud, which is in early access - some more on that here [1] and here [2]. Pricing will be transparent once we're open access.

Like I mention in that comment, we'd like to keep our repository 100% MIT licensed. I realize this is unpopular among open source startups - and I'm sure there are good reasons for that. We've considered these reasons and still landed on the MIT license.

[1] >>39647101

[2] >>39646788

>>welder+1S1
Nice, Waka looks cool! I've talked a bit about the tradeoffs with library-mode pollers, for example here: >>39644327 . Which isn't to say they don't make sense, but scaling wise I think there can be some drawbacks.

> I'm curious how you're building a money making business around an open source product.

We'd like to make money off of our cloud version. See the comment on pricing here - >>39653084 - which also links to other comments about pricing, sorry about that.

>>wereHa+332
It is planned - see here for more details: >>39646300 .

We still need to do some work on this feature though, we'll make sure to document it when it's well-supported.

>>jerryg+Hm
Mergent (YC S21 - https://mergent.co) might be precisely what you're looking for in terms of a push-over-HTTP model for background jobs and crons.

You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.

Happy to answer any questions here or over email james@mergent.co

zlacker

Show HN: Hatchet – Open-source distributed task queue