Launch HN: Hatchet (YC W24) – Open-source task queue, now with a cloud version

submitted by abelan+(OP) on 2024-06-27 14:35:34 | 245 points 94 comments
[source] [go to bottom]

Hey HN - this is Alexander and Gabe from Hatchet (https://hatchet.run). We’re building a modern task queue as an alternative to tools like Celery for Python and BullMQ for Node. Our open-source repo is at https://github.com/hatchet-dev/hatchet and is 100% MIT licensed.

When we did a Show HN a few months ago (https://news.ycombinator.com/item?id=39643136), our cloud version was invite-only and we were focused on our open-source offering.

Today we’re launching our self-serve cloud so that anyone can get started creating tasks on our platform - you can get started at https://cloud.onhatchet.run, or you can use these credentials to access a demo (should be prefilled):

  URL: https://demo.hatchet-tools.com 
  Email: hacker@news.ycombinator.com
  Password: HatchetDemo123!

People are currently using Hatchet for a bunch of use-cases: orchestrating RAG pipelines, queueing up user notifications, building agentic LLM workflows, or scheduling image generation tasks on GPUs.

We built this out of frustration with existing tools and a conviction that PostgreSQL is the right choice for a task queue. Beyond the fact that many developers are already using Postgres in their stack, which makes it easier to self-host Hatchet, it’s also easier to model higher-order concepts in Postgres, like chains of tasks (which we call workflows). In our system, the acknowledgement of the task, the task result, and the updates to higher-order models are done as part of the same Postgres transaction, which significantly reduces the risk of data loss/race conditions when compared with other task queues (which usually pass acknowledgements through a broker, storing the task results elsewhere, and only then figuring out the next task in the chain).

We also became increasingly frustrated with tools like Celery and the challenges it introduces when using a modern Python stack (> 3.5). We wrote up a list of these frustrations here: https://docs.hatchet.run/blog/problems-with-celery.

Since our Show HN, we’ve (partially or completely) addressed the most common pieces of feedback from the post, which we’ll outline here:

1. The most common ask was built-in support for fanout workflows — one task which triggers an arbitrary number of child tasks to run in parallel. We previously only had support for DAG executions. We generalized this concept and launched child workflows (https://docs.hatchet.run/home/features/child-workflows). This is the first step towards a developer-friendly model of durable execution.

2. Support for HTTP-based triggers — we’ve built out support for webhook workers (https://docs.hatchet.run/home/features/webhooks), which allow you to trigger any workflow over an HTTP webhook. This is particularly useful for apps on Vercel, who are dealing with timeout limits of 60s, 300s, or 900s (depending on your tier).

3. Our RabbitMQ dependency — while we haven’t gotten rid of this completely, we’ve recently launched hatchet-lite (https://docs.hatchet.run/self-hosting/hatchet-lite), which allows you to run the various Hatchet components in a single Docker image that bundles RabbitMQ along with a migration process, admin CLI, our REST API, and our gRPC engine. Hopefully the lite was a giveaway, but this is meant for local development and low-volume processing, on the order of hundreds per minute.

We’ve also launched more features, like support for global rate limiting, steps which only run on workflow failure, and custom event streaming.

We’ll be here the whole day for questions and feedback, and look forward to hearing your thoughts!

NOTE: showing posts with links only show all posts

>>dalber+r8
We're using RabbitMQ for pub/sub between different components of our engine. The actual task queue is entirely backed by Postgres, but things like streaming events between different workers are done through RabbitMQ at the moment, as well as sending a message from one component to another when you distribute the engine components. I've written a little more about this here: >>39643940 .

We're eventually going to support a lightweight Postgres-backed messaging table, but the number of pub/sub messages sent through RabbitMQ is typically an order of magnitude higher than the number of tasks sent.

>>cedws+La
Yeah, pretty much - that was more of a side project while figuring out what to work on next. Plus the Terraform licensing changes were on the horizon and I became a little frustrated with the whole ecosystem.

Part of the reason for working on Hatchet (this version) was that I built the Terraform management tool on top of Temporal and felt there was room for improvement.

(for those curious - https://github.com/hatchet-dev/hatchet-v1-archived)

>>abelan+(OP)
Hey, this is Gabe from zenfetch. Been following you guys for a few months now since your first launch. I definitely resonate with all the problems you've described regarding celery shortcomings / other distributed task queues. We're on celery right now and have been through the ringer with various workflow platforms. Only reason we haven't switched to Hatchet is because we are finally in a stable place, though that might change soon in which case I'd be very open to jumping ship.

I know a lot of folks are going after the AI agent workflow orchestration platform, do you see yourselves progressing there?

In my head, Hatchet coupled with BAML (https://www.boundaryml.com/) could be an incredible combination to support these AI agents. Congrats on the launch

>>abelan+(OP)
Being MIT licensed, does that mean that another company could also offer this as a hosted solution? Did you think about encumbering with a license that allowed commercial use, but prohibited resale?

Also, somewhat related, years ago I wrote a very small framework for fan-out of Django-based tasks in Celery. We have been running it in production for years. It doesn't have adoption beyond our company, but I think there are some good ideas in it. Feel free to take a look if it's of interest! https://github.com/groveco/django-sprinklers

>>gabev+eg
Hi Gabe, also Gabe here. Yes, this is a core usecase we're continuing to develop. Prior to Hatchet I spent some time as a contractor building LLM agents where I was frustrated with the state-of-tooling for orchestration and lock in of some of these platforms.

To that end, we’re building Hatchet to orchestrate agents with features that are common like streaming from running workers to frontend [1] and rate limiting [2] without imposing too many opinions on core application logic.

[1] https://docs.hatchet.run/home/features/streaming [2] https://docs.hatchet.run/home/features/rate-limits

>>nickze+ud
> My main thing is the RabbitMQ dependency (that seems to be a topic of interest in this thread). Getting rid of that and just depending on PG seems like the main path forward that would increase adoption.

Yep, we agree - this is more a matter of bandwidth as well as figuring out the final definition of the pub/sub interface. While we wouldn't prefer to maintain two message queue implementations, we likely won't drop the RabbitMQ implementation entirely, even if we offer Postgres as an alternative. So if we do need to support two implementations, we'd prefer to build out a core set of features that we're happy with first. That said, the message queue API is definitely stabilizing (https://github.com/hatchet-dev/hatchet/blob/31cf5be248ff9ed7...), so I hope we can pick this up in the coming months.

> You also compare yourself against Celery and BullMQ, but there is also talk in the readme around durable execution. That to me puts you in the realm of Temporal. How would you say you compare/compete with Temporal? Are you looking to compete with them?

Yes, our child workflows feature is an alternative to Temporal which lets you execute Temporal-like workflows. These are durable from the perspective of the parent step which executes them, as any events generated by the child workflows get replayed if the parent step re-executes. Non-parent steps are the equivalent of a Temporal activity, while parent steps are the equivalent of a Temporal workflow.

Our longer-term goal is to build a better developer experience than Temporal, centered around observability and worker management. On the observability side, we're investing heavily in our dashboard, eventing, alerting and logging features. On the worker management side, we'd love to integrate more natively with worker runtime environments to handle use-cases like autoscaling.

>>abelan+(OP)
Hatchet looks pretty awesome. I was thinking about using it to replace my Celery worker. However, the problem is that I can only use the gRPC client to create a task (correct me if I am wrong). What I want is to be able to commit a bunch of database rows altogether with the background task itself directly. The benefit of doing so with a PostgreSQL database is that all the rows will be in the same transaction. With traditional background worker solutions, you will run into two problems:

1. Commit changes in the db first: if you fail to enqueue the task, there will be data rows hanging in the db but no task to process them

2. Push the task first: the task may kick start too early, and the DB transaction is not committed yet, it cannot find the rows still in transaction. You will need to retry failure

We also looked at Celery and hope it can provide a similar offer, but the issue seems open for years:

https://github.com/celery/celery/issues/5149

With the needs, I build a simple Python library on top of SQLAlchemy:

https://github.com/LaunchPlatform/bq

It would be super cool if Hatchet also supports native SQL inserts with ORM frameworks. Without the ability to commit tasks with all other data rows, I think it's missing out a bit of the benefit of using a database as the worker queue backend.

>>cyral+fn
Chiming in as the founder of https://www.inngest.com. It looks like Hatchet is trying to catch up with us, though some immediate differences:

* Inngest is fully event driven, with replays, fan-outs, `step.waitForEvent` to automatically pause and resume durable functions when specific events are received, declarative cancellation based off of events, etc.

* We have real-time metrics, tracing, etc. out of the box in our UI

* Out of the box support for TS, Python, Golang, Java. We're also interchangeable with zero-downtime language and cloud migrations

* I don't know Hatchet's local dev story, but it's a one-liner for us

* Batching, to turn eg. 100 events into a single execution

* Concurrency, throttling, rate limiting, and debouncing, built in and operate at a function level

* Support for your own multi-tenancy keys, allowing you to create queues and set concurrency limits for your own concurrency

* Works serverless, servers, or anywhere

* And, specifically, it's all procedural and doesn't have to be a DAG.

We've also invested heavily in flow control — the aspects of batching, concurrency, custom multi-tenancy controls, etc. are all things that you have to layer over other systems.

I expect because we've been around for a couple of years that newer folks like Hatchet end up trying to replicate some of what we've done, though building this takes quite some time. Either way, happy to see our API and approach start to spread :)

>>dalber+cq
Then maybe Procrastinate (https://procrastinate.readthedocs.io/en/main/) is something for you (I just contributed some features to it). It has very good documentation, MIT license, and also some nice features like job scheduling, priorities, cancellation, etc.

>>cyral+fn
Re Inngest - there are a few differences:

1. Hatchet is MIT licensed and designed to be self-hosted in production, with cloud as an alternative. While the Inngest dev server is open source, it doesn't support self-hosting: https://www.inngest.com/docs/self-hosting.

2. Inngest is built on an HTTP webhook model while Hatchet is built on a long-lived, client-initiated gRPC connection. While we support HTTP webhooks for serverless environments, a core part of the Hatchet platform is built to display the health of a long-lived worker and provide worker-level metrics that can be used for autoscaling. All async runtimes that we've worked on in the past have eventually migrated off of serverless for a number of reasons, like reducing latency or having more control over things like runtime environment and DB connections. AFIAK the concept of a worker or worker health doesn't exist in Inngest.

There are the finer details which we can hash out in the other thread, but both products rely on events, tasks and durable workflows as core concepts, and there's a lot of overlap.

>>abelan+(OP)
Awesome! Reducing moving parts is always a great thing!For 99.9% of teams this is a great alternative to rely only on the database the team's already using. For those teams that use MongoDB, I created something similar (and simpler of course): https://allquiet.app/open-source/mongo-queueing The package is C#, but the idea could be adapted to practically any language that has a MongoDB driver.

>>tonyhb+hw
If we’re going to give credit where credit’s due, the history of durable execution traces back to the ideas of AWS step functions and Azure durable functions alongside the original Cadence and Conductor project. A lot of the features here are attempting to make patterns in those projects accessible to a wider range of developers.

Hatchet is also event driven [1], has built-in support for tracing and metrics, and has a TS [2], Python [3] and Golang SDK [4], has support for throttling and rate limiting [5], concurrency with custom multi-tenancy keys [6], works on serverless [7], and supports procedural workflows [8].

That said, there are certainly lots of things to work on. Batching and better tracing are on our roadmap. And while we don’t have a Java SDK, we do have a Github discussion for future SDKs that you can vote on here: https://github.com/hatchet-dev/hatchet/discussions/436.

[1] https://docs.hatchet.run/home/features/triggering-runs/event...

[2] https://docs.hatchet.run/sdks/typescript-sdk

[3] https://docs.hatchet.run/sdks/python-sdk

[4] https://docs.hatchet.run/sdks/go-sdk

[5] https://docs.hatchet.run/home/features/rate-limits

[6] https://docs.hatchet.run/home/features/concurrency/round-rob...

[7] https://docs.hatchet.run/home/features/webhooks

[8] https://docs.hatchet.run/home/features/child-workflows

>>tonyhb+hw
Maybe let them have their launch? Mitchell said it best:

https://x.com/mitchellh/status/1759626842817069290?s=46&t=57...

>>mind-b+zC
While the execution model is very similar to Airflow, we're primarily targeting async jobs which are spawned from an application, while Airflow is primarily for data pipelines. The connector ecosystem of Airflow is very powerful and not something that we're trying to replace.

That's not to say you can't use Hatchet for data pipelines - this is a common use-case. But you probably don't want to use Hatchet for big data pipelines where payload sizes are very large and you're working with payloads that aren't JSON serializable.

Airflow also tends to be quite slow when the task itself is short-lived. We don't have benchmarks, but you can have a look at Windmill's benchmarks on this: https://www.windmill.dev/docs/misc/benchmarks/competitors#re....

>>didip+hC
It does seem like some really great options are emerging in the Go community, and a lot of newer execution frameworks are supporting Go as one of the first languages. Another great addition is https://github.com/riverqueue/river.

>>abelan+(OP)
I've built several message queues over the years.

I hated the configuration and management complexity of RabbitMQ and Celery and pretty much everything else.

My ultimate goal was to build a message queue that was extremely fast and required absolutely zero config and was HTTP based thus has no requirement for any specific client.

I developed one in Python that was pretty complete but slow, then developing a prototype in Rust that was extremely fast but incomplete.

The latest is sasquatch. Its written in golang, uses sqlite for the db and behaves in a very similar way to Amazon SQS in that connections are HTTP and it uses long polling to wait for messages.

https://github.com/crowdwave/sasquatch

Its only in the very early stages of development at this stage and likely isn't even compiling but most of the code is in place. I'm hoping to get around to next phase of development soon.

I just love the idea of a message queue that is a single static binary and when you run it, you have a fully functioning message queue with nothing more to do - not even fiddling with Postgres.

Absolute zero config, not minutes, hours or days of futzing with configs and blogs and tutorials.

>>abelan+(OP)
I'm wondering what is the difference from https://docs.urlinks.io/gateway-chain/ . There are a lot similar concepts, like very similar. Hatchet feels like same product but with money from VCs.

>>acaloi+fj
I remember that post and I’ve read it a few times, thank you for it! I was already working on River at the time but it was refreshing to see the case made so strongly by another person who gets it.

- Blake, co-author of riverqueue.com / https://github.com/riverqueue/river :)

>>abelan+(OP)
Shameless plug since I never get to do those: https://github.com/gaffo/jorb

There's many great distributed job runners out there. I've never found one for go that lets me have the features without running 7 processes and message queues sprawled over hosts and docker containers.

jorb is just a framework to slap into a go script when you want to fire a lot of work at your computer and let it run it to completion.

I've tried to build this many times and this is the first time I've gotten it to stick.

Yes you can do this with core go primitives but I find this abstraction to be a lot better and (eventually) was easier to debug deadlocks.

I'm just putting it here cause it's semi related.

>>vasco+DS1
> You can use celery with postgres without issues

How? This issue still seems to be open after 6 years: https://github.com/celery/celery/issues/5149

>>abelan+(OP)
How is this technically different from what solid queue is doing on rails ?

https://github.com/rails/solid_queue

Just trying to understand. I do get that hatchet would be language agnostic, SDK API kind of a solution.

>>krick+RD1
You're right, if all you need is a queue with a small number of workers connected at low volume, you don't need Hatchet or any other managed queue - you can get some pretty performant behavior with something like: https://github.com/abelanger5/postgres-fair-queue/blob/main/....

The point of Hatchet is to support more complex behavior - like chaining tasks together, building automation around querying and retrying failed tasks, handling a lot of the fairness and concurrency use-cases you'd otherwise need to build yourself, etc - or just getting something that works out of the box and can support those use-cases in the future.

And if you are running at low volume and trying to debug user issues, a grafana panel isn't going to get you the level of granularity or admin control you need to track down the errors in your methods (rather than just at the queue level). You'd need to integrate your task queue with Sentry and a logging system - and in our case, error tracing and logging are available in the Hatchet UI.

zlacker

Launch HN: Hatchet (YC W24) – Open-source task queue, now with a cloud version