Hey HN - this is Alexander and Gabe from Hatchet (https://hatchet.run). We’re building a modern task queue as an alternative to tools like Celery for Python and BullMQ for Node. Our open-source repo is at https://github.com/hatchet-dev/hatchet and is 100% MIT licensed.
When we did a Show HN a few months ago (https://news.ycombinator.com/item?id=39643136), our cloud version was invite-only and we were focused on our open-source offering.
Today we’re launching our self-serve cloud so that anyone can get started creating tasks on our platform - you can get started at https://cloud.onhatchet.run, or you can use these credentials to access a demo (should be prefilled):
URL: https://demo.hatchet-tools.com
Email: hacker@news.ycombinator.com
Password: HatchetDemo123!
People are currently using Hatchet for a bunch of use-cases: orchestrating RAG pipelines, queueing up user notifications, building agentic LLM workflows, or scheduling image generation tasks on GPUs.We built this out of frustration with existing tools and a conviction that PostgreSQL is the right choice for a task queue. Beyond the fact that many developers are already using Postgres in their stack, which makes it easier to self-host Hatchet, it’s also easier to model higher-order concepts in Postgres, like chains of tasks (which we call workflows). In our system, the acknowledgement of the task, the task result, and the updates to higher-order models are done as part of the same Postgres transaction, which significantly reduces the risk of data loss/race conditions when compared with other task queues (which usually pass acknowledgements through a broker, storing the task results elsewhere, and only then figuring out the next task in the chain).
We also became increasingly frustrated with tools like Celery and the challenges it introduces when using a modern Python stack (> 3.5). We wrote up a list of these frustrations here: https://docs.hatchet.run/blog/problems-with-celery.
Since our Show HN, we’ve (partially or completely) addressed the most common pieces of feedback from the post, which we’ll outline here:
1. The most common ask was built-in support for fanout workflows — one task which triggers an arbitrary number of child tasks to run in parallel. We previously only had support for DAG executions. We generalized this concept and launched child workflows (https://docs.hatchet.run/home/features/child-workflows). This is the first step towards a developer-friendly model of durable execution.
2. Support for HTTP-based triggers — we’ve built out support for webhook workers (https://docs.hatchet.run/home/features/webhooks), which allow you to trigger any workflow over an HTTP webhook. This is particularly useful for apps on Vercel, who are dealing with timeout limits of 60s, 300s, or 900s (depending on your tier).
3. Our RabbitMQ dependency — while we haven’t gotten rid of this completely, we’ve recently launched hatchet-lite (https://docs.hatchet.run/self-hosting/hatchet-lite), which allows you to run the various Hatchet components in a single Docker image that bundles RabbitMQ along with a migration process, admin CLI, our REST API, and our gRPC engine. Hopefully the lite was a giveaway, but this is meant for local development and low-volume processing, on the order of hundreds per minute.
We’ve also launched more features, like support for global rate limiting, steps which only run on workflow failure, and custom event streaming.
We’ll be here the whole day for questions and feedback, and look forward to hearing your thoughts!
For some more detail -- to ensure we can't assign duplicate work, we track which workers are assigned to jobs by using the concept of a WorkerSemaphore, where each worker slot is backed by a row in the WorkerSemaphore table. When assigning tasks, we scan the WorkerSemaphore table and use `FOR UPDATE SKIP LOCKED` to skip any locked rows help by other assignment transactions. We also have a uniqueness constraint on the task id across all WorkerSemaphores to ensure that no more than 1 task can be acquired by a semaphore.
This is slightly different to the way most pg-backed queues work, where `FOR UPDATE SKIP LOCKED` is done on the task level, but this is because not every worker maintains its own connection to the database in Hatchet, so we use this pattern to assign tasks across multiple workers and route the task via gRPC to the correct worker after the transaction completes.
We're eventually going to support a lightweight Postgres-backed messaging table, but the number of pub/sub messages sent through RabbitMQ is typically an order of magnitude higher than the number of tasks sent.
(1) you, for free
(2) develop all the functionality of RabbitMQ as a Postgres extension with the most permissive license
(3) in order to have it on RDS
(4) and never hear from you again?
This is a colorful exaggeration. But it’s true. It is playing out with the pgvecto-rs people too.
People don’t want Postgres because it is good. They want it because it is offered by RDS, which makes it good.
Great job so far- The flow-based UI with triggers is killer! AFAIK, this surpasses what Celery includes.
Part of the reason for working on Hatchet (this version) was that I built the Terraform management tool on top of Temporal and felt there was room for improvement.
(for those curious - https://github.com/hatchet-dev/hatchet-v1-archived)
I am definitely a fan of all things postgres and it's great to see another solution that uses it.
My main thing is the RabbitMQ dependency (that seems to be a topic of interest in this thread). Getting rid of that and just depending on PG seems like the main path forward that would increase adoption. Right now I'd be considering something like this over using a tool like Rabbit (if I were making that consideration.)
You also compare yourself against Celery and BullMQ, but there is also talk in the readme around durable execution. That to me puts you in the realm of Temporal. How would you say you compare/compete with Temporal? Are you looking to compete with them?
EDIT: I also understand that Rabbit comes with certain things (or rather, lacks certain things) that you are building ontop of, which is cool. It's easy to say: why are you using rabbit?? but if it's allowing you to function like it with new additions/features, seems like a good thing!
I know a lot of folks are going after the AI agent workflow orchestration platform, do you see yourselves progressing there?
In my head, Hatchet coupled with BAML (https://www.boundaryml.com/) could be an incredible combination to support these AI agents. Congrats on the launch
Also, somewhat related, years ago I wrote a very small framework for fan-out of Django-based tasks in Celery. We have been running it in production for years. It doesn't have adoption beyond our company, but I think there are some good ideas in it. Feel free to take a look if it's of interest! https://github.com/groveco/django-sprinklers
The advice of "commoditize your complements" is working out great for amazon. Ironically, AWS is almost a commodity itself, and the OSS community could flip the table, but we haven't figured out how to do it.
To that end, we’re building Hatchet to orchestrate agents with features that are common like streaming from running workers to frontend [1] and rate limiting [2] without imposing too many opinions on core application logic.
[1] https://docs.hatchet.run/home/features/streaming [2] https://docs.hatchet.run/home/features/rate-limits
Yep, we agree - this is more a matter of bandwidth as well as figuring out the final definition of the pub/sub interface. While we wouldn't prefer to maintain two message queue implementations, we likely won't drop the RabbitMQ implementation entirely, even if we offer Postgres as an alternative. So if we do need to support two implementations, we'd prefer to build out a core set of features that we're happy with first. That said, the message queue API is definitely stabilizing (https://github.com/hatchet-dev/hatchet/blob/31cf5be248ff9ed7...), so I hope we can pick this up in the coming months.
> You also compare yourself against Celery and BullMQ, but there is also talk in the readme around durable execution. That to me puts you in the realm of Temporal. How would you say you compare/compete with Temporal? Are you looking to compete with them?
Yes, our child workflows feature is an alternative to Temporal which lets you execute Temporal-like workflows. These are durable from the perspective of the parent step which executes them, as any events generated by the child workflows get replayed if the parent step re-executes. Non-parent steps are the equivalent of a Temporal activity, while parent steps are the equivalent of a Temporal workflow.
Our longer-term goal is to build a better developer experience than Temporal, centered around observability and worker management. On the observability side, we're investing heavily in our dashboard, eventing, alerting and logging features. On the worker management side, we'd love to integrate more natively with worker runtime environments to handle use-cases like autoscaling.
1. Commit changes in the db first: if you fail to enqueue the task, there will be data rows hanging in the db but no task to process them
2. Push the task first: the task may kick start too early, and the DB transaction is not committed yet, it cannot find the rows still in transaction. You will need to retry failure
We also looked at Celery and hope it can provide a similar offer, but the issue seems open for years:
https://github.com/celery/celery/issues/5149
With the needs, I build a simple Python library on top of SQLAlchemy:
https://github.com/LaunchPlatform/bq
It would be super cool if Hatchet also supports native SQL inserts with ORM frameworks. Without the ability to commit tasks with all other data rows, I think it's missing out a bit of the benefit of using a database as the worker queue backend.
> develop all the functionality of RabbitMQ as a Postgres extension with the most permissive license
That's fair - we're not going to develop all the functionality of RabbitMQ on Postgres (if we were, we probably would have started with a amqp-compatible broker). We're building the orchestration layer that sits on top of the underlying message queue and database to manage the lifecycle of a remotely-invoked function.
We've been using Celery at ListenNotes.com since 2017. I agree that observability of Celery tasks is not great.
It seems like a very lightweight tasks table in your existing PG database which represents whether or not the task has been written to Hatchet would solve both of these cases. Once Hatchet is sent the workflow/task to execute, it's guaranteed to be enqueued/requeued. That way, you could get the other benefits of Hatchet with still getting transactional enqueueing. We could definitely add this for certain ORM frameworks/SDKs with enough interest.
With Hatchet, the starting point is a single function call that gets enqueued according to a configuration you've set to respective different fairness and concurrency constraints. Durable workflows can be built on top of that, but the entire platform should feel intuitive and familiar to anyone working in the codebase.
* Inngest is fully event driven, with replays, fan-outs, `step.waitForEvent` to automatically pause and resume durable functions when specific events are received, declarative cancellation based off of events, etc.
* We have real-time metrics, tracing, etc. out of the box in our UI
* Out of the box support for TS, Python, Golang, Java. We're also interchangeable with zero-downtime language and cloud migrations
* I don't know Hatchet's local dev story, but it's a one-liner for us
* Batching, to turn eg. 100 events into a single execution
* Concurrency, throttling, rate limiting, and debouncing, built in and operate at a function level
* Support for your own multi-tenancy keys, allowing you to create queues and set concurrency limits for your own concurrency
* Works serverless, servers, or anywhere
* And, specifically, it's all procedural and doesn't have to be a DAG.
We've also invested heavily in flow control — the aspects of batching, concurrency, custom multi-tenancy controls, etc. are all things that you have to layer over other systems.
I expect because we've been around for a couple of years that newer folks like Hatchet end up trying to replicate some of what we've done, though building this takes quite some time. Either way, happy to see our API and approach start to spread :)
1. Hatchet is MIT licensed and designed to be self-hosted in production, with cloud as an alternative. While the Inngest dev server is open source, it doesn't support self-hosting: https://www.inngest.com/docs/self-hosting.
2. Inngest is built on an HTTP webhook model while Hatchet is built on a long-lived, client-initiated gRPC connection. While we support HTTP webhooks for serverless environments, a core part of the Hatchet platform is built to display the health of a long-lived worker and provide worker-level metrics that can be used for autoscaling. All async runtimes that we've worked on in the past have eventually migrated off of serverless for a number of reasons, like reducing latency or having more control over things like runtime environment and DB connections. AFIAK the concept of a worker or worker health doesn't exist in Inngest.
There are the finer details which we can hash out in the other thread, but both products rely on events, tasks and durable workflows as core concepts, and there's a lot of overlap.
Aren't all the money go to AI companies these days (even the unicorns didn't do well with their IPOs. E.g. Hashicorp).
That said, I love every single addition to the Go community so thumbs up from me.
ETA: I really like the idea of this being entirely built on Postgres. That makes infrastructure a lot easier to manage
One of our key aspects is reliability. We were apprehensive of officially supporting self hosting with awkward queue and state store migrations until you could "Set it and forget it". Otherwise, you're almost certainly going to be many versions behind with a very tedious upgrade path.
So, if you're a cowboy, totally self hostable. If you're not (which makes sense — you're using durable execution), check back in a short amount of time :)
Hatchet is also event driven [1], has built-in support for tracing and metrics, and has a TS [2], Python [3] and Golang SDK [4], has support for throttling and rate limiting [5], concurrency with custom multi-tenancy keys [6], works on serverless [7], and supports procedural workflows [8].
That said, there are certainly lots of things to work on. Batching and better tracing are on our roadmap. And while we don’t have a Java SDK, we do have a Github discussion for future SDKs that you can vote on here: https://github.com/hatchet-dev/hatchet/discussions/436.
[1] https://docs.hatchet.run/home/features/triggering-runs/event...
[2] https://docs.hatchet.run/sdks/typescript-sdk
[3] https://docs.hatchet.run/sdks/python-sdk
[4] https://docs.hatchet.run/sdks/go-sdk
[5] https://docs.hatchet.run/home/features/rate-limits
[6] https://docs.hatchet.run/home/features/concurrency/round-rob...
Where I defer is if you already have Redis in the mix, I might be inclined to reach for it first in a lot of scenarios. If you have complex distribution needs then something more like RabbitMQ would be better.
I do think their cloud offering is interesting, and being PostgreSQL backed is a big plus for in-house development.
https://x.com/mitchellh/status/1759626842817069290?s=46&t=57...
That's not to say you can't use Hatchet for data pipelines - this is a common use-case. But you probably don't want to use Hatchet for big data pipelines where payload sizes are very large and you're working with payloads that aren't JSON serializable.
Airflow also tends to be quite slow when the task itself is short-lived. We don't have benchmarks, but you can have a look at Windmill's benchmarks on this: https://www.windmill.dev/docs/misc/benchmarks/competitors#re....
I disagree that "adding Redis to our software stack" to support a queue is problematic. It's a single process and extremely simple. Instead now with tools like this, you're clobbering up your database with temporal tasks alongside your operational data.
1. Repository/document ingestion and indexing fanout for applications like code generation or legal tech LLM agents
2. Orchestrating cloud deployment pipelines
3. Web scraping and post-processing
4. GPU inference jobs requiring multiple steps, compute classes, or batches
Once that's done and we consider our core API stable, there's a good chance we'll start tackling a new set of SDKs later this year.
I particularly like the section on escape hatches - though you start to see the issue with this approach when you use something like Celery, where the docs and Github issues contain a number of warnings about using Redis. RabbitMQ also tends to be very feature-rich from an MQ perspective compared to Redis, so it gets more and more difficult to support both over time.
We'd like to build in escape hatches as well - this starts with the application code being the exact same whether you're on cloud or self-hosted - and adding support for things like archiving task result storage to the object store of your choice, or swapping out the pub/sub system.
I’m guessing :shrug:
And to answer the question, no, the license doesn't restrict a company from offering a hosted version of Hatchet. We chose the license that we'd want to see if we were making a decision to adopt Hatchet.
That said, managing and running the cloud version is significantly from a version meant for one org -- the infra surrounding the cloud version manages hundreds and eventually thousands of different tenants. While it's all the same open-source engine + API, there's a lot of work required to distribute the different engine components in a way that's reliable and supports partitioning databases between tenants.
Also, minor thing, but the granularity around rate limiting and queues also feels like quite the luxury. Excited for more here too
Cool to see them on the front page, congrats on the launch
A question around workflows having just skimmed your docs. Is it possible to define a workflow that has steps in Python and a TS app?
It's also possible to have a single DAG workflow (instead of parent/child) that has steps across multiple languages, but you'll need to use a relatively undocumented method called `RegisterAction` within each SDK and use the API to register the DAG (instead of using the built-in helpers) for this use-case. So we recommend using the parent/child workflows instead.
I hated the configuration and management complexity of RabbitMQ and Celery and pretty much everything else.
My ultimate goal was to build a message queue that was extremely fast and required absolutely zero config and was HTTP based thus has no requirement for any specific client.
I developed one in Python that was pretty complete but slow, then developing a prototype in Rust that was extremely fast but incomplete.
The latest is sasquatch. Its written in golang, uses sqlite for the db and behaves in a very similar way to Amazon SQS in that connections are HTTP and it uses long polling to wait for messages.
https://github.com/crowdwave/sasquatch
Its only in the very early stages of development at this stage and likely isn't even compiling but most of the code is in place. I'm hoping to get around to next phase of development soon.
I just love the idea of a message queue that is a single static binary and when you run it, you have a fully functioning message queue with nothing more to do - not even fiddling with Postgres.
Absolute zero config, not minutes, hours or days of futzing with configs and blogs and tutorials.
The problem we often hit when building apps on top of LLMs is managing LLM context windows (and sometimes swappable LLM providers). For which you need different types of worker/consumer/queue setups.
TypeScript is amazing for building full-stack web apps quickly. For a decade my go-to was Django, but everything just goes so much faster with endpoints & frontend all in the same place. But, finding a good job/queue service is a little more of a challenge in this world that "just setup Celery". BullMQ is great, but doesn't work with "distributed" Redis providers like Upstash (Vercel's choice).
So, in a roundabout way, an offering like this is in a super-duper position for AI money :)
- Blake, co-author of riverqueue.com / https://github.com/riverqueue/river :)
With hatchet theres been a little bit of a dance trying to get workflows and runs to play nicely, but all in all I was able to get everything I needed working without much trouble. You end up running quite a few more tasks than needed (essentially no-ops), or wrapping small tasks in wrapper workflows, but from both a implementation and implication standpoint, there's almost no difference.
10/10 solved problem with SAQ, 8/10 not an issue with Hatchet... 2/10 smh celerey
There's many great distributed job runners out there. I've never found one for go that lets me have the features without running 7 processes and message queues sprawled over hosts and docker containers.
jorb is just a framework to slap into a go script when you want to fire a lot of work at your computer and let it run it to completion.
I've tried to build this many times and this is the first time I've gotten it to stick.
Yes you can do this with core go primitives but I find this abstraction to be a lot better and (eventually) was easier to debug deadlocks.
I'm just putting it here cause it's semi related.
The dashboard in Hatchet has a great GUI where you can navigate between all the tasks, see how they all connect, see the data passed in to each one, see the return results from each task, and each one has a log box you can print information to. You can rerun tasks, override variables, trigger identical workflows, filter tasks by metadata
It's dramatically reduced the amount of time it takes me to spot, identify, and fix bugs. I miss the simplicity of SAQ but that's the reason I switched and it's paid off already
FWIW, Ive used both dramatiq/celery with redis in heavy prod environments and never had an issue with debugging. And Im having a tough time understanding how switching the underlying queue infrastructure would have made my life easier.
It's nice to be running on Postgres (i.e. not really having to worry about payload size, I heard some people were passing images from task to task) but for me that is just a nicety and wasn't a reason to switch.
If you're happy with your current infra, happy with the visibility, and there's nothing lacking in the development perspective, then yeah probably not much point in switching your infra to begin with [1]. But if you're building complicated workflows, and just want your code to run with an extreme level of visibility, it's worth checking out Hatchet.
[1] I'm sure the founders would have more to say here, but as a consumer I'm not really deep in the architecture of the product. Best I could do could be to give you 100 reasons I will never use Celery again XD
How? This issue still seems to be open after 6 years: https://github.com/celery/celery/issues/5149
https://github.com/rails/solid_queue
Just trying to understand. I do get that hatchet would be language agnostic, SDK API kind of a solution.
The point of Hatchet is to support more complex behavior - like chaining tasks together, building automation around querying and retrying failed tasks, handling a lot of the fairness and concurrency use-cases you'd otherwise need to build yourself, etc - or just getting something that works out of the box and can support those use-cases in the future.
And if you are running at low volume and trying to debug user issues, a grafana panel isn't going to get you the level of granularity or admin control you need to track down the errors in your methods (rather than just at the queue level). You'd need to integrate your task queue with Sentry and a logging system - and in our case, error tracing and logging are available in the Hatchet UI.
(Wiring together 40+ preemptible TPUs was a nice crucible for learning about all of these. And much like a crucible, it was as painful as it sounds. Hatchet would’ve been nice.)
Thanks for making this!
Configurable retry delays are currently in development.