Launch HN: Hatchet (YC W24) – Open-source task queue, now with a cloud version

submitted by abelan+(OP) on 2024-06-27 14:35:34 | 245 points 94 comments
[source] [links] [go to bottom]

Hey HN - this is Alexander and Gabe from Hatchet (https://hatchet.run). We’re building a modern task queue as an alternative to tools like Celery for Python and BullMQ for Node. Our open-source repo is at https://github.com/hatchet-dev/hatchet and is 100% MIT licensed.

When we did a Show HN a few months ago (https://news.ycombinator.com/item?id=39643136), our cloud version was invite-only and we were focused on our open-source offering.

Today we’re launching our self-serve cloud so that anyone can get started creating tasks on our platform - you can get started at https://cloud.onhatchet.run, or you can use these credentials to access a demo (should be prefilled):

  URL: https://demo.hatchet-tools.com 
  Email: hacker@news.ycombinator.com
  Password: HatchetDemo123!

People are currently using Hatchet for a bunch of use-cases: orchestrating RAG pipelines, queueing up user notifications, building agentic LLM workflows, or scheduling image generation tasks on GPUs.

We built this out of frustration with existing tools and a conviction that PostgreSQL is the right choice for a task queue. Beyond the fact that many developers are already using Postgres in their stack, which makes it easier to self-host Hatchet, it’s also easier to model higher-order concepts in Postgres, like chains of tasks (which we call workflows). In our system, the acknowledgement of the task, the task result, and the updates to higher-order models are done as part of the same Postgres transaction, which significantly reduces the risk of data loss/race conditions when compared with other task queues (which usually pass acknowledgements through a broker, storing the task results elsewhere, and only then figuring out the next task in the chain).

We also became increasingly frustrated with tools like Celery and the challenges it introduces when using a modern Python stack (> 3.5). We wrote up a list of these frustrations here: https://docs.hatchet.run/blog/problems-with-celery.

Since our Show HN, we’ve (partially or completely) addressed the most common pieces of feedback from the post, which we’ll outline here:

1. The most common ask was built-in support for fanout workflows — one task which triggers an arbitrary number of child tasks to run in parallel. We previously only had support for DAG executions. We generalized this concept and launched child workflows (https://docs.hatchet.run/home/features/child-workflows). This is the first step towards a developer-friendly model of durable execution.

2. Support for HTTP-based triggers — we’ve built out support for webhook workers (https://docs.hatchet.run/home/features/webhooks), which allow you to trigger any workflow over an HTTP webhook. This is particularly useful for apps on Vercel, who are dealing with timeout limits of 60s, 300s, or 900s (depending on your tier).

3. Our RabbitMQ dependency — while we haven’t gotten rid of this completely, we’ve recently launched hatchet-lite (https://docs.hatchet.run/self-hosting/hatchet-lite), which allows you to run the various Hatchet components in a single Docker image that bundles RabbitMQ along with a migration process, admin CLI, our REST API, and our gRPC engine. Hopefully the lite was a giveaway, but this is meant for local development and low-volume processing, on the order of hundreds per minute.

We’ve also launched more features, like support for global rate limiting, steps which only run on workflow failure, and custom event streaming.

We’ll be here the whole day for questions and feedback, and look forward to hearing your thoughts!

replies(28): >>michae+i7 >>dalber+r8 >>cedws+La >>nickze+ud >>jusonc+pf >>gabev+eg >>numloc+fg >>klysm+Bh >>acaloi+fj >>cyral+fn >>fangpe+Bn >>wenbin+Kp >>ocoleg+fs >>mads_q+hB >>didip+hC >>mind-b+zC >>theviv+FP >>soohoo+EZ >>distri+h31 >>barrel+981 >>n00bsk+Td1 >>andrew+Il1 >>smalls+Yn1 >>krick+RD1 >>grogen+EN1 >>stevef+ZU1 >>subham+za2 >>h0h0h0+v83

>>abelan+(OP)
can you run the whole task as a postgres transaction? like if i want to make an idempotent job by only updating some status to "complete" once the job finishes.

replies(3): >>abelan+L8 >>teaear+d9 >>mind-b+NC

>>abelan+(OP)
I'm super interested in a Postgres-only task queue, but I'm still unclear from your post whether the only broker dependency is PostgreSQL. You mention working towards getting rid of the RabbitMQ dependency but the existence of RabbitMQ in your stack is dissonant with the statement 'a conviction that PostgreSQL is the right choice for a task queue'. In my mind, if you are using Postgres as a queue, I'm not sure why you'd also have RabbitMQ.

replies(1): >>abelan+A9

>>michae+i7
No, the whole task doesn't execute as a postgres transaction. Transactions will update the status of a task (and higher-order concepts like workflows) and assign/unassign work to workers, but they're short-lived by design.

For some more detail -- to ensure we can't assign duplicate work, we track which workers are assigned to jobs by using the concept of a WorkerSemaphore, where each worker slot is backed by a row in the WorkerSemaphore table. When assigning tasks, we scan the WorkerSemaphore table and use `FOR UPDATE SKIP LOCKED` to skip any locked rows help by other assignment transactions. We also have a uniqueness constraint on the task id across all WorkerSemaphores to ensure that no more than 1 task can be acquired by a semaphore.

This is slightly different to the way most pg-backed queues work, where `FOR UPDATE SKIP LOCKED` is done on the task level, but this is because not every worker maintains its own connection to the database in Hatchet, so we use this pattern to assign tasks across multiple workers and route the task via gRPC to the correct worker after the transaction completes.

>>michae+i7
Not a Hatchet user, but this doesn’t sound like a Hatchet-specific question. Long running transactions could be problematic depending on the details. I handle idempotency by not holding a transaction and instead only upserting records in jobs and using the job record itself to get the status. For example, if you want to know if a PDF has had all of its pages OCR’d, look at all of the job records for the PDF and aggregate them by status. If they’re all complete you’re good to go.

>>dalber+r8
We're using RabbitMQ for pub/sub between different components of our engine. The actual task queue is entirely backed by Postgres, but things like streaming events between different workers are done through RabbitMQ at the moment, as well as sending a message from one component to another when you distribute the engine components. I've written a little more about this here: >>39643940 .

We're eventually going to support a lightweight Postgres-backed messaging table, but the number of pub/sub messages sent through RabbitMQ is typically an order of magnitude higher than the number of tasks sent.

replies(3): >>doctor+0d >>metada+hd >>dalber+cq

>>abelan+(OP)
What happened to the Terraform management tool? Pivot?

replies(1): >>abelan+pd

>>abelan+A9
Do you find it frustrating that what people basically want is:

(1) you, for free

(2) develop all the functionality of RabbitMQ as a Postgres extension with the most permissive license

(3) in order to have it on RDS

(4) and never hear from you again?

This is a colorful exaggeration. But it’s true. It is playing out with the pgvecto-rs people too.

People don’t want Postgres because it is good. They want it because it is offered by RDS, which makes it good.

replies(3): >>fizx+Ug >>teaear+Fj >>abelan+po

>>abelan+A9
If it's feasible, having Postgres as the only dependency would greatly simplify deployment and management for smaller scale systems.

Great job so far- The flow-based UI with triggers is killer! AFAIK, this surpasses what Celery includes.

>>cedws+La
Yeah, pretty much - that was more of a side project while figuring out what to work on next. Plus the Terraform licensing changes were on the horizon and I became a little frustrated with the whole ecosystem.

Part of the reason for working on Hatchet (this version) was that I built the Terraform management tool on top of Temporal and felt there was room for improvement.

(for those curious - https://github.com/hatchet-dev/hatchet-v1-archived)

>>abelan+(OP)
Interesting and congrats on the launch!

I am definitely a fan of all things postgres and it's great to see another solution that uses it.

My main thing is the RabbitMQ dependency (that seems to be a topic of interest in this thread). Getting rid of that and just depending on PG seems like the main path forward that would increase adoption. Right now I'd be considering something like this over using a tool like Rabbit (if I were making that consideration.)

You also compare yourself against Celery and BullMQ, but there is also talk in the readme around durable execution. That to me puts you in the realm of Temporal. How would you say you compare/compete with Temporal? Are you looking to compete with them?

EDIT: I also understand that Rabbit comes with certain things (or rather, lacks certain things) that you are building ontop of, which is cool. It's easy to say: why are you using rabbit?? but if it's allowing you to function like it with new additions/features, seems like a good thing!

replies(1): >>abelan+5m

>>abelan+(OP)
What are some real world use cases you see customers using this for?

replies(1): >>gabrie+zS

>>abelan+(OP)
Hey, this is Gabe from zenfetch. Been following you guys for a few months now since your first launch. I definitely resonate with all the problems you've described regarding celery shortcomings / other distributed task queues. We're on celery right now and have been through the ringer with various workflow platforms. Only reason we haven't switched to Hatchet is because we are finally in a stable place, though that might change soon in which case I'd be very open to jumping ship.

I know a lot of folks are going after the AI agent workflow orchestration platform, do you see yourselves progressing there?

In my head, Hatchet coupled with BAML (https://www.boundaryml.com/) could be an incredible combination to support these AI agents. Congrats on the launch

replies(1): >>gabrie+kk

>>abelan+(OP)
Being MIT licensed, does that mean that another company could also offer this as a hosted solution? Did you think about encumbering with a license that allowed commercial use, but prohibited resale?

Also, somewhat related, years ago I wrote a very small framework for fan-out of Django-based tasks in Celery. We have been running it in production for years. It doesn't have adoption beyond our company, but I think there are some good ideas in it. Feel free to take a look if it's of interest! https://github.com/groveco/django-sprinklers

replies(4): >>tracke+PG >>911e+GR >>bbor+ZY >>abelan+W21

>>doctor+0d
So true.

The advice of "commoditize your complements" is working out great for amazon. Ironically, AWS is almost a commodity itself, and the OSS community could flip the table, but we haven't figured out how to do it.

replies(2): >>giovan+3n >>brecko+GY2

>>abelan+(OP)
Looks cool, but I’m still team everything-in-Postgres

replies(1): >>teaear+jj

>>abelan+(OP)
I love seeing commercial activity around using Postgres as a queue. Last year I wrote a post titled "Choose Postgres queue technology" that spent quite a bit of time on the front page here. I don't think it's likely that my post actually sparked new development in this area, but for the people who were already using Postgres queues in their applications, I hope it made them feel more comfortable talking about it in public. And I've seen a notable increase in public discussions around the idea, and they're not all met with derision. There's long been a dogma around Postgres and relational databases being the wrong tool for the job, and indeed they are not perfect, but neither is adding Redis or RabbitMQ to our software stacks simply to support queue use cases. Kudos to the Hatchet team! I hope you all find success.

replies(5): >>mikeju+mA >>tracke+QF >>hipade+CQ >>abelan+cW >>bgentr+xs1

>>klysm+Bh
This uses Postgres

replies(1): >>e-brak+zX1

>>doctor+0d
At least pgvector is financially supported by AWS.

>>gabev+eg
Hi Gabe, also Gabe here. Yes, this is a core usecase we're continuing to develop. Prior to Hatchet I spent some time as a contractor building LLM agents where I was frustrated with the state-of-tooling for orchestration and lock in of some of these platforms.

To that end, we’re building Hatchet to orchestrate agents with features that are common like streaming from running workers to frontend [1] and rate limiting [2] without imposing too many opinions on core application logic.

[1] https://docs.hatchet.run/home/features/streaming [2] https://docs.hatchet.run/home/features/rate-limits

>>nickze+ud
> My main thing is the RabbitMQ dependency (that seems to be a topic of interest in this thread). Getting rid of that and just depending on PG seems like the main path forward that would increase adoption.

Yep, we agree - this is more a matter of bandwidth as well as figuring out the final definition of the pub/sub interface. While we wouldn't prefer to maintain two message queue implementations, we likely won't drop the RabbitMQ implementation entirely, even if we offer Postgres as an alternative. So if we do need to support two implementations, we'd prefer to build out a core set of features that we're happy with first. That said, the message queue API is definitely stabilizing (https://github.com/hatchet-dev/hatchet/blob/31cf5be248ff9ed7...), so I hope we can pick this up in the coming months.

> You also compare yourself against Celery and BullMQ, but there is also talk in the readme around durable execution. That to me puts you in the realm of Temporal. How would you say you compare/compete with Temporal? Are you looking to compete with them?

Yes, our child workflows feature is an alternative to Temporal which lets you execute Temporal-like workflows. These are durable from the perspective of the parent step which executes them, as any events generated by the child workflows get replayed if the parent step re-executes. Non-parent steps are the equivalent of a Temporal activity, while parent steps are the equivalent of a Temporal workflow.

Our longer-term goal is to build a better developer experience than Temporal, centered around observability and worker management. On the observability side, we're investing heavily in our dashboard, eventing, alerting and logging features. On the worker management side, we'd love to integrate more natively with worker runtime environments to handle use-cases like autoscaling.

>>fizx+Ug
AWS is a commodity, albeit an expensive one. After all, it has competitors like GCP, which some people like me actually prefer.

replies(1): >>fire_l+2W1

>>abelan+(OP)
How does this compare to Temporal or Inngest? I've been investigating them and the durable execution pattern recently and would like to implement one soon.

replies(5): >>pm90+Fo >>tonyhb+hw >>abelan+oz >>Bhavde+JB >>ensign+QH

>>abelan+(OP)
Hatchet looks pretty awesome. I was thinking about using it to replace my Celery worker. However, the problem is that I can only use the gRPC client to create a task (correct me if I am wrong). What I want is to be able to commit a bunch of database rows altogether with the background task itself directly. The benefit of doing so with a PostgreSQL database is that all the rows will be in the same transaction. With traditional background worker solutions, you will run into two problems:

1. Commit changes in the db first: if you fail to enqueue the task, there will be data rows hanging in the db but no task to process them

2. Push the task first: the task may kick start too early, and the DB transaction is not committed yet, it cannot find the rows still in transaction. You will need to retry failure

We also looked at Celery and hope it can provide a similar offer, but the issue seems open for years:

https://github.com/celery/celery/issues/5149

With the needs, I build a simple Python library on top of SQLAlchemy:

https://github.com/LaunchPlatform/bq

It would be super cool if Hatchet also supports native SQL inserts with ORM frameworks. Without the ability to commit tasks with all other data rows, I think it's missing out a bit of the benefit of using a database as the worker queue backend.

replies(1): >>abelan+er

>>doctor+0d
While I understand the sentiment, we see it very differently. We're interested in creating the best product possible, and being open source helps with that. The users who are self-hosting in our Discord give extremely high quality feedback and post feature ideas and discussions which shape the direction of the product. There's plenty of room for Hatchet the OSS repo and Hatchet the cloud version to coexist.

> develop all the functionality of RabbitMQ as a Postgres extension with the most permissive license

That's fair - we're not going to develop all the functionality of RabbitMQ on Postgres (if we were, we probably would have started with a amqp-compatible broker). We're building the orchestration layer that sits on top of the underlying message queue and database to manage the lifecycle of a remotely-invoked function.

>>cyral+fn
Temporal is kinda difficult to self host. Plus you have to buy into their specific paradigm/terminology for running tasks. This tool seems a lot more generic.

replies(1): >>gabrie+nt

>>abelan+(OP)
Looks awesome.

We've been using Celery at ListenNotes.com since 2017. I agree that observability of Celery tasks is not great.

>>abelan+A9
That makes sense, though a bit disappointing. One hope of using Postgres as a task queue is simplifying your overall stack. Having to host RabbitMQ partially defeats that. I'll stay tuned for the Postgres-backed messaging!

replies(1): >>tiraz+Ow

>>fangpe+Bn
That's correct, you can only create tasks via the gRPC client, Hatchet can't hook into the same transaction as your inserts or updates.

It seems like a very lightweight tasks table in your existing PG database which represents whether or not the task has been written to Hatchet would solve both of these cases. Once Hatchet is sent the workflow/task to execute, it's guaranteed to be enqueued/requeued. That way, you could get the other benefits of Hatchet with still getting transactional enqueueing. We could definitely add this for certain ORM frameworks/SDKs with enough interest.

replies(1): >>bennyp+QZ1

>>abelan+(OP)
We've been using hatchet for cloud deployments and have really enjoyed the reliable execution / observability, congrats on the launch.

>>pm90+Fo
We’ve heard and experienced the paradigm/terminology thing and are focusing heavily on devex. It's common to hear that only one engineer on a team will have experience with or knowledge of how things are architected with Temporal, which creates silos and makes it very difficult to debug when things are going wrong.

With Hatchet, the starting point is a single function call that gets enqueued according to a configuration you've set to respective different fairness and concurrency constraints. Durable workflows can be built on top of that, but the entire platform should feel intuitive and familiar to anyone working in the codebase.

>>cyral+fn
Chiming in as the founder of https://www.inngest.com. It looks like Hatchet is trying to catch up with us, though some immediate differences:

* Inngest is fully event driven, with replays, fan-outs, `step.waitForEvent` to automatically pause and resume durable functions when specific events are received, declarative cancellation based off of events, etc.

* We have real-time metrics, tracing, etc. out of the box in our UI

* Out of the box support for TS, Python, Golang, Java. We're also interchangeable with zero-downtime language and cloud migrations

* I don't know Hatchet's local dev story, but it's a one-liner for us

* Batching, to turn eg. 100 events into a single execution

* Concurrency, throttling, rate limiting, and debouncing, built in and operate at a function level

* Support for your own multi-tenancy keys, allowing you to create queues and set concurrency limits for your own concurrency

* Works serverless, servers, or anywhere

* And, specifically, it's all procedural and doesn't have to be a DAG.

We've also invested heavily in flow control — the aspects of batching, concurrency, custom multi-tenancy controls, etc. are all things that you have to layer over other systems.

I expect because we've been around for a couple of years that newer folks like Hatchet end up trying to replicate some of what we've done, though building this takes quite some time. Either way, happy to see our API and approach start to spread :)

replies(4): >>BiteCo+yz >>abelan+YD >>p10jkl+AI >>cyral+bg1

>>dalber+cq
Then maybe Procrastinate (https://procrastinate.readthedocs.io/en/main/) is something for you (I just contributed some features to it). It has very good documentation, MIT license, and also some nice features like job scheduling, priorities, cancellation, etc.

>>cyral+fn
Re Inngest - there are a few differences:

1. Hatchet is MIT licensed and designed to be self-hosted in production, with cloud as an alternative. While the Inngest dev server is open source, it doesn't support self-hosting: https://www.inngest.com/docs/self-hosting.

2. Inngest is built on an HTTP webhook model while Hatchet is built on a long-lived, client-initiated gRPC connection. While we support HTTP webhooks for serverless environments, a core part of the Hatchet platform is built to display the health of a long-lived worker and provide worker-level metrics that can be used for autoscaling. All async runtimes that we've worked on in the past have eventually migrated off of serverless for a number of reasons, like reducing latency or having more control over things like runtime environment and DB connections. AFIAK the concept of a worker or worker health doesn't exist in Inngest.

There are the finer details which we can hash out in the other thread, but both products rely on events, tasks and durable workflows as core concepts, and there's a lot of overlap.

>>tonyhb+hw
But we can't self host, right?

So it's locked in.

replies(1): >>tonyhb+UC

>>acaloi+fj
I remember reading that post, there were a lot of good ideas in the comments

>>abelan+(OP)
Awesome! Reducing moving parts is always a great thing!For 99.9% of teams this is a great alternative to rely only on the database the team's already using. For those teams that use MongoDB, I created something similar (and simpler of course): https://allquiet.app/open-source/mongo-queueing The package is C#, but the idea could be adapted to practically any language that has a MongoDB driver.

>>cyral+fn
Doesn't look like Inngest allows you to self-host either.

>>abelan+(OP)
I am surprised that there's still money for this type of OSS SaaS companies.

Aren't all the money go to AI companies these days (even the unicorns didn't do well with their IPOs. E.g. Hashicorp).

That said, I love every single addition to the Go community so thumbs up from me.

replies(2): >>abelan+UR >>kevdor+Ho1

>>abelan+(OP)
I'm really curious how you folks compare to something like Apache Airflow. They do a similar durable execution w/ DAGs on top of postgres and redis. They're Python-only (one definite difference). I'm curious what other comparisons you see

ETA: I really like the idea of this being entirely built on Postgres. That makes infrastructure a lot easier to manage

replies(1): >>abelan+fP

>>michae+i7
Long running transactions can easily lock up your database. I'd definitely avoid those. You're better off writing status records to the DB and using those to determine whether something is running, failing, etc.

>>BiteCo+yz
We're source available, and a helm chart will be coming soon. We're actually doing the last of any queue migrations now.

One of our key aspects is reliability. We were apprehensive of officially supporting self hosting with awkward queue and state store migrations until you could "Set it and forget it". Otherwise, you're almost certainly going to be many versions behind with a very tedious upgrade path.

So, if you're a cowboy, totally self hostable. If you're not (which makes sense — you're using durable execution), check back in a short amount of time :)

>>tonyhb+hw
If we’re going to give credit where credit’s due, the history of durable execution traces back to the ideas of AWS step functions and Azure durable functions alongside the original Cadence and Conductor project. A lot of the features here are attempting to make patterns in those projects accessible to a wider range of developers.

Hatchet is also event driven [1], has built-in support for tracing and metrics, and has a TS [2], Python [3] and Golang SDK [4], has support for throttling and rate limiting [5], concurrency with custom multi-tenancy keys [6], works on serverless [7], and supports procedural workflows [8].

That said, there are certainly lots of things to work on. Batching and better tracing are on our roadmap. And while we don’t have a Java SDK, we do have a Github discussion for future SDKs that you can vote on here: https://github.com/hatchet-dev/hatchet/discussions/436.

[1] https://docs.hatchet.run/home/features/triggering-runs/event...

[2] https://docs.hatchet.run/sdks/typescript-sdk

[3] https://docs.hatchet.run/sdks/python-sdk

[4] https://docs.hatchet.run/sdks/go-sdk

[5] https://docs.hatchet.run/home/features/rate-limits

[6] https://docs.hatchet.run/home/features/concurrency/round-rob...

[7] https://docs.hatchet.run/home/features/webhooks

[8] https://docs.hatchet.run/home/features/child-workflows

>>acaloi+fj
I mostly agree... a traditional RDBMS can vertically scale a lot on modern hardware. It's usually easier for devs to reason with. And odds are already part of your stack. You can go a long way with just PostgreSQL. It works well for traditional RDBMS cases, works well enough as a Document store and other uses as well. The plugin ecosystem is pretty diverse as well, more than most competing options.

Where I defer is if you already have Redis in the mix, I might be inclined to reach for it first in a lot of scenarios. If you have complex distribution needs then something more like RabbitMQ would be better.

>>numloc+fg
I'm not sure that it matters... all the cloud providers has simple queues and more complex orchestrators available already.

I do think their cloud offering is interesting, and being PostgreSQL backed is a big plus for in-house development.

>>cyral+fn
Hatchet and Temproral are MIT licensed and therefore usable by anyone, I can't find the license for Inngest, but in another comment they say it is "source available" and self hostable, not sure under what terms, but smart companies that avoid vendor lock in will want to steer well clear of it if they can.

>>tonyhb+hw
Maybe let them have their launch? Mitchell said it best:

https://x.com/mitchellh/status/1759626842817069290?s=46&t=57...

replies(2): >>tonyhb+oQ >>cyral+2g1

>>mind-b+zC
While the execution model is very similar to Airflow, we're primarily targeting async jobs which are spawned from an application, while Airflow is primarily for data pipelines. The connector ecosystem of Airflow is very powerful and not something that we're trying to replace.

That's not to say you can't use Hatchet for data pipelines - this is a common use-case. But you probably don't want to use Hatchet for big data pipelines where payload sizes are very large and you're working with payloads that aren't JSON serializable.

Airflow also tends to be quite slow when the task itself is short-lived. We don't have benchmarks, but you can have a look at Windmill's benchmarks on this: https://www.windmill.dev/docs/misc/benchmarks/competitors#re....

>>abelan+(OP)
Seems interesting, what are the plans on Rust SDK?

replies(1): >>abelan+RT

>>p10jkl+AI
Ah, yes, fair. Someone (and I don't know who) mentioned our company so I did jump in... kind of fair, too. I'l leave this thread :)

>>acaloi+fj
> neither is adding Redis or RabbitMQ to our software stacks simply to support queue use cases

I disagree that "adding Redis to our software stack" to support a queue is problematic. It's a single process and extremely simple. Instead now with tools like this, you're clobbering up your database with temporal tasks alongside your operational data.

replies(1): >>altdat+YG1

>>numloc+fg
I’m also interested in understand the context for MIT instead of dual licensing for commercial needs, what’s the current best strategy ?

>>didip+hC
It does seem like some really great options are emerging in the Go community, and a lot of newer execution frameworks are supporting Go as one of the first languages. Another great addition is https://github.com/riverqueue/river.

>>jusonc+pf
Folks are using us for long-lived tasks traditionally considered background jobs, as well as near-real-time background jobs. Our latency is acceptable for requests where users may still be waiting, such as LLM/GPU inference. Some concrete examples:

1. Repository/document ingestion and indexing fanout for applications like code generation or legal tech LLM agents

2. Orchestrating cloud deployment pipelines

3. Web scraping and post-processing

4. GPU inference jobs requiring multiple steps, compute classes, or batches

>>theviv+FP
We'd like to stabilize our existing 3 SDKs and create a proper spec for future SDKs to implement. While we use proto definitions and openapi to generate clients, there are a lot of decisions made while calling these APIs that are undocumented but kept consistent between TS, Python and Go.

Once that's done and we consider our core API stable, there's a good chance we'll start tackling a new set of SDKs later this year.

replies(1): >>plasma+er1

>>acaloi+fj
Yes, I remember reading the post and the discussion surrounding it being very high quality!

I particularly like the section on escape hatches - though you start to see the issue with this approach when you use something like Celery, where the docs and Github issues contain a number of warnings about using Redis. RabbitMQ also tends to be very feature-rich from an MQ perspective compared to Redis, so it gets more and more difficult to support both over time.

We'd like to build in escape hatches as well - this starts with the application code being the exact same whether you're on cloud or self-hosted - and adding support for things like archiving task result storage to the object store of your choice, or swapping out the pub/sub system.

>>numloc+fg
I feel like just rehosting an actively maintained github repo would draw significant negative PR. And even if not, I feel like part of this business plan revolves around becoming a relatively big part of the ecosystem; one or two cloud providers potentially poaching your customers with a drop down option could easily be worth more in advertising than you’re losing in subscription dollars.

I’m guessing :shrug:

>>abelan+(OP)
we use hatchet to orchestrate our long running backend jobs. it provided us with scalability, reliability, and observability into our tasks with a couple lines of code.

>>numloc+fg
Very cool! Does it support the latest version of Celery?

And to answer the question, no, the license doesn't restrict a company from offering a hosted version of Hatchet. We chose the license that we'd want to see if we were making a decision to adopt Hatchet.

That said, managing and running the cloud version is significantly from a version meant for one org -- the infra surrounding the cloud version manages hundreds and eventually thousands of different tenants. While it's all the same open-source engine + API, there's a lot of work required to distribute the different engine components in a way that's reliable and supports partitioning databases between tenants.

>>abelan+(OP)
Nice, looks really good. High time a decent task queue came along that is usable with the Node ecosystem.

>>abelan+(OP)
I’ve been through a whole journey with distributed tasks queues - from celery, to arq, to recently hatchet. Not only is hatchet the only solution that doesn’t make me want to tear my hair out, but the visibility the product gives you is amazing! Being able to visually explore logs, props, refrigerate specific queues etc has been a game changer,

Also, minor thing, but the granularity around rate limiting and queues also feels like quite the luxury. Excited for more here too

Cool to see them on the front page, congrats on the launch

replies(1): >>tecoho+Nj1

>>abelan+(OP)
This looks really awesome! We were just discussing at work how we're having a hard time finding a framework for a task queue that supports dependant tasks and has support for Python & TS. I suppose writing that out it does feel like a pretty specific requirement. I'm glad to see this pop up though, feels very relevant to me right now.

A question around workflows having just skimmed your docs. Is it possible to define a workflow that has steps in Python and a TS app?

replies(1): >>abelan+5g1

>>p10jkl+AI
I specifically asked about Inngest, so their comment is very helpful (more so than those only focused on the open source or licensing issue)

>>n00bsk+Td1
Thanks! Yes, our recommended approach is to write a parent workflow which calls child workflows registered on a different worker. We have users who are managing a set of Python functions from a Typescript backend with this approach.

It's also possible to have a single DAG workflow (instead of parent/child) that has steps across multiple languages, but you'll need to use a relatively undocumented method called `RegisterAction` within each SDK and use the API to register the DAG (instead of using the built-in helpers) for this use-case. So we recommend using the parent/child workflows instead.

replies(1): >>n00bsk+mg1

>>tonyhb+hw
Thank you, if you build a .NET API we will give it a try.

>>abelan+5g1
Ah okay that makes sense! Thanks for the reply, will definitely try hatchet out!

>>barrel+981
After multiple years fighting with Celery, we moved to Prefect last year and have been mostly happy with it. The only sticking point for me has been “tasks can’t start tasks, will have to be sub-flows” part. Did you ever try out Prefect and can share anything from the experience?

replies(1): >>barrel+nN1

>>abelan+(OP)
I've built several message queues over the years.

I hated the configuration and management complexity of RabbitMQ and Celery and pretty much everything else.

My ultimate goal was to build a message queue that was extremely fast and required absolutely zero config and was HTTP based thus has no requirement for any specific client.

I developed one in Python that was pretty complete but slow, then developing a prototype in Rust that was extremely fast but incomplete.

The latest is sasquatch. Its written in golang, uses sqlite for the db and behaves in a very similar way to Amazon SQS in that connections are HTTP and it uses long polling to wait for messages.

https://github.com/crowdwave/sasquatch

Its only in the very early stages of development at this stage and likely isn't even compiling but most of the code is in place. I'm hoping to get around to next phase of development soon.

I just love the idea of a message queue that is a single static binary and when you run it, you have a fully functioning message queue with nothing more to do - not even fiddling with Postgres.

Absolute zero config, not minutes, hours or days of futzing with configs and blogs and tutorials.

>>abelan+(OP)
I'm wondering what is the difference from https://docs.urlinks.io/gateway-chain/ . There are a lot similar concepts, like very similar. Hatchet feels like same product but with money from VCs.

replies(1): >>pizzaf+Fo1

>>smalls+Yn1
So there's a market.

>>didip+hC
There are a lot of AI startups that fall in the category of LLM API consumers (Anthropic/OpenAI wrappers). Or, as I heard the CTO of one of them joking, "we're actually more EC2 wrappers than OpenAI wrappers".

The problem we often hit when building apps on top of LLMs is managing LLM context windows (and sometimes swappable LLM providers). For which you need different types of worker/consumer/queue setups.

TypeScript is amazing for building full-stack web apps quickly. For a decade my go-to was Django, but everything just goes so much faster with endpoints & frontend all in the same place. But, finding a good job/queue service is a little more of a challenge in this world that "just setup Celery". BullMQ is great, but doesn't work with "distributed" Redis providers like Upstash (Vercel's choice).

So, in a roundabout way, an offering like this is in a super-duper position for AI money :)

>>abelan+RT
Project looks interesting, would welcome seeing an API (or c# client) to be able to use it.

>>acaloi+fj
I remember that post and I’ve read it a few times, thank you for it! I was already working on River at the time but it was refreshing to see the case made so strongly by another person who gets it.

- Blake, co-author of riverqueue.com / https://github.com/riverqueue/river :)

>>abelan+(OP)
Can somebody explain why would I use it instead a simple Redis/SQS/Postgres queue implemented in 50 LOC (+ some grafana panel for monitoring) (which is pretty much mandatory even for a wrapper of this or any other service)? I'm not trying to mock it, it's a serious question. What is implied by "task queue" that makes it worth bothering to use a dedicated service?

replies(5): >>altdat+IG1 >>nextwo+KG1 >>barrel+3O1 >>vasco+DS1 >>abelan+Pj2

>>krick+RD1
I also want the answer to this question. Instinctually i want to say if you’re asking this Q it means you don’t need it (just like most people dont need Kubernetes/Snowflake/data lakes)

>>krick+RD1
It’s always dev ex, and saving time for these things.

>>hipade+CQ
Out of all the processes/infrastructure ive had to manage in my career, Redis has been the simplest, and least hassle out of all of them. Even when you add Redis sentinel to the picture, it just does its job

>>tecoho+Nj1
I don't have any experience with prefect, but I have to say one of my favorite things about SAQ (Simple Async Queue) was a task was a task was a task. You could enqueue them from anywhere, nest them, repeat them, skip them, whichever.

With hatchet theres been a little bit of a dance trying to get workflows and runs to play nicely, but all in all I was able to get everything I needed working without much trouble. You end up running quite a few more tasks than needed (essentially no-ops), or wrapping small tasks in wrapper workflows, but from both a implementation and implication standpoint, there's almost no difference.

10/10 solved problem with SAQ, 8/10 not an issue with Hatchet... 2/10 smh celerey

replies(1): >>tecoho+RF4

>>abelan+(OP)
Shameless plug since I never get to do those: https://github.com/gaffo/jorb

There's many great distributed job runners out there. I've never found one for go that lets me have the features without running 7 processes and message queues sprawled over hosts and docker containers.

jorb is just a framework to slap into a go script when you want to fire a lot of work at your computer and let it run it to completion.

I've tried to build this many times and this is the first time I've gotten it to stick.

Yes you can do this with core go primitives but I find this abstraction to be a lot better and (eventually) was easier to debug deadlocks.

I'm just putting it here cause it's semi related.

>>krick+RD1
I have a bunch of different queues that used SAQ (~50 LoC for the whole setup) and deployed it to production. A lot of them use LLMs, and when one of them failed it was near impossible to debug. Every workflow has over a dozen connected tasks, and every task can run on over a dozen separate rows before completing... I was spending hours in log files (often unsuccessfully)

The dashboard in Hatchet has a great GUI where you can navigate between all the tasks, see how they all connect, see the data passed in to each one, see the return results from each task, and each one has a log box you can print information to. You can rerun tasks, override variables, trigger identical workflows, filter tasks by metadata

It's dramatically reduced the amount of time it takes me to spot, identify, and fix bugs. I miss the simplicity of SAQ but that's the reason I switched and it's paid off already

replies(1): >>yuppie+IS1

>>krick+RD1
You can use celery with postgres without issues if you want the stuff you don't get with that, like tweakable retries, tweakable amounts of prefetch and other important-at-scale things. Plus out of the box working sdk with higher level patterns for you developers. Like what if devs want to track how long something waited in the queue or a metric about retries etc, things that you'd have to roll by hand.

replies(1): >>mathnm+u12

>>barrel+3O1
Is that a problem with the underlying infrastructure though? Im not seeing how using postgres queues would solve your issue... Instead it seems like an issue with your client lib, SAQ not providing the appropriate tooling to debug.

FWIW, Ive used both dramatiq/celery with redis in heavy prod environments and never had an issue with debugging. And Im having a tough time understanding how switching the underlying queue infrastructure would have made my life easier.

replies(1): >>barrel+KU1

>>yuppie+IS1
No it's not a problem with the underlying infrastructure. I believe the OP was asking why use this product, not why is this specific infrastructure necessary. The infrastructure before was working fine (with SAQ at least, Celery was an absolute mess of SIGFAULTs), so that was not really part of my decision. I actually really liked SAQ and probably preferred it from an infra perspective.

It's nice to be running on Postgres (i.e. not really having to worry about payload size, I heard some people were passing images from task to task) but for me that is just a nicety and wasn't a reason to switch.

If you're happy with your current infra, happy with the visibility, and there's nothing lacking in the development perspective, then yeah probably not much point in switching your infra to begin with [1]. But if you're building complicated workflows, and just want your code to run with an extreme level of visibility, it's worth checking out Hatchet.

[1] I'm sure the founders would have more to say here, but as a consumer I'm not really deep in the architecture of the product. Best I could do could be to give you 100 reasons I will never use Celery again XD

>>abelan+(OP)
Is .NET/C# support on the roadmap?

>>giovan+3n
I would say the lock-in is considerable. To avoid the lock-in (maybe you go full Kubernetes on EC2) then it’s ton more work.

>>teaear+jj
And RabbitMQ :S

>>abelan+er
I wonder if you could use LISTEN/NOTIFY so when a task name and payload are committed to a 'tasks' table, then it enqueues the job as if you had done so via gRPC?

>>vasco+DS1
> You can use celery with postgres without issues

How? This issue still seems to be open after 6 years: https://github.com/celery/celery/issues/5149

>>abelan+(OP)
How is this technically different from what solid queue is doing on rails ?

https://github.com/rails/solid_queue

Just trying to understand. I do get that hatchet would be language agnostic, SDK API kind of a solution.

>>krick+RD1
You're right, if all you need is a queue with a small number of workers connected at low volume, you don't need Hatchet or any other managed queue - you can get some pretty performant behavior with something like: https://github.com/abelanger5/postgres-fair-queue/blob/main/....

The point of Hatchet is to support more complex behavior - like chaining tasks together, building automation around querying and retrying failed tasks, handling a lot of the fairness and concurrency use-cases you'd otherwise need to build yourself, etc - or just getting something that works out of the box and can support those use-cases in the future.

And if you are running at low volume and trying to debug user issues, a grafana panel isn't going to get you the level of granularity or admin control you need to track down the errors in your methods (rather than just at the queue level). You'd need to integrate your task queue with Sentry and a logging system - and in our case, error tracing and logging are available in the Hatchet UI.

replies(1): >>sillys+ss2

zlacker

Launch HN: Hatchet (YC W24) – Open-source task queue, now with a cloud version