Show HN: Hatchet – Open-source distributed task queue

>>topics+G3
Each task in Hatchet is backed by a workflow [1]. Workflows are predefined steps which are persisted in PostgreSQL. If a worker dies or crashes midway through (stops heartbeating to the engine), we reassign tasks (assuming they have retries left). We also track timeouts in the database, which means if we miss a timeout, we simply retry after some amount of time. Like I mentioned in the post, we avoid some classes of faults just by relying on PostgreSQL and persisting each workflow run, so you don't need to time out with distributed locks in Redis, for example, or worry about data loss if Redis OOMs. Our `ticker` service is basically its own worker which is assigned a lease for each step run.

We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.

To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.

The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.

There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].

[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...

>>abelan+1c
I think the answer is no but just to be sure: are you able to trigger step executions programmatically from within a step, even if you can't await their results?

Related, but separately: can you trigger a variable number of task executions from one step? If the answer to the previous question is yes then it would of course be trivial; if not, I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

For example some of the examples involve a load_docs step, but all loaded docs seem to be passed to the next step execution in the DAG together, unless I'm just misunderstanding something. How could we tweak such an example to have a separate task execution per document loaded? The benefits of durable execution and being able to resume an intensive workflow without repeating work is lessened if you can't naturally/easily control the size of the unit of work for task executions.

>>sigmar+cD
You can execute a new workflow programmatically, for example see [1]. So people have triggered, for example, 50 child workflows from a parent step. As you've identified the difficult part there is the "collect" or "gathering" step, we've had people hack around that by waiting for all the steps from a second workflow (and falling back to the list events method to get status), but this isn't an approach I'd recommend and it's not well documented. And there's no circuit breaker.

> I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.

The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.

As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.

[1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-clien...

zlacker