Hi HN! We’re Dan and Tony - founders of Inngest (https://www.inngest.com/). Inngest is a developer platform and toolchain for developing, testing and running background jobs, and workflows. Inngest invokes your jobs via HTTP, wherever you want to deploy your code.
Shipping reliable background jobs and workflows is a time suck for any software team. They’re painful to develop locally and getting into production is a tedious experience of configuring infra. When you want to add scheduling, orchestrate multi-step workflows or handle concurrency or idempotency, you spend even more time building bespoke systems - not your actual product.
Software engineers spend a ton of duplicated effort building and rebuilding this at every company. It shouldn’t be this way.
We’ve taken our experience building and scaling reliable, secure queueing systems across Healthcare, B2B SaaS, and developer infra companies. With Inngest, we sought out to create a single platform and set of developer tools to unburden the developer.
- You write functions alongside your API, in your existing codebase with our simple SDK. We invoke your functions via HTTPS, so there are no additional worker services to setup.
- End-to-end local development, with one command. Our dev server runs Inngest on any machine with a web interface to visualize, debug, and test your functions with zero additional dependencies.
- Our serverless queue calls you, so you can run your code anywhere - serverless, servers or edge.
- Inngest manages state across functions and long-running workflows for you. We handle retries, concurrency, idempotency, and coordinating parallel and sequential workloads out-of-the-box.
We’ve helped users like:
- Snaplet.dev uses Inngest to handle the lifecycle of managing preview databases for their developer platform.
- Ocoya.com re-build their e-commerce and social media scheduling workflows in days while dramatically simplifying their infra to run solely with Inngest + serverless functions.
- Secta.ai uses Inngest to run all of their AI image generation models on GPU-optimized instances.
Today, we have a TypeScript SDK and we will expand to other languages soon (Go is next). We’re building in the open on Github and we offer usage-based plans with a generous free tier.
We’re excited to share this with HN and we’re eager for your feedback! What are your experiences building systems for background jobs and workflows?
I realize you can't please everyone at all times but I'd love to have a Rust or Zig SDK option. Go is a good start in that direction I guess..
A lot of folks in the TS/JS community also don't often build distributed systems and it's easy to get wrong. So we think they're hungry for something like Inngest that they don't need to manage or spend weeks learning some complex system. Plus, TS gives us typing for all events/messages.
We already have a working Go SDK that we use internally and we have a test harness that will enable us to add other languages like Rust or Zig more easily. We even have a community member building a PoC for Elixir.
The last straw for me was the few times I ran into issues, often due to my own mistakes, their support was nearly real-time and worked with me either help me solve the problem or dig in on their end to see where the issue was. Honestly more than anything the support gives me confidence to fully commit to this and use across all my production apps.
Anyway, great stuff all, you’ve built something awesome here.
This is why we've built event schema versioning and versioning for functions baked into the platform. We have big plans for the schema management side of things that bring concepts of data governance to engineering teams. It should just be for data teams. As a bonus, we can also generate language types from schemas easily then.
What else about schema management is a pain? What have you used for this?
Thanks! What type of monitoring were you looking for? We have some basic metrics now, but know we need to improve this. What metrics, alerting, observability are important for you?
1. Wait timings for jobs.
2. Run timings for jobs.
3. Timeout occurrences and stdout/stderr logs of those runs
4. Retry metrics, and if there is a retry limit, then metrics on jobs that were abandoned.
One thing that is easy to overlook is giving users the ability to define a specific “urgency” for their jobs which would allow for different alerting thresholds on things like running time or waiting.
Observability is super key for background work even more so since it's not always tied to a specific user action, so you need to have a trail to understand issues.
> One thing that is easy to overlook is giving users the ability to define a specific “urgency” for their jobs which would allow for different alerting thresholds on things like running time or waiting.
We are adding prioritization for functions soon so this is helpful for thinking about how to think about telemetry for different priority/urgent jobs.
re: timeouts - managing timeouts usually means managing dead-letter queues and our goal is to remove the need to think about DLQs at all and build metrics and smarter retry/replay logic right into the Inngest platform.
Do I get it right that difference between this and for example ActiveJob in Rails is that you handle well multi step workflows where there's a need to coordinate and wait for some event/thing to finish (or just sleep). And benefit is that it it's easy to read whole flow as it's async function?
Being HTTP based (push vs. pull), it's easier to manage and works natively with serverless and servers.
Inngest is also event-driven, so you can fan-out and do things like have your workflow wait for another event. Our `step.waitForEvent()` allows you to pause a function until another event is received, creating dynamic jobs that can wait for additional actions or input. Also, using events allows us to replay failures super easily.
re: ActiveJob - Yeah, multi-step workflows are a huge difference. We manage step retries and the function state for you. That makes things like sleep and coordinating between events easy. As you mentioned, it leads to simpler function definition so it means that almost any engineer can write workflows quickly and easily read the code in a single place, reducing bugs due to disconnected jobs.
Agreed that alerting is important! We alert on job failures, plus we integrate with observability tools like Sentry.
For DLQs, you're right that they have value. We aren't killing DLQs but rather rethinking them with better ergonomics. Instead of having a dumping ground for unacked messages, we're developing a "replay" feature that lets you retry failed jobs over a period of time. Our planned replay feature will run failures in a separate queue, which can be cancelled at any time. The replay itself can be retried as well if there's still a problem
*Caveat*: This is super nuanced and hotly debated, so this is high level and no perfect answer here.
Mid term, we plan to move from SSPL to a more open license in the future as we further develop our open source project.
As for the schema management part, we at bytebase.com have also built an OSS product to tackle this specifically.
DX is great! Writing the jobs feels very natural, much much simpler than Temporal. The development server is neat and makes debugging jobs very easy. TypeScript SDK is idiomatic, the types are properly inferred & propagated throughout the whole app.
The nice thing about writing step functions for Inngest vs regular "async worker queues" is that we can express logic, e.g. "if X than wait for event Y", with a layer of caching/retries on top.
Over the last year we've been iterating on the internals a lot to build things like:
- Concurrency (shared nothing, auto-scalable)
- Batching (have one fn run with 100 events, vs 1:1 mapping)
- Prioritization
- Replay
- Parallelization
- Branch deploys
- Rate limiting
The changes have been heavy, and it would be really hard for self-hosted people to handle the migrations necessary for these. Now that this is slowing, self hosting is realistically something that's possible soon. We'd prefer to offer self hosting when it's easy and ready, vs something that's a burden to operate.
What led you guys to work on this problem? What inspired you guys?