zlacker

One repeat issue I’ve had with my past position is need to schedule an unlimited number of jobs, often months to year from now. Example use case: a patient schedules an appointment for a follow up in 6 months, so I schedule a series of appointment reminders in the days leading up to it. I might have millions of these jobs.

I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.

I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.

replies(3): >>herval+c1 >>kbar13+p3 >>abelan+Md

>>moribv+(OP)
why do you need to schedule things 6 months in advance, instead of, say, check everything that needs notifications in a rolling window (eg 24h ahead) and schedule those?

replies(1): >>moribv+L6

>>moribv+(OP)
can you explain why this cannot be a simple daily cronjob to query for appointments upcoming next <time window> and send out notifications at that time? polling every few seconds seems way overkill

replies(1): >>moribv+Z7

>>herval+c1
Well, it was a dumbed down example. In that particular case, appointments can be added, removed, or moved at any moment, so I can’t just run one job every 24 hours to tee up the next day’s work and leave it at that. Simply polling the database for messages that are due to go out gives me my just-in-time queue, but then I need to build out the work to distribute it, and we didn’t like the IO costs.

I did end up moving it Redis and basically ZADD an execution timestamp and job ID, then ZRANGEBYSCORE at my desired interval and remove those jobs as I successfully distribute them out to workers. I then set a fence time. At that time a job runs to move stuff that should have ran but didn’t (rare, thankfully) into a remediation queue, and load the next block of items that should run between now + fence. At the service level, any items with a scheduled date within the fence gets ZADDed after being inserted into the normal database. Anything outside the fence will be picked up at the appropriate time.

This worked. I was able to ramp up the polling time to get near-real time dispatch while also noticeably reducing costs. Problems were some occasional Redis issues (OOM and having to either a keep bumping up the Redis instance size or reduce the fence duration), allowing multiple pollers for redundancy and scale (I used schelock for that :/), and occasionally a bug where the poller craps out in the middle of the Redis work resulting in at least once SLA which required downstream protections to make sure I don’t send the same message multiple time to the patient.

Again, it all works but I’m interested in seeing if there are solutions that I don’t have to hand roll.

replies(2): >>herval+pr >>tonyhb+Kv

>>kbar13+p3
Sure: >>39646719

>>moribv+(OP)
It wouldn't be suitable for that at the moment, but might be after some refactors coming this weekend. I wrote a very quick scheduling API which pushes schedules as workflow triggers, but it's only supported on the Go SDK. It also is CPU-intensive at thousands of schedules, as the schedules are run as separate goroutines (on a dedicated `ticker` service) - I'm not proud of this. This was a pattern that made sense for the cron schedule and I just adapted it for the one-time scheduling.

Looking ahead (and back) in the database and placing an exclusive lock on the schedule is the way to do this. You basically guarantee scheduling at +/- the polling interval if your service goes down while maintaining the lock. This allows you to horizontally scale the `tickers` which are polling for the schedules.

replies(1): >>moribv+Gj

>>abelan+Md
Thanks for the follow-up! I’ll keep an eye on the progress.

>>moribv+L6
Couldn’t u just enqueue + change a status, then check before firing? I don’t see why you’d need more than a dumb queue and a db table for that, unless you’re doing millions of qps

>>moribv+L6
I built https://www.inngest.com specifically because of healthcare flows. You should check it out, with the obvious disclaimer that I'm biased. Here's what you need:

1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).

2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).

3. General reliability and API access.

Inngest does that for you, but again — disclaimer, I made it and am biased.