The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].
The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.
Let me know what you need for self-hosted docs, happy to write them up for you.
[1] https://github.com/hatchet-dev/hatchet/tree/main/internal/se... [2] https://docs.hatchet.run/self-hosting/configuration-options#...