Five Invariants of a Crash-Safe Background Worker

Most background workers look correct right up until the day they aren’t.

They process tasks, retry on failure, and log errors. They work in development. They often work in production. And then one day a machine reboots, a process crashes, or a dependency disappears for a few hours, and suddenly you’re left with stuck work, duplicated work, or missing work — and no clear explanation for any of it.

At that point people usually reach for a bigger framework, more retries, or more concurrency.

None of those fix the real problem.

The problem is that most background workers are built without invariants.

What follows are five invariants that must hold if you want a background worker to survive crashes, restarts, and long-running failures without losing work or lying about progress.

These are not implementation details. They are properties the system must maintain at all times.

Invariant 1: Work must exist independently of the worker

If work only exists in memory, it does not exist.

A background worker that discovers work, processes it, and tracks progress in local variables is already broken. The moment the process exits — cleanly or otherwise — that work is gone.

Crash-safe systems treat work as durable state. Tasks live somewhere persistent before a worker ever touches them. The worker is not the owner of the work; it is a temporary executor.

This changes how you reason about everything. A worker no longer “has” tasks. It merely claims them for a while.

If the worker disappears, the work remains.

Invariant 2: Claims must expire without cooperation

A common failure mode in background systems is the “stuck task.” A worker claims a task, crashes mid-execution, and nothing ever picks that task up again because the system is waiting for a signal that will never come.

If task ownership requires a worker to voluntarily release it, your system is not crash-safe.

Claims must be time-bound. They must expire even if the worker never comes back. The system must be able to say, “This task was claimed, but the claim is no longer valid,” without asking the worker for permission.

This single invariant eliminates an entire class of deadlocks that retries, restarts, and watchdogs never fully solve.

Invariant 3: Failure must be represented as state, not control flow

Exceptions are not durable. Stack traces are not durable. Log messages are not durable.

If the only record of a failure is that an exception was raised and caught, then the system has already forgotten why something failed.

Crash-safe systems record failure as explicit state: what failed, how it failed, whether it is retryable, and when it should be retried — if at all.

This allows the system to make decisions later, possibly in a different process, possibly on a different machine, with full context.

It also forces an important discipline: retries become intentional. The system retries because it decided to, not because a loop happened to run again.

Invariant 4: Waiting is a valid and visible state

Many background systems treat waiting as an absence of activity. Nothing is happening, so nothing is recorded.

That’s a mistake.

If a task is waiting — because a file is missing, a network path is offline, or an external service is unavailable — that waiting must be explicit. There should be a concrete answer to the question “Why hasn’t this run yet?” that does not involve reading logs or guessing.

Crash-safe systems treat waiting as first-class state. They record why the task is waiting and when it will be reconsidered.

This is not about observability dashboards. It is about correctness. A system that cannot explain its own inactivity cannot be trusted.

Invariant 5: Progress must be idempotent by default

If re-running a task can corrupt state, your system is brittle.

Crash-safe workers assume they may run the same task more than once. They assume retries may overlap with restarts. They assume partial progress may have occurred before failure.

This forces all progress to be idempotent, or at least safely repeatable.

Once you accept this invariant, a lot of complexity disappears. You stop trying to perfectly coordinate execution and start designing operations that are safe even when coordination fails.

That is the correct direction of effort.

Why these invariants matter more than frameworks

You can implement all of the above with a database table and careful thinking. You can violate all of them while using a sophisticated job framework.

Crash safety does not come from tooling. It comes from constraints you refuse to violate.

Once these invariants hold, crashes stop being special events. Restarts stop being scary. Background work becomes something the system does, not something it hopes will finish.

Most systems never reach this point because they mistake retries for reliability and activity for progress.

Reliability starts with invariants.

If you’ve ever looked at a “stuck” background task and thought, “I don’t know why this isn’t running,” one of these invariants was already broken.

← Back to Blog