Relier — Zero Job Loss for FastAPI + Celery

§01 · Problem

Eight ways a Celery worker silently loses your data.

Every Celery deployment has these failure modes. Most teams discover them at 2am, after a customer complains. Relier closes each one.

F01 — runtime

Worker OOM kill

Worker is killed mid-execution. All in-flight tasks vanish without a trace. You don't even know it happened.

F02 — retries

Non-idempotent retries

Task retries re-execute side effects. Double Stripe charges. Duplicate emails. Corrupt database state.

F03 — timeouts

No task timeouts

One stalled upstream API holds a worker hostage forever. No soft timeout. No cleanup hook. Just a zombie.

F04 — deploys

Ungraceful shutdown

Deploy at 3pm. SIGTERM lands. 12 in-flight tasks silently dropped. Nobody notices until a customer complains.

F05 — visibility

Zero visibility

No idea what's running right now. Which worker has which task. How long it's been. You're flying blind.

F06 — traffic

Traffic spikes

Queue floods with no backpressure. Workers cascade-fail. No admission control. No Retry-After.

F07 — poison

Poison-pill tasks

One bad payload crashes the worker, gets retried, crashes again. Infinite loop. No quarantine. No DLQ.

F08 — schema

Schema drift

Mid-deploy payload mismatch. Old workers pick up new-format tasks. Silent deserialization failures. Data lost.

§02 · Primitives

Reliability primitives, not boilerplate.

One decorator. Four guarantees. Every task tracked from enqueue to completion — and brought back if anything in between dies.

Zero-job-loss resurrection.

Every task registers a Redis heartbeat with a short TTL. When a worker crashes, the heartbeat expires. The Phoenix resurrector detects it and re-queues the task on a fresh worker — automatically.

Heartbeat registration on the persistent worker event loop.
OOM detection via Redis TTL expiry, checked every five seconds.
Automatic re-queue with original payload intact.
DLQ quarantine after max_resurrections exceeded.
OTEL span emitted for every resurrection event.

tasks.py python

@rl_task(queue="high_priority", max_resurrections=5)
async def process_document(doc_id: str):
    # Your existing code. Zero changes needed.
    result = await store_document(doc_id)
    return result
# Worker dies? Phoenix resurrects in <35s.
# Delivery rate: 99.97%

Safe retries by default.

Atomic Redis Lua check-and-set makes any task safely retryable. First run claims, executes, caches. Retry returns cached result instantly. No double charges. No duplicate emails.

Lua script: SET key value NX EX ttl — atomic claim.
On first run: claim → execute → store result.
On retry: return cached result, skip execution.
IN_FLIGHT race handling with automatic TTL expiry.

tasks.py python

# Option A — one flag. Done.
@rl_task(idempotent=True, idempotency_ttl=3600)
async def send_invoice(invoice_id: str):
    await stripe.charge(invoice_id)
    # Already ran? → cached result, charge skipped.
# Option B — manual key for custom logic.
async with idempotency_lock(key=event_id) as lock:
    if lock.already_executed:
        return lock.cached_result

Soft + hard with cleanup hooks.

Two-tier timeout enforcement. Soft timeout fires your cleanup hook to save progress. Hard timeout cancels the coroutine unconditionally. Both emit OTEL events you can plug into anywhere.

Soft timeout fires an async cleanup hook.
Hard timeout: unconditional task cancellation.
Save partial results before hard kill.
Both tiers emit OTEL events with rl.timeout.type.

tasks.py python

@rl_task(
    soft_timeout=25,
    hard_timeout=30,
    on_soft_timeout=save_progress
)
async def process_large_job(job_id: str):
    return await run_job(job_id)
async def save_progress(ctx: TaskContext):
    await redis.set(f"partial:{ctx.task_id}", ctx.partial_result)

Tasks finish or hand off.

Relier intercepts SIGTERM from deploys, scale-downs, and K8s evictions. Worker enters drain mode, finishes current tasks, and hands off the rest. Zero task loss on every deploy.

Worker enters drain mode — stops accepting new tasks.
Waits for current tasks to finish within grace window.
Unfinished tasks: checkpoint → re-queue elsewhere.
Clean exit with full accounting.

terminal shell

$ rl worker drain --timeout 30
⏳ Worker rl-worker-2 entering drain mode...
✓ task_a8f2c1 completed (12.4s)
✓ task_b2d4e8 completed (18.1s)
⚠ task_c9f1a3 exceeded timeout — re-queuing
✓ Clean exit. 1 task handed off. 0 lost.

§03 · Developer experience

See everything. Control everything.

The rl CLI gives you real-time visibility into every task, worker, and failure — and the muscles to act on it.

~/relier — rl

⌃C to exit

§04 · Benchmarks

Relier vs vanilla Celery.

Measured overhead. Real numbers. No asterisks. Reproduce them yourself with rl bench all.

Metric Relier v1.0 Vanilla Celery

Task delivery rate

measured over 10M tasks

99.97% ~94%

Worker OOM recovery

SIGKILL → task back on a worker

< 35s ∞ lost

Duplicate prevention

idempotent=True flag

100% 0%

Admission control p99

Lua atomic rate-limit

< 1ms n/a

Graceful shutdown

SIGTERM → drain → hand off

100% ~60%

Overhead per task

heartbeat + idempotency check

< 10ms 0ms

§05 · Quickstart

Five minutes to zero job loss.

Install. Decorate. Deploy. That's the whole onboarding.

01 · install

One package.

Zero database dependencies. Zero GPU dependencies. Python 3.11+.

$ pip install relier

02 · decorate

Wrap your tasks.

Add @rl_task. Dispatch with .apush() from FastAPI, .push() from Django.

@rl_task(idempotent=True) async def my_task(arg): return await do_work(arg) # dispatch await my_task.apush("data")

03 · run

Start the cluster.

One command brings up workers, the Phoenix resurrector, and the OTEL exporter.

$ rl cluster up

§06 · Surface area

Every reliability primitive you need, in one library.

No glue. No second service. No second database.

primitive · 01

Phoenix resurrection

Worker dies → task comes back. Automatic. Median recovery < 35 seconds with full payload integrity.

primitive · 02

Idempotency

Atomic Lua check-and-set. One flag, no double charges.

primitive · 03

Soft + hard timeouts

Two-tier timeout with cleanup hooks. No zombie workers.

primitive · 04

Graceful shutdown

SIGTERM → drain → finish or hand off. Zero loss on deploy.

primitive · 05

Inflight visibility

Every running task, worker, and queue depth in real time.

primitive · 06

Admission control

Lua atomic rate-limit. < 1ms p99. Returns 429 + Retry-After.

primitive · 07

Dead letter queue

Poison pills quarantined with payload + stack trace. Release when fixed.

primitive · 08

Schema versioning

Versioned envelope. Auto-migration on pickup. Deploy any time.

primitive · 09

OpenTelemetry native

Every event emits OTEL spans. Plug into Grafana, Jaeger, Datadog, anything OTLP.

primitive · 10

Chaos engineering CLI

rl chaos worker-kill --watch. Prove the guarantees yourself.

primitive · 11

Sync + async, one API

async def or def — both work. .apush() from FastAPI, .push() from Django or Flask. Persistent event loop under the hood, zero per-task asyncio overhead.

primitive · 12

First-class CLI

rl tasks, rl worker, rl dlq, rl chaos — full control from terminal.

When your worker dies tonight,
your tasks come back.

Eight ways a Celery worker silently loses your data.

Worker OOM kill

Non-idempotent retries

No task timeouts

Ungraceful shutdown

Zero visibility

Traffic spikes

Poison-pill tasks

Schema drift

Reliability primitives, not boilerplate.

Zero-job-loss resurrection.

Safe retries by default.

Soft + hard with cleanup hooks.

Tasks finish or hand off.

See everything. Control everything.

Relier vs vanilla Celery.

Five minutes to zero job loss.

One package.

Wrap your tasks.

Start the cluster.

Every reliability primitive you need, in one library.

Phoenix resurrection

Idempotency

Soft + hard timeouts

Graceful shutdown

Inflight visibility

Admission control

Dead letter queue

Schema versioning

OpenTelemetry native

Chaos engineering CLI

Sync + async, one API

First-class CLI

Built for engineers at 2am
whose queue just died.

Eight ways a Celery worker silently loses your data.

Worker OOM kill

Non-idempotent retries

No task timeouts

Ungraceful shutdown

Zero visibility

Traffic spikes

Poison-pill tasks

Schema drift

Reliability primitives, not boilerplate.

Zero-job-loss resurrection.

Safe retries by default.

Soft + hard with cleanup hooks.

Tasks finish or hand off.

See everything. Control everything.

Relier vs vanilla Celery.

Five minutes to zero job loss.

One package.

Wrap your tasks.

Start the cluster.

Every reliability primitive you need, in one library.

Phoenix resurrection

Idempotency

Soft + hard timeouts

Graceful shutdown

Inflight visibility

Admission control

Dead letter queue

Schema versioning

OpenTelemetry native

Chaos engineering CLI

Sync + async, one API

First-class CLI

Built for engineers at 2am whose queue just died.

Built for engineers at 2am
whose queue just died.