Relier¶
Reliability layer for Celery. Zero job loss.
Pre-1.0 — public API may change
Relier is currently v0.x.y. The core engine (Phoenix resurrector,
idempotency, schema envelope, admission control, fence-token protocol)
is production-grade and has been validated against the first-party
chaos suite.
The public API layout is still stabilising. Minor version bumps
(0.y.0) may introduce breaking changes to decorator options,
dispatch helpers, and CLI command shapes. Pin to a minor version
(e.g. relier==0.1.*) until 1.0 ships.
See the versioning policy at the bottom of this page and the CHANGELOG for the migration log.
Relier wraps your existing Celery setup with a self-healing reliability layer. Tasks that enter your system always exit it, either successfully, or in the Dead Letter Queue with a full traceable reason.
# Before Relier, a plain Celery task.
# If the worker dies mid-execution, this is gone. Silently.
@celery_app.task
def process_order(order_id: str):
charge_card(order_id)
send_receipt(order_id)
# After Relier, same function, zero changes to your business logic.
# Worker dies? Relier re-queues it automatically, within 35 seconds.
# Called twice with the same order_id? Runs once. Card charged once.
@rl_task(idempotent=True)
async def process_order(order_id: str):
await charge_card(order_id)
await send_receipt(order_id)
Built for FastAPI, works with anything¶
Relier is async-first: the orchestration layer runs on a persistent asyncio
event loop owned by each worker process, so async tasks dispatch and execute
without spinning up a new loop per call. FastAPI integration is the design
target, and await task.apush(...) is the canonical dispatch path.
Relier also runs cleanly on top of sync frameworks, Flask, classic
(synchronous) Django, management commands, plain scripts via task.push(...).
push is a thin sync wrapper over apush that bridges back to the worker's
loop when one is running and falls back to asyncio.run() otherwise. The
reliability guarantees are identical either way.
What problem does Relier solve?¶
Celery is a great task queue. But out of the box, it has some sharp edges:
| Problem | What breaks | Without Relier |
|---|---|---|
| Worker OOM-killed | All in-flight tasks | Lost forever, no trace |
| Non-idempotent retries | Double charges, duplicate emails | Your problem to solve |
| No task timeouts | Zombie tasks block workers | Manual implementation |
| Ungraceful deploys | Tasks mid-flight during restart | ~40% silently lost |
| Zero visibility | What's running right now? | Check logs and hope |
| Traffic spikes | Queue floods, cascade failures | Manual rate limiting |
| Poison-pill tasks | One bad payload loops forever | Workers keep dying |
| Rolling deploy schema drift | Old payloads on new code | Silent failures |
Relier solves all 8.
Key Features¶
Zero job loss (Phoenix Pattern)¶
Every task registers a heartbeat in Redis. If a worker is killed mid-task, the heartbeat expires. A background resurrector detects this within 35 seconds and re-queues the task on a healthy worker automatically, with the original payload intact.
Safe retries (Idempotency)¶
An atomic Redis Lua script ensures a task only executes once, even when retried multiple times after failure. Pass idempotent=True and Relier handles the rest.
Timeout enforcement¶
Two-tier timeout: a soft timeout fires your cleanup hook so you can save partial progress. A hard timeout terminates unconditionally. Both emit OpenTelemetry events.
Graceful shutdown¶
Relier intercepts SIGTERM (deploys, scale-downs, Kubernetes evictions) and waits for in-flight tasks to finish. Tasks that won't finish in time are handed off to another worker not dropped.
Dead Letter Queue¶
Tasks that can't be recovered after repeated attempts are quarantined in the DLQ with a full payload, stack trace, and resurrection history. Inspect and re-release them at any time via the CLI.
Admission control¶
A Redis Lua script enforces cluster-wide rate limits atomically. Returns Retry-After on rejection. Workers never see traffic above configured capacity.
Full OpenTelemetry observability¶
Every task, retry, resurrection, timeout, and DLQ quarantine emits OTEL spans and metrics. Plug into Grafana, Jaeger, Honeycomb, or any OTLP endpoint.
Requirements¶
- Python 3.11+
- Redis 7+ with AOF persistence and
maxmemory-policy noeviction. See deployment guide. Relier preflight-checks this and refuses to start if either is wrong. - Celery 5.4+
- Any Python web framework (FastAPI, Flask, Django, Starlette, …) optional; Relier works inside any Python process, including plain scripts and CLIs.
Installation¶
That's it. No GPU, no native extensions, no database required. Just Python + Redis.
Where to go next¶
If you're new¶
| I want to… | Go here |
|---|---|
| Get something running in 5 minutes | Quickstart |
| Learn what Celery is (and how Relier sits on top) | Celery Primer |
| Understand how Relier's mechanisms work | Core Concepts |
| Plug Relier into FastAPI / Flask / Django | Integration Recipes |
Once you're building¶
| I want to… | Go here |
|---|---|
| Pick the right pattern for retries, batches, locks | Patterns Cookbook |
| Get unstuck when something breaks | Troubleshooting & FAQ |
| Deploy safely across rolling code changes | Rolling Deploys & Schema Migrations |
| Deploy bare-metal, Docker dev, Docker prod, or Kubernetes | Deployment |
| Verify reliability with chaos tests | Chaos Guide |
Reference¶
| I want to… | Go here |
|---|---|
See all @rl_task options + dispatch methods |
API Reference |
See all rl CLI commands and what they touch in Redis |
CLI Reference |
| Configure Relier for production | Configuration |
| Set up dashboards and alerts | Metrics Reference |
| Read the deep-dive on internals | Architecture |
| Understand exactly what's protected against what failure | Durability, HA, & Failure Boundaries |
Versioning policy¶
Relier follows Semantic Versioning, with one explicit caveat for the pre-1.0 series:
0.PATCH.x— bug fixes, doc updates, internal refactors. Always safe to upgrade.0.MINOR.0— feature additions and potentially breaking changes to the public API surface (@rl_taskoptions, dispatch helpers, CLI command shapes,relier.config.Settingsfield names). Read the CHANGELOG before bumping.1.0.0— locks the public API. From that point on, breaking changes require a major-version bump per standard SemVer.
The core engine (Redis key layout, Lua scripts, fence-token protocol, schema envelope format) is treated as an internal contract even pre-1.0: changes there are versioned and migrated, never silent.