Worker OOM kill
Worker is killed mid-execution. All in-flight tasks vanish without a trace. You don't even know it happened.
The reliability layer for FastAPI + Celery. Sync or async — zero job loss. No database. Just Redis.
# If the worker dies mid-execution tonight, # this task is lost silently and forever. @celery_app.task def process_order(order_id: str): charge_card(order_id) send_receipt(order_id)
# Worker dies tonight? Phoenix auto-resurrects it! # Deduplicates retries automatically. @rl_task(idempotent=True) async def process_order(order_id: str): await charge_card(order_id) await send_receipt(order_id)
Every Celery deployment has these failure modes. Most teams discover them at 2am, after a customer complains. Relier closes each one.
Worker is killed mid-execution. All in-flight tasks vanish without a trace. You don't even know it happened.
Task retries re-execute side effects. Double Stripe charges. Duplicate emails. Corrupt database state.
One stalled upstream API holds a worker hostage forever. No soft timeout. No cleanup hook. Just a zombie.
Deploy at 3pm. SIGTERM lands. 12 in-flight tasks silently dropped. Nobody notices until a customer complains.
No idea what's running right now. Which worker has which task. How long it's been. You're flying blind.
Queue floods with no backpressure. Workers cascade-fail. No admission control. No Retry-After.
One bad payload crashes the worker, gets retried, crashes again. Infinite loop. No quarantine. No DLQ.
Mid-deploy payload mismatch. Old workers pick up new-format tasks. Silent deserialization failures. Data lost.
One decorator. Four guarantees. Every task tracked from enqueue to completion — and brought back if anything in between dies.
Every task registers a Redis heartbeat with a short TTL. When a worker crashes, the heartbeat expires. The Phoenix resurrector detects it and re-queues the task on a fresh worker — automatically.
@rl_task(queue="high_priority", max_resurrections=5) async def process_document(doc_id: str): # Your existing code. Zero changes needed. result = await store_document(doc_id) return result # Worker dies? Phoenix resurrects in <35s. # Delivery rate: 99.97%
Atomic Redis Lua check-and-set makes any task safely retryable. First run claims, executes, caches. Retry returns cached result instantly. No double charges. No duplicate emails.
# Option A — one flag. Done. @rl_task(idempotent=True, idempotency_ttl=3600) async def send_invoice(invoice_id: str): await stripe.charge(invoice_id) # Already ran? → cached result, charge skipped. # Option B — manual key for custom logic. async with idempotency_lock(key=event_id) as lock: if lock.already_executed: return lock.cached_result
Two-tier timeout enforcement. Soft timeout fires your cleanup hook to save progress. Hard timeout cancels the coroutine unconditionally. Both emit OTEL events you can plug into anywhere.
@rl_task( soft_timeout=25, hard_timeout=30, on_soft_timeout=save_progress ) async def process_large_job(job_id: str): return await run_job(job_id) async def save_progress(ctx: TaskContext): await redis.set(f"partial:{ctx.task_id}", ctx.partial_result)
Relier intercepts SIGTERM from deploys, scale-downs, and K8s evictions. Worker enters drain mode, finishes current tasks, and hands off the rest. Zero task loss on every deploy.
$ rl worker drain --timeout 30 ⏳ Worker rl-worker-2 entering drain mode... ✓ task_a8f2c1 completed (12.4s) ✓ task_b2d4e8 completed (18.1s) ⚠ task_c9f1a3 exceeded timeout — re-queuing ✓ Clean exit. 1 task handed off. 0 lost.
The rl CLI gives you real-time visibility into every task, worker, and failure — and the muscles to act on it.
Measured overhead. Real numbers. No asterisks. Reproduce them yourself with rl bench all.
Install. Decorate. Deploy. That's the whole onboarding.
Zero database dependencies. Zero GPU dependencies. Python 3.11+.
$ pip install relier
Add @rl_task. Dispatch with .apush() from FastAPI, .push() from Django.
One command brings up workers, the Phoenix resurrector, and the OTEL exporter.
$ rl cluster up
No glue. No second service. No second database.
Worker dies → task comes back. Automatic. Median recovery < 35 seconds with full payload integrity.
Atomic Lua check-and-set. One flag, no double charges.
Two-tier timeout with cleanup hooks. No zombie workers.
SIGTERM → drain → finish or hand off. Zero loss on deploy.
Every running task, worker, and queue depth in real time.
Lua atomic rate-limit. < 1ms p99. Returns 429 + Retry-After.
Poison pills quarantined with payload + stack trace. Release when fixed.
Versioned envelope. Auto-migration on pickup. Deploy any time.
Every event emits OTEL spans. Plug into Grafana, Jaeger, Datadog, anything OTLP.
rl chaos worker-kill --watch. Prove the guarantees yourself.
async def or def — both work. .apush() from FastAPI, .push() from Django or Flask. Persistent event loop under the hood, zero per-task asyncio overhead.
rl tasks, rl worker, rl dlq, rl chaos — full control from terminal.
Open source. Apache 2.0. Free forever. Made with conviction in Abuja.