Configuration¶
All Relier settings are read from environment variables with the RELIER_
prefix. The recommended approach is a .env file in your project root, Relier
loads it automatically via pydantic-settings.
# Validate your current configuration
rl config show # display all active values
rl config validate # check Redis policy + all env vars
How configuration is loaded under the hood
Settings are parsed exactly once per process at first call to
get_settings() (an @lru_cached function in relier/config.py). The
resulting Settings object is frozen, Pydantic raises if anything
tries to mutate it. That immutability is intentional: it makes every
worker, the resurrector, the CLI, and your producer see the same config
snapshot, which prevents surprises like an admission limit raised at
runtime taking effect only in one process. To change a value: edit .env
(or set the env var) and restart the affected processes.
Minimal production .env¶
RELIER_REDIS_URL=redis://redis:6379/0
RELIER_ADMISSION_LIMIT=10000
RELIER_CELERY_WORKER_COUNT=8
RELIER_CELERY_WORKER_CONCURRENCY=8
RELIER_OTEL_ENABLED=true
RELIER_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
Redis Connection¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_REDIS_URL |
str |
redis://localhost:6379/0 |
Direct Redis connection URL. Ignored if RELIER_REDIS_USE_SENTINEL=true. |
RELIER_REDIS_PASSWORD |
str |
— | Redis requirepass value. Sensitive, never commit to version control. |
RELIER_REDIS_MAX_CONNECTIONS |
int |
20 |
Connection pool size per worker process. Total connections = workers × this value. |
RELIER_REDIS_SOCKET_TIMEOUT |
float |
5.0 |
Socket read timeout in seconds. |
RELIER_REDIS_CONNECT_TIMEOUT |
float |
2.0 |
Connection establishment timeout in seconds. |
RELIER_REDIS_HEALTH_CHECK_INTERVAL |
int |
30 |
Pool health check interval in seconds. |
Redis persistence is required
Relier stores all task state in Redis. Without persistence, a Redis restart drops every heartbeat and payload, tasks in flight are lost.
Minimum required Redis config:
See Deployment for full Redis setup.Redis Sentinel (High Availability)¶
For production deployments that need Redis HA without manual failover:
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_REDIS_USE_SENTINEL |
bool |
false |
Route connections through Redis Sentinel instead of a direct URL. |
RELIER_REDIS_SENTINEL_NODES |
str |
— | Comma-separated host:port list. Example: sentinel-1:26379,sentinel-2:26379,sentinel-3:26379 |
RELIER_REDIS_SENTINEL_MASTER_NAME |
str |
relier-master |
Sentinel master group name. |
RELIER_REDIS_SENTINEL_PASSWORD |
str |
— | Sentinel requirepass value (if set). |
RELIER_REDIS_USE_SENTINEL=true
RELIER_REDIS_SENTINEL_NODES=sentinel-1:26379,sentinel-2:26379,sentinel-3:26379
RELIER_REDIS_SENTINEL_MASTER_NAME=relier-master
Worker Pool¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_CELERY_WORKER_COUNT |
int |
8 |
Number of worker processes. Used to validate pool size. |
RELIER_CELERY_WORKER_CONCURRENCY |
int |
8 |
Concurrent tasks per worker. Set this in your celery worker command with --concurrency. |
Pool sizing formula
Set RELIER_REDIS_MAX_CONNECTIONS ≥ (RELIER_CELERY_WORKER_CONCURRENCY × 3).
Each worker task needs up to 3 simultaneous Redis connections (heartbeat, inflight, idempotency).
rl config validate will warn you if the pool is undersized.
Phoenix Resurrector¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_HEARTBEAT_TTL |
int |
10 |
Heartbeat key TTL in seconds. Worker death is detected within this window. Shorter = faster detection, more Redis writes. |
RELIER_MAX_RESURRECTIONS |
int |
5 |
Maximum resurrection attempts before a task is quarantined to the DLQ. |
RELIER_RESURRECTION_CHECK_INTERVAL |
int |
2 |
How often the resurrector scans for dead tasks (seconds). |
RELIER_RESURRECTION_REQUEUE_DELAY |
float |
0.05 |
Delay between individual task re-queues in a batch (seconds). Prevents broker flooding. |
RELIER_RESURRECTION_BATCH_SIZE |
int |
1000 |
Maximum tasks resurrected per scan pass. |
RELIER_RESURRECTION_MAX_QUEUE_DEPTH |
int |
10000 |
If the recovery queue has more than this many tasks, skip the resurrection scan. Prevents runaway resurrection under sustained failures. |
Detection latency = RELIER_HEARTBEAT_TTL + RELIER_RESURRECTION_CHECK_INTERVAL.
With defaults: 10s + 2s = 12s to detect, plus broker round-trip. Total resurrection time ≤ 35s under normal conditions.
Why these knobs exist, thundering-herd protection
RELIER_RESURRECTION_BATCH_SIZE, RELIER_RESURRECTION_MAX_QUEUE_DEPTH, and
RELIER_RESURRECTION_REQUEUE_DELAY together form Relier's defence against
a mass-failure thundering herd. When 100 workers die at once, naive
resurrection would flood the broker. Instead:
- The resurrector replays at most
RESURRECTION_BATCH_SIZEtasks per scan pass. - Before each pass, it checks the recovery-queue depth. If it's already at
RESURRECTION_MAX_QUEUE_DEPTH, the scan is deferred, expired tasks stay in the index and are picked up on a future pass once the workers have drained things. RESURRECTION_REQUEUE_DELAYpaces the resurrector itself so it can't CPU-pin during sustained failure storms.
See Durability → Thundering-herd defences for the full picture, including the internal semaphore that bounds concurrent broker submissions.
Idempotency¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_IDEMPOTENCY_DEFAULT_TTL |
int |
3600 |
Default result cache TTL in seconds (1 hour). |
RELIER_IDEMPOTENCY_INFLIGHT_TTL |
int |
120 |
How long the IN_FLIGHT sentinel lives. Must be longer than RELIER_HARD_TIMEOUT. |
IN_FLIGHT TTL must exceed hard_timeout
If RELIER_IDEMPOTENCY_INFLIGHT_TTL is shorter than hard_timeout, a task can time out while its sentinel is still live, then retry and see IN_FLIGHT, waiting for a worker that's already dead.
Relier validates this at decoration time and raises ValueError if violated.
Timeouts¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_SOFT_TIMEOUT |
int |
25 |
Global default soft timeout in seconds. Overridden per-task with @rl_task(soft_timeout=N). |
RELIER_HARD_TIMEOUT |
int |
30 |
Global default hard timeout in seconds. Overridden per-task with @rl_task(hard_timeout=N). |
RELIER_GRACEFUL_SHUTDOWN_TIMEOUT |
int |
30 |
How long to wait for in-flight tasks during SIGTERM drain before handing them off. |
Checkpointing¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_CHECKPOINT_MAX_INLINE_BYTES |
int |
262144 |
Maximum size of a checkpoint stored directly in Redis (256KB). Larger checkpoints spill to the filesystem backend. |
RELIER_CHECKPOINT_BACKEND |
str |
"inline" |
Storage backend for large checkpoints: "inline" (Redis only) or "filesystem". |
RELIER_CHECKPOINT_DIR |
str |
/var/lib/relier/checkpoints |
Directory for filesystem checkpoints. Must be shared across all workers (e.g., a mounted volume). |
Admission Control¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_ADMISSION_LIMIT |
int |
5000 |
Maximum tasks dispatched per admission window. |
RELIER_ADMISSION_WINDOW |
int |
10 |
Window duration in seconds. Requests beyond the limit return HTTP 429. |
Effective rate = RELIER_ADMISSION_LIMIT / RELIER_ADMISSION_WINDOW requests/second.
Default: 5000 / 10s = 500 tasks/second sustained. Burst up to 5000 in any 10-second window.
OpenTelemetry¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_OTEL_ENABLED |
bool |
false |
Enable OpenTelemetry export. |
RELIER_OTEL_EXPORTER_OTLP_ENDPOINT |
str |
http://localhost:4317 |
OTLP gRPC endpoint (OpenTelemetry Collector, Grafana Agent, etc.). |
When RELIER_OTEL_ENABLED=true, Relier exports:
- Distributed traces (spans) via OTLP gRPC
- Metrics via OTLP gRPC (Prometheus scrape also available via prometheus-client)
General¶
| Variable | Type | Default | Description |
|---|---|---|---|
RELIER_LOG_LEVEL |
str |
"INFO" |
Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL. |
RELIER_SECRET_KEY |
str |
"change-in-production" |
Reserved for future signing features. Set this to a random value in production. |
Updating configuration¶
# Write a value to .env and print what changed
rl config set RELIER_HEARTBEAT_TTL 15
# Then restart your workers for the change to take effect
rl worker restart rl-worker-1
Settings are read once at worker startup. The only way to apply a change to a running worker is to restart it.