Deployment¶

Relier is a plain Python library, the engine (@rl_task, the Phoenix resurrector, idempotency, DLQ, admission control, SLO tracking) is pure Python. Like Celery, it runs as ordinary processes. Docker is a convenience, not a requirement: nothing in Relier depends on running inside a container.

There is exactly one hard dependency: a reachable Redis with persistence enabled. Relier preflight-checks Redis at startup and refuses to start with a clear error if it is unreachable or misconfigured nothing comes up half-working.

This guide covers the three supported ways to run Relier:

Tier	Redis	How	Use for
Bare metal	You provide one	`make worker` + `make resurrector`	Local development, CI, tests, integrating into an existing host
Docker dev	Single node, AOF + RDB	`make dev` (uses `docker-compose.yml`)	A full local cluster that mirrors prod shape
Docker prod	HA: master + replicas + Sentinel + backup	`make prod` (uses `docker-compose.prod.yml`)	Production deployments on a VM or single host
Kubernetes	StatefulSet or managed Redis	YAML manifests below	Multi-host production

Relier always runs the same two process types regardless of tier: Celery workers consuming task queues, and the Phoenix resurrector scanning for dead workers. Only the surrounding infrastructure changes.

Before you deploy: what Relier needs¶

Two things, every tier:

Redis with persistence enabled and maxmemory-policy noeviction. Without persistence, a Redis restart drops every heartbeat and payload in flight, the zero-job-loss guarantee breaks. Without noeviction, Redis can silently evict heartbeats under memory pressure, causing the resurrector to misread a live worker as dead and re-queue tasks (duplicate execution). Relier validates maxmemory-policy at worker startup and refuses to start if it is wrong. See Production Redis configuration.
A running resurrector process (rl run-resurrector). Every worker already embeds a resurrection scanner, so surviving workers will pick up tasks from a dead worker automatically. The dedicated rl run-resurrector process provides coverage for the one edge case the embedded scanners cannot handle: all workers dying simultaneously. It is strongly recommended for production.

Everything else, admission control, SLO tracking, DLQ, idempotency is already inside the worker process.

Tier 1: Bare metal (no Docker)¶

The simplest way to run Relier. Useful for local development, CI, and any host where Docker is overkill.

Prerequisites: Python 3.11+, and a reachable Redis (brew install redis && redis-server, a system package, a remote instance anything). Set RELIER_REDIS_URL if it is not redis://localhost:6379/0.

Using the bundled Makefile¶

make setup                       # create the venv and install Relier
export RELIER_REDIS_URL=redis://localhost:6379/0

make worker                      # terminal 1: a Celery worker
make resurrector                 # terminal 2: the Phoenix resurrector

The make worker target consumes every public Relier queue plus the internal re-queue queue used by Phoenix for resurrections. The make resurrector target shells out to rl run-resurrector.

Raw commands (no Makefile)¶

# Worker consumes every queue (recommended for local dev)
celery -A relier.tasks.app worker -l info \
  -Q high_priority,default,low_priority,re-queue

# Resurrector, single process per cluster
rl run-resurrector

Bare-metal preflight¶

If Redis is not running, both processes exit immediately with:

RuntimeError: Relier cannot reach Redis (localhost:6379). ... Refusing to start.

Start Redis or fix RELIER_REDIS_URL and re-run. Nothing starts in a half-broken state.

A minimal local Redis with the right config¶

For bare-metal dev, the quickest way to get a correctly-configured Redis is to borrow Relier's own config file:

redis-server scripts/redis/redis.conf

That config enables AOF + RDB persistence and sets maxmemory-policy noeviction, the only two settings Relier strictly requires.

Tier 2: Docker, development cluster¶

A full local cluster running in Docker: one Redis node (with AOF + RDB), the worker pool (three queue-specialized workers), the resurrector, and the observability stack (OTel collector + Prometheus + Grafana).

This is defined entirely in the bundled docker-compose.yml. You do not need to write your own.

Bring it up¶

make dev          # builds and starts in detached mode
make dev-logs     # follow logs
make dev-down     # stop the cluster

or directly:

docker compose up -d --build
docker compose logs -f
docker compose down

What's actually running¶

The docker-compose.yml ships with these services:

Service	Purpose
`redis`	Single-node Redis with persistence (`scripts/redis/redis.conf`)
`worker-high`	Worker consuming `high_priority,default`
`worker-default`	Worker consuming `default,low_priority`
`worker-recovery`	Worker consuming `re-queue` (Phoenix's recovery queue)
`resurrector`	The `rl run-resurrector` process
`otel-collector`	Receives OTLP from workers and exports to Prometheus
`prometheus`	Scrapes the OTel collector
`grafana`	Dashboards on `http://localhost:3000` (anonymous viewer)

Source code is bind-mounted into the worker containers (./src:/app/src) so edits in your editor take effect after restarting the affected service. This is explicitly a dev configuration, production never bind-mounts source.

Queue topology, explained¶

Relier exposes three public queues and one internal queue. The dev compose splits workers across them so that a flood of low-priority work cannot starve high-priority traffic, and so that Phoenix's recovery queue is consumed by a dedicated pool:

worker-high     ← high_priority, default
worker-default  ← default, low_priority
worker-recovery ← re-queue          (Phoenix-only never publish into this)

Your code routes a task into a queue via @rl_task(queue="high_priority"). Publishing into re-queue from user code is rejected at decoration time, re-queue is for resurrections only.

Configuration for the dev stack¶

All app services share the same environment via a YAML anchor at the top of docker-compose.yml:

x-app-env: &app-env
  RELIER_REDIS_URL: redis://redis:6379/0
  RELIER_OTEL_ENABLED: "true"
  RELIER_OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317

To override, edit docker-compose.yml or add a .env file in the project root (Docker Compose picks it up automatically).

Tier 3: Docker, production HA cluster¶

The full reliability stack. Defined in docker-compose.prod.yml. This is what ships when you run make prod.

What it includes that dev doesn't¶

Addition	Why
1 Redis master + 2 replicas	Survives the loss of any single Redis node
3 Sentinels	Automatic failover quorum (2 of 3 needed to promote)
Authenticated Redis (`requirepass` + `masterauth`)	Locked-down brokers
Authenticated Sentinels (`requirepass`)	Sentinel-to-Sentinel auth
Backup sidecar	Hourly RDB snapshots from a replica, 7-day retention, optional S3 offsite
Filesystem checkpoint volume	Large checkpoints spill to shared storage
Per-service memory limits	Predictable resource budget
No bind-mounts	Source is baked into the image; nothing mutable from the host

Relier uses Sentinel, not Cluster. Sentinel gives transparent failover from one Redis master to a replica. Relier's working set is small (in-flight heartbeats, payloads, idempotency locks; large checkpoints spill to the filesystem backend) so sharding is unnecessary. Sentinel also keeps Lua scripts, MULTI/EXEC, and Pub/Sub semantics intact all of which Relier uses extensively. See Durability → Layer 2.

Bring it up¶

export REDIS_PASSWORD=...      # required: Redis data-node password
export SENTINEL_PASSWORD=...   # required: Sentinel password
export GRAFANA_ADMIN_PASSWORD=... # optional, defaults to 'admin'

make prod        # builds + starts detached
make prod-down   # stop

The two _PASSWORD variables are referenced as ${REDIS_PASSWORD:?...} in the compose file, Compose refuses to start if either is unset. Do not commit these to git. Put them in a .env file (excluded from git) or inject them from your secrets manager.

What the manifest does, at a glance¶

flowchart TD
  sentinel["Sentinel quorum\nsentinel-1 · sentinel-2 · sentinel-3"]
  workers["Workers (3 services)\n+ resurrector"]
  master[relier-redis-master]
  replicas["Replicas\nreplica-1 · replica-2"]
  backup["Backup sidecar\nhourly RDB snapshots"]

  sentinel -- monitors --> master
  workers -- connects to --> master
  master -- replicates to --> replicas
  replicas -- snapshot --> backup

When the master dies, Sentinel promotes a replica. Workers reconnect through Sentinel and the cluster keeps running. No tasks are lost.

How workers find the right Redis¶

The production manifest sets these Relier variables for every app service:

RELIER_REDIS_USE_SENTINEL: "true"
RELIER_REDIS_SENTINEL_NODES: "relier-sentinel-1:26379,relier-sentinel-2:26379,relier-sentinel-3:26379"
RELIER_REDIS_SENTINEL_MASTER_NAME: "relier-master"
RELIER_REDIS_PASSWORD: ${REDIS_PASSWORD:?...}
RELIER_REDIS_SENTINEL_PASSWORD: ${SENTINEL_PASSWORD:?...}

With RELIER_REDIS_USE_SENTINEL=true, RELIER_REDIS_URL is ignored. Relier discovers the current master through the Sentinel quorum on each connection and reconnects automatically on failover. See Configuration → Redis Sentinel.

Large checkpoints in production¶

Production sets:

RELIER_CHECKPOINT_BACKEND: "filesystem"
RELIER_CHECKPOINT_DIR: "/var/lib/relier/checkpoints"

…with a shared redis_checkpoints volume mounted into every app service. This matters because a checkpoint written by worker-high may need to be read by worker-recovery when Phoenix resurrects that task they must see the same filesystem. See ctx.set_partial for what gets checkpointed.

If you skip the shared volume, oversized checkpoints either fail (with CheckpointTooLargeError) or get written to one container's local disk and disappear when a different worker tries to resume.

Tier 4: Kubernetes¶

For larger deployments, Relier maps cleanly onto standard Kubernetes primitives. You need three workloads:

Component	Kind	Notes
Redis	StatefulSet or managed service (ElastiCache, Memorystore, Upstash)	Must have AOF + `noeviction`
Workers	Deployment	Scales horizontally; PodDisruptionBudget recommended
Resurrector	Deployment with `replicas: 1`	One dedicated process is enough; every worker also embeds a scanner. Distributed locks prevent double-resurrection if you run more.

Redis (StatefulSet with persistence)¶

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: relier-redis
spec:
  selector: { matchLabels: { app: relier-redis } }
  serviceName: relier-redis
  replicas: 1
  template:
    metadata: { labels: { app: relier-redis } }
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args:
            - redis-server
            - --appendonly
            - "yes"
            - --appendfsync
            - everysec
            - --maxmemory-policy
            - noeviction
          ports: [{ containerPort: 6379 }]
          volumeMounts:
            - { name: redis-data, mountPath: /data }
          livenessProbe:
            exec: { command: ["redis-cli", "ping"] }
            initialDelaySeconds: 10
            periodSeconds: 10
  volumeClaimTemplates:
    - metadata: { name: redis-data }
      spec:
        accessModes: ["ReadWriteOnce"]
        resources: { requests: { storage: 10Gi } }
---
apiVersion: v1
kind: Service
metadata: { name: relier-redis }
spec:
  selector: { app: relier-redis }
  ports: [{ port: 6379, targetPort: 6379 }]
  clusterIP: None

Worker Deployment¶

apiVersion: apps/v1
kind: Deployment
metadata: { name: relier-worker }
spec:
  replicas: 4
  selector: { matchLabels: { app: relier-worker } }
  template:
    metadata: { labels: { app: relier-worker } }
    spec:
      containers:
        - name: worker
          image: your-registry/relier-app:latest
          command:
            - celery
            - -A
            - relier.tasks.app
            - worker
            - --loglevel=info
            - --concurrency=8
            - -Q
            - high_priority,default,low_priority
          env:
            - { name: RELIER_REDIS_URL, value: redis://relier-redis:6379/0 }
            - { name: RELIER_HEARTBEAT_TTL, value: "10" }
            - { name: RELIER_CELERY_WORKER_CONCURRENCY, value: "8" }
            - { name: RELIER_REDIS_MAX_CONNECTIONS, value: "30" }
          resources:
            requests: { cpu: "500m", memory: "512Mi" }
            limits:   { cpu: "2",    memory: "2Gi" }
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]
      terminationGracePeriodSeconds: 60   # ≥ RELIER_GRACEFUL_SHUTDOWN_TIMEOUT + 30s buffer

Run a second Deployment for the recovery queue with -Q re-queue if you want isolation between user traffic and Phoenix-injected re-queues. (Optional, default works fine for most workloads.)

Resurrector Deployment¶

apiVersion: apps/v1
kind: Deployment
metadata: { name: relier-resurrector }
spec:
  replicas: 1   # Always exactly one
  selector: { matchLabels: { app: relier-resurrector } }
  template:
    metadata: { labels: { app: relier-resurrector } }
    spec:
      containers:
        - name: resurrector
          image: your-registry/relier-app:latest
          command: ["rl", "run-resurrector"]
          env:
            - { name: RELIER_REDIS_URL, value: redis://relier-redis:6379/0 }
          resources:
            requests: { cpu: "100m", memory: "128Mi" }
            limits:   { cpu: "500m", memory: "512Mi" }

Graceful rolling deploys on Kubernetes¶

A rolling update sends SIGTERM to old pods while new ones start. Relier handles this correctly because it intercepts SIGTERM:

Worker receives SIGTERM.
Relier's drain phase stops accepting new tasks from the broker.
Running tasks either finish, or their heartbeats expire on shutdown.
Phoenix re-queues any unfinished tasks onto a new pod within ~12 s.
Worker exits cleanly.

Set terminationGracePeriodSeconds ≥ RELIER_GRACEFUL_SHUTDOWN_TIMEOUT + 30 s (default: 60 s) so the drain phase has room to complete. Add a PodDisruptionBudget to keep at least one worker alive during voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: relier-worker-pdb }
spec:
  minAvailable: 1
  selector: { matchLabels: { app: relier-worker } }

Production Redis configuration¶

Regardless of platform, your Redis instance MUST have these settings:

# Persistence, without this a Redis restart loses heartbeats and payloads
appendonly yes
appendfsync everysec

# Eviction, without this Redis can silently delete heartbeats under pressure
# and the resurrector will see live workers as dead (duplicate execution)
maxmemory-policy noeviction

# Recommended: bound memory so Redis errors on writes instead of OOM-killing
maxmemory 2gb

The bundled scripts/redis/redis.conf ships these settings (plus RDB snapshots as a fast-restart base for the backup sidecar). Both docker-compose.yml and docker-compose.prod.yml mount that file as /etc/relier/redis.conf.

Why `noeviction`?¶

Other eviction policies (allkeys-lru, volatile-lru, etc.) let Redis delete keys when it runs out of memory. Relier stores heartbeat keys (rl:hb:*) and Phoenix payloads (rl:phoenix:*) in Redis. If Redis evicts a heartbeat while the worker is alive, the resurrector sees the heartbeat as expired and re-queues the task even though the original worker is still running it. That's a duplicate execution.

With noeviction, Redis returns an error on writes when memory is full, which your application can catch, retry, or alert on. Silent data loss is not recoverable.

rl config validate checks this and exits non-zero if it is wrong.

What `everysec` AOF actually guarantees¶

appendfsync everysec is the default in the bundled scripts/redis/redis.conf. The guarantee it gives you: you lose at most 1 second of acknowledged writes if the Redis process crashes. Replication tightens that further in practice (the promoted replica usually has the last write the master ACK'd).

The bundled config also keeps no-appendfsync-on-rewrite no so that AOF rewrites do not extend the data-loss window. This trades a possible latency spike during the rewrite for a stable durability guarantee, the right call for a coordination plane.

For the full breakdown of how persistence, Sentinel failover, leases, fence tokens, backups, and thundering-herd defences interact, see Durability, HA, & Failure Boundaries.

Tuning AOF rewrite cadence for high-throughput deployments¶

The bundled scripts/redis/redis.conf ships:

auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

Redis triggers an AOF rewrite whenever the file has doubled in size past the last rewrite, with a 64 MB floor. For most deployments that is fine, rewrites fire occasionally and the latency blip is small. On high-throughput clusters (hundreds of sustained dispatches per second) the AOF crosses 64 MB quickly, and rewrites can fire every few minutes. Because Relier deliberately keeps no-appendfsync-on-rewrite no (see Durability → AOF rewrite), the active fsyncs keep happening while the rewrite child is writing the new file, and the two streams contend on the same disk. That contention is where P99 spikes on apush come from.

The honest fix is fewer, larger rewrites plus faster or isolated disk. None of the levers here weaken durability:

Lever	Default	High-throughput suggestion	Effect
`auto-aof-rewrite-min-size`	64mb	512mb – 2gb	Floor before a rewrite is eligible. Raising it makes rewrites sparser and more predictable.
`auto-aof-rewrite-percentage`	100	200 – 300	How much the AOF must grow past the last rewrite before a new one fires. Higher values widen the gap between rewrites.
AOF volume placement	shared with RDB	dedicated NVMe / provisioned IOPS	Removes I/O contention entirely instead of just thinning it.

A reasonable starting point for a busy cluster:

auto-aof-rewrite-percentage 200
auto-aof-rewrite-min-size 512mb

…combined with the AOF on a dedicated NVMe or provisioned-IOPS volume. Verify with redis-cli INFO persistence after a load test, aof_pending_rewrite should be 0 most of the time, and the interval between rewrites should be measured in tens of minutes, not seconds.

What NOT to tune¶

Two settings look tempting but cost more than they save:

no-appendfsync-on-rewrite yes: skips fsyncs while a rewrite is running. Removes the latency spike, but silently extends the data-loss window to every write since the rewrite began if the process crashes mid-rewrite. Relier ships no on purpose; do not change it.
appendonly no on the master (with persistence delegated to replicas): eliminates AOF I/O on the active node, at the cost of making async replication the only durability boundary. Async replication ACKs before it replicates, so a master crash + restart-loop, or a Sentinel-failover race window, will silently drop writes that apush already returned success for. This breaks Relier's only promise. Do not do this, even with a docs warning.

If your latency is still unacceptable after the tuning above, raise an issue upstream rather than reaching for either of these knobs, the right fix is on the I/O path, not the durability contract.

Managed Redis compatibility¶

If you're using a hosted Redis service, here's whether it can satisfy Relier's two hard requirements (AOF persistence + noeviction):

Provider	AOF persistence	`noeviction`	Notes
Redis Cloud	✅ Available	✅ Available	Set in database config
AWS ElastiCache	✅ Available	✅ Available	Set via parameter group
Upstash	⚠️ Always-on	❌ Not configurable	Use Upstash only for dev/staging
Heroku Redis	✅ Available	✅ Available	Premium plans only

Relier validates maxmemory-policy at startup and refuses to start if it is wrong, so a misconfigured managed instance will surface immediately rather than silently breaking the zero-job-loss guarantee.

Capacity planning¶

Connection pool sizing¶

Each concurrent task on a worker needs up to 3 simultaneous Redis connections (heartbeat, inflight tracking, idempotency). The formula:

RELIER_REDIS_MAX_CONNECTIONS ≥ RELIER_CELERY_WORKER_CONCURRENCY × 3

With 8 concurrent tasks per worker, set RELIER_REDIS_MAX_CONNECTIONS=30 (a little headroom above 24).

rl config validate checks this and warns if undersized.

Sizing worker fleet¶

A worker process is one OS process plus the memory your task code needs. Reasonable starting points:

Instance size	Workers	Concurrency	Max connections / worker
1 CPU / 1 GB	1	4	15
2 CPU / 2 GB	2	8	30
4 CPU / 4 GB	4	8	30
8 CPU / 8 GB	8	8	30

Profile your task memory and adjust --concurrency accordingly.

Admission control sizing¶

RELIER_ADMISSION_LIMIT / RELIER_ADMISSION_WINDOW = sustained tasks/second

Default: 5000 tasks per 10-second window = 500 tasks/second sustained, with burst headroom up to 5000 in any 10 s. Raise RELIER_ADMISSION_LIMIT in step with worker capacity.

Secrets management¶

Never commit RELIER_REDIS_PASSWORD, RELIER_REDIS_SENTINEL_PASSWORD, or RELIER_SECRET_KEY:

Platform	Approach
Docker Compose	`.env` excluded from git, or Docker secrets
Kubernetes	`kubectl create secret generic relier-secrets --from-literal=...`
AWS ECS	Secrets Manager + `secrets:` in task definition
GCP Cloud Run	Secret Manager + `--set-secrets`
Fly.io	`fly secrets set KEY=value`

Health checks¶

rl doctor

Pings Redis and exits 1 on failure. Wire it into your orchestrator:

# Kubernetes liveness
livenessProbe:
  exec: { command: ["rl", "doctor"] }
  initialDelaySeconds: 15
  periodSeconds: 30
  failureThreshold: 3

rl config validate is the corresponding readiness check, it asserts Redis policy and environment variables before declaring the worker ready to serve.

Observability stack (bundled)¶

Both docker-compose.yml and docker-compose.prod.yml include the OTel collector, Prometheus, and Grafana out of the box. When you bring the cluster up, Grafana is reachable at http://localhost:3000:

dev: anonymous viewer enabled (GF_AUTH_ANONYMOUS_ENABLED=true)
prod: anonymous disabled, admin password from GRAFANA_ADMIN_PASSWORD

If you don't want the observability stack, set RELIER_OTEL_ENABLED=false and remove the otel-collector / prometheus / grafana services from your compose file. The worker and resurrector do not require them.

Deployment¶

Before you deploy: what Relier needs¶

Tier 1: Bare metal (no Docker)¶

Using the bundled Makefile¶

Raw commands (no Makefile)¶

Bare-metal preflight¶

A minimal local Redis with the right config¶

Tier 2: Docker, development cluster¶

Bring it up¶

What's actually running¶

Queue topology, explained¶

Configuration for the dev stack¶

Tier 3: Docker, production HA cluster¶

What it includes that dev doesn't¶

Bring it up¶

What the manifest does, at a glance¶

How workers find the right Redis¶

Large checkpoints in production¶

Tier 4: Kubernetes¶

Redis (StatefulSet with persistence)¶

Worker Deployment¶

Resurrector Deployment¶

Graceful rolling deploys on Kubernetes¶

Production Redis configuration¶

Why noeviction?¶

What everysec AOF actually guarantees¶

Tuning AOF rewrite cadence for high-throughput deployments¶

What NOT to tune¶

Managed Redis compatibility¶

Capacity planning¶

Connection pool sizing¶

Sizing worker fleet¶

Admission control sizing¶

Secrets management¶

Health checks¶

Observability stack (bundled)¶

Why `noeviction`?¶

What `everysec` AOF actually guarantees¶