Redis persistence: AOF vs RDB for queues

This walkthrough sits under In-Memory vs Persistent Queue Storage and the wider Backend Frameworks & Worker Scaling guide, and zeroes in on the single config decision that determines whether a Redis-backed queue loses jobs on a crash.

Redis-backed queues (RQ, Celery-on-Redis, BullMQ, Sidekiq) keep jobs in memory by default. When the process is killed — OOM, deploy, node failure — every enqueued-but-unprocessed job vanishes unless persistence is configured. The concrete symptom is a customer payment that was accepted by your API, enqueued, and then never ran because the Redis pod restarted. Redis offers two persistence engines, RDB and AOF, with very different durability windows and very different effects on enqueue latency. This guide configures both, measures the fsync cost, and lands on a recommended setup for queue workloads.

Prerequisites

  • A Redis instance you control (redis.conf editable, not a locked-down managed tier).
  • A job queue running on it where you can measure enqueue latency under load.
  • Disk you understand: persistence durability is only as good as the underlying volume's fsync behavior (local SSD vs network EBS matters a lot).
  • A stated tolerance for data loss measured in jobs or seconds — this number drives the entire choice. Review exactly-once vs at-least-once delivery if you have not framed it yet.

Step 1: Understand what each engine persists

RDB writes a point-in-time binary snapshot of the whole dataset on an interval. AOF appends every write command to a log and replays it on restart. The difference is the size of the loss window.

Engine Mechanism Loss window on crash Restart cost Write amplification
RDB Periodic full snapshot (fork + dump) Everything since last snapshot (minutes) Fast (load one file) Low
AOF everysec Append log, fsync once/sec Up to ~1 second of writes Slower (replay log) Moderate
AOF always Append log, fsync every write ~Zero (last write only) Slower (replay log) High

For a queue, the loss window is lost jobs. RDB's minutes-long window is usually unacceptable for anything other than disposable work. AOF everysec caps loss at roughly one second of enqueues; AOF always caps it at essentially the in-flight write.

Step 2: Configure RDB snapshots

RDB alone suits queues where losing a few minutes of jobs is acceptable (idempotent analytics, caches, replaceable work).

# redis.conf — RDB snapshot triggers
save 900 1       # snapshot if >=1 key changed in 900s
save 300 10      # ...or >=10 keys in 300s
save 60 10000    # ...or >=10000 keys in 60s (busy queue -> frequent dumps)

dbfilename dump.rdb
dir /var/lib/redis

# If a background save fails, stop accepting writes so you notice immediately,
# instead of silently running with stale/no persistence.
stop-writes-on-bgsave-error yes

rdbcompression yes        # smaller file, slight CPU cost
rdbchecksum yes           # detect corruption on load

The save rules show RDB's core weakness for queues: between snapshots, nothing is durable. A burst of jobs enqueued and a crash 50 seconds later loses all of them if no rule tripped. The fork to snapshot also doubles memory transiently under copy-on-write, which can OOM a memory-tight queue host.

Step 3: Configure AOF with the right fsync policy

AOF is the right default for durable queues. The appendfsync setting is the durability/latency dial.

# redis.conf — AOF persistence
appendonly yes
appendfilename "appendonly.aof"
appenddirname "appendonlydir"

# THE durability dial:
#   always   -> fsync after every write. ~zero loss, highest enqueue latency.
#   everysec -> fsync once/sec. <=1s loss, near-memory latency. (recommended)
#   no       -> let the OS flush. fastest, largest/undefined loss window.
appendfsync everysec

# AOF rewrite compacts the log so replay stays fast and the file stays bounded.
auto-aof-rewrite-percentage 100   # rewrite when AOF doubles since last rewrite
auto-aof-rewrite-min-size 64mb    # but never below 64mb (avoid churn on tiny logs)

# During a rewrite, hold (don't fsync) new writes to avoid disk contention spikes.
no-appendfsync-on-rewrite no      # 'no' = keep fsyncing; safer for queues

appendfsync everysec is the production sweet spot: it bounds loss to one second while keeping enqueue latency within a fraction of a millisecond of pure in-memory operation. always removes nearly all loss but, as Step 5 shows, can multiply enqueue latency several-fold because every LPUSH/ZADD waits on a physical disk flush.

Step 4: Use hybrid RDB+AOF for fast recovery

Modern Redis (7+) can write the AOF as an RDB-format preamble plus an incremental command tail, getting RDB's fast load and AOF's small loss window together.

# redis.conf — hybrid persistence (recommended for durable queues)
appendonly yes
appendfsync everysec
aof-use-rdb-preamble yes     # AOF begins with a compact RDB snapshot,
                             # then appends recent commands -> fast replay + low loss
save 300 100                 # keep periodic RDB too, as a portable backup artifact

With aof-use-rdb-preamble yes, restart loads the compact snapshot portion quickly and replays only the recent command tail, so recovery time on a large queue does not grow linearly with the full command history. This is the configuration most durable Redis queues should run.

Step 5: Measure fsync impact on enqueue latency

Do not guess the cost — measure it on your disk. Enqueue latency under always vs everysec is the number that justifies your choice.

# Compare enqueue latency under each policy on YOUR hardware.
# 1) Set everysec, run the benchmark against the queue's push operation:
redis-cli CONFIG SET appendfsync everysec
redis-benchmark -t lpush -n 100000 -q
# 2) Switch to always and re-run:
redis-cli CONFIG SET appendfsync always
redis-benchmark -t lpush -n 100000 -q
# Expect 'always' to show markedly higher latency / lower ops on networked disks,
# because every LPUSH blocks on a physical fsync. On local NVMe the gap shrinks.
# Watch real fsync stalls in production:
redis-cli INFO persistence | grep -E 'aof_last_write_status|aof_pending_rewrite|rdb_last_bgsave_status'
redis-cli --latency             # rising latency often = disk fsync contention

If redis-benchmark shows always cutting your enqueue throughput unacceptably, everysec plus idempotent consumers (so the rare lost-second of jobs is retried at the source) is the better engineering trade than paying per-enqueue fsync latency.

Verification

Prove persistence actually survives a restart — the only test that matters.

# 1) Enqueue a marker job, then hard-kill and restart Redis.
redis-cli LPUSH task_queue '{"job_id":"persist-check","run":true}'
redis-cli SAVE                       # or rely on AOF everysec having flushed
sudo systemctl restart redis         # simulate crash/restart

# 2) The job MUST still be there after restart:
redis-cli LRANGE task_queue 0 -1 | grep persist-check && echo "DURABLE" || echo "LOST"

# 3) Confirm the engine loaded cleanly (no corruption):
redis-cli INFO persistence | grep -E 'loading:0|aof_last_bgrewrite_status:ok|rdb_last_load_keys_loaded'

A correctly persisted queue prints DURABLE; a misconfigured one (AOF off, or RDB with no recent snapshot) prints LOST. Run this in staging as a release gate after any persistence change.

Gotchas & edge cases

  • fsync everysec still loses up to a second. It is not zero-loss. If a job is non-idempotent and irreplaceable (a one-time charge), pair everysec with an idempotency key at the producer, or move that specific work to a disk-first broker. See preventing duplicate job execution with idempotency.
  • Managed Redis may override your config. ElastiCache, Memorystore, and Upstash expose limited or different persistence knobs; AOF always may be unavailable. Confirm the actual appendfsync in effect with CONFIG GET appendfsync rather than trusting your intended config file.
  • RDB fork OOM on tight hosts. Snapshotting forks the process; copy-on-write can transiently need up to 2x memory on a write-heavy queue. Leave headroom or the snapshot is killed and stop-writes-on-bgsave-error yes then freezes enqueues.
  • Connection pool, not just persistence, gates latency. A correctly tuned AOF still bottlenecks if clients exhaust connections. Tune the client side too — see tuning Sidekiq's Redis connection pool for the pattern.

Related