articles/distributed-kv-store-interview-summary.md

Summary: "Airbnb Staff Interview: Designing a Distributed Key-Value Store"

Source: https://freedium-mirror.cfd/airbnb-staff-interview-designing-a-distributed-key-value-store-11f49fc65da5 Author: Anurag Goel ยท ~6 min read ยท February 20, 2026

Core Message

"Design a key-value store" is a classic staff-level interview question that seems simple but quietly murders candidates with its depth. The interviewer isn't looking for a checklist โ€” they want to see how storage internals, distributed systems theory, and workload-specific constraints connect into a coherent story.

The Three Load-Bearing Constraints

  • Append-only writes โ€” no in-place modification; every update appends to a log
  • Strong consistency โ€” reads must reflect the latest committed write, even across servers
  • Values up to 2 GB โ€” can't buffer full values in memory on read or write

Together these eliminate obvious designs: no in-place hash tables, no Redis, no eventual-consistency hand-waving.

Write Path: LSM-Trees

The append-only constraint is actually a gift โ€” sequential writes crush random writes. Architecture:

  1. WAL (Write-Ahead Log) โ€” append-only durability on disk
  2. Memtable โ€” in-memory sorted buffer absorbing writes at memory speed
  3. SSTables โ€” when memtable fills, flush to disk as an immutable sorted file
  4. Compaction โ€” background merge of SSTables. Leveled compaction for read-heavy workloads (fewer files to scan), tiered for write-heavy (less write amplification)

Handling the 2 GB Problem: Value Separation (WiscKey)

Stuffing giant values into SSTables makes compaction catastrophic (re-reading and re-writing massive blobs on every merge = write amplification).

Fix: keep only keys + small pointers (file ID + byte offset) in the SSTable; store actual blobs in a separate append-only value log. Compaction only touches tiny metadata.

Always stream large values in 4โ€“16 MB chunks with checksums. Maintain a chunk manifest per key for resumable uploads, range reads, and crash recovery.

Read Path & Strong Consistency

Strong consistency = linearizability: after a write, any subsequent read from any server must see it.

  • Leader-based replication with Raft โ€” one leader per partition handles all writes
  • Writes acknowledged only after a majority quorum has persisted
  • Reads go to the leader by default
  • Fencing tokens (epoch numbers) prevent stale ex-leaders from serving data after failover โ€” non-negotiable for true linearizability

Scaling reads: lease-based follower reads. The leader grants timed leases to followers; followers can serve reads within that window without sacrificing consistency.

Lookup order: memtable โ†’ immutable memtables โ†’ SSTables (newest first). Bloom filters skip SSTables that definitely don't contain the key.

Conditional Operations (CAS)

ETags / version numbers on every value. A conditional write says: "set this, but only if version is still X." With leader-based design, this is a single Raft log entry combining precondition check + new value. If the check fails, the entry is rejected atomically.

Failure Recovery

  • Crash recovery: replay WAL from last checkpoint. Periodic checkpointing is critical at 100 TB scale โ€” you can't replay from the beginning of time
  • Node loss: rebuild global index from SSTable metadata (not full data scan); rebuild value-log GC metadata lazily in the background
  • Torn writes: caught by checksums; mismatched chunks get discarded, client retries

Multi-Region & CAP

At multi-region scale, every quorum write pays cross-region latency. Mitigation: place majority of replicas in one primary region with one remote replica โ€” fast commits most of the time, cross-region durability.

During a network partition: choose CP (block writes) for a strongly consistent system. Document the trade-off explicitly and design SLAs around it.

Bottom Line โ€” What the Interviewer Wants

The deepest question in the problem is about engineering taste: where do you put the complexity?

Complexity goes to... To keep...
Background compaction Foreground writes fast
Value-log GC Compaction cheap
Leader failover Reads simple

Good systems design is about moving complexity to where it's cheapest to pay for it. The answer they're looking for is not a perfect system โ€” it's a thoughtful one.