Summary: "Airbnb Staff Interview: Designing a Distributed Key-Value Store"

Source: https://freedium-mirror.cfd/airbnb-staff-interview-designing-a-distributed-key-value-store-11f49fc65da5 Author: Anurag Goel · ~6 min read · February 20, 2026

Core Message

"Design a key-value store" is a classic staff-level interview question that seems simple but quietly murders candidates with its depth. The interviewer isn't looking for a checklist — they want to see how storage internals, distributed systems theory, and workload-specific constraints connect into a coherent story.

The Three Load-Bearing Constraints

Append-only writes — no in-place modification; every update appends to a log
Strong consistency — reads must reflect the latest committed write, even across servers
Values up to 2 GB — can't buffer full values in memory on read or write

Together these eliminate obvious designs: no in-place hash tables, no Redis, no eventual-consistency hand-waving.

Write Path: LSM-Trees

The append-only constraint is actually a gift — sequential writes crush random writes. Architecture:

WAL (Write-Ahead Log) — append-only durability on disk
Memtable — in-memory sorted buffer absorbing writes at memory speed
SSTables — when memtable fills, flush to disk as an immutable sorted file
Compaction — background merge of SSTables. Leveled compaction for read-heavy workloads (fewer files to scan), tiered for write-heavy (less write amplification)

Handling the 2 GB Problem: Value Separation (WiscKey)

Stuffing giant values into SSTables makes compaction catastrophic (re-reading and re-writing massive blobs on every merge = write amplification).

Fix: keep only keys + small pointers (file ID + byte offset) in the SSTable; store actual blobs in a separate append-only value log. Compaction only touches tiny metadata.

Always stream large values in 4–16 MB chunks with checksums. Maintain a chunk manifest per key for resumable uploads, range reads, and crash recovery.

Read Path & Strong Consistency

Strong consistency = linearizability: after a write, any subsequent read from any server must see it.

Leader-based replication with Raft — one leader per partition handles all writes
Writes acknowledged only after a majority quorum has persisted
Reads go to the leader by default
Fencing tokens (epoch numbers) prevent stale ex-leaders from serving data after failover — non-negotiable for true linearizability

Scaling reads: lease-based follower reads. The leader grants timed leases to followers; followers can serve reads within that window without sacrificing consistency.

Lookup order: memtable → immutable memtables → SSTables (newest first). Bloom filters skip SSTables that definitely don't contain the key.

Conditional Operations (CAS)

ETags / version numbers on every value. A conditional write says: "set this, but only if version is still X." With leader-based design, this is a single Raft log entry combining precondition check + new value. If the check fails, the entry is rejected atomically.

Failure Recovery

Crash recovery: replay WAL from last checkpoint. Periodic checkpointing is critical at 100 TB scale — you can't replay from the beginning of time
Node loss: rebuild global index from SSTable metadata (not full data scan); rebuild value-log GC metadata lazily in the background
Torn writes: caught by checksums; mismatched chunks get discarded, client retries

Multi-Region & CAP

At multi-region scale, every quorum write pays cross-region latency. Mitigation: place majority of replicas in one primary region with one remote replica — fast commits most of the time, cross-region durability.

During a network partition: choose CP (block writes) for a strongly consistent system. Document the trade-off explicitly and design SLAs around it.

Bottom Line — What the Interviewer Wants

The deepest question in the problem is about engineering taste: where do you put the complexity?

Complexity goes to...	To keep...
Background compaction	Foreground writes fast
Value-log GC	Compaction cheap
Leader failover	Reads simple

Good systems design is about moving complexity to where it's cheapest to pay for it. The answer they're looking for is not a perfect system — it's a thoughtful one.