articles/distributed-kv-store-interview-summary.md
Table of Contents
Summary: "Airbnb Staff Interview: Designing a Distributed Key-Value Store"
Source: https://freedium-mirror.cfd/airbnb-staff-interview-designing-a-distributed-key-value-store-11f49fc65da5 Author: Anurag Goel ยท ~6 min read ยท February 20, 2026
Core Message
"Design a key-value store" is a classic staff-level interview question that seems simple but quietly murders candidates with its depth. The interviewer isn't looking for a checklist โ they want to see how storage internals, distributed systems theory, and workload-specific constraints connect into a coherent story.
The Three Load-Bearing Constraints
- Append-only writes โ no in-place modification; every update appends to a log
- Strong consistency โ reads must reflect the latest committed write, even across servers
- Values up to 2 GB โ can't buffer full values in memory on read or write
Together these eliminate obvious designs: no in-place hash tables, no Redis, no eventual-consistency hand-waving.
Write Path: LSM-Trees
The append-only constraint is actually a gift โ sequential writes crush random writes. Architecture:
- WAL (Write-Ahead Log) โ append-only durability on disk
- Memtable โ in-memory sorted buffer absorbing writes at memory speed
- SSTables โ when memtable fills, flush to disk as an immutable sorted file
- Compaction โ background merge of SSTables. Leveled compaction for read-heavy workloads (fewer files to scan), tiered for write-heavy (less write amplification)
Handling the 2 GB Problem: Value Separation (WiscKey)
Stuffing giant values into SSTables makes compaction catastrophic (re-reading and re-writing massive blobs on every merge = write amplification).
Fix: keep only keys + small pointers (file ID + byte offset) in the SSTable; store actual blobs in a separate append-only value log. Compaction only touches tiny metadata.
Always stream large values in 4โ16 MB chunks with checksums. Maintain a chunk manifest per key for resumable uploads, range reads, and crash recovery.
Read Path & Strong Consistency
Strong consistency = linearizability: after a write, any subsequent read from any server must see it.
- Leader-based replication with Raft โ one leader per partition handles all writes
- Writes acknowledged only after a majority quorum has persisted
- Reads go to the leader by default
- Fencing tokens (epoch numbers) prevent stale ex-leaders from serving data after failover โ non-negotiable for true linearizability
Scaling reads: lease-based follower reads. The leader grants timed leases to followers; followers can serve reads within that window without sacrificing consistency.
Lookup order: memtable โ immutable memtables โ SSTables (newest first). Bloom filters skip SSTables that definitely don't contain the key.
Conditional Operations (CAS)
ETags / version numbers on every value. A conditional write says: "set this, but only if version is still X." With leader-based design, this is a single Raft log entry combining precondition check + new value. If the check fails, the entry is rejected atomically.
Failure Recovery
- Crash recovery: replay WAL from last checkpoint. Periodic checkpointing is critical at 100 TB scale โ you can't replay from the beginning of time
- Node loss: rebuild global index from SSTable metadata (not full data scan); rebuild value-log GC metadata lazily in the background
- Torn writes: caught by checksums; mismatched chunks get discarded, client retries
Multi-Region & CAP
At multi-region scale, every quorum write pays cross-region latency. Mitigation: place majority of replicas in one primary region with one remote replica โ fast commits most of the time, cross-region durability.
During a network partition: choose CP (block writes) for a strongly consistent system. Document the trade-off explicitly and design SLAs around it.
Bottom Line โ What the Interviewer Wants
The deepest question in the problem is about engineering taste: where do you put the complexity?
| Complexity goes to... | To keep... |
|---|---|
| Background compaction | Foreground writes fast |
| Value-log GC | Compaction cheap |
| Leader failover | Reads simple |
Good systems design is about moving complexity to where it's cheapest to pay for it. The answer they're looking for is not a perfect system โ it's a thoughtful one.