Understanding the hidden costs of traditional database structures on modern storage
The Problem in One Sentence
B-trees work fine on SSDs initially, but their in-place updates cause high write amplification and hot-spot wear over time, resulting in slower performance and shorter SSD lifespan compared to flash-optimized alternatives.
Background: Why B-trees Exist
Let me start with some context. B-trees are everywhere today:
- Database engines (MySQL, PostgreSQL)
- Filesystems (NTFS, ext4)
- Storage systems across the industry
They were designed decades ago for spinning disks, where:
- Random seeks were incredibly expensive
- The goal was to minimize disk head movement
- Solution: Keep trees shallow, pack many keys per node
This design made perfect sense for mechanical storage.
The SSD Reality Check
When SSDs first arrived, B-trees seemed to work great:
- Reads are lightning fast
- Random access is practically free
- Initial writes perform well too
But here's the catch: Over time, B-trees' update patterns clash fundamentally with how flash memory actually works.
The result isn't just slower performance—it's measurable reduction in your SSD's lifespan.
The Write Amplification Problem
Let me show you what happens with a simple example:
The Setup
Page size: 4 KB
Erase block size: 256 KB (64 pages per block)
Operation: Insert one record into a B-tree leaf page
sql
What the B-tree "Thinks" Happens
- Find page #17
- Update it in place
- Done!
What Actually Happens on the SSD
- Read the entire 256 KB erase block containing page #17
- Modify just 4 KB in memory
- Erase the whole 256 KB block
- Write the entire 256 KB back
Result: 64× more physical work than the logical write
This is called write amplification, and it's just the beginning.
The Hot Spot Problem
B-trees have a natural hierarchy problem:
Root and upper nodes → Updated frequently (every insert/delete affects them)
Leaf nodes → Updated occasionally (only when that specific data changes)
What This Means for Your SSD
- The same erase blocks get rewritten constantly
- Flash memory cells wear out after 3,000–10,000 program/erase cycles
- Hot spots wear out much faster than the rest of the drive
- SSD controller remaps failed blocks to spares
- But spare blocks are finite
Bottom line: Uneven wear patterns in B-trees directly accelerate SSD degradation.
The Garbage Collection Tax
As your SSD fills up, another problem emerges: garbage collection.
How Garbage Collection Works
- SSD controller needs to clean partially-used blocks
- Moves valid pages out of a block
- Erases the block
- Writes data back elsewhere
Why B-trees Make This Worse
- B-tree updates scatter writes across many different pages
- Scattered writes mean garbage collection has to move more data around
- A single 4 KB update can cascade into multiple block rewrites during GC
- Write amplification compounds exponentially
The Downward Spiral
These problems don't exist in isolation—they feed on each other:
Frequent B-tree updates
↓
Same blocks hit repeatedly (hot spots)
↓
Controller remaps worn blocks, spare pool shrinks
↓
More garbage collection overhead needed
↓
Higher write amplification across the board
↓
More erases per logical write
↓
Performance degrades + Lifespan shortens
sql
This creates a feedback loop where the problems accelerate over time.
Real-World Impact
Performance Degradation
- Garbage collection competes with real application writes
- Response times become unpredictable
- Throughput drops significantly under sustained load
Lifespan Reduction
- Flash cells wear out sooner due to excessive erase cycles
- Drive may fail years before its expected lifespan
- Premature replacement costs
The Hidden Costs
- Increased infrastructure replacement budgets
- Potential data availability issues
- Performance troubleshooting overhead
The Modern Solution
This is why the industry has largely moved away from B-trees for write-heavy workloads on SSDs:
LSM Trees (Log-Structured Merge Trees)
- Used by: RocksDB, Cassandra, LevelDB
- How they work: Sequential writes, periodic compaction
- Trade-off: Slightly more complex reads for much better write patterns
Copy-on-Write B-trees
- Used by: Btrfs, APFS, ZFS
- How they work: Never modify data in place, always write to new locations
- Trade-off: More metadata overhead for better flash compatibility
Both approaches sacrifice some read efficiency for dramatically better write behavior and longer SSD life.
Key Takeaways
- B-trees aren't inherently "bad" for SSDs—they work fine initially, especially for read-heavy workloads
- The problems emerge over time under sustained write workloads due to fundamental mismatches with flash memory
- Write amplification is the root cause—what looks like a 4 KB write becomes 256 KB+ of physical work
- Modern alternatives exist that are specifically designed for flash storage characteristics
- The choice of data structure can significantly impact both performance and hardware longevity