Geth(5) The Storage Stack - Kehao Zheng's Website

The previous chapters introduced the Merkle Patricia Trie (how data is authenticated) and the Account & State layer (how state is organized and mutated). But neither chapter answered a practical question: where do all these bytes actually end up?

This chapter traces the full path from an in-memory state mutation down to bytes on disk. It covers four things:

The interface hierarchy — how geth defines a storage contract that every backend must implement
The key-value store — how Pebble (the default engine) turns Put/Get calls into disk I/O
The key schema and accessor layer — how core/rawdb/ organizes all of Ethereum’s data into a single flat key space
The freezer — how ancient, finalized blocks are moved out of the key-value store into append-only flat files

The Four-Layer Diagram#

When StateDB.Commit() finishes (covered in Chapter 04), the trie nodes and account data need to reach disk. They travel through four layers:

1
+-----------------------------------------------------------+
2
|  Layer 4: StateDB                                         |
3
|  In-memory dirty state, journal, snapshots                |
4
|  (core/state/)                                            |
5
+---------------------------+-------------------------------+
6
                            |  Commit()
7
                            v
8
+-----------------------------------------------------------+
9
|  Layer 3: Trie + TrieDB                                   |
10
|  Merkle Patricia Trie nodes, path-based or hash-based     |
11
|  persistence (trie/, triedb/)                              |
12
+---------------------------+-------------------------------+
13
                            |  triedb.Commit() → batch writes
14
                            v
15
+-----------------------------------------------------------+
16
|  Layer 2: rawdb accessor layer                            |
17
|  Key-prefix schema, Read/Write functions                  |
18
|  (core/rawdb/)                                            |
19
+---------------------------+-------------------------------+
20
                            |  ethdb.Put(), ethdb.Batch.Write()
21
                            v
22
+-----------------------------------------------------------+
23
|  Layer 1: Key-Value Store + Freezer                       |
24
|  Pebble (default) or LevelDB for live data                |
25
|  Freezer for ancient chain segments                       |
26
|  (ethdb/pebble/, core/rawdb/freezer.go)                   |
27
+-----------------------------------------------------------+

Layers 3 and 4 were covered in previous chapters. This chapter focuses on Layers 1 and 2 — the bottom half of the stack.

The Interface Hierarchy#

Geth defines all storage contracts in a single file: ethdb/database.go. Everything above this boundary — trie code, rawdb accessors, the Ethereum service — programs against these interfaces. The actual storage engine (Pebble, LevelDB, or an in-memory map) is invisible to them.

The Key-Value Side#

The core building blocks are two tiny interfaces — one for reading, one for writing:

1
type KeyValueReader interface {
2
    Has(key []byte) (bool, error)
3
    Get(key []byte) ([]byte, error)
4
}
5

6
type KeyValueWriter interface {
7
    Put(key []byte, value []byte) error
8
    Delete(key []byte) error
9
}

These are combined into KeyValueStore, which adds batch support, iteration, compaction, and statistics:

1
type KeyValueStore interface {
2
    KeyValueReader
3
    KeyValueWriter
4
    KeyValueStater
5
    KeyValueSyncer
6
    KeyValueRangeDeleter
7
    Batcher
8
    Iteratee
9
    Compacter
10
    io.Closer
11
}

Batcher provides NewBatch() for atomic multi-key writes (covered below).
Iteratee provides NewIterator(prefix, start) for ordered key scans.
Compacter provides Compact(start, limit) for triggering LSM-tree compaction.
KeyValueSyncer provides SyncKeyValue() to force-flush the write-ahead log.

The Ancient Side#

Old, finalized blocks rarely change and are better stored in flat, append-only files. Geth calls this the “ancient store” and defines a separate interface family for it:

1
type AncientReaderOp interface {
2
    Ancient(kind string, number uint64) ([]byte, error)
3
    AncientRange(kind string, start, count, maxBytes uint64) ([][]byte, error)
4
    Ancients() (uint64, error)
5
    Tail() (uint64, error)
6
    // ...
7
}
8

9
type AncientWriter interface {
10
    ModifyAncients(func(AncientWriteOp) error) (int64, error)
11
    TruncateHead(n uint64) (uint64, error)
12
    TruncateTail(n uint64) (uint64, error)
13
    SyncAncient() error
14
}

Ancient(kind, number) retrieves a single item (e.g., Ancient("headers", 42) returns block 42’s header).
ModifyAncients(fn) is the write API. The callback receives an AncientWriteOp with Append/AppendRaw methods. If the callback returns an error, all changes are rolled back.
Tail() returns the first available item number — items before this have been pruned.

The Unified Database#

At the top, a single interface combines both worlds:

1
type Database interface {
2
    KeyValueStore
3
    AncientStore
4
}

Every component in geth that needs storage receives an ethdb.Database. Internally it is a freezerdb — a struct that embeds a KeyValueStore (Pebble) and a chainFreezer (flat files):

1
type freezerdb struct {
2
    ethdb.KeyValueStore
3
    *chainFreezer
4

5
    readOnly    bool
6
    ancientRoot string
7
}

The rawdb.Open() function constructs this combination, validates that the key-value store and freezer are consistent (matching genesis hashes, no gaps in block numbers), and starts a background goroutine that periodically freezes finalized blocks.

The Key-Value Store: Pebble#

Pebble is geth’s default storage engine (replacing LevelDB). It is an LSM-tree key-value store from CockroachDB that provides the ethdb.KeyValueStore interface.

Configuration#

The pebble.New() constructor in ethdb/pebble/pebble.go sets up the engine with these key parameters:

1
// ethdb/pebble/pebble.go (inside New)
2

3
opt := &pebble.Options{
4
    Cache:        pebble.NewCache(int64(cache * 1024 * 1024)),
5
    MaxOpenFiles: handles,
6
    MemTableSize: uint64(memTableSize),
7
    MemTableStopWritesThreshold: memTableLimit,        // 4
8
    MaxConcurrentCompactions:    runtime.NumCPU,
9
    Levels: []pebble.LevelOptions{
10
        {TargetFileSize: 2 * 1024 * 1024, FilterPolicy: bloom.FilterPolicy(10)},
11
        {TargetFileSize: 4 * 1024 * 1024, FilterPolicy: bloom.FilterPolicy(10)},
12
        // ... 5 more levels, doubling each time up to 128 MB
13
    },
14
    L0CompactionThreshold: 2,
15
}

Cache is split between read and write buffers. The total is set from geth’s --cache flag.
4 memtables allow smoother write flushing (smaller, more frequent flushes instead of large spikes).
Bloom filters (10 bits per key) on every level accelerate point lookups by avoiding disk reads for keys that don’t exist.
L0 compaction threshold = 2 is lower than Pebble’s default of 4, reducing the compaction debt at the cost of more frequent compactions.

Asynchronous Writes#

By default, geth uses async writes — Put and Batch.Write return before the write-ahead log (WAL) is fsynced to disk:

1
writeOptions: pebble.NoSync,

This gives much better write throughput, especially on macOS. Geth is designed to recover from unclean shutdowns, so losing a few recent writes is acceptable. For safety, periodic background fsyncs are triggered via WALBytesPerSync.

Core Operations#

The Get and Put methods are thin wrappers around Pebble’s native API:

1
func (d *Database) Get(key []byte) ([]byte, error) {
2
    d.quitLock.RLock()
3
    defer d.quitLock.RUnlock()
4
    if d.closed {
5
        return nil, pebble.ErrClosed
6
    }
7
    dat, closer, err := d.db.Get(key)
8
    if err != nil {
9
        return nil, err
10
    }
11
    ret := make([]byte, len(dat))
12
    copy(ret, dat)
13
    closer.Close()
14
    return ret, nil
15
}
16

17
func (d *Database) Put(key []byte, value []byte) error {
18
    // ... closed check ...
19
    return d.db.Set(key, value, d.writeOptions)
20
}

Note that Get copies the value into a new byte slice. Pebble’s Get returns a pointer into an internal buffer with a closer — the data is only valid until closer.Close() is called. The copy ensures the caller owns the bytes.

Batch Writes#

Individual Put calls are expensive: each one goes through the WAL individually. When geth needs to write many keys atomically (e.g., inserting a block’s worth of trie nodes), it uses a batch:

1
type Batch interface {
2
    KeyValueWriter        // Put, Delete
3
    KeyValueRangeDeleter  // DeleteRange
4

5
    ValueSize() int       // bytes queued so far
6
    Write() error         // flush all queued ops to disk atomically
7
    Reset()               // clear the batch for reuse
8
    Replay(w KeyValueWriter) error  // replay ops against another writer
9
}

A batch buffers Put/Delete operations in memory. Nothing touches the database until Write() is called, and the entire batch is applied atomically — either all writes succeed or none do.

The IdealBatchSize constant (100 KB) serves as a guideline: callers can check batch.ValueSize() >= ethdb.IdealBatchSize to decide when to flush and start a new batch. This prevents batches from growing too large in memory.

Here is how batches are used in practice. During chain freezing, old blocks are deleted from the key-value store in batches:

1
// core/rawdb/chain_freezer.go (inside freeze)
2

3
batch := db.NewBatch()
4
for i := 0; i < len(ancients); i++ {
5
    if first+uint64(i) != 0 {
6
        DeleteBlockWithoutNumber(batch, ancients[i], first+uint64(i))
7
        DeleteCanonicalHash(batch, first+uint64(i))
8
    }
9
}
10
if err := batch.Write(); err != nil {
11
    log.Crit("Failed to delete frozen canonical blocks", "err", err)
12
}
13
batch.Reset()

Under the hood in Pebble, Write() calls pebble.Batch.Commit(), which applies all buffered operations to the database in a single atomic write.

The Key Schema#

Geth stores everything — headers, bodies, receipts, trie nodes, contract code, snapshots, transaction indices — in a single flat key-value namespace. The core/rawdb/schema.go file defines the key-prefix schema that organizes this namespace.

Singleton Keys#

Some values are global, storing a single piece of state:

1
headHeaderKey         = []byte("LastHeader")
2
headBlockKey          = []byte("LastBlock")
3
headFinalizedBlockKey = []byte("LastFinalized")
4
persistentStateIDKey  = []byte("LastStateID")
5
trieJournalKey        = []byte("TrieJournal")
6
SnapshotRootKey       = []byte("SnapshotRoot")

These are fixed-length keys that map to a single value (typically a 32-byte hash or an 8-byte block number).

Prefix-Based Keys#

Most data is keyed by combining a single-byte prefix with a block number (big-endian uint64) and/or a hash (32 bytes). Single-byte prefixes keep keys short and ensure different data types never collide:

Prefix	Key Format	Value
`"h"`	`h` + num(8) + hash(32)	Block header (RLP)
`"h"` + `"n"`	`h` + num(8) + `n`	Canonical hash for block number
`"H"`	`H` + hash(32)	Block number for hash
`"b"`	`b` + num(8) + hash(32)	Block body (RLP)
`"r"`	`r` + num(8) + hash(32)	Block receipts (RLP)
`"l"`	`l` + txHash(32)	Transaction lookup metadata
`"c"`	`c` + codeHash(32)	Contract bytecode
`"a"`	`a` + accountHash(32)	Snapshot: account data
`"o"`	`o` + accountHash(32) + storageHash(32)	Snapshot: storage slot
`"A"`	`A` + hexPath	Trie node (path-based, account trie)
`"O"`	`O` + accountHash(32) + hexPath	Trie node (path-based, storage trie)
`"L"`	`L` + stateRoot(32)	State ID (path-based)

The key-building functions are also defined in schema.go:

1
func headerKey(number uint64, hash common.Hash) []byte {
2
    return append(append(headerPrefix, encodeBlockNumber(number)...), hash.Bytes()...)
3
}
4

5
func blockBodyKey(number uint64, hash common.Hash) []byte {
6
    return append(append(blockBodyPrefix, encodeBlockNumber(number)...), hash.Bytes()...)
7
}
8

9
func codeKey(hash common.Hash) []byte {
10
    return append(CodePrefix, hash.Bytes()...)
11
}

Block numbers are always encoded as 8-byte big-endian integers. This ensures that keys sort in block-number order within each prefix, which makes range scans efficient.

The Accessor Layer#

The core/rawdb/ package provides accessor functions — typed Read/Write/Delete helpers that handle key construction, RLP encoding/decoding, and the ancient-vs-live lookup logic. Higher layers never construct raw keys or call db.Get() directly.

Chain Data Accessors#

The pattern is consistent across all chain data. Here is ReadHeader:

1
func ReadHeader(db ethdb.Reader, hash common.Hash, number uint64) *types.Header {
2
    data := ReadHeaderRLP(db, hash, number)
3
    if len(data) == 0 {
4
        return nil
5
    }
6
    header := new(types.Header)
7
    if err := rlp.DecodeBytes(data, header); err != nil {
8
        log.Error("Invalid block header RLP", "hash", hash, "err", err)
9
        return nil
10
    }
11
    return header
12
}

It delegates to ReadHeaderRLP, which handles the two-tier lookup — check the freezer first, fall back to the key-value store:

1
func ReadHeaderRLP(db ethdb.Reader, hash common.Hash, number uint64) rlp.RawValue {
2
    var data []byte
3
    db.ReadAncients(func(reader ethdb.AncientReaderOp) error {
4
        data, _ = reader.Ancient(ChainFreezerHeaderTable, number)
5
        if len(data) > 0 && crypto.Keccak256Hash(data) == hash {
6
            return nil
7
        }
8
        data, _ = db.Get(headerKey(number, hash))
9
        return nil
10
    })
11
    return data
12
}

First, try reader.Ancient("headers", number) — the freezer is indexed by block number alone.
If found, verify the hash matches (the freezer only stores canonical data — the requested hash might be a fork block).
If not found (or hash mismatch), fall back to db.Get(headerKey(number, hash)) — the key-value store, which stores both canonical and non-canonical blocks.

The ReadAncients wrapper ensures the entire callback runs under the freezer’s read lock, so no concurrent writes can change the data mid-read.

The write side is simpler — it always targets the key-value store (data is only moved to the freezer later by the background freezer goroutine):

1
func WriteHeader(db ethdb.KeyValueWriter, header *types.Header) {
2
    var (
3
        hash   = header.Hash()
4
        number = header.Number.Uint64()
5
    )
6
    WriteHeaderNumber(db, hash, number)
7

8
    data, err := rlp.EncodeToBytes(header)
9
    if err != nil {
10
        log.Crit("Failed to RLP encode header", "err", err)
11
    }
12
    key := headerKey(number, hash)
13
    if err := db.Put(key, data); err != nil {
14
        log.Crit("Failed to store header", "err", err)
15
    }
16
}

WriteHeader does two things: stores the hash→number mapping (for reverse lookups) and stores the RLP-encoded header at h + number + hash.

State Data Accessors#

The core/rawdb/accessors_state.go file provides accessors for state-related data — contract code, preimages, state IDs, and trie journals:

1
func ReadCode(db ethdb.KeyValueReader, hash common.Hash) []byte {
2
    data := ReadCodeWithPrefix(db, hash)
3
    if len(data) != 0 {
4
        return data
5
    }
6
    data, _ = db.Get(hash.Bytes())
7
    return data
8
}
9

10
func WriteCode(db ethdb.KeyValueWriter, hash common.Hash, code []byte) {
11
    if err := db.Put(codeKey(hash), code); err != nil {
12
        log.Crit("Failed to store contract code", "err", err)
13
    }
14
}

ReadCode tries the current prefixed scheme ("c" + codeHash) first, then falls back to a legacy scheme (bare codeHash as key) for backward compatibility.

State IDs map state roots to sequential numbers, used by the path-based trie database (see Chapter 03):

1
func ReadStateID(db ethdb.KeyValueReader, root common.Hash) *uint64 {
2
    data, err := db.Get(stateIDKey(root))
3
    if err != nil || len(data) == 0 {
4
        return nil
5
    }
6
    number := binary.BigEndian.Uint64(data)
7
    return &number
8
}

The Freezer: Ancient Storage#

The key-value store (Pebble) is optimized for random reads and writes, but it pays a cost: LSM-tree compaction continuously rewrites data on disk. For historical chain data that is never modified after finalization, this overhead is wasteful. The freezer solves this by moving finalized blocks out of Pebble into append-only flat files.

How the Freezer Works#

The freezer stores data in tables — each table holds one type of data. The chain freezer has four tables:

Table	Data	Prunable
`"headers"`	RLP-encoded block headers	No
`"hashes"`	Canonical block hashes (32 bytes each)	No
`"bodies"`	RLP-encoded block bodies	Yes
`"receipts"`	RLP-encoded receipts	Yes

Headers and hashes are kept forever (not prunable). Bodies and receipts can be pruned via TruncateTail — once pruned, they are no longer accessible from the freezer (though an optional Era database can serve as a backup).

Each table is stored as a pair of files on disk:

1
type freezerTable struct {
2
    items      atomic.Uint64   // total items stored (including removed from tail)
3
    itemOffset atomic.Uint64   // items removed from the table
4
    itemHidden atomic.Uint64   // items marked deleted but not yet physically removed
5
    // ...
6
    head   *os.File            // current data file being written to
7
    index  *os.File            // index file: maps item number → (filenum, offset)
8
    files  map[uint32]*os.File // all open data files
9
    // ...
10
}

The index file contains fixed-size 6-byte entries (uint16 file number + uint32 offset). To find item N, read 6 bytes at position N×6 in the index file.
The data files contain the actual blobs, optionally Snappy-compressed. Data files are capped at 2 GB each (freezerTableSize).

This design makes reads O(1): seek to the index entry, read the file number and offset, seek to that position in the data file, read the blob.

The Freezer Struct#

The Freezer struct ties the tables together:

1
type Freezer struct {
2
    datadir string
3
    frozen  atomic.Uint64          // number of items frozen
4
    tail    atomic.Uint64          // first stored item
5
    // ...
6
    tables       map[string]*freezerTable
7
    instanceLock *flock.Flock      // prevents double-open
8
}

frozen tracks how many items have been written. tail tracks how many have been pruned from the start. Valid items are in the range [tail, frozen).

Writing to the Freezer#

All writes go through ModifyAncients, which provides transactional semantics:

1
func (f *Freezer) ModifyAncients(fn func(ethdb.AncientWriteOp) error) (writeSize int64, err error) {
2
    f.writeLock.Lock()
3
    defer f.writeLock.Unlock()
4

5
    prevItem := f.frozen.Load()
6
    defer func() {
7
        if err != nil {
8
            for name, table := range f.tables {
9
                err := table.truncateHead(prevItem)
10
                // ...
11
            }
12
        }
13
    }()
14

15
    f.writeBatch.reset()
16
    if err := fn(f.writeBatch); err != nil {
17
        return 0, err
18
    }
19
    item, writeSize, err := f.writeBatch.commit()
20
    if err != nil {
21
        return 0, err
22
    }
23
    f.frozen.Store(item)
24
    return writeSize, nil
25
}

The write lock is held for the entire operation.
If the callback or the commit fails, the defer rolls back all tables to their previous item count.
On success, f.frozen advances to the new item count.

The Chain Freezer: Background Migration#

The chainFreezer in core/rawdb/chain_freezer.go wraps the base Freezer and adds a background goroutine that periodically moves finalized blocks from the key-value store into the freezer:

1
// core/rawdb/chain_freezer.go (inside freeze, simplified)
2

3
threshold, _ := f.freezeThreshold(nfdb)
4
frozen, _ := f.Ancients()
5

6
if frozen-1 >= threshold {
7
    return  // nothing to freeze
8
}
9

10
// Phase 1: Copy blocks to the freezer
11
ancients, _ := f.freezeRange(nfdb, first, last)
12

13
// Phase 2: Sync freezer files to disk
14
f.SyncAncient()
15

16
// Phase 3: Delete the frozen blocks from the key-value store
17
batch := db.NewBatch()
18
for i := 0; i < len(ancients); i++ {
19
    if first+uint64(i) != 0 {
20
        DeleteBlockWithoutNumber(batch, ancients[i], first+uint64(i))
21
        DeleteCanonicalHash(batch, first+uint64(i))
22
    }
23
}
24
batch.Write()

The freeze threshold is max(finalized_block, head - 90000) — the higher of the finalized block number and 90,000 blocks behind the head (about 12.5 days of blocks).

The freezeRange method copies each block’s hash, header, body, and receipts into the freezer:

1
func (f *chainFreezer) freezeRange(nfdb *nofreezedb, number, limit uint64) (hashes []common.Hash, err error) {
2
    _, err = f.ModifyAncients(func(op ethdb.AncientWriteOp) error {
3
        for ; number <= limit; number++ {
4
            hash := ReadCanonicalHash(nfdb, number)
5
            header := ReadHeaderRLP(nfdb, hash, number)
6
            body := ReadBodyRLP(nfdb, hash, number)
7
            receipts := ReadReceiptsRLP(nfdb, hash, number)
8

9
            op.AppendRaw(ChainFreezerHashTable, number, hash[:])
10
            op.AppendRaw(ChainFreezerHeaderTable, number, header)
11
            op.AppendRaw(ChainFreezerBodiesTable, number, body)
12
            op.AppendRaw(ChainFreezerReceiptTable, number, receipts)
13
            hashes = append(hashes, hash)
14
        }
15
        return nil
16
    })
17
    return hashes, err
18
}

After copying, the data is fsynced to the freezer, then deleted from the key-value store in batches. The key-value store also deletes any side-chain blocks (non-canonical forks) for the same block numbers.

State History Freezer#

In addition to the chain freezer, geth has a state history freezer for the path-based trie database. This stores old account and storage values so nodes can serve historical state queries. The tables are:

Table	Data
`"history.meta"`	Metadata for each state transition
`"account.index"`	Index into account data
`"storage.index"`	Index into storage data
`"account.data"`	Concatenated account diffs
`"storage.data"`	Concatenated storage slot diffs

All state history tables are prunable. The accessors_state.go file provides typed readers and writers (e.g., ReadStateHistory, WriteStateHistory) that work with these tables through the same AncientReaderOp/AncientWriteOp interfaces.

Putting It All Together#

Here is the complete write path for a new block, from top to bottom:

1
StateDB.Commit()
2
  ├─ Write contract code      → rawdb.WriteCode()       → db.Put("c" + codeHash, code)
3
  ├─ Commit trie nodes         → triedb batch writes     → db.Put("A" + path, node)
4
  └─ Update snapshot           → db.Put("a" + hash, account)
5

6
blockchain.writeBlockAndSetHead()
7
  ├─ rawdb.WriteHeader()       → db.Put("h" + num + hash, headerRLP)
8
  ├─ rawdb.WriteBody()         → db.Put("b" + num + hash, bodyRLP)
9
  ├─ rawdb.WriteReceipts()     → db.Put("r" + num + hash, receiptsRLP)
10
  ├─ rawdb.WriteTxLookupEntries() → db.Put("l" + txHash, blockNum)
11
  └─ rawdb.WriteHeadBlockHash() → db.Put("LastBlock", hash)
12

13
        ↓ (later, background goroutine)
14

15
chainFreezer.freeze()
16
  ├─ Copy h/b/r to freezer flat files
17
  ├─ fsync freezer
18
  └─ Delete h/b/r from Pebble via batch

All key-value writes go through Pebble with async WAL writes for speed. The freezer migrates finalized blocks to flat files in the background, keeping the key-value store lean. Reads check the freezer first (for old canonical data), then fall back to the key-value store.

What’s Next#

With the storage stack complete, we’ve covered the full bottom-to-top path: from raw bytes on disk through the trie, up to the in-memory state. Chapter 06 — Transaction Execution shifts to the other axis of the system — tracing what happens when a single transaction moves through geth’s execution pipeline.

Welcome