Database Indexes: Part 2 — How B-Tree Indexes Work Internally (With a Real Query Walkthrough)

Ever wonder how databases locate one record among millions in just milliseconds? This deep dive traces a real query through every layer of B-tree architecture—from root pages to raw bytes.

Makoto

Jul 29, 2025

A Millisecond Mystery

You're running a query on a table with 10 million users. You type:

And almost instantly, the row appears. Not seconds, not even a full blink — just milliseconds.

How is that even possible?

If we had to read the table record by record, it would take minutes. But under the hood, the database engine performs a dance of jumps, skips, and direct page reads. The secret behind this speed?

A humble yet powerful structure: the B-tree index.

Quick Bridge from Part 1

In Part 1, we saw how databases evolved from sequential files to B-trees to solve the linear search problem. Now it's time to see exactly how a B-tree delivers those millisecond responses.

Time to open the black box and see what happens inside.

Missed Part 1? Don’t worry — this deep dive stands on its own, but feel free to circle back later to understand how we got here.

Database Indexes: Part 1 — The Origin

Makoto

July 15, 2025

Read full story

Following Maria's Email Through the B-Tree

Let's trace this exact query through every step:

The Complete Path:

Query → BTreeIndex → Root Page → Leaf Page → Row Locator → Final Data

We'll see exactly what happens in memory, on disk, and how they transform into each other.

🔑 Key Concepts First

Before we dive into the journey, let's quickly understand the main concepts we'll encounter:

What is a B-Tree?

A B-tree is similar to a multi-way decision tree, where each node can have multiple children (not just two, as in binary trees). Think of it as a book with a detailed index where:

Each index entry (node) contains multiple topic ranges (keys)
Each topic range guides you to the right section or specific page
You can skip entire sections with each decision

Key Properties:

Always balanced - All leaf nodes are at the same depth
Wide, not tall - Nodes contain many keys (50-200+ per node)
Ordered - Keys within each node are sorted
Self-maintaining - Automatically stays balanced during inserts/deletes

Visual Structure:

            [100]                   ← Root node
           /     \
     [50 | 75]   [150 | 200]        ← Internal nodes  
     /   |   \    /   |    \
  [10] [60] [80] [120] [160] [250]  ← Leaf nodes with actual data

Each internal node acts like a "signpost" that says:

Root: everything < 100 goes left, ≥ 100 goes right
Left internal: < 50, 50-74, ≥ 75
Right internal: 100-149, 150-199, ≥ 200

Why This Structure Matters:

Height stays low: A 3-level B-tree can handle millions of records. Compare this to a binary tree that might need 20+ levels for the same data.

Disk-friendly: Each node fits exactly in one 4KB page, so every "decision" requires just one disk read.

What is a BTreeIndex?

Think of it as a separate file (users_email.idx) that acts like a "smartphone book" for your tablet. Instead of names→phone numbers, it maps emails→exact locations where those user records live.

What are "Pages"?

Both index and data are stored in 4KB chunks called "pages" - like pages in a book. Each page has a unique number (Page #1, Page #47, etc.) so the database can jump directly to any page without reading everything before it.

Why 4KB pages?

This matches how your hard disk and operating system prefer to read data - one efficient chunk at a time, not byte by byte. It's like reading a paragraph instead of one letter at a time.

Memory ↔ Disk Transformation

You'll see "Memory" and "Disk" side-by-side throughout our journey:

In Memory: Objects with properties (BTreeNode.keys, BTreeNode.children)
On Disk: Raw bytes stored in files (0x6d 0x61 0x72...)

The database constantly transforms between these: read from disk → deserialize into objects → work in memory → serialize back → write to disk.

Now let's see how these concepts work together in practice...

🔍Step 0: Understanding Our Data

Before we search for Maria’s email, the engine first needs to understand what kind of table it’s dealing with — its shape, structure, and what fields exist.

What is our Users Table?

First, let's see what we're indexing. Our users table looks like this:

id: UUID identifier
email: user's email address
name: user's full name
age: user's age
created_at: timestamp
An Index: Build B-tree on email field for fast lookups

Sample data:

What happens on disk:

When you create this table and index, the database creates two separate files:

users_data.bin - The actual table data
- Contains all user records (id, email, name, age, created_at)
- Organized in 4KB pages for efficient reading
- This is where your actual data lives
users_email.idx - The email index
- Contains only email addresses + pointers to the data file
- Much smaller than the data file
- Organized as a B-tree for fast searching

File System:
├── users_data.bin    (10MB - user records)
└── users_email.idx   (2MB - just emails + pointers)

📌 Note: This article focuses on the **logical layout** of B-tree indexes — how they map keys to data pages. We’re not diving into insertion, deletion, or rebalancing algorithms here — not because they’re boring (they’re not!), but because many excellent articles already cover them well.

Enjoying the deep dive? Feel free to share it
Share

Step 1: Loading the Root Page

What the Database Does:

Open the index file: users_email.idx
Read page #1 (4KB) from disk
Deserialize bytes into BTreeNode object

Step 2: Navigation Logic

Understanding the Structure We Just Discovered

Now that we've loaded the root page, we can see the navigation paths available to us:

What this structure tells us:

The path we'll take - Root → Page #52 → Maria's record
Page numbers aren't sequential - #47, #52, #67 could be scattered across the disk
We only load what we need - We've loaded Page #1 (root), and will only read Page #52 from disk into memory next, leaving #47 and #67 as untouched bytes on disk
Navigation boundaries - The root keys ["clara@", "pedro@"] divide the data space into ranges

B-Tree Decision at the Root

To decide which child page to follow, the B-tree compares our target email with the boundary keys in the root node:

Target: "maria@example.com"
Root keys: ["clara@", "pedro@"]

We go step by step:

Compare with "clara@"
"maria@" >= "clara@" → ✅ Yes → So it's not in the leftmost child (Page #47)
Compare with "pedro@"
"maria@" < "pedro@" → ✅ Yes → So it's before the "pedro@" boundary

Together, these comparisons tell us:
"maria@" falls in the middle range: ["clara@", "pedro@")

This means we navigate to:

→ children[1] = Page #52

Note: B-trees use half-open intervals — ranges that are inclusive on the left and exclusive on the right. So the middle child is responsible for any key ≥ "clara@" and < "pedro@".

Step 3: Loading the Leaf Page

Understanding Internal vs Leaf Pages

We're about to encounter two different types of pages in our B-tree:

Internal Pages (like the root): Act like a "table of contents"

Store boundary keys that divide data into sections
Point to other pages (either more internal pages or leaf pages)
Don't contain actual email addresses - just navigation guides

Leaf Pages: Store the actual index entries

Contain the real email addresses we're searching for
Point to actual data locations (not other pages)
This is where we'll finally find "maria@example.com"

Think of it like a library:

Internal pages = Floor directory ("Science books: Floor 3")
Leaf pages = Shelf labels ("Biology: Books A-M on this shelf")

Disk Read Result:

Step 4: Finding the Exact Match

We already loaded Page #52 and saw its memory structure:

keys:     ["clara@example.com", "luis@example.com", "maria@example.com"]
locators: [(202,45), (89,12), (134,7)]

Now it’s time to match our search target.

Searching in the Leaf Page

The database now performs a linear scan inside the node — but only within that tiny 4KB page.

It compares each stored key with your target:

Target key: "maria@example.com" 
1. "clara@example.com" → ❌ not a match 
2. "maria@example.com" → ✅ match found!

Once matched, it returns the Row Locator:

RowLocator result:
├─ page_id: 134     ← Points to the data file page
├─ slot_id: 7       ← Record’s slot within that page

The index never stores the actual user row. It just tells you exactly where to find it.

The Row Locator acts like GPS: it leads the engine directly to Maria’s record in the data file — without touching anything else.

Step 5: Final Data Access

Understanding Index vs Data Storage

We're about to jump from the index file to the actual data file. These are stored separately:

Index File (users_email.idx): Contains only navigation info (keys + pointers)

Like a library catalog that tells you where books are located
Small and fast to search through
Doesn't contain actual user data

Data File (users_data.bin): Contains the actual user records

Like the actual books on library shelves
Large file with all the real data
Organized in 4KB pages for efficient access

The Row Locator (134,7) is like a call number that takes you from the catalog to the exact shelf and book you want.

Row Locator Usage:

Using (page=134, slot=7):

Open data file: users_data.bin
seek(data_file, page_134_offset)
Calculate slot offset: 24 + (7 × 118) = 850 bytes
read(118_bytes) from that position

Data Page Structure:

Record Layout (118 bytes each):

Reality Check: Binary Storage on Disk

So far we've shown conceptual examples — nodes, keys, and locators. But when the database hits the disk, there are no JSON objects or structs waiting to be read.

Everything is just bytes — raw, fixed-size, and precisely arranged.

When the database receives a row locator like (134, 7), it doesn't need to search or parse. It does one thing: math.

What is `seek()`?

seek(offset) is a system call that tells the operating system:

“Move the read pointer in this file to byte position X. Don’t read anything yet — just get ready.”

Why seek() works so efficiently

Because every part of the layout is predictable:

Pages are fixed at 4KB → easy to compute start offsets
Records are fixed at 118 bytes → slot math is trivial
No variable-length gaps → no need to read forward to find boundaries

No scanning, no parsing, no disk traversal. Just a mathematical jump.

Why it’s so fast

Files are byte-addressable
All files are just sequences of bytes. Your OS sees a file like this:

[0][1][2]...[827400][827401][827402]...

So when you say seek(827400), the OS moves the internal read pointer there instantly.

No I/O happens during seek()
seek() itself doesn’t read — it only positions the pointer.
Actual disk I/O happens when you call read().

🌳Why B-Trees Matter

We just traced a single query through every layer — from index to page, to raw bytes. Now, let's connect the dots:

The Complete Architecture

BTreeIndex - The entry point that knows where to find the root page on disk
Page IDs - Not memory addresses, but file positions that seek() can jump to instantly
Internal pages - Navigation layers that divide the search space into manageable ranges
Leaf pages - The final destination where actual email keys live alongside their row locators
Row Locators - GPS coordinates (page_id, slot_id) that become pure math: page_id × 4096 + slot_id × record_size

The Speed Formula

What makes this architecture so fast isn't just one thing — it's how every piece works together:

Logarithmic Navigation - O(log n) path means 3-4 page reads for millions of records
Disk-Optimized Design - 4KB pages match exactly what your OS and disk prefer to read
Mathematical Precision - Row locators become instant byte calculations via seek(549714)
Separation of Concerns - Index file (small, fast) tells you exactly where to look in data file (large, direct access)
Minimal I/O Operations - Root + internal + leaf + data = 4-5 disk reads total, regardless of table size

This is why a query on 10 million records feels instant — the database isn't searching through millions of records, it's doing math to jump directly to the right few bytes.

🚀 What's Next in Part 3

In the final part, we'll dive into how PostgreSQL actually implements this:

Real page headers, checksums, and MVCC metadata
The pageinspect extension to examine actual index pages
How concurrent transactions affect index navigation
Advanced optimization techniques with EXPLAIN ANALYZE

The journey from our conceptual B-tree to production PostgreSQL internals—where theory meets the messy reality of concurrent, persistent storage.

Want to dive deeper into B-tree algorithms? We focused on how B-trees enable fast queries through smart architecture and disk optimization. If you're curious about the detailed mechanics of B-tree insertion, deletion, and rebalancing algorithms, check out PlanetScale's excellent deep dive — it covers the algorithmic details we intentionally skipped here.

Makotons Substack

Database Indexes: Part 1 — The Origin

Discussion about this post

Ready for more?

Makotons Substack

Database Indexes: Part 2 — How B-Tree Indexes Work Internally (With a Real Query Walkthrough)

Ever wonder how databases locate one record among millions in just milliseconds? This deep dive traces a real query through every layer of B-tree architecture—from root pages to raw bytes.

A Millisecond Mystery

Quick Bridge from Part 1

Database Indexes: Part 1 — The Origin

Following Maria's Email Through the B-Tree

🔑 Key Concepts First

What is a B-Tree?

Key Properties:

Visual Structure:

Why This Structure Matters:

What is a BTreeIndex?

What are "Pages"?

Why 4KB pages?

Memory ↔ Disk Transformation

🔍Step 0: Understanding Our Data

What is our Users Table?

Step 1: Loading the Root Page

Step 2: Navigation Logic

Understanding the Structure We Just Discovered

B-Tree Decision at the Root

Step 3: Loading the Leaf Page

Understanding Internal vs Leaf Pages

Step 4: Finding the Exact Match

Searching in the Leaf Page

Step 5: Final Data Access

Understanding Index vs Data Storage

Row Locator Usage:

Reality Check: Binary Storage on Disk

What is seek()?

Why seek() works so efficiently

Why it’s so fast

🌳Why B-Trees Matter

The Complete Architecture

The Speed Formula

🚀 What's Next in Part 3

Discussion about this post

Ready for more?

What is `seek()`?