Building a Memory Engine on Object Storage: A Deep Dive

10 min readSystems

Introduction

I've been fascinated by how modern data systems are moving away from traditional databases. Companies are storing petabytes of data on S3 while keeping their compute separate. Products like TurboPuffer, DuckDB, and ClickHouse are leading this shift.

So I decided to build my own memory engine to understand how this works. Not a production system-just enough to grasp the core concepts: how do you make queries fast when your data lives in slow object storage?

Turns out, it's all about clever indexing, smart query planning, and understanding the cost model of object stores.


Why Object Storage?

Before diving into the implementation, let's talk about why this architecture matters.

The Traditional Model: Coupled Storage and Compute

Traditional databases like PostgreSQL or MySQL couple storage and compute:

This has problems:

  • Storage costs scale with compute - You pay for SSD even when idle
  • Hard to scale independently - Need more storage? Add more servers
  • Data locality - Difficult to share data across regions

The Modern Model: Separated Storage and Compute

Now look at how systems like Snowflake and DuckDB work:

Benefits:

  • Storage is cheap - S3 costs $0.023/GB/month vs $0.10+ for SSD
  • Infinite scale - No disk limits
  • Independent scaling - Spin up compute only when needed
  • Data sharing - Multiple systems access the same data

The catch? Object storage is slow. S3 GET requests take 10-100ms. A database disk read takes <1ms.

So how do we make it fast?


The Core Challenge: Minimizing Object Store Reads

Here's the fundamental problem: if you store 1TB across 10,000 blocks in S3, and you want to find rows where city = "NYC", you could:

Naive approach:

  1. Fetch all 10,000 blocks from S3 (10,000 GET requests)
  2. Scan every row in memory
  3. Return matching rows

Cost: 10,000 × 100ms = 1,000 seconds (16+ minutes)

Smart approach:

  1. Use an index to find which 5 blocks contain NYC data
  2. Fetch only those 5 blocks (5 GET requests)
  3. Return matching rows

Cost: 5 × 100ms = 500ms

The difference? Indexes and query planning.


Building the Engine: Architecture Overview

I built a simplified memory engine with six key components:

Let me walk through each layer.


Layer 1: Storage Abstraction

First, we need to abstract the object store. Whether it's S3, GCS, or Azure, they all have the same basic operations:

ObjectStore Interface:
  - Put(key, data)
  - Get(key) → data
  - List(prefix) → keys[]
  - Delete(key)

This abstraction means you can develop locally using filesystem and deploy to real S3 without changing any code above this layer.


Layer 2: Block Manager

Object stores work with immutable objects. Once written, they can't be modified. So we organize data into fixed-size blocks:

Block Structure:
{
  id: "users_block_0"
  table_name: "users"
  rows: {
    "u001": {id: "u001", name: "Alice", age: 25, city: "NYC"},
    "u002": {id: "u002", name: "Bob", age: 30, city: "SF"}
  }
}

Each block becomes one object in S3. For example:

  • tables/users/users_block_0 contains rows 1-100
  • tables/users/users_block_1 contains rows 101-200

Why fixed-size blocks?

  • Predictable S3 GET costs
  • Better caching (fetch 1MB, not 1GB)
  • Enables parallel fetching

The block manager maintains a manifest file tracking all blocks:

{
  "blocks": {
    "users": ["users_block_0", "users_block_1", "users_block_2"],
    "products": ["products_block_0"]
  }
}

Layer 3: Write-Ahead Log (WAL)

Here's a key insight: S3 PUT operations are expensive. Each PUT costs $0.005 per 1,000 requests.

If we write every single row insertion as a separate object, costs explode. Instead, we batch writes using a Write-Ahead Log:

Writes accumulate in memory, then flush to blocks periodically (e.g., after 100 entries).

This is inspired by LSM-trees (used in RocksDB, Cassandra). The pattern:

  1. Buffer writes in memory
  2. Periodically flush to immutable blocks
  3. Merge/compact old blocks later

Layer 4: Indexing - The Secret Sauce

This is where the magic happens. Without indexes, you scan every block. With indexes, you know exactly which blocks to fetch.

Schema Definition

First, we define a schema with typed columns:

Schema: users
Columns:
  - id: String (Primary Key, Required)
  - name: String (Indexed, Required)
  - age: Integer (Indexed, Required)
  - city: String (Indexed, Optional)

Marking columns as indexed creates secondary indexes on those fields.

Index Structure

Each indexed column gets its own in-memory index:

City Index Structure:
{
  "NYC": [
    {primary_key: "u001", block_id: "users_block_0"},
    {primary_key: "u003", block_id: "users_block_0"}
  ],
  "SF": [
    {primary_key: "u002", block_id: "users_block_0"},
    {primary_key: "u005", block_id: "users_block_1"}
  ],
  "LA": [
    {primary_key: "u004", block_id: "users_block_1"}
  ]
}

When you insert a row with city: "NYC", the city index gets updated to point to that row's location.

Real-World Example

Let's say we have 5 users across 2 blocks:

Block 0: u001 (Alice, 25, NYC)
         u002 (Bob, 30, SF)
         u003 (Charlie, 25, NYC)

Block 1: u004 (David, 35, LA)
         u005 (Eve, 28, SF)

Indexes in Memory:
├─ City Index:
│  ├─ "NYC" → [u001@block_0, u003@block_0]
│  ├─ "SF"  → [u002@block_0, u005@block_1]
│  └─ "LA"  → [u004@block_1]
└─ Age Index:
   ├─ 25 → [u001@block_0, u003@block_0]
   ├─ 28 → [u005@block_1]
   ├─ 30 → [u002@block_0]
   └─ 35 → [u004@block_1]

Query: WHERE city = "NYC"

  • Look up "NYC" in city index → finds u001, u003 in block_0
  • Fetch only block_0 from S3
  • Return matching rows

1 block fetch instead of 2. On a real dataset with thousands of blocks, this is the difference between 10ms and 10 seconds.


Layer 5: Query Planner

The query planner turns logical queries into physical execution plans. It's the "brain" that decides which blocks to fetch.

Here's how it works for a multi-predicate query:

Query: WHERE age = 25 AND city = "NYC"

Planning steps:

1. Look up age = 25 in age index
   → Returns: [u001@block_0, u003@block_0]

2. Look up city = "NYC" in city index
   → Returns: [u001@block_0, u003@block_0]

3. Intersect (AND logic)
   → Matching: [u001@block_0, u003@block_0]

4. Deduplicate blocks
   → Unique blocks: [block_0]

5. Create execution plan
   → Fetch: [block_0]
   → Extract: [u001, u003]

The executor then fetches only block_0 and extracts u001, u003.

Query Types Supported

1. Point Lookup (Primary Key)

Query: GET user WHERE id = "u002"
Plan: Fetch block_0, extract u002
Cost: 1 S3 GET

2. Equality Filter

Query: SELECT * WHERE city = "NYC"
Plan: Fetch [block_0], filter for NYC
Cost: 1 S3 GET

3. Range Query

Query: SELECT * WHERE age BETWEEN 25 AND 30
Plan: Fetch [block_0, block_1], filter age range
Cost: 2 S3 GETs

4. Multi-Predicate (AND)

Query: SELECT * WHERE age = 25 AND city = "NYC"
Plan: Intersect indexes, fetch [block_0]
Cost: 1 S3 GET

Layer 6: Execution Engine

The execution engine takes the plan and fetches data. The process:

Execution Flow:
1. Check WAL (recent writes not yet in blocks)
2. For each block_id in plan:
   a. Fetch block from S3
   b. Extract specified rows
   c. Apply any remaining filters
3. Merge results from WAL + blocks
4. Return to client

Key optimizations:

  • WAL check first - Recent writes aren't in blocks yet
  • Parallel block fetching - Fetch multiple blocks concurrently
  • Early termination - Stop once limit is reached
  • Result streaming - Don't materialize everything in memory

How TurboPuffer and DuckDB Do It

Now that we understand the basics, let's see how production systems tackle this.

TurboPuffer's Approach

TurboPuffer is a vector database built on S3. They use:

1. Columnar Storage

Instead of storing rows together, they store columns separately:

Why? If you query SELECT age WHERE city = "NYC", you only fetch the age and city columns, not names or IDs.

2. Bloom Filters

Before fetching a block, check a bloom filter (probabilistic data structure):

Bloom Filter per Block:
block_0: bloomfilter(all_values_in_block)
block_1: bloomfilter(all_values_in_block)

Query: WHERE city = "NYC"
- Check block_0 bloom filter → "might contain NYC" → fetch
- Check block_1 bloom filter → "definitely no NYC" → skip

This avoids wasted S3 GETs.

3. Compression

Each column is compressed separately using algorithms optimized for the data type:

  • Integers: Delta encoding, bit packing
  • Strings: Dictionary encoding
  • Floats: Gorilla compression

Result: 10x smaller blocks = 10x fewer S3 GETs.

DuckDB's Approach

DuckDB takes it further with vectorized execution:

1. Parquet Format

DuckDB reads Parquet files directly from S3. Parquet stores data in columnar format with built-in compression and statistics.

2. Predicate Pushdown

DuckDB uses Parquet metadata to skip entire row groups:

Query: SELECT * FROM users WHERE age > 30

Parquet Metadata:
row_group_0: min_age=20, max_age=29  → SKIP (all ages < 30)
row_group_1: min_age=30, max_age=45  → FETCH (might contain matches)
row_group_2: min_age=22, max_age=28  → SKIP (all ages < 30)

Only fetches row_group_1

3. Parallel S3 Reads

DuckDB can fetch 100+ blocks in parallel, saturating network bandwidth.

Apache Arrow's Influence

Both systems are inspired by Apache Arrow, which defines:

  • Columnar memory layout - CPU cache-friendly
  • Zero-copy reads - No deserialization overhead
  • SIMD operations - Process multiple values per instruction

Arrow format structure:

Column "age" (int32):
┌──────────────────────┐
│ Validity Bitmap      │  1 bit per value (null or not)
├──────────────────────┤
│ Data Buffer          │  Packed int32 array
│ [25, 30, 25, 35, 28] │
└──────────────────────┘

This layout enables SIMD instructions to process 8 values at once.


Performance Comparison

Let's compare approaches on a 1M row table:

ApproachBlocks FetchedS3 GETsQuery Time
No Indexes (full scan)1,0001,000~100 seconds
Primary key index11~100ms
Secondary index (city)1010~1 second
Multi-index (age + city)55~500ms
+ Bloom filters33~300ms
+ Columnar storage11~100ms
+ Parquet predicate pushdown0.51~50ms

The progression is clear: smart indexing and storage format matter more than raw compute power.


Real-World Tradeoffs

Building this taught me that separation of storage and compute isn't free:

Pros

Cheap storage - S3 is 10x cheaper than SSD
Infinite scale - Store petabytes without worry
Elasticity - Spin up compute only when needed
Data sharing - Multiple systems access same data

Cons

Latency - 100ms per S3 GET vs <1ms for local disk
Network costs - S3 egress can be expensive
Complexity - Need sophisticated indexing and caching
Cold starts - First query is slow (empty cache)

The sweet spot? OLAP workloads (analytics, BI tools) where:

  • Queries are complex but infrequent
  • Data is read-heavy, not write-heavy
  • Cost matters more than latency

For OLTP workloads (transactional databases), traditional databases still win.


The Complete Flow: A Query's Journey

Let me show you what happens when a query executes:

Step-by-step:

  1. Query arrives - Client sends: WHERE city='NYC' AND age=25

  2. Planning phase - Planner consults indexes:

    • City index: "NYC" → [u001, u003] in block_0
    • Age index: 25 → [u001, u003] in block_0
    • Intersection: [u001, u003]
    • Plan: Fetch block_0
  3. Execution phase:

    • Check WAL for recent writes
    • Fetch block_0 from S3 (100ms)
    • Extract rows u001, u003
    • Apply any remaining filters
  4. Return results - Send 2 rows to client

Total time: ~150ms (instead of scanning all blocks)


What I Learned

Building this from scratch was eye-opening:

1. Indexes are everything
Without indexes, object storage is unusable. With indexes, it's competitive with traditional databases.

2. The cost model is different
In traditional DBs, you optimize for disk seeks. In object storage, you optimize for GET requests.

3. Immutability is powerful
Immutable blocks enable aggressive caching, time travel queries, and easy replication.

4. Columnar wins for analytics
Storing columns separately means you only fetch what you need.

5. There's no free lunch
Separation of storage and compute trades latency for cost and scale. Know your workload.


Going Further

This implementation barely scratches the surface. To make it production-ready, you'd need:

  • Bloom filters per block - Skip blocks without fetching
  • Compaction - Merge small blocks into larger ones
  • Columnar storage - Use Parquet or Arrow format
  • Caching layer - Cache hot blocks in memory/local SSD
  • Parallel execution - Fetch 100+ blocks concurrently
  • Metadata versioning - Handle concurrent writes
  • Query optimization - Cost-based query planning

But the fundamentals are all here: schemas, indexes, query planning, and block management.


Conclusion

The shift from coupled to separated storage and compute is one of the most important architectural changes in modern data systems. It's why Snowflake can scale to exabytes, why DuckDB can query S3 faster than many databases, and why TurboPuffer can serve vector search at massive scale.

The key insight? Indexes and metadata in memory, data in object storage. Use the fast tier (memory) to avoid the slow tier (S3).

If you're building a data system today, this architecture should be your starting point. The economics are too compelling to ignore.

Storage is cheap. Compute is elastic. The future of databases is separated.


References

Quick Facts

🌍 Based in India

🚀 Open Source Enthusiast

😰 Serial Introvert

DHRUV.DEV

DHRUV.DEV