Building a Memory Engine on Object Storage: A Deep Dive
Introduction
I've been fascinated by how modern data systems are moving away from traditional databases. Companies are storing petabytes of data on S3 while keeping their compute separate. Products like TurboPuffer, DuckDB, and ClickHouse are leading this shift.
So I decided to build my own memory engine to understand how this works. Not a production system-just enough to grasp the core concepts: how do you make queries fast when your data lives in slow object storage?
Turns out, it's all about clever indexing, smart query planning, and understanding the cost model of object stores.
Why Object Storage?
Before diving into the implementation, let's talk about why this architecture matters.
The Traditional Model: Coupled Storage and Compute
Traditional databases like PostgreSQL or MySQL couple storage and compute:
This has problems:
- Storage costs scale with compute - You pay for SSD even when idle
- Hard to scale independently - Need more storage? Add more servers
- Data locality - Difficult to share data across regions
The Modern Model: Separated Storage and Compute
Now look at how systems like Snowflake and DuckDB work:
Benefits:
- Storage is cheap - S3 costs $0.023/GB/month vs $0.10+ for SSD
- Infinite scale - No disk limits
- Independent scaling - Spin up compute only when needed
- Data sharing - Multiple systems access the same data
The catch? Object storage is slow. S3 GET requests take 10-100ms. A database disk read takes <1ms.
So how do we make it fast?
The Core Challenge: Minimizing Object Store Reads
Here's the fundamental problem: if you store 1TB across 10,000 blocks in S3, and you want to find rows where city = "NYC", you could:
Naive approach:
- Fetch all 10,000 blocks from S3 (10,000 GET requests)
- Scan every row in memory
- Return matching rows
Cost: 10,000 × 100ms = 1,000 seconds (16+ minutes)
Smart approach:
- Use an index to find which 5 blocks contain NYC data
- Fetch only those 5 blocks (5 GET requests)
- Return matching rows
Cost: 5 × 100ms = 500ms
The difference? Indexes and query planning.
Building the Engine: Architecture Overview
I built a simplified memory engine with six key components:
Let me walk through each layer.
Layer 1: Storage Abstraction
First, we need to abstract the object store. Whether it's S3, GCS, or Azure, they all have the same basic operations:
ObjectStore Interface:
- Put(key, data)
- Get(key) → data
- List(prefix) → keys[]
- Delete(key)
This abstraction means you can develop locally using filesystem and deploy to real S3 without changing any code above this layer.
Layer 2: Block Manager
Object stores work with immutable objects. Once written, they can't be modified. So we organize data into fixed-size blocks:
Block Structure:
{
id: "users_block_0"
table_name: "users"
rows: {
"u001": {id: "u001", name: "Alice", age: 25, city: "NYC"},
"u002": {id: "u002", name: "Bob", age: 30, city: "SF"}
}
}
Each block becomes one object in S3. For example:
tables/users/users_block_0contains rows 1-100tables/users/users_block_1contains rows 101-200
Why fixed-size blocks?
- Predictable S3 GET costs
- Better caching (fetch 1MB, not 1GB)
- Enables parallel fetching
The block manager maintains a manifest file tracking all blocks:
{
"blocks": {
"users": ["users_block_0", "users_block_1", "users_block_2"],
"products": ["products_block_0"]
}
}
Layer 3: Write-Ahead Log (WAL)
Here's a key insight: S3 PUT operations are expensive. Each PUT costs $0.005 per 1,000 requests.
If we write every single row insertion as a separate object, costs explode. Instead, we batch writes using a Write-Ahead Log:
Writes accumulate in memory, then flush to blocks periodically (e.g., after 100 entries).
This is inspired by LSM-trees (used in RocksDB, Cassandra). The pattern:
- Buffer writes in memory
- Periodically flush to immutable blocks
- Merge/compact old blocks later
Layer 4: Indexing - The Secret Sauce
This is where the magic happens. Without indexes, you scan every block. With indexes, you know exactly which blocks to fetch.
Schema Definition
First, we define a schema with typed columns:
Schema: users
Columns:
- id: String (Primary Key, Required)
- name: String (Indexed, Required)
- age: Integer (Indexed, Required)
- city: String (Indexed, Optional)
Marking columns as indexed creates secondary indexes on those fields.
Index Structure
Each indexed column gets its own in-memory index:
City Index Structure:
{
"NYC": [
{primary_key: "u001", block_id: "users_block_0"},
{primary_key: "u003", block_id: "users_block_0"}
],
"SF": [
{primary_key: "u002", block_id: "users_block_0"},
{primary_key: "u005", block_id: "users_block_1"}
],
"LA": [
{primary_key: "u004", block_id: "users_block_1"}
]
}
When you insert a row with city: "NYC", the city index gets updated to point to that row's location.
Real-World Example
Let's say we have 5 users across 2 blocks:
Block 0: u001 (Alice, 25, NYC)
u002 (Bob, 30, SF)
u003 (Charlie, 25, NYC)
Block 1: u004 (David, 35, LA)
u005 (Eve, 28, SF)
Indexes in Memory:
├─ City Index:
│ ├─ "NYC" → [u001@block_0, u003@block_0]
│ ├─ "SF" → [u002@block_0, u005@block_1]
│ └─ "LA" → [u004@block_1]
│
└─ Age Index:
├─ 25 → [u001@block_0, u003@block_0]
├─ 28 → [u005@block_1]
├─ 30 → [u002@block_0]
└─ 35 → [u004@block_1]
Query: WHERE city = "NYC"
- Look up "NYC" in city index → finds u001, u003 in block_0
- Fetch only block_0 from S3
- Return matching rows
1 block fetch instead of 2. On a real dataset with thousands of blocks, this is the difference between 10ms and 10 seconds.
Layer 5: Query Planner
The query planner turns logical queries into physical execution plans. It's the "brain" that decides which blocks to fetch.
Here's how it works for a multi-predicate query:
Query: WHERE age = 25 AND city = "NYC"
Planning steps:
1. Look up age = 25 in age index
→ Returns: [u001@block_0, u003@block_0]
2. Look up city = "NYC" in city index
→ Returns: [u001@block_0, u003@block_0]
3. Intersect (AND logic)
→ Matching: [u001@block_0, u003@block_0]
4. Deduplicate blocks
→ Unique blocks: [block_0]
5. Create execution plan
→ Fetch: [block_0]
→ Extract: [u001, u003]
The executor then fetches only block_0 and extracts u001, u003.
Query Types Supported
1. Point Lookup (Primary Key)
Query: GET user WHERE id = "u002"
Plan: Fetch block_0, extract u002
Cost: 1 S3 GET
2. Equality Filter
Query: SELECT * WHERE city = "NYC"
Plan: Fetch [block_0], filter for NYC
Cost: 1 S3 GET
3. Range Query
Query: SELECT * WHERE age BETWEEN 25 AND 30
Plan: Fetch [block_0, block_1], filter age range
Cost: 2 S3 GETs
4. Multi-Predicate (AND)
Query: SELECT * WHERE age = 25 AND city = "NYC"
Plan: Intersect indexes, fetch [block_0]
Cost: 1 S3 GET
Layer 6: Execution Engine
The execution engine takes the plan and fetches data. The process:
Execution Flow:
1. Check WAL (recent writes not yet in blocks)
2. For each block_id in plan:
a. Fetch block from S3
b. Extract specified rows
c. Apply any remaining filters
3. Merge results from WAL + blocks
4. Return to client
Key optimizations:
- WAL check first - Recent writes aren't in blocks yet
- Parallel block fetching - Fetch multiple blocks concurrently
- Early termination - Stop once limit is reached
- Result streaming - Don't materialize everything in memory
How TurboPuffer and DuckDB Do It
Now that we understand the basics, let's see how production systems tackle this.
TurboPuffer's Approach
TurboPuffer is a vector database built on S3. They use:
1. Columnar Storage
Instead of storing rows together, they store columns separately:
Why? If you query SELECT age WHERE city = "NYC", you only fetch the age and city columns, not names or IDs.
2. Bloom Filters
Before fetching a block, check a bloom filter (probabilistic data structure):
Bloom Filter per Block:
block_0: bloomfilter(all_values_in_block)
block_1: bloomfilter(all_values_in_block)
Query: WHERE city = "NYC"
- Check block_0 bloom filter → "might contain NYC" → fetch
- Check block_1 bloom filter → "definitely no NYC" → skip
This avoids wasted S3 GETs.
3. Compression
Each column is compressed separately using algorithms optimized for the data type:
- Integers: Delta encoding, bit packing
- Strings: Dictionary encoding
- Floats: Gorilla compression
Result: 10x smaller blocks = 10x fewer S3 GETs.
DuckDB's Approach
DuckDB takes it further with vectorized execution:
1. Parquet Format
DuckDB reads Parquet files directly from S3. Parquet stores data in columnar format with built-in compression and statistics.
2. Predicate Pushdown
DuckDB uses Parquet metadata to skip entire row groups:
Query: SELECT * FROM users WHERE age > 30
Parquet Metadata:
row_group_0: min_age=20, max_age=29 → SKIP (all ages < 30)
row_group_1: min_age=30, max_age=45 → FETCH (might contain matches)
row_group_2: min_age=22, max_age=28 → SKIP (all ages < 30)
Only fetches row_group_1
3. Parallel S3 Reads
DuckDB can fetch 100+ blocks in parallel, saturating network bandwidth.
Apache Arrow's Influence
Both systems are inspired by Apache Arrow, which defines:
- Columnar memory layout - CPU cache-friendly
- Zero-copy reads - No deserialization overhead
- SIMD operations - Process multiple values per instruction
Arrow format structure:
Column "age" (int32):
┌──────────────────────┐
│ Validity Bitmap │ 1 bit per value (null or not)
├──────────────────────┤
│ Data Buffer │ Packed int32 array
│ [25, 30, 25, 35, 28] │
└──────────────────────┘
This layout enables SIMD instructions to process 8 values at once.
Performance Comparison
Let's compare approaches on a 1M row table:
| Approach | Blocks Fetched | S3 GETs | Query Time |
|---|---|---|---|
| No Indexes (full scan) | 1,000 | 1,000 | ~100 seconds |
| Primary key index | 1 | 1 | ~100ms |
| Secondary index (city) | 10 | 10 | ~1 second |
| Multi-index (age + city) | 5 | 5 | ~500ms |
| + Bloom filters | 3 | 3 | ~300ms |
| + Columnar storage | 1 | 1 | ~100ms |
| + Parquet predicate pushdown | 0.5 | 1 | ~50ms |
The progression is clear: smart indexing and storage format matter more than raw compute power.
Real-World Tradeoffs
Building this taught me that separation of storage and compute isn't free:
Pros
✅ Cheap storage - S3 is 10x cheaper than SSD
✅ Infinite scale - Store petabytes without worry
✅ Elasticity - Spin up compute only when needed
✅ Data sharing - Multiple systems access same data
Cons
❌ Latency - 100ms per S3 GET vs <1ms for local disk
❌ Network costs - S3 egress can be expensive
❌ Complexity - Need sophisticated indexing and caching
❌ Cold starts - First query is slow (empty cache)
The sweet spot? OLAP workloads (analytics, BI tools) where:
- Queries are complex but infrequent
- Data is read-heavy, not write-heavy
- Cost matters more than latency
For OLTP workloads (transactional databases), traditional databases still win.
The Complete Flow: A Query's Journey
Let me show you what happens when a query executes:
Step-by-step:
-
Query arrives - Client sends:
WHERE city='NYC' AND age=25 -
Planning phase - Planner consults indexes:
- City index: "NYC" → [u001, u003] in block_0
- Age index: 25 → [u001, u003] in block_0
- Intersection: [u001, u003]
- Plan: Fetch block_0
-
Execution phase:
- Check WAL for recent writes
- Fetch block_0 from S3 (100ms)
- Extract rows u001, u003
- Apply any remaining filters
-
Return results - Send 2 rows to client
Total time: ~150ms (instead of scanning all blocks)
What I Learned
Building this from scratch was eye-opening:
1. Indexes are everything
Without indexes, object storage is unusable. With indexes, it's competitive with traditional databases.
2. The cost model is different
In traditional DBs, you optimize for disk seeks. In object storage, you optimize for GET requests.
3. Immutability is powerful
Immutable blocks enable aggressive caching, time travel queries, and easy replication.
4. Columnar wins for analytics
Storing columns separately means you only fetch what you need.
5. There's no free lunch
Separation of storage and compute trades latency for cost and scale. Know your workload.
Going Further
This implementation barely scratches the surface. To make it production-ready, you'd need:
- Bloom filters per block - Skip blocks without fetching
- Compaction - Merge small blocks into larger ones
- Columnar storage - Use Parquet or Arrow format
- Caching layer - Cache hot blocks in memory/local SSD
- Parallel execution - Fetch 100+ blocks concurrently
- Metadata versioning - Handle concurrent writes
- Query optimization - Cost-based query planning
But the fundamentals are all here: schemas, indexes, query planning, and block management.
Conclusion
The shift from coupled to separated storage and compute is one of the most important architectural changes in modern data systems. It's why Snowflake can scale to exabytes, why DuckDB can query S3 faster than many databases, and why TurboPuffer can serve vector search at massive scale.
The key insight? Indexes and metadata in memory, data in object storage. Use the fast tier (memory) to avoid the slow tier (S3).
If you're building a data system today, this architecture should be your starting point. The economics are too compelling to ignore.
Storage is cheap. Compute is elastic. The future of databases is separated.