Post

A Short Primer on Scalable Data Storage Solutions for ML Systems

A Short Primer on Scalable Data Storage Solutions for ML Systems

Here is a consolidated view of the most popular data storage technologies used in modern system design. It’s grouped by their primary architectural category to make the comparison easier.

1. Relational Databases (RDBMS) - The General Purpose Workhorses

These are the default choice for structured data where data integrity is paramount.

TechnologyPostgreSQL
TypeSQL (Relational)
PatternOLTP (Online Transaction Processing)
ACIDYes (Strict ACID compliance)
ScalabilityVertical (Scale Up). Horizontal scaling is hard (requires sharding/read replicas), but tools like Citus exist.
CAP TheoremCA (In practice, it prioritizes Consistency and Availability, usually single-master).
Primary Use CaseCore application DB, User profiles, Financial ledgers, Complex joins.
LatencyLow (ms).
Managed ServiceYes (AWS RDS, Google Cloud SQL, Azure Database).
Critical InfoThe industry standard for general-purpose app development. Very extensible (PostGIS for geo, JSONB for NoSQL-like features).

2. Data Warehouses (OLAP) - The Analytics Engines

These are designed for scanning billions of rows to answer business questions. They are not for live application backends.

TechnologyGoogle BigQueryAmazon RedshiftApache Hive
TypeSQL (Columnar)SQL (Columnar)SQL-like (HQL) on Hadoop
PatternOLAP (Online Analytical Processing)OLAPOLAP
ACIDAtomic updates supported, but not designed for transactional work.Yes (Snapshot isolation).ACID support is available (ACID v2) but limited compared to RDBMS.
ScalabilityHorizontal (Massively Parallel Processing - MPP).Horizontal (MPP).Horizontal (Uses HDFS/Cloud Storage).
Use CaseAd-hoc analysis, BI dashboards, ML training data generation.Enterprise Data Warehouse, BI reporting.Batch processing on huge datasets (Petabyte scale) over HDFS/S3.
LatencySeconds to Minutes (High).Seconds to Minutes (High).Minutes to Hours (Very High).
Managed?Serverless (Fully Managed).Cluster-based (You manage nodes) or Serverless.Often self-managed or via EMR/Dataproc.
Critical InfoDecouples compute and storage perfectly. You pay per query.Tightly coupled compute/storage (traditionally), though RA3 nodes fix this.Legacy tech. Being replaced by Spark SQL or Presto/Trino.

3. NoSQL: Key-Value & In-Memory - The Speed Demons

Used for caching, real-time features, and session management.

TechnologyRedisAerospike
TypeNoSQL (Key-Value)NoSQL (Key-Value)
PatternOLTP (Real-time)OLTP (Real-time at Scale)
ACIDAtomic single-key ops. No multi-key transactions.Strong Consistency available (ACID for single row).
ScalabilityHorizontal (Redis Cluster) - AP or CP depending on config.Horizontal (Linear scale). AP (Available/Partition Tolerant).
Use CaseCaching, Session Store, Leaderboards, Pub/Sub.Real-time Feature Store (Fraud detection), Ad-Tech bidding.
LatencyMicroseconds (In-memory).Sub-millisecond (Hybrid RAM/SSD).
Managed?Yes (AWS ElastiCache, Redis Enterprise).Yes (Aerospike Cloud).
Critical InfoSingle-threaded event loop. Data must usually fit in RAM.Optimized for Flash/SSD. Can store TBs/PBs of data with RAM-like speed.

4. NoSQL: Wide-Column Stores - The Big Data Writers

Designed for massive write throughput and storing petabytes of data.

TechnologyApache CassandraGoogle BigtableApache HBase
TypeNoSQL (Wide-Column)NoSQL (Wide-Column)NoSQL (Wide-Column)
PatternOLTP (Heavy Write)OLTP (Heavy Write/Read)OLTP (Heavy Write)
ACIDNo (Tunable Consistency).No (Single-row atomicity).Strong Consistency (Single row).
ScalabilityHorizontal. AP (Eventual Consistency).Horizontal. CP (Strong Consistency).Horizontal. CP (Strong Consistency).
Use CaseIoT sensor data, activity logs, messaging apps (Discord/Messenger).Time-series data, financial data, user history (Gmail/Search/Maps).Hadoop-native random access storage.
LatencyLow (Single digit ms).Low (Single digit ms).Low to Medium.
Managed?Yes (AWS Keyspaces, Datastax).Fully Managed (Serverless).Self-managed or Cloud Bigtable.
Critical Info“Masterless” architecture (no single point of failure). Great for writes.Powered by Colossus (Google’s file system). Storage/Compute decoupled.Built on top of HDFS. Requires ZooKeeper. High operational overhead.

5. NoSQL: Document & Search - The Flexible Ones

Used for unstructured data and complex text searching.

TechnologyMongoDBElasticsearch
TypeNoSQL (Document - JSON)NoSQL (Search Engine / Document)
PatternOLTPOLAP (Text Analytics / Log Search)
ACIDYes (Multi-document ACID supported since v4.0).No (Near real-time, eventual consistency).
ScalabilityHorizontal (Sharding). CP (Consistency/Partition Tolerance).Horizontal. AP (Focuses on Availability).
Use CaseContent management, Catalogs, Rapid prototyping (Schema-less).Full-text search, Log analytics (ELK Stack), Autocomplete.
LatencyLow (ms).Low (ms) for search, Slower for updates.
Managed?Yes (MongoDB Atlas).Yes (Elastic Cloud, AWS OpenSearch).
Critical InfoVery developer-friendly. Flexible schema allows rapid iteration.Based on Lucene. Not a primary datastore (danger of split-brain).

Summary of CAP Theorem Categorization

  • CP (Consistency + Partition Tolerance): MongoDB, HBase, Bigtable, Redis (if configured for strong consistency), Kafka.
    • Trade-off: If the network partitions, the system may stop accepting writes to ensure data doesn’t diverge.
  • AP (Availability + Partition Tolerance): Cassandra, DynamoDB, Couchbase, Aerospike (in AP mode).
    • Trade-off: If the network partitions, the system stays up, but you might read stale data (“Eventual Consistency”).
  • CA (Consistency + Availability): RDBMS (Postgres, MySQL).
    • Trade-off: They cannot handle network partitions well. They usually run on a single node or a primary/replica setup where partition tolerance is sacrificed (the system goes down or becomes read-only).

Summary Table for Quick Reference

NameCategoryACIDCAPBest For
PostgresRDBMSYesCARelational data, Core Apps
BigQueryData WarehouseNo-Analytics on Petabytes
RedisKey-ValueAtomicCPCaching, Sessions, Queues
AerospikeKey-ValueAtomicAPReal-time Feature Store (Fraud)
CassandraWide-ColumnNoAPMassive Write workloads (IoT)
BigtableWide-ColumnAtomCPHigh throughput K-V at scale
ElasticSearchNoAPText Search, Log Analysis
MongoDocumentYesCPFlexible Schema, Catalogs

What’s the difference between ‘Columner’ and ‘Wide-Column’ data storage solutions?

That is an excellent question that clarifies a very common confusion. Although both terms contain the word “column,” columnar and wide-column storage systems are built for completely different purposes and have vastly different internal designs.

Here is a side-by-side comparison.

Columnar Storage (The OLAP/Analytics Engine)

Columnar storage systems are built to run analytical queries (OLAP) by prioritizing reading efficiency over everything else.

Core Idea: Storage by Attribute

  • How it Works: Data is stored on disk grouped by attribute (column), not by row.
    • Example: A table with columns (User_ID, Name, Sale_Amount, Date). All Sale_Amount values are stored next to each other on the disk, and all Date values are stored next to each other.
  • Best For: Read-heavy, analytical queries that involve aggregating data over large numbers of rows.
  • Key Benefit (I/O Reduction): When you run a query like SELECT AVG(Sale_Amount), the database only reads the Sale_Amount column from the disk. It completely skips reading the User_ID, Name, and Date columns, which drastically reduces I/O and speeds up the query.
  • Data Types: Optimized for homogeneous, structured data (like SQL tables).
  • Examples: Google BigQuery, Amazon Redshift, Snowflake.

Wide-Column Stores (The Scalable OLTP Engine)

Wide-column stores are a NoSQL database design built to handle massive write throughput and petabyte-scale horizontal scaling. They are designed for operational workloads.

Core Idea: Flexible, Distributed Rows

  • How it Works: Data is stored as rows, but each row can have a massive number of columns (hence “wide”). Unlike a traditional RDBMS, columns can be created on the fly, and different rows in the same table can have different sets of columns. The columns are grouped into structures called “Column Families” or “Bins.”
  • Best For: Extremely high-throughput writes (like log ingestion, sensor data, or storing user history) and fast key-based lookups.
  • Key Benefit (Scalability & Flexibility): The database automatically partitions rows across thousands of machines based on the primary key, providing linear scalability for writes and reads. The dynamic schema allows rapid application development.
  • Data Types: Optimized for mutable, semi-structured data where the schema is expected to evolve.
  • Examples: Apache Cassandra, Google Bigtable, Apache HBase, Aerospike (conceptually shares this distributed, K-V model).

Side-by-Side Comparison

FeatureColumnar StorageWide-Column Stores
Primary GoalAnalytics (OLAP)Scalable Transactions (OLTP/Real-Time)
Data LayoutData is stored by column. (All Sale_Amount values are adjacent).Data is stored by row (but rows are wide and flexible).
OptimizationRead I/O. Only read the columns you need.Write Throughput and Horizontal Partitioning.
SchemaFixed (Schema-on-Write).Flexible/Dynamic (Schema-on-Read).
Typical QuerySELECT SUM(Sales) GROUP BY MonthGET user_data WHERE user_id = 123
Key AdvantageMassive cost and speed savings on analytical queries.Linear scalability and predictable low latency for key lookups.
ExamplesBigQuery, Redshift, SnowflakeCassandra, Bigtable, ScyllaDB

Do Wide-column stores allow partitioning and indexing? If so, then explain how.

Wide-column stores absolutely allow both partitioning and indexing. In fact, these features are essential to their core function of providing massive horizontal scalability and fast retrieval.

The difference lies in how they implement them compared to a traditional SQL database.

Here is a breakdown of how partitioning and indexing work in a wide-column store (using Cassandra/Bigtable as primary examples).

1. Partitioning (The Key to Scale)

Partitioning is the primary mechanism for scaling wide-column stores horizontally. It dictates where data lives on the cluster.

How it Works

  • Partition Key (Required): When you define a table, you must designate one or more columns as the Partition Key (or Row Key in Bigtable).
  • Data Distribution: The database uses a hashing function on the Partition Key to determine which physical machine (node) in the cluster will store that data.
  • Data Grouping: All rows (records) that share the same Partition Key are guaranteed to be stored together on the same physical node.

The Impact

  • Fast Writes/Reads: Queries that target a specific partition key (e.g., SELECT * FROM users WHERE user_id = 'Alice') are extremely fast because the system knows exactly which single node to query, avoiding a cluster-wide search.
  • Scalability: If your cluster has 100 machines, partitioning ensures your 100 terabytes of data are evenly distributed across all 100 machines, allowing the database to scale linearly.
  • The Limitation: You must know the Partition Key to query efficiently. Trying to query without it results in an expensive “full cluster scan,” which is discouraged.

2. Indexing (The Key to Flexibility)

Indexing in wide-column stores is available, but it is implemented differently than in an RDBMS, and it often comes with significant performance trade-offs.

The Problem with Traditional RDBMS Indexes

In a traditional RDBMS, if you index the city column, the index is a separate structure that maps all city values to the exact physical location of the rows on disk. This is fast for lookups but complex to update.

The Wide-Column Approach: Secondary Indexes

Wide-column stores use Secondary Indexes to allow queries on non-primary-key columns.

  • Primary Access (The “Main Index”): The Partition Key acts as the primary, built-in index, optimized for that specific query.
  • Secondary Indexes: These are separate, secondary tables that the database creates and manages automatically.
    • How it Works (Conceptual): If you create a secondary index on the status='active' column, the database effectively creates a new internal key-value table:
      • Key: 'status_active'
      • Value: [user_id_1, user_id_50, user_id_300, ...]
  • Trade-off (The “Cost”):
    • Reads (Pros): They allow you to run queries like SELECT * FROM users WHERE city = 'New York', which would otherwise require a full cluster scan.
    • Writes (Cons): Every time you update the city column on a user record, the database has to also update the massive, distributed Secondary Index table. This adds latency to your write operations, which defeats the purpose of using a high-throughput system like Cassandra.

Conclusion:

Both Partitioning and Indexing are used in wide-column stores. Partitioning is your mandatory, structural mechanism for scalability. Indexing is an optional feature (the secondary index) that provides query flexibility but adds significant overhead to your critical write path and must be used sparingly.


Show examples of some ‘good’ and ‘bad’ queries for storage systems like BigTable.

This is a fantastic way to understand the limitations and design philosophy of Bigtable. Unlike a relational database, where a simple SELECT can be slow due to poor indexing, Bigtable’s speed is entirely governed by how efficiently the system can locate and scan data using the Row Key.

Here is a list of three fast and three very slow queries on Bigtable.

Fast Queries (O(1) to O(logN) by Row Key)

The fast queries are all based on Row Key lookups or contiguous Row Key range scans. Bigtable is optimized to find data based on the key because the key maps directly to a specific server node in the cluster.

#Query GoalRow Key StructureWhy It’s Fast
1Single Point Lookup (Finding one user’s profile)user_4321This is the fastest possible operation. Bigtable hashes the key and immediately sends the request to the one server that owns that row. It’s an O(1) lookup.
2Time-Series Range Scan (Recent history for one device)device_XYZ#2025-11-19-14:00:00The user queries for key_prefix='device_XYZ' AND time BETWEEN T1 AND T2. Since all data for one device is stored contiguously on a few servers, Bigtable only scans a small, optimized range. This is often called a Row Key Prefix Scan.
3Small Transaction Lookup (A tiny slice of the data)transaction_1234Similar to #1, this is a direct, targeted read. If your key design is good (e.g., you’ve structured the key as user_ID#transaction_ID), you still benefit from the fast primary lookup.

Slow Queries (Full Cluster Scan / Scatter-Gather)

The slow queries are those that force Bigtable to abandon the Row Key and scan large portions of the cluster to find the answer. This is the anti-pattern for Bigtable.

#Query GoalReason for SlownessWhy It’s Slow (Full Scan)
1Secondary Attribute Lookup (Finding all users in one country)SELECT * WHERE country_code = 'AUS'Bigtable has no secondary indexes. To answer this, it must read the country_code column of every single row on every single server in the cluster. This is the definition of an expensive, slow, Full Cluster Scan.
2Complex Aggregation (Counting the total number of items)COUNT(*) or GROUP BY on a non-key column.Bigtable must read every single row in the cluster and then perform a cluster-wide aggregation. While faster than traditional databases, it is the slowest possible operation in Bigtable and defeats its design goals.
3Poorly Designed Key Range Scan (Scanning a non-contiguous range)user_ID LIKE 'A%' (If users A-Z are scattered)If the users whose IDs start with ‘A’ are stored across 50 different servers (i.e., they are not contiguous on disk), the query requires 50 parallel network requests, making the operation slow and high-latency. Effective range scans require careful Row Key design.

The Core Principle

To use Bigtable correctly, you must ensure that all your read queries are satisfiable by the Row Key and its prefix. If you find yourself needing to query based on an attribute that is not part of the Row Key, you should use a different database (like Elasticsearch or BigQuery).


Explain CAP theory in a concise manner with examples.

Here is a simple explanation of the CAP Theorem, with examples for Kafka and Elasticsearch.

What is the CAP Theorem?

In simple terms, the CAP Theorem is a rule for distributed systems (like databases or applications spread across multiple computers). It states that any distributed system can only provide two of the following three guarantees at the same time:

  1. C - Consistency:
    • What it means: All clients see the same, most up-to-date data at the same time, no matter which node they connect to.
    • Analogy: A bank account. If you withdraw money, your balance must be instantly and consistently updated everywhere. You can’t be shown an old, incorrect balance.
  2. A - Availability:
    • What it means: The system is always available to respond to requests. It never returns an error, even if some of its nodes are down or the data is slightly out-of-date.
    • Analogy: A social media feed. It always loads, even if the “like” count you see is 30 seconds stale. It prefers to show you something rather than an error page.
  3. P - Partition Tolerance:
    • What it means: The system continues to operate even if the network breaks down between its nodes (a “network partition”). The nodes are still alive, but they just can’t talk to each other.

The Real Choice: C vs. A

In any real-world system, you must choose Partition Tolerance (P). Network failures will happen. The internet is unreliable, and servers will get disconnected.

Therefore, the CAP theorem isn’t really “pick two of three.” It’s:

“When a network partition (P) happens, you must choose between Consistency (C) and Availability (A).”

  • CP (Choose Consistency over Availability): When a partition happens, the system will become unavailable (it will stop responding or return errors) to avoid the risk of serving stale or incorrect data.
  • AP (Choose Availability over Consistency): When a partition happens, the system will stay available, but it might serve inconsistent (stale) data.

Example 1: Kafka (A CP System)

Kafka is a distributed streaming log. Its primary purpose is to be a durable, ordered, and truthful record of events.

  • How it works: Data is written to a topic partition. That partition has one “leader” node and several “follower” nodes that copy the data.
  • Consistency (C): For a write to be successful, the leader must write the data and then wait for a “quorum” of its in-sync followers (the ISR) to also copy it. Both writes and reads are served only by the leader.
  • Partition (P): What happens if the leader node is partitioned and can’t talk to its followers or the central coordinator (ZooKeeper/KRaft)?
  • The Choice: The cluster will elect a new leader from the followers that were in the ISR.
  • The Consequence: During the time the old leader is partitioned and a new leader is being elected (a process that can take seconds), that partition is unavailable. It will refuse to accept new writes.
  • Conclusion: Kafka is a CP system. It sacrifices availability (for a short time) to guarantee that the data log is perfectly consistent and no messages are ever lost or out of order. A bank would use a CP system.

Example 2: Elasticsearch (An AP System)

Elasticsearch is a distributed search and analytics engine. Its primary purpose is to be fast and always available for search queries.

  • How it works: Data is written to a “primary shard” (node) and then copied to “replica shards.”
  • Availability (A): A search query (a read request) can be answered by either the primary shard or any of the replica shards.
  • Partition (P): What happens if a node with a replica shard gets partitioned from the primary?
  • The Choice: The primary shard might receive new data, but the isolated replica doesn’t know about it. If a user’s read request hits that isolated replica, the replica will happily serve the stale data it has. It doesn’t know it’s inconsistent.
  • The Consequence: The system remains fully available for reads, but different users might see slightly different results (stale data) for a short time.
  • Conclusion: Elasticsearch is an AP system. It prioritizes availability (you can always search) at the cost of “eventual consistency.” A search engine or a product catalog, where being 10 seconds out-of-date is acceptable, would use an AP system.

Explain what are OLTP and OLAP systems.

That’s a core concept in data systems design. Understanding the difference between OLTP and OLAP defines what kind of database you should use for any given task.

Here is a detailed breakdown of OLTP and OLAP, including their characteristics and real-world examples.

What is OLTP? (Online Transaction Processing)

OLTP is about running the core business. It involves fast, frequent, short transactions that keep the lights on.

Core Characteristics

  • Primary Goal: Data integrity, high concurrency, and speed for small, atomic transactions.
  • Data Pattern: Read, Insert, Update, and Delete (CRUD) operations. Transactions are typically simple and affect only a few rows at a time.
  • Database Design: Optimized for row-based storage (faster to read/write entire records) and relies heavily on indexing (to locate a single record quickly).
  • Scale: Must handle high write volume and high concurrency (many users trying to update data simultaneously).
  • Integrity: Requires strict ACID compliance (Atomicity, Consistency, Isolation, Durability) to ensure no money is lost, and the data is never corrupted.
  • Data Volume: Typically holds the current, operational state (Terabytes, but focused on the speed of retrieval).

Real-World OLTP Examples

SystemTransaction ExampleCharacteristic
Banking (ATM)A deposit or withdrawal.High concurrency, requires immediate write integrity.
E-commerce CheckoutProcessing an order.The transaction must either fully succeed or fully fail.
Website LoginUpdating a user’s last login time.Fast key-value lookup, small update.
User Feature StoreUpdating a user’s tx_count_last_hour feature.High-frequency read-modify-write operation.

Appropriate Databases: PostgreSQL, MySQL, Cassandra, Aerospike, Google Bigtable.

What is OLAP? (Online Analytical Processing)

OLAP is about running the business analysis. It involves complex queries on vast amounts of historical data to uncover trends and insights.

Core Characteristics

  • Primary Goal: Speed for complex queries, aggregation, and deep analysis.
  • Data Pattern: Read-only, large-scale, complex queries. Queries typically aggregate data over millions or billions of rows (e.g., SUM, AVG, GROUP BY).
  • Database Design: Optimized for columnar storage (faster to read only the columns needed) and Massively Parallel Processing (MPP). It uses partitioning and clustering instead of traditional indexes.
  • Scale: Optimized for massive read volume and high computational power.
  • Integrity: Typically follows BASE principles (Basically Available, Soft state, Eventually consistent) or is highly durable but not transactional.
  • Data Volume: Holds historical data, logs, and facts (Petabytes).

Real-World OLAP Examples

SystemQuery ExampleCharacteristic
Business Intelligence (BI)What was the total revenue in Q3 for customers aged 30-40 in Europe?Aggregates over a large time window and segment.
ML TrainingCalculate the average transaction amount for all users over the past 3 years.Full scan of historical logs to build feature vectors.
Reporting DashboardShow the latency trend of the API over the last 90 days.High-volume reads of time-series data.
Financial ForecastingPredict the sales volume for the next quarter by region.Complex joins and large-scale computation.

Appropriate Databases: Google BigQuery, Amazon Redshift, Apache Hive, Snowflake.

Summary of Key Differences

FeatureOLTP (Transaction Processing)OLAP (Analytical Processing)
ActivityDay-to-day operations (Updating a record).Strategic analysis (Running a report).
Typical QuerySELECT * WHERE user_id = 123SELECT AVG(sales) GROUP BY country
MetricsTransactions per second (TPS).Query latency for complex aggregates.
Data StructureRow-based, indexed.Columnar-based, partitioned.
GoalConsistency and Speed of Writes/Updates.Speed and efficiency of Reads/Scans.

Is there any solution to achieve massive scaling for an RDBMS system while still maintaining ACID compliance?

Yes, one such product in Google Spanner.

What is Spanner?

Google Cloud Spanner is Google’s globally distributed, horizontally scalable, strongly consistent, fully managed relational database.

In one line:

It combines the scalability of NoSQL with the consistency + SQL capabilities of a relational database.

It’s the backbone of many Google-scale systems (AdWords, Google Play, Gmail metadata, etc.) and is available as a fully managed service in GCP.

How does Spanner work? (the core concepts)

Spanner achieves something extremely rare: global distribution + horizontal scaling + strong consistency + SQL + ACID transactions. It uses a set of innovations to pull this off.

Let’s break it down simply.

1. Tables → Shards (Splits) → Distributed Across Nodes

Spanner automatically splits tables into “splits” based on primary key ranges:

  • You choose a primary key.
  • Spanner automatically shards data based on the key.
  • Each shard is stored on different nodes and can be replicated across regions.

This gives:

  • Horizontal scalability
  • Automatic load balancing
  • Automatic resharding when data grows

2. TrueTime: the magic ingredient

Spanner uses TrueTime, a globally synchronized clock API that provides bounded clock uncertainty.

TrueTime is powered by:

  • GPS receivers
  • Atomic clocks inside Google data centers

TrueTime gives timestamps with a guarantee:

1
t ∈ [earliest_possible_time, latest_possible_time]

This allows Spanner to:

  • Order transactions globally
  • Avoid inconsistency due to clock drift
  • Achieve external consistency (strongest consistency you can get)

TrueTime is why Spanner can do global ACID transactions, something most distributed databases can’t do.

3. Synchronous Replication Across Regions

Data is replicated using Paxos/Raft-like quorum consensus, not asynchronous replication.

For example (3 replicas):

  • Majority (2/3) must acknowledge writes.
  • If one replica fails, the system continues.

This provides:

  • No data loss
  • Strong consistency
  • Geographic availability

You can choose:

  • multi-region,
  • regional, or
  • multi-zone configurations

4. Strongly Consistent Transactions

Spanner supports:

  • Strongly consistent reads (linearizable)
  • ACID transactions
  • Serializable isolation across distributed shards

SQL support includes:

  • ANSI SQL
  • Joins
  • Secondary indexes
  • Mutations
  • Commit timestamps
  • Change streams (CDC)

This is full relational behavior but at horizontal scale.

5. Automatic scaling + storage separation

Compute and storage scale independently.

Spanner:

  • Automatically adds splits as data grows
  • Rebalances hot shards
  • Handles replica placement
  • Manages failovers automatically

This is crucial for YouTube-scale or multi-petabyte workloads.

6. Fully Managed

You don’t manage:

  • Replication
  • Sharding
  • Clock sync
  • Failover
  • Backups
  • Storage resizing

GCP handles all of that.

7. Use cases where Spanner shines

  • Global-scale OLTP
  • Financial systems needing strong consistency
  • Inventory, account balances, orders
  • Gaming backends
  • Multi-region user databases
  • Enterprise SaaS with global tenants
  • Google-scale metadata systems
  • High-QPS transactional workloads

The moment you need:

  • billions of rows
  • globally consistent transactions
  • horizontal scale
  • SQL + strong consistency Spanner is often the right answer.

8. When NOT to use Spanner

Spanner is not cheap. It is premium infrastructure.

Avoid Spanner when:

  • Dataset is small (<1 TB)
  • QPS is low
  • You don’t need multi-region or strong global consistency
  • You need simple lookups only (use Bigtable)
  • You need flexible schema/noSQL (use Firestore/Bigtable)

9. Comparison: Bigtable vs Spanner

FeatureBigtableSpanner
TypeWide-column NoSQLDistributed RDBMS
SchemaSemi-structuredStructured SQL
TransactionsSingle rowGlobal, ACID
JoinsNoYes
ConsistencyEventually/strong (per row)Strong globally
Use casesHigh-throughput lookupsGlobal OLTP
ScalingMassive linear scaleMassive linear scale
PricingCheaperPremium

If you need:

  • key-value store → Bigtable
  • global SQL database → Spanner

TL;DR Answer

  • How does Spanner work? Using sharded tables, synchronous replication, Paxos, and TrueTime global clock to provide globally consistent, scalable SQL with ACID transactions.

  • Is it available in GCP? Yes, it is a fully managed GCP service known as Cloud Spanner.


Since Google Cloud Spanner is a distributed RDBMS, does it support all 3 properties in CAP theorem?

Yes, Google Cloud Spanner is a unique distributed database that challenges the traditional understanding of the CAP theorem.

While the CAP theorem states that a distributed system can only guarantee two out of the three properties (Consistency, Availability, and Partition Tolerance), Spanner effectively provides Consistency (C) and Availability (A) while running on a highly partitioned, global network.

How Spanner “Cheats” CAP

It doesn’t technically violate CAP, but it minimizes the trade-off to a point where it feels like it supports all three.

  1. Consistency (Strong External Consistency): Spanner provides global, synchronous replication. It uses a unique timekeeping API called TrueTime, which relies on GPS and atomic clocks in Google’s data centers. This allows Spanner to assign globally consistent timestamps to transactions, ensuring strict serializability across the globe.
  2. Availability (High Availability): Google claims 99.999% availability for Spanner. It achieves this through massive redundancy and a highly optimized network.
  3. Partition Tolerance (The “Technical” Trade-off): In the strict CAP sense, Spanner is a CP system. If a major network partition occurs (e.g., a fiber cut isolating a continent), Spanner chooses Consistency over Availability. The isolated partition will stop accepting writes to maintain data integrity.
    • However: Google’s private global network is so redundant that such partitions are exceptionally rare. Thus, in practice, users experience it as a CA system that also handles partitioning (P) seamlessly 99.999% of the time.

So, practically speaking, Spanner is often described as a CA database that is effectively P-tolerant due to Google’s infrastructure, though theoretically, it remains a CP system.

Enjoyed this article? Never miss out on future posts - follow me.
This post is licensed under CC BY 4.0 by the author.