Modern data platforms allow multiple jobs to read and write data at the same time.
- Spark jobs run while ingestion pipelines are updating tables.
- Streaming systems append new data while dashboards query historical results.
- Nodes crash. Networks fail.
And yet… your data usually doesn’t corrupt.
Why?
Because of ACID.
But ACID is not just four textbook words. It’s a collection of engineering mechanisms implemented deep inside storage engines, transaction logs, and distributed coordination systems.
Let’s start simple — and go all the way down.
What is ACID? (Simple Explanation)
ACID stands for:
- Atomicity
- Consistency
- Isolation
- Durability
These four properties guarantee that transactions behave reliably.
Figure: ACID as four complementary guarantees for reliable transaction processing.
Simple Bank Transfer Example
Imagine Alice transfers €100 to Bob.
Two things must happen:
- Alice’s balance decreases by €100
- Bob’s balance increases by €100
Atomicity
Either both operations succeed, or neither does. No half-completed transfers.
Consistency
The total money in the system remains correct before and after the transaction.
Isolation
If two transfers happen at the same time, they don’t interfere in a way that corrupts balances.
Durability
Once the transfer is confirmed, it survives crashes.
That’s the theory.
Now let’s see how systems actually implement this.
Atomicity & Durability: How Systems Make Changes Crash-Safe
The core mechanism behind atomicity and durability is Write-Ahead Logging (WAL).
Used in systems like PostgreSQL and MySQL.
The Rule
Before modifying actual data files, the system writes the change to a log.
Only after the log is safely persisted to disk does it update the real data.
Why?
If the system crashes:
- It replays the log.
- It reconstructs the correct state.
This ensures:
- No partial writes
- No corrupted rows
- No lost committed transactions
Durability depends not just on writing to memory, but forcing the log to disk (fsync).
This ordering is what makes databases crash-safe.
Figure: WAL ordering ensures changes are recoverable before data files are updated.
Isolation: Preventing Transactions From Interfering
Now things get more interesting.
If multiple transactions run at the same time, how do we prevent chaos?
There are two major approaches:
- Locking
- MVCC
Traditional Locking
Older systems rely heavily on locks:
- Row-level locks
- Table-level locks
- Blocking readers or writers
While safe, locking reduces concurrency and can cause contention.
Modern analytical systems prefer something smarter.
MVCC — Multi-Version Concurrency Control
Instead of overwriting data immediately, systems create new versions.
Each transaction:
- Has a start timestamp
- Sees only data committed before it started
Writers create new row versions. Readers continue seeing the old version.
This is how:
- Long-running queries don’t block ingestion
- Concurrent writes don’t corrupt results
PostgreSQL uses MVCC at the row level.
Lakehouse systems use the same idea — but at the file level.
Snapshot Isolation
Snapshot Isolation means:
When your query starts, it sees a consistent snapshot of the data at that moment.
Even if new data arrives while the query runs, your result does not change mid-execution.
This is critical in analytics:
- A 30-minute Spark job must see consistent data.
- It cannot return “half old, half new” results.
But here’s something important:
Snapshot Isolation ≠ Serializable Isolation
Snapshot isolation still allows certain anomalies.
Example: Phantom Reads
Transaction A:
- Counts rows where
status = "pending"
Transaction B:
- Inserts a new
pendingrow - Commits
Transaction A runs the same query again.
It now sees more rows than before.
That’s a phantom read.
Snapshot isolation prevents many conflicts — but not all.
Serializable Isolation (Stronger Guarantees)
Serializable isolation guarantees:
The result is as if transactions ran one after another in some order.
- No phantoms
- No write skew
- No anomalies
But this comes at a cost:
- More coordination
- More locking or validation
- Reduced performance
Some distributed databases implement Serializable Snapshot Isolation (SSI).
For example, Google Spanner uses tightly synchronized clocks (TrueTime) to maintain strong guarantees across global nodes.
This is where ACID becomes a distributed systems problem.
ACID in Distributed Systems
Single-node databases are simpler.
In distributed systems:
- Nodes can fail independently
- Network partitions can occur
- Partial commits are possible
To coordinate transactions across nodes, systems use Two-Phase Commit (2PC).
Phase 1: Prepare
All participants agree they can commit.
Phase 2: Commit
Coordinator instructs all to finalize.
But 2PC has drawbacks:
- If coordinator crashes, participants may block
- It reduces availability
To solve broader consensus problems, systems use protocols like:
- Paxos
- Raft
Now ACID is no longer just about storage. It becomes about coordination and consensus.
ACID in Lakehouse Architectures
Modern data engineering platforms like:
- Delta Lake
- Apache Iceberg
Implement ACID differently.
They do not use row-level MVCC.
Instead, they rely on:
- File-level versioning
- Metadata transaction logs
- Atomic metadata commits
- Optimistic concurrency control
Optimistic Concurrency Control
Writers:
- Read current metadata version
- Write new data files
- Attempt to commit new metadata version
If metadata changed → retry.
No heavy locks. Conflicts are detected at commit time.
This enables high write concurrency in data lakes.
Figure: Optimistic concurrency in lakehouses uses atomic metadata commits and retries on conflicts.
How Spark Achieves Snapshot Isolation
When Spark reads a Delta or Iceberg table:
- It reads a specific table version
- That version points to a specific set of files
Writers produce new files and create a new version.
Readers continue reading their original version.
Only after metadata commit does the new version become visible.
This is snapshot isolation implemented at the file and metadata layer.
Trade-offs of ACID
ACID guarantees correctness.
But it comes with trade-offs:
- Write amplification (logs + data files)
- Metadata overhead
- Compaction requirements
- Retry conflicts under heavy concurrency
- Increased coordination cost in distributed systems
Engineering is always about balancing:
- Correctness
- Performance
- Scalability
Final Thoughts
ACID is not just an acronym.
It is:
- Logging mechanisms
- Version management
- Isolation protocols
- Crash recovery strategies
- Distributed coordination algorithms
Understanding ACID means understanding how data systems survive concurrency, crashes, and scale.
It separates:
Using data tools from Designing data platforms.
And in modern data engineering, that difference matters.