Distributed Systems
CAP theorem, split-brain prevention, shard allocation, cluster state management, and data lifecycle — the theoretical foundations behind Elasticsearch's architecture.
CAP Theorem & Elasticsearch
C
Consistency
A
Availability
P
Partition Tolerance
Elasticsearch prioritizes Consistency and Partition Tolerance (CP system). During a network partition, Elasticsearch will sacrifice availability to prevent split-brain scenarios — a minority partition will stop accepting writes rather than risking data divergence. The master node requires a quorum of master-eligible nodes to make decisions.
However, Elasticsearch behaves more like an AP system for reads in practice: search queries can be served from any replica, even during a partition, though results may be stale. The near-real-time (NRT) refresh interval (default 1s) already means search results are inherently eventually consistent.
Split-Brain Prevention
Split-brain occurs when a network partition divides the cluster into two groups, each electing its own master. Both sides accept writes, creating divergent data that cannot be reconciled.
Legacy: minimum_master_nodes (pre-7.x)
Required manual configuration: (master_eligible_nodes / 2) + 1. For 3 nodes: minimum_master_nodes = 2. Error-prone because operators forgot to update this when adding/removing nodes.
Modern: Voting Configuration (7.x+)
Elasticsearch automatically manages the voting configuration. When nodes join or leave, the cluster updates the set of voting members. A master decision requires a majority (quorum) of voting members. This eliminated the most common cause of split-brain in production.
Voting members: [es01, es02, es03]
Quorum required: (3 / 2) + 1 = 2 nodes
Scenario: es02 goes down
Remaining voters: [es01, es03] = 2 ≥ quorum
Result: Cluster continues operating
Scenario: es02 AND es03 go down
Remaining voters: [es01] = 1 < quorum
Result: es01 cannot elect a master → cluster stops accepting writes
This is BY DESIGN — prevents split-brainShard Allocation
Every index is divided into shards. Each shard is a self-contained Lucene index. The master node decides which node holds which shard based on allocation rules.
Primary Shards
The authoritative copy. All write operations target the primary shard first, then replicate to replicas. The number of primary shards is fixed at index creation (except via _split or _shrink APIs).
Replica Shards
Copies of primaries on different nodes. Serve read requests (increasing throughput) and provide failover if the primary node goes down. Replica count can be changed at any time.
// Force awareness across availability zones
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.awareness.attributes": "zone",
"cluster.routing.allocation.awareness.force.zone.values": "zone-a,zone-b"
}
}
// Each node is tagged with its zone:
// node.attr.zone: zone-a (on es01, es02)
// node.attr.zone: zone-b (on es03)
// Elasticsearch ensures primary and replica are in different zonesCluster State Management
The cluster state is a global data structure maintained by the master node and replicated to all nodes. It contains:
Index metadata
Mappings, settings, aliases for every index
Shard routing table
Which shard lives on which node
Node membership
Active nodes and their roles
Ingest pipelines
Registered pipeline definitions
Index templates
Template patterns and their configurations
Persistent settings
Cluster-wide settings that survive restarts
Performance Impact
Large cluster states (thousands of indices, millions of fields) slow down master operations. Each mapping change requires a cluster state update that must be replicated to all nodes. This is why mapping explosions (dynamic mapping creating thousands of fields) are a serious production issue — they bloat the cluster state and can cause master instability.
Master Election Process
1. Node startup
└── Node discovers other nodes via discovery.seed_hosts
2. Voting
└── Each master-eligible node votes for the node it believes
should be master (typically the node with the highest
cluster state version, or lowest node ID as tiebreaker)
3. Quorum
└── A node becomes master when it receives votes from a
majority of voting members
4. Cluster state publishing
└── New master publishes the cluster state to all nodes
└── Nodes acknowledge receipt
└── Master commits the state when a majority acknowledge
5. Fault detection
└── Master pings followers every 1s (cluster.fault_detection.follower_check.interval)
└── Followers ping master every 1s
└── After 3 consecutive failures, the node is considered dead
└── If master is considered dead, a new election beginsILM: Index Lifecycle Management
ILM automates index management through lifecycle phases. As data ages, it moves through tiers optimized for different access patterns:
Hot
Active writes, frequent searches. Fast SSD storage.
Warm
No writes, occasional searches. Can use cheaper storage.
Cold
Rare searches. Compressed, frozen possible.
Frozen
Searchable snapshots. Data in object storage, cached locally.
Delete
Data removed after retention period.
PUT _ilm/policy/maclab-logs-policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "7d"
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"cold": {
"min_age": "90d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "maclab-backup"
}
}
},
"delete": {
"min_age": "365d",
"actions": { "delete": {} }
}
}
}
}Data Streams
Data streams are the modern replacement for index-per-day patterns. They provide an abstraction over multiple backing indices, automatically managing rollover and ILM integration.
Data stream: logs-maclab
├── .ds-logs-maclab-2024.01.15-000001 (backing index, warm)
├── .ds-logs-maclab-2024.01.22-000002 (backing index, warm)
├── .ds-logs-maclab-2024.01.29-000003 (backing index, hot)
└── .ds-logs-maclab-2024.02.05-000004 (write index, hot)
Writes → always go to the latest backing index (write index)
Reads → automatically fan out across all backing indices
Rollover → ILM creates a new backing index, old one becomes read-onlyData streams require a @timestamp field and only support append-only operations (no updates or deletes by document ID). This constraint enables optimizations like sorted index merging and efficient time-based filtering.