Domain Knowledge

Distributed Systems

CAP theorem, split-brain prevention, shard allocation, cluster state management, and data lifecycle — the theoretical foundations behind Elasticsearch's architecture.

CAP Theorem & Elasticsearch

Consistency

Availability

Partition Tolerance

Elasticsearch prioritizes Consistency and Partition Tolerance (CP system). During a network partition, Elasticsearch will sacrifice availability to prevent split-brain scenarios — a minority partition will stop accepting writes rather than risking data divergence. The master node requires a quorum of master-eligible nodes to make decisions.

However, Elasticsearch behaves more like an AP system for reads in practice: search queries can be served from any replica, even during a partition, though results may be stale. The near-real-time (NRT) refresh interval (default 1s) already means search results are inherently eventually consistent.

Split-Brain Prevention

Split-brain occurs when a network partition divides the cluster into two groups, each electing its own master. Both sides accept writes, creating divergent data that cannot be reconciled.

Legacy: minimum_master_nodes (pre-7.x)

Required manual configuration: (master_eligible_nodes / 2) + 1. For 3 nodes: minimum_master_nodes = 2. Error-prone because operators forgot to update this when adding/removing nodes.

Modern: Voting Configuration (7.x+)

Elasticsearch automatically manages the voting configuration. When nodes join or leave, the cluster updates the set of voting members. A master decision requires a majority (quorum) of voting members. This eliminated the most common cause of split-brain in production.

Quorum calculation for our 3-node clustertext

Voting members: [es01, es02, es03]
Quorum required: (3 / 2) + 1 = 2 nodes

Scenario: es02 goes down
  Remaining voters: [es01, es03] = 2 ≥ quorum
  Result: Cluster continues operating

Scenario: es02 AND es03 go down
  Remaining voters: [es01] = 1 < quorum
  Result: es01 cannot elect a master → cluster stops accepting writes
  This is BY DESIGN — prevents split-brain

Shard Allocation

Every index is divided into shards. Each shard is a self-contained Lucene index. The master node decides which node holds which shard based on allocation rules.

Primary Shards

The authoritative copy. All write operations target the primary shard first, then replicate to replicas. The number of primary shards is fixed at index creation (except via _split or _shrink APIs).

Replica Shards

Copies of primaries on different nodes. Serve read requests (increasing throughput) and provide failover if the primary node goes down. Replica count can be changed at any time.

Shard allocation awarenessjson

// Force awareness across availability zones
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "zone-a,zone-b"
  }
}

// Each node is tagged with its zone:
// node.attr.zone: zone-a  (on es01, es02)
// node.attr.zone: zone-b  (on es03)
// Elasticsearch ensures primary and replica are in different zones

Cluster State Management

The cluster state is a global data structure maintained by the master node and replicated to all nodes. It contains:

Index metadata

Mappings, settings, aliases for every index

Shard routing table

Which shard lives on which node

Node membership

Active nodes and their roles

Ingest pipelines

Registered pipeline definitions

Index templates

Template patterns and their configurations

Persistent settings

Cluster-wide settings that survive restarts

Performance Impact

Large cluster states (thousands of indices, millions of fields) slow down master operations. Each mapping change requires a cluster state update that must be replicated to all nodes. This is why mapping explosions (dynamic mapping creating thousands of fields) are a serious production issue — they bloat the cluster state and can cause master instability.

Master Election Process

Master election flowtext

1. Node startup
   └── Node discovers other nodes via discovery.seed_hosts

2. Voting
   └── Each master-eligible node votes for the node it believes
       should be master (typically the node with the highest
       cluster state version, or lowest node ID as tiebreaker)

3. Quorum
   └── A node becomes master when it receives votes from a
       majority of voting members

4. Cluster state publishing
   └── New master publishes the cluster state to all nodes
   └── Nodes acknowledge receipt
   └── Master commits the state when a majority acknowledge

5. Fault detection
   └── Master pings followers every 1s (cluster.fault_detection.follower_check.interval)
   └── Followers ping master every 1s
   └── After 3 consecutive failures, the node is considered dead
   └── If master is considered dead, a new election begins

ILM: Index Lifecycle Management

ILM automates index management through lifecycle phases. As data ages, it moves through tiers optimized for different access patterns:

Hot

Active writes, frequent searches. Fast SSD storage.

Warm

No writes, occasional searches. Can use cheaper storage.

Cold

Rare searches. Compressed, frozen possible.

Frozen

Searchable snapshots. Data in object storage, cached locally.

Delete

Data removed after retention period.

ILM policy examplejson

PUT _ilm/policy/maclab-logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "maclab-backup"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}

Data Streams

Data streams are the modern replacement for index-per-day patterns. They provide an abstraction over multiple backing indices, automatically managing rollover and ILM integration.

Data stream structuretext

Data stream: logs-maclab
├── .ds-logs-maclab-2024.01.15-000001  (backing index, warm)
├── .ds-logs-maclab-2024.01.22-000002  (backing index, warm)
├── .ds-logs-maclab-2024.01.29-000003  (backing index, hot)
└── .ds-logs-maclab-2024.02.05-000004  (write index, hot)

Writes → always go to the latest backing index (write index)
Reads  → automatically fan out across all backing indices
Rollover → ILM creates a new backing index, old one becomes read-only

Data streams require a @timestamp field and only support append-only operations (no updates or deletes by document ID). This constraint enables optimizations like sorted index merging and efficient time-based filtering.