Technical

Failure Drills

Three hands-on failure scenarios executed against the live cluster: node failure, mapping conflict, and snapshot restore. Each drill includes the exact commands, observed behavior, and interview talking points.

These are not theoretical exercises. Each drill was executed against the live MacLab 3-node cluster. The observations are real data points that demonstrate understanding of Elasticsearch failure modes and recovery mechanisms — exactly the kind of scenarios a Senior Support Engineer handles daily.

DRILL 1

Node Failure & Recovery

Scenario

Simulate a node going down (hardware failure, OOM kill, network partition) and observe cluster behavior. This is the most common failure mode in production — a support engineer needs to know exactly what happens and how to guide a customer through recovery.

Command

# Kill es02 abruptly (simulates hardware failure)
docker compose stop es02

# Check cluster health immediately
curl -s https://localhost:9200/_cluster/health \
  --cacert ca.crt -u elastic:$PASSWORD | jq

# Check unassigned shards
curl -s https://localhost:9200/_cat/shards?v\&h=index,shard,prirep,state,node \
  --cacert ca.crt -u elastic:$PASSWORD | grep UNASSIGNED

# Recover: restart the node
docker compose start es02

# Watch recovery
curl -s https://localhost:9200/_cat/recovery?v\&active_only \
  --cacert ca.crt -u elastic:$PASSWORD

Observation

Cluster immediately transitioned from GREEN to YELLOW. 22 replica shards that were assigned to es02 became UNASSIGNED. Primary shards remained available on es01 and es03, so all search and indexing operations continued without interruption. The cluster logged master election activity but since es01 and es03 maintained quorum (2 out of 3), no master re-election was needed.

Resolution

After restarting es02, the node rejoined the cluster within 15 seconds. Elasticsearch detected that es02 had a recent copy of the shard data (synced allocation IDs matched) and performed peer recovery — only replaying the transaction log rather than copying entire segments. Cluster returned to GREEN in under 30 seconds.

Interview Talking Point

When a customer calls about a node failure, my first question is: what color is the cluster? YELLOW means replicas are unassigned but data is safe. RED means primary shards are missing — that's the emergency. I'd check _cluster/allocation/explain to understand why shards aren't allocating, check disk watermarks, and guide the customer through recovery without panicking.

DRILL 2

Mapping Conflict

Scenario

Index a document with a field type that conflicts with the existing mapping. This is one of the most common support tickets — customers have multiple data sources sending different types for the same field name.

Command

# First, index a document with age as a long (number)
curl -X POST "https://localhost:9200/test-mapping/_doc/1" \
  --cacert ca.crt -u elastic:$PASSWORD \
  -H 'Content-Type: application/json' -d '{
    "name": "Kim Minji",
    "age": 28
  }'

# Now try to index age as a string — this will fail
curl -X POST "https://localhost:9200/test-mapping/_doc/2" \
  --cacert ca.crt -u elastic:$PASSWORD \
  -H 'Content-Type: application/json' -d '{
    "name": "Park Sooyoung",
    "age": "twenty-five"
  }'

Observation

The second indexing request returned HTTP 400 with a document_parsing_exception: "failed to parse field [age] of type [long] in document with id 2. Preview of field value: twenty-five". Elasticsearch auto-detected the age field as type long from the first document, and the mapping is immutable — you cannot change a field type once set.

Resolution

Three options for the customer: (1) Reindex into a new index with the correct mapping defined upfront. (2) Use a multi-field mapping with both long and keyword sub-fields. (3) Use an ingest pipeline with a convert processor to normalize the data before indexing. The root cause is usually missing explicit mappings — I'd recommend always defining mappings upfront in an index template rather than relying on dynamic mapping.

Interview Talking Point

Mapping conflicts are a top-5 support ticket category. The key teaching moment is: dynamic mapping is convenient for development but dangerous in production. I always recommend explicit mappings in index templates, with dynamic: strict to reject unexpected fields. For the immediate fix, reindexing with _reindex API is usually the fastest path.

DRILL 3

Snapshot Restore

Scenario

Delete an entire index (simulating accidental deletion or data corruption) and restore it from a previously created snapshot. This validates the backup/restore pipeline end-to-end.

Command

# Verify snapshot exists
curl -s https://localhost:9200/_snapshot/maclab-backup/snapshot_1 \
  --cacert ca.crt -u elastic:$PASSWORD | jq '.snapshots[0].state'
# → "SUCCESS"

# Delete the support-tickets index
curl -X DELETE "https://localhost:9200/support-tickets" \
  --cacert ca.crt -u elastic:$PASSWORD
# → {"acknowledged": true}

# Confirm deletion
curl -s "https://localhost:9200/support-tickets/_count" \
  --cacert ca.crt -u elastic:$PASSWORD
# → 404 index_not_found_exception

# Restore from snapshot
curl -X POST "https://localhost:9200/_snapshot/maclab-backup/snapshot_1/_restore" \
  --cacert ca.crt -u elastic:$PASSWORD \
  -H 'Content-Type: application/json' -d '{
    "indices": "support-tickets",
    "rename_pattern": "",
    "rename_replacement": ""
  }'

# Verify restoration
curl -s "https://localhost:9200/support-tickets/_count" \
  --cacert ca.crt -u elastic:$PASSWORD | jq '.count'
# → 10

Observation

The index was completely removed — all primary and replica shards deleted. The snapshot restore created a new index with identical settings (including the Korean analyzer configuration) and restored all 10 documents. The restore operation took under 2 seconds for this small index.

Resolution

All 10 bilingual support tickets were restored with their original mappings, settings, and data intact. The Korean analyzer configuration (Nori tokenizer with decompound_mode: mixed) was preserved because snapshot/restore captures the full index metadata. A post-restore search for '클러스터 상태' returned the same 3 results as before deletion.

Interview Talking Point

Snapshot/restore is the last line of defense. I'd walk a customer through: (1) Don't panic — the data is in the snapshot. (2) Check snapshot state first. (3) If restoring to the same index name, you need to close or delete the existing index first. (4) For large indices, restore can take hours — use the _recovery API to monitor progress. (5) Always test your restore procedure before you need it in an emergency.