Domain Knowledge

Customer Support Perspective

KCS methodology, troubleshooting playbooks, customer communication patterns, and APJ-specific considerations for the Senior Support Engineer role.

KCS: Knowledge Centered Service

KCS is a methodology for integrating knowledge creation and maintenance into the support workflow. Instead of treating knowledge base articles as a separate task, KCS makes knowledge capture a natural byproduct of solving customer issues. Elastic uses KCS as a core practice for their support organization.

Four KCS Principles

Abundance

Knowledge is not a scarce resource to be hoarded. The more you share, the more valuable it becomes. Everyone contributes, everyone benefits.

Create Value

Every interaction creates organizational value. Even if you can't solve the issue immediately, documenting the symptoms and investigation steps creates knowledge for the next engineer.

Demand-Driven

Create and maintain knowledge based on actual customer demand. Don't pre-write articles speculatively — write them when customers ask. The most-viewed articles are the most valuable.

Trust

Trust contributors to create and modify knowledge. Peer review happens naturally through reuse. Articles improve over time as more engineers encounter the same issue.

The Solve Loop

The Solve Loop is the operational heart of KCS. It happens with every customer interaction:

Capture

Document the customer's context and problem in their words as you work the case. Don't wait until resolution.

Structure

Use a consistent template (SPRE) so articles are scannable. Situation, Problem, Resolution, Environment.

Reuse

Before investigating, search the knowledge base. If an article exists, link it. If it's incomplete, improve it while solving.

Improve

Every touch improves an article. Add missing steps, correct errors, expand the environment section. Flag articles that are wrong.

The Evolve Loop

The Evolve Loop is the organizational layer that improves the KCS practice itself over time:

Content Health

Monitor article quality metrics: reuse rate, freshness, flagged articles. Retire stale content. Identify gaps in coverage.

Process Integration

KCS must be embedded in the workflow, not bolted on. Tools should prompt for knowledge capture. Search should surface articles in the case form.

Performance Assessment

Measure success by knowledge contribution quality, not just ticket count. Recognize engineers who improve the most articles.

Leadership & Communication

Leadership must model KCS behavior. Celebrate knowledge sharing. Invest in training. Communicate the business value of the knowledge base.

Article Lifecycle

WIP
Not Validated
Validated
Archived

SPRE Template

Knowledge article templatetext
SITUATION
  What is the customer experiencing? What are the symptoms?
  Example: "Cluster health is RED after adding a 4th node"

PROBLEM
  What is the root cause?
  Example: "Disk watermark exceeded on new node, preventing shard allocation"

RESOLUTION
  Step-by-step fix
  Example:
    1. Check disk watermarks: GET _cluster/settings
    2. Free disk space or adjust watermarks
    3. Re-enable allocation
    4. Verify cluster health returns to GREEN

ENVIRONMENT
  Version, OS, deployment type, relevant configuration
  Example: "Elasticsearch 8.12.0, RHEL 8, self-managed, 4 nodes"

Elastic + KCS + GenAI

Elastic has integrated GenAI into their KCS workflow with significant results:

6x

Increase in case deflection through AI-powered knowledge base search. Customers find answers before opening tickets.

23%

Improvement in Mean Time to First Response (MTFR). AI suggests relevant articles to support engineers, reducing research time.

KCS Terminology Cheat Sheet

TermDefinition
Solve LoopCapture-Structure-Reuse-Improve cycle during case work
Evolve LoopOrganizational improvement of KCS practices
SPRESituation-Problem-Resolution-Environment article template
ReuseUsing existing knowledge to solve new cases
FlaggingMarking articles as incorrect or incomplete
Content StandardQuality criteria for articles at each lifecycle stage
KCS CoachPeer mentor who helps engineers improve KCS practices
DeflectionCustomer self-serves using knowledge base without opening a ticket

Troubleshooting Methodology: RED THEN GREEN

Cluster is RED

Step 1: Identify unassigned primary shardsbash
# Check cluster health
curl -s https://localhost:9200/_cluster/health?pretty \
  --cacert ca.crt -u elastic:$PASSWORD

# Find unassigned shards
curl -s https://localhost:9200/_cat/shards?v\&h=index,shard,prirep,state,unassigned.reason \
  --cacert ca.crt -u elastic:$PASSWORD | grep UNASSIGNED

# Get allocation explanation for a specific shard
curl -s https://localhost:9200/_cluster/allocation/explain?pretty \
  --cacert ca.crt -u elastic:$PASSWORD -H 'Content-Type: application/json' -d '{
    "index": "support-tickets",
    "shard": 0,
    "primary": true
  }'

Common Causes of RED

Disk watermark exceeded (default: 85% low, 90% high, 95% flood)

Node holding the only copy of a primary shard is down

Corrupted shard data (Lucene segment corruption)

Insufficient master-eligible nodes for quorum

Allocation filtering rules preventing shard assignment

Search is Slow

Performance investigationbash
# Enable slow logs
PUT /support-tickets/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.fetch.warn": "1s"
}

# Profile a specific query
POST /support-tickets/_search
{
  "profile": true,
  "query": {
    "match": { "description": "클러스터 상태" }
  }
}

# Check hot threads (find CPU-heavy operations)
GET _nodes/hot_threads

Data is Missing

Investigation Checklist

1. Check if the ingest pipeline is running: GET _ingest/pipeline/maclab-logs

2. Check for pipeline errors: GET _nodes/stats/ingest

3. Verify the index template matches the index pattern

4. Check if the index exists and is writeable (not read-only from watermark)

5. Verify the mapping accepts the field types being sent

6. Check for bulk indexing rejections in node stats

7. Verify the refresh interval — documents are not searchable until refreshed

Customer Communication

Lead with Empathy

Acknowledge the customer's frustration before diving into technical details. 'I understand this is impacting your production environment and I'm prioritizing this immediately.' In Korean: '프로덕션 환경에 영향을 주고 있다는 점 충분히 이해합니다. 즉시 최우선으로 대응하겠습니다.'

Set Expectations Early

Tell the customer what you're going to do, how long it might take, and when you'll next update them. Never leave a customer wondering if you're still working on their issue.

Explain, Don't Just Fix

A support engineer who fixes the issue AND explains what happened creates trust. 'The cluster went RED because disk usage exceeded the flood watermark at 95%. Here's how to prevent this in the future...'

Follow Up Proactively

After resolving the issue, check back in 24-48 hours. 'Hi, I wanted to confirm that your cluster health has remained GREEN since our last interaction. Did the disk monitoring alert we set up trigger correctly?'

APJ-Specific Considerations

Timezone Management

APJ spans UTC+5:30 (India) to UTC+13 (New Zealand). Korean business hours (KST, UTC+9) overlap well with Japan and Australia but require handoff coordination with India. As a Korean-based engineer, maintaining flexible hours for APAC-wide escalations is expected.

Cultural Sensitivity

Korean enterprise customers (Samsung, LG, SK, Hyundai) use formal honorific language (존댓말). Technical support in Korean requires proper formal register: "확인해 보겠습니다" (formal) not "확인해 볼게" (casual). Japanese customers similarly expect keigo (敬語). Understanding these nuances builds trust.

Korean Language Support

Providing support in Korean eliminates the translation barrier that adds resolution time. A Korean customer explaining "샤드가 할당되지 않습니다" (shards are not being allocated) should not need to translate their problem to English to get help. This is why the role requires native Korean.

Regional Compliance

Korean customers often operate under PIPA (개인정보보호법, Personal Information Protection Act). Data residency requirements may affect cluster architecture decisions — some customers require all data to remain within Korean borders, impacting snapshot repository locations and cross-region replication.