Technical

Data Ingestion

Ingest pipelines, index templates, Korean language analysis with Nori tokenizer, and bulk indexing of bilingual support tickets.

Data Summary

System Logs

maclab-logs index

Support Tickets

Bilingual KR/EN

Total Documents

Across 2 indices

Korean Analyzer

Nori

analysis-nori plugin

Ingest Pipeline: maclab-logs

The ingest pipeline transforms raw log entries before indexing. It parses timestamps, extracts structured fields from log messages using grok patterns, and normalizes data.

PUT _ingest/pipeline/maclab-logsjson

{
  "description": "Process MacLab system logs",
  "processors": [
    {
      "date": {
        "field": "raw_timestamp",
        "formats": ["ISO8601"],
        "target_field": "@timestamp"
      }
    },
    {
      "grok": {
        "field": "message",
        "patterns": [
          "%{LOGLEVEL:log_level} %{GREEDYDATA:log_message}"
        ],
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "pipeline",
        "value": "maclab-logs"
      }
    },
    {
      "lowercase": {
        "field": "log_level",
        "ignore_failure": true
      }
    }
  ]
}

Pipeline Processors Explained

date — Parses the raw_timestamp field into a proper @timestamp, enabling time-based queries and Kibana time filters.

grok — Extracts the log level (INFO, WARN, ERROR) and message body from unstructured log text using regex patterns.

set — Adds a pipeline metadata field so you can track which pipeline processed each document.

lowercase — Normalizes log levels to lowercase for consistent aggregation (INFO and info both become info).

Index Template: maclab-logs-*

PUT _index_template/maclab-logsjson

{
  "index_patterns": ["maclab-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "default_pipeline": "maclab-logs"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "log_level": { "type": "keyword" },
        "log_message": { "type": "text" },
        "service": { "type": "keyword" },
        "host": { "type": "keyword" },
        "pipeline": { "type": "keyword" }
      }
    }
  }
}

The template applies to any index matching maclab-logs-*. With 1 primary shard and 1 replica, every document exists on at least 2 of our 3 nodes, providing redundancy while keeping the shard count manageable. The default_pipeline setting automatically routes all documents through our ingest pipeline.

Korean Language Support: Nori Analyzer

Support Tickets Index — Korean Analyzer Configurationjson

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "analysis": {
      "tokenizer": {
        "nori_user_dict": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed"
        }
      },
      "analyzer": {
        "korean": {
          "type": "custom",
          "tokenizer": "nori_user_dict",
          "filter": [
            "nori_readingform",
            "nori_part_of_speech",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "ticket_id": { "type": "keyword" },
      "title": {
        "type": "text",
        "analyzer": "korean",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "korean"
      },
      "severity": { "type": "keyword" },
      "product": { "type": "keyword" },
      "language": { "type": "keyword" },
      "customer_name": { "type": "keyword" },
      "resolution_hours": { "type": "float" },
      "status": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Nori Tokenizer Deep Dive

The Nori tokenizer is Elasticsearch's official Korean morphological analyzer, built on the Lucene Korean analyzer. Korean is an agglutinative language where words are formed by combining morphemes — standard whitespace tokenization fails because compound words are common.

decompound_mode: mixed — Outputs both the original compound token and its decomposed parts. For example, "가곡역" (Gagok Station) produces both "가곡역" and "가곡" + "역". This gives the best recall for Korean text search.

nori_readingform — Converts Hanja (Chinese characters occasionally used in Korean) to their Hangul reading form.

nori_part_of_speech — Filters out grammatical particles and suffixes (like 을/를, 이/가) that don't contribute to search relevance, similar to stop words in English.

Sample Bilingual Support Tickets

Bulk index — bilingual support ticketsjson

{"index": {"_index": "support-tickets", "_id": "T001"}}
{"ticket_id":"T001","title":"클러스터 상태 RED 긴급 대응","description":"고객 프로덕션 클러스터가 RED 상태입니다. 주 샤드 할당에 실패하였으며, 디스크 사용률이 95%를 초과했습니다. 긴급 대응이 필요합니다.","severity":"critical","product":"elasticsearch","language":"ko","customer_name":"Samsung SDS","resolution_hours":2.5,"status":"resolved","created_at":"2024-01-15T09:30:00Z"}

{"index": {"_index": "support-tickets", "_id": "T002"}}
{"ticket_id":"T002","title":"Kibana dashboard loading slowly","description":"Customer reports Kibana dashboards take 30+ seconds to load. The cluster has 500+ indices and heavy aggregation queries. Need to optimize visualization queries and check cluster resources.","severity":"high","product":"kibana","language":"en","customer_name":"NTT Data","resolution_hours":4.0,"status":"resolved","created_at":"2024-01-16T14:00:00Z"}

{"index": {"_index": "support-tickets", "_id": "T003"}}
{"ticket_id":"T003","title":"인덱스 매핑 충돌 해결","description":"서로 다른 데이터 소스에서 같은 필드명에 다른 타입으로 데이터를 보내고 있습니다. age 필드가 long과 text로 혼재되어 매핑 충돌이 발생했습니다.","severity":"medium","product":"elasticsearch","language":"ko","customer_name":"LG CNS","resolution_hours":3.0,"status":"resolved","created_at":"2024-01-17T11:00:00Z"}

The support tickets index contains 10 bilingual documents (6 Korean, 4 English) representing realistic customer issues across Elasticsearch, Kibana, and Elastic Agent products. Severity ranges from critical to low, with resolution times averaging 5.2 hours.

Korean Text Search Verification

Search for "클러스터 상태" (cluster status)json

GET support-tickets/_search
{
  "query": {
    "multi_match": {
      "query": "클러스터 상태",
      "fields": ["title", "description"],
      "analyzer": "korean"
    }
  }
}

// Result: 3 hits
// T001: "클러스터 상태 RED 긴급 대응" (score: 4.82)
// T005: "클러스터 성능 모니터링 설정" (score: 1.23)
// T008: "클러스터 노드 추가 후 샤드 재배치" (score: 0.95)

The Nori tokenizer correctly decomposes "클러스터 상태" into morphemes and matches documents containing related terms. This demonstrates that Korean language search is fully functional — a critical capability for supporting Korean enterprise customers.