Kubernetes Centralized Logging: Best-in-Class Observability Pipeline for 100+ Microservices

Six months ago, I implemented a centralized logging pipeline for Kubernetes-based microservices running in production.

The goal was simple: build a reliable, scalable, secure, searchable, and alertable observability pipeline that could handle logs from more than 100 microservices without becoming fragile, expensive, or difficult to operate.

The result was a production-ready logging architecture where every microservice writes structured logs to stdout and stderr, and the platform takes care of collection, buffering, processing, storage, search, dashboards, alerting, and long-term retention.

The pipeline has been running successfully for six months with no major issues.

Kubernetes centralized logging observability pipeline architecture diagram — End-to-end architecture for a centralized Kubernetes logging pipeline across 100+ microservices.

The Problem

When a system grows beyond a few services, logging becomes much more than “write logs somewhere.”

In a Kubernetes environment with 100+ microservices, logs are constantly generated across many pods, containers, nodes, namespaces, and environments. Without a centralized pipeline, troubleshooting becomes slow and inconsistent.

Common problems include:

Application logs are scattered across pods and nodes.
Developers need manual access to containers or clusters to inspect logs.
Important errors are discovered too late.
Log formats are inconsistent across services.
Sensitive data may appear in logs if there is no processing layer.
Storage costs grow quickly when hot and cold data are not separated.
There is no reliable way to replay, reprocess, or recover logs during downstream outages.

High-Level Architecture

Kubernetes Pods
  → OpenTelemetry Collector
  → Apache Kafka
  → Vector
  → OpenSearch / Elasticsearch
  → Grafana
  → ElastAlert 2
  → PagerDuty / Slack / Email / Teams / Webhooks

For long-term retention:

Vector → Amazon S3

1. Kubernetes Logging Model

Each microservice writes logs to standard output and standard error. This is preferred for containerized workloads because the application does not need to know anything about the logging backend.

Application logs
Container logs
Kubernetes events
System and audit logs

2. Collection Layer: OpenTelemetry Collector

The preferred collector is OpenTelemetry Collector, deployed as a DaemonSet across Kubernetes nodes. It collects from stdout/stderr, adds metadata, and forwards data downstream.

3. Stream and Buffer Layer: Apache Kafka

Kafka is the durable buffer between collection and processing, enabling high throughput, backpressure handling, replay capability, and resilience during downstream outages.

4. Processing Layer: Vector

Vector handles parsing, normalization, enrichment, transformation, routing, sensitive data redaction, error handling, and dead-letter queue workflows.

5. Storage and Search: OpenSearch

OpenSearch is the hot storage and search layer for fast investigation, analytics, and lifecycle management.

6. Dashboards and Visualization: Grafana

Grafana dashboards make logs actionable for developers, operations, support, and leadership teams.

7. Alerting: ElastAlert 2

ElastAlert 2 provides rule-based alerting and routes alerts to PagerDuty, Slack, Teams, email, and webhooks.

8. Archive and Cold Storage: Amazon S3

S3 provides long-term, cost-effective retention for compliance, audits, and reprocessing scenarios.

Pipeline Guarantees

Reliability: buffering, retries, and replay reduce log loss risk.
Scalability: each layer scales independently.
Security: RBAC, TLS, secrets management, redaction.
Observability: logs, metrics, and traces under one strategy.
Compliance: retention + audit support.
Cost Optimization: hot/cold separation.

Lessons Learned

Use structured logs whenever possible.
Include service name, environment, correlation ID, request ID, and severity.
Use durable buffering (Kafka) before search storage.
Normalize formats and redact sensitive data early.
Separate hot search from cold archive.
Design alerting to avoid noise.
Monitor the logging pipeline itself.

Final Result

This pipeline has been running in production for six months across more than 100 microservices. The architecture is reliable, scalable, secure, searchable, alertable, and cost-conscious.

← Back to Articles