Monitoring

Building Observability into Modern Applications

Understanding the three pillars of observability - metrics, logs, and traces - and implementing comprehensive monitoring strategies for distributed systems.

May 22, 2025
11 min read
Observability Dashboard

The Observability Imperative

In today's world of distributed systems, microservices, and cloud-native applications, traditional monitoring approaches fall short. The complexity of modern software systems means that we can't predict all the ways they might fail. This is where observability becomes crucial—it's not just about knowing when something breaks, but understanding why it broke and how to fix it quickly.

Observability is the ability to understand the internal state of a system by examining its external outputs. It's about building systems that can tell you their own story when things go wrong, even in ways you never anticipated.

The Three Pillars of Observability

Observability is built on three fundamental pillars: metrics, logs, and traces. While each pillar provides value on its own, the real power comes from correlating data across all three.

The Three Pillars:

Metrics

Numerical values that represent your system's health over time

Logs

Discrete events with contextual information about what happened

Traces

The journey of a request through your distributed system

Metrics: The Heartbeat of Your System

Metrics are numerical measurements taken over time. They're your first line of defense in understanding system health and are essential for alerting and capacity planning.

Types of Metrics

Understanding different metric types helps you choose the right measurement for each situation:

Counters

Monotonically increasing values that represent cumulative totals. Examples include total requests processed, errors encountered, or bytes transferred.

// Counter example
const requestCounter = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'status']
})

requestCounter.inc({ method: 'GET', status: '200' })

Gauges

Values that can go up or down, representing a snapshot at a point in time. Examples include CPU usage, memory consumption, or active connection counts.

Histograms

Sample observations and count them in configurable buckets. Perfect for measuring request durations or response sizes.

// Histogram example
const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
})

const startTime = Date.now()
// ... process request
requestDuration.observe((Date.now() - startTime) / 1000)

Key Metrics to Track

While metrics will vary by application, some universal patterns apply:

Essential Metrics Categories:

RED Metrics (Request-focused)
  • Rate: Requests per second
  • Errors: Error rate or count
  • Duration: Response time percentiles
USE Metrics (Resource-focused)
  • Utilization: % of time resource is busy
  • Saturation: Amount of queued work
  • Errors: Count of error events
Business Metrics
  • Conversion rates
  • User engagement
  • Revenue per request

Logs: The Detailed Story

Logs provide detailed, contextual information about events in your system. While metrics tell you something is wrong, logs help you understand what exactly happened.

Structured Logging

Moving beyond simple text logs to structured formats like JSON enables powerful searching, filtering, and analysis capabilities.

// Structured logging example
logger.info('User login successful', {
  userId: '12345',
  email: 'user@example.com',
  loginMethod: 'oauth',
  ipAddress: '192.168.1.1',
  userAgent: 'Mozilla/5.0...',
  timestamp: '2025-05-22T10:30:00Z',
  requestId: 'req-789'
})

Log Levels and When to Use Them

Proper log level usage is crucial for maintaining signal-to-noise ratio:

  • ERROR: Application errors that need immediate attention
  • WARN: Potential issues that might need investigation
  • INFO: Important business events and state changes
  • DEBUG: Detailed information for troubleshooting
  • TRACE: Very detailed information, typically disabled in production

Log Aggregation and Analysis

In distributed systems, logs from multiple services need to be aggregated and correlated. Popular solutions include:

  • ELK Stack: Elasticsearch, Logstash, and Kibana
  • Grafana Loki: Prometheus-inspired log aggregation
  • Fluentd/Fluent Bit: Open-source data collectors
  • Cloud Solutions: CloudWatch, Stackdriver, Splunk

Distributed Tracing: Following the Request Journey

Distributed tracing tracks requests as they flow through multiple services, providing a complete picture of how your distributed system handles each request.

Core Tracing Concepts

Understanding these concepts is essential for implementing effective tracing:

Trace

A trace represents the entire journey of a request through your system, from the initial entry point to the final response.

Span

A span represents a single operation within a trace. Each span has a start time, duration, and can include metadata about the operation.

Context Propagation

The mechanism by which trace context is passed between services, typically through HTTP headers or message metadata.

// OpenTelemetry tracing example
const tracer = trace.getTracer('my-service')

app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('get_user')
  
  try {
    span.setAttributes({
      'user.id': req.params.id,
      'http.method': req.method,
      'http.url': req.url
    })
    
    const user = await getUserFromDatabase(req.params.id)
    span.setStatus({ code: SpanStatusCode.OK })
    res.json(user)
  } catch (error) {
    span.recordException(error)
    span.setStatus({ 
      code: SpanStatusCode.ERROR, 
      message: error.message 
    })
    res.status(500).json({ error: 'Internal server error' })
  } finally {
    span.end()
  }
})

Sampling Strategies

Tracing every request can be expensive. Implement smart sampling strategies:

  • Probabilistic Sampling: Sample a percentage of traces
  • Rate-based Sampling: Limit traces per second
  • Adaptive Sampling: Adjust sampling based on system load
  • Error Sampling: Always sample traces with errors

Implementing Observability: A Practical Approach

Building observability into your applications requires thoughtful planning and implementation. Here's a practical approach to get started:

Step 1: Instrument Your Code

Start by adding instrumentation to your application code. Use libraries like OpenTelemetry for vendor-neutral instrumentation:

// Auto-instrumentation setup
const { NodeSDK } = require('@opentelemetry/sdk-node')
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node')

const sdk = new NodeSDK({
  instrumentations: [getNodeAutoInstrumentations()]
})

sdk.start()

Step 2: Set Up Data Collection

Deploy collectors to gather observability data from your applications:

# OpenTelemetry Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: jaeger-collector:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Step 3: Build Dashboards and Alerts

Create dashboards that provide both high-level system health views and detailed drill-down capabilities:

Dashboard Design Principles:

  • Start with Business Metrics: Show what matters to users first
  • Use the Inverted Pyramid: High-level overview, then details
  • Include Context: Show related metrics together
  • Enable Drill-down: Link from metrics to logs and traces
  • Set Appropriate Time Ranges: Default to relevant time windows

Advanced Observability Patterns

As your observability practice matures, consider these advanced patterns:

Correlation and Context

The real power of observability comes from correlating data across the three pillars. Use correlation IDs to link metrics, logs, and traces:

// Correlation ID pattern
const correlationId = req.headers['x-correlation-id'] || generateId()

// Add to all logs
logger.info('Processing request', { correlationId, action: 'start' })

// Add to metrics
requestCounter.inc({ correlationId })

// Add to traces
span.setAttributes({ 'correlation.id': correlationId })

Service Level Objectives (SLOs)

Define and monitor SLOs to ensure your observability efforts align with business objectives:

  • SLI (Service Level Indicator): A quantitative measure of service behavior
  • SLO (Service Level Objective): A target range for your SLI
  • Error Budget: The acceptable level of unreliability

Chaos Engineering

Use chaos engineering to validate that your observability tools can detect and help you respond to failures:

  • Introduce controlled failures
  • Verify that monitoring detects the issues
  • Practice incident response procedures
  • Improve observability based on learnings

Observability in Production

Running observability tools in production requires careful consideration of performance, costs, and operational overhead:

Performance Considerations

  • Sampling: Use appropriate sampling rates for traces
  • Batching: Batch metrics and logs for efficient transmission
  • Async Processing: Don't let observability block application performance
  • Circuit Breakers: Fail fast if observability backends are down

Cost Management

  • Data Retention: Set appropriate retention policies
  • Sampling: Sample expensive high-volume data
  • Tiered Storage: Use cheaper storage for older data
  • Alert Fatigue: Tune alerts to reduce noise

Common Pitfalls and How to Avoid Them

Learn from common mistakes teams make when implementing observability:

Observability Anti-Patterns:

  • Metric Explosion: Creating too many high-cardinality metrics
  • Log Spam: Logging everything without considering value
  • Trace Pollution: Creating spans for every function call
  • Dashboard Overload: Creating too many dashboards to maintain
  • Alert Fatigue: Setting up alerts that fire too frequently
  • Vendor Lock-in: Tying observability to specific vendor formats

The Future of Observability

Observability continues to evolve with new technologies and approaches:

eBPF and Kernel-level Observability

eBPF enables deep system observability without modifying application code, providing insights into network traffic, system calls, and kernel behavior.

AI-Powered Observability

Machine learning is being applied to:

  • Automatic anomaly detection
  • Intelligent alerting and noise reduction
  • Root cause analysis automation
  • Predictive capacity planning

OpenTelemetry Standardization

The OpenTelemetry project is creating vendor-neutral standards for observability instrumentation, making it easier to avoid vendor lock-in.

Conclusion

Observability is not just about tools—it's about building a culture of understanding your systems. It requires investment in instrumentation, tooling, and processes, but the payoff in reduced MTTR (Mean Time To Recovery) and improved system reliability is substantial.

Start small with basic metrics and logging, then gradually add distributed tracing and more sophisticated analysis. Focus on correlation between the three pillars and always tie your observability efforts back to business outcomes.

Remember that observability is a journey, not a destination. As your systems evolve and grow more complex, your observability practices must evolve with them. The goal is not perfect observability, but sufficient observability to understand and operate your systems effectively.

In our increasingly complex distributed world, observability isn't optional—it's essential for building reliable, maintainable systems that serve users well. Invest in observability early and often, and your future self will thank you when you need to debug that critical production issue at 3 AM.

Tags:

ObservabilityMonitoringDistributed SystemsSREMetricsLoggingTracing