intermediate kubernetes · 90 minutes

LLM Observability on Kubernetes: A Practical Guide

Master LLM observability on Kubernetes with this practical guide. Learn to monitor AI agent performance, cost, and behavior using OpenTelemetry, Prometheus, Grafa...

Prerequisites

  • kubectl installed
  • Kubernetes cluster (minikube, k3s, or cloud)
  • Helm installed
  • Python 3.8+
  • Docker installed

Tools Used

kubectlhelmpythondockeropentelemetry-collector
LLM Observability on Kubernetes: A Practical Guide
Advertisement

Monitoring traditional applications often feels like a well-trodden path. You set up logs, grab some metrics, and perhaps add a few traces. However, integrating Large Language Models (LLMs) or AI agents, especially when running on Kubernetes, fundamentally changes this paradigm. LLM observability on Kubernetes is a different beast entirely, demanding a more nuanced approach than standard application monitoring.

This tutorial is designed for DevOps, ML, or platform engineers grappling with the unique challenges of monitoring LLM-powered applications and AI agents on Kubernetes. You’ll learn why traditional tools fall short and how to build a practical, end-to-end observability pipeline. We will use battle-tested Kubernetes-native tools like Prometheus, Grafana, Loki, and OpenTelemetry. The tutorial includes hands-on experience with a simple AI agent application, instrumenting it, deploying it to Kubernetes, and setting up a unified observability stack to monitor its performance, cost, and behavior.

Why Traditional Observability Fails for LLM-Powered AI Agents

Generative AI applications, particularly those powered by LLMs and complex AI agents, introduce a new dimension to observability that traditional methods struggle to address. It’s no longer just about CPU and memory; it’s about context, coherence, and cost per token. For effective LLM observability on Kubernetes, your standard monitoring stack needs an upgrade.

Here’s why traditional monitoring falls short for LLMs and AI agents:

  • Non-Determinism: LLMs are inherently non-deterministic. The same prompt can yield different responses, making it hard to track performance or identify regressions solely through request/response codes. You need to understand the content and quality of responses. For example, a successful HTTP 200 response doesn’t indicate if an LLM response was a hallucination.
  • Complex Prompt/Response Dynamics: The interaction isn’t a simple input-output. It involves intricate prompt engineering, context windows, and diverse response formats. Observing just the HTTP status code tells you nothing about a hallucination or an off-topic answer.
  • Token Usage & Cost: Every interaction with an LLM consumes tokens, which directly translates to cost, especially with proprietary models like OpenAI’s GPT-4. Monitoring token usage per request, per user, or per session is critical for cost control and capacity planning. Traditional metrics simply don’t capture this. For example, a simple query might cost fractions of a cent, but 10,000 queries per minute can quickly escalate costs to thousands of dollars daily.
  • Latency Nuances: LLM response times are often dominated by token generation, not just initial processing. You need to differentiate between prompt processing latency and response streaming latency for accurate performance tuning of your LLM applications.
  • AI Agent Complexity: This is where it gets really interesting. AI agents involve multiple steps: planning, tool selection, tool execution, memory management, and iterative reasoning. Each step is a potential failure point. You need to trace the agent’s entire decision path, track tool call successes/failures, and understand how the agent arrived at its final answer. A simple error log tells you that something failed, but not why the agent chose a particular path or tool.
  • “Black Box” Nature: While you can control the inputs, the internal workings of large foundation models are opaque. Observability needs to shine a light on the model’s behavior at the application layer.

On Kubernetes, you’re adept at monitoring resource utilization at the pod and container level. But for LLM applications, this is only half the story. You need to correlate Kubernetes infrastructure metrics with application-specific LLM and agent metrics to get a complete picture of your LLM observability on Kubernetes.

Core Observability Pillars for LLM Workloads on Kubernetes

The traditional pillars of observability (Logs, Metrics, and Traces) remain foundational, but they need to be adapted and extended for LLM workloads on Kubernetes. This integrated approach is key to achieving comprehensive LLM observability.

Logs: The Narrative of Your AI Agent’s Decisions

For LLM applications and AI agents, logs are more than just error messages. They are the narrative of your agent’s reasoning process, crucial for understanding and debugging LLM observability on Kubernetes.

  • Prompt/Response Logging: Crucial for debugging and understanding model behavior. Log the full input prompt, context, and the LLM’s raw response. This helps diagnose why a model might have hallucinated or gone off-topic.
  • Agent Decision Logs: For AI agents, log every significant step:
    • Initial plan formulation.
    • Tool selection decisions.
    • Inputs and outputs of each tool call.
    • Re-planning attempts.
    • Intermediate reasoning steps.
  • Structured Logging: Absolutely essential. Use JSON logging to include metadata like trace_id, span_id, user_id, session_id, model_name, temperature, token_counts, and safety_flags. This makes logs queryable and correlatable with metrics and traces.
  • Kubernetes Integration: Leverage Kubernetes’ standard output (stdout/stderr) for logs. A log collector like Fluent Bit, deployed as a DaemonSet, can then ship these structured logs to a centralized logging solution like Loki, Elasticsearch, or Splunk.

Metrics: Quantifying LLM Performance and Cost

Metrics provide the quantitative insights into your LLM application’s health, performance, and operational cost. These are vital for effective LLM observability on Kubernetes.

  • Application-Level Metrics: These are paramount. We’ll dive into specific examples shortly, but think latency, token usage, error rates for LLM calls, and specific agent action success rates.
  • Resource Utilization: Standard Kubernetes metrics for CPU, memory, network I/O are still vital. For GPU-accelerated inference, monitoring GPU utilization, memory, and temperature is critical. Prometheus can scrape these from Kubernetes kube-state-metrics and node-exporter (or specific GPU exporters like DCGM Exporter).
  • Cost Metrics: Beyond just resource utilization, track API calls to external LLMs and internal token consumption. This allows for real-time cost estimation and budgeting.
  • Kubernetes Integration: Prometheus, with its ServiceMonitor and PodMonitor custom resources, is perfectly suited for scraping application-level metrics directly from your LLM application pods.

Traces: Following the AI Agent’s Chain of Thought

Distributed tracing is arguably the most powerful pillar for debugging complex, multi-step AI agents, offering deep insights into LLM observability on Kubernetes. It visualizes the entire execution path.

  • End-to-End Flow: A trace provides a timeline view of a single request or agent interaction, spanning across multiple services and internal functions. For an AI agent, this means seeing the initial user query, the LLM call, the tool selection logic, the tool execution, and the final response, all linked together.
  • Span Granularity: Create spans for:
    • Incoming request.
    • Each LLM API call (input prompt, model, temperature, output response).
    • Each step of the agent’s reasoning chain (e.g., “planning step”, “tool invocation”, “context retrieval”).
    • Each external tool call (e.g., database lookup, external API call).
  • Context Propagation: Essential for connecting spans across service boundaries. OpenTelemetry automatically handles this for many protocols.
  • Semantic Conventions: Use OpenTelemetry’s semantic conventions for LLM operations. This ensures consistency and makes traces easier to interpret across different tools.
  • Kubernetes Integration: Deploy an OpenTelemetry Collector within your Kubernetes cluster to receive traces from your instrumented applications and export them to a tracing backend like Jaeger or Tempo.

Key Metrics for LLM Apps & AI Agents on Kubernetes

Let’s get specific about the metrics you must track for LLM-powered applications and AI agents to ensure comprehensive LLM observability on Kubernetes.

Performance Metrics for LLM Applications

These reveal how quickly and efficiently your LLM application is serving requests.

  • Prompt Processing Latency: Time taken from receiving a prompt to sending it to the LLM API.
    • Prometheus metric type: histogram
    • Example: llm_prompt_processing_seconds_bucket
  • Response Generation Latency: Time taken for the LLM to generate the full response, or the time until the first token is received for streaming.
    • Prometheus metric type: histogram
    • Example: llm_response_generation_seconds_bucket
  • Total Request Latency: End-to-end time for a user query.
    • Prometheus metric type: histogram
    • Example: llm_agent_total_request_seconds_bucket
  • Throughput (Queries Per Second): Number of requests or agent interactions handled per second.
    • Prometheus metric type: counter
    • Example: llm_agent_requests_total
  • Tool Call Latency: Time taken for specific tools invoked by the agent (e.g., database query, external API call).
    • Prometheus metric type: histogram
    • Example: agent_tool_call_seconds_bucket{tool_name="weather_api"}

Resource Utilization Metrics for LLMs on Kubernetes

While standard Kubernetes metrics cover CPU and memory, pay special attention to:

  • GPU Utilization: Percentage of GPU compute units being used. Critical for local inference.
    • Prometheus metric type: gauge
    • Example: gpu_utilization_percentage
  • GPU Memory Usage: Amount of memory allocated on the GPU.
    • Prometheus metric type: gauge
    • Example: gpu_memory_usage_bytes
  • CPU and Memory (per Pod): Standard container_cpu_usage_seconds_total and container_memory_working_set_bytes from kube-state-metrics and node-exporter.

LLM Cost Monitoring Metrics

Directly impacting your budget, these are often overlooked in initial deployments and are crucial for comprehensive LLM observability.

  • Input Token Count: Number of tokens sent in the prompt.
    • Prometheus metric type: counter
    • Example: llm_input_tokens_total
  • Output Token Count: Number of tokens received in the response.
    • Prometheus metric type: counter
    • Example: llm_output_tokens_total
  • Total LLM API Calls: Number of requests made to the underlying LLM (internal or external).
    • Prometheus metric type: counter
    • Example: llm_api_calls_total{model_name="gpt-4"}
  • Estimated Cost: A derived metric calculated by multiplying token counts or API calls by their respective per-unit costs. This is often best handled in Grafana using PromQL. For example, if input tokens cost $0.01 per 1000 and output tokens cost $0.03 per 1000, you can calculate real-time cost.
    • Prometheus metric type: (Derived in Grafana)
    • Example (PromQL): (sum(rate(llm_input_tokens_total[5m])) / 1000 * 0.01) + (sum(rate(llm_output_tokens_total[5m])) / 1000 * 0.03)

Model Quality & AI Agent Behavior Metrics

These are more challenging to define but crucial for understanding the LLM’s effectiveness and the performance of your AI agents.

  • Safety Guardrail Activations: Count of times a safety or moderation filter was triggered (e.g., content flagged as unsafe).
    • Prometheus metric type: counter
    • Example: llm_safety_violations_total
  • Hallucination Flags: If your application has logic to detect potential hallucinations, count these. This is often heuristic.
    • Prometheus metric type: counter
    • Example: llm_hallucinations_detected_total
  • Agent Goal Completion Rate: For multi-step agents, track the percentage of interactions where the agent successfully achieved its objective.
    • Prometheus metric type: counter (for successes/failures)
    • Example: agent_goal_completions_total, agent_goal_failures_total
  • Tool Call Success/Failure Rates: Track how often an agent’s chosen tool executed successfully.
    • Prometheus metric type: counter
    • Example: agent_tool_call_success_total{tool_name="database_lookup"}, agent_tool_call_failure_total{tool_name="external_api"}
  • Number of Agent Steps/Iterations: How many steps or LLM calls an agent took to complete a task. High numbers might indicate inefficiency.
    • Prometheus metric type: histogram or gauge
    • Example: agent_iteration_steps_count_bucket

Instrumenting LLM Applications & Agents for Observability

Instrumentation is where you expose the internal state of your LLM application and AI agent. OpenTelemetry is the gold standard here for its vendor-neutral approach and comprehensive support for traces, metrics, and logs, making it ideal for LLM observability on Kubernetes.

Why OpenTelemetry is Essential for LLM Observability

OpenTelemetry (OTel) provides a set of APIs, SDKs, and tools to instrument your application to generate and export telemetry data. It’s language-agnostic and supports various exporters, allowing you to switch observability backends without re-instrumenting your code. For complex distributed systems like Kubernetes-hosted AI agents, OTel’s distributed tracing capabilities are invaluable for understanding the flow of LLM interactions.

Instrumentation Methods for LLM Observability

1. OpenTelemetry for Tracing and Metrics

This is the recommended approach for deep, custom instrumentation for LLM observability.

  • Traces:
    • Use the OpenTelemetry SDK for your language (e.g., opentelemetry-python).
    • Wrap LLM calls, tool calls, and agent steps with spans.
    • Add relevant attributes (e.g., model_name, prompt_hash, token_counts, tool_name, status) to spans.
  • Metrics:
    • While OpenTelemetry can also generate metrics, for simple counter/gauge/histogram metrics that Prometheus can scrape directly, using your language’s Prometheus client library (e.g., prometheus_client for Python) is often simpler for HTTP exposition.
    • For richer, more complex metrics or if you want to use OTLP for metrics, OpenTelemetry’s metrics API is also powerful.

2. Framework Callbacks (e.g., LangChain)

If you’re using an LLM framework like LangChain, many provide callback systems that are perfect for capturing agent activity for better LLM observability.

  • LangChain Callbacks: Implement custom callbacks (BaseCallbackHandler) to log, emit metrics, or create traces at various points:
    • on_llm_start/on_llm_end
    • on_tool_start/on_tool_end
    • on_chain_start/on_chain_end
    • on_agent_action/on_agent_finish
  • Integrating with OpenTelemetry: Within these callbacks, you can explicitly create OpenTelemetry spans and add attributes.

3. Custom Wrappers

For simpler LLM integrations or when frameworks don’t offer enough hooks, you can create custom wrapper functions around your LLM API calls and agent logic.

import time
import logging
from prometheus_client import Histogram, Counter
from opentelemetry import trace
from opentelemetry.propagate import set_global_textmap
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    ConsoleSpanExporter,
    BatchSpanProcessor,
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace.sampling import ALWAYS_ON
from opel_propagate_b3 import B3Format # pip install opel-propagate-b3

# --- OpenTelemetry Setup (for Tracing) ---
# For a real application, you'd configure the OTLPSpanExporter to point to your OpenTelemetry Collector.
# For demonstration, we'll use ConsoleSpanExporter or a local OTLP endpoint if available.
resource = Resource.create({"service.name": "llm-agent-app", "service.version": "1.0.0"})
provider = TracerProvider(resource=resource, sampler=ALWAYS_ON)
# In a K8s cluster, this endpoint should point to the OTel Collector service.
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector.observability:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
provider.add_span_processor(span_processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm.agent.tracer")

# Propagator for B3 headers (commonly used in microservices)
set_global_textmap(B3Format())

# Instrument requests library for outgoing HTTP calls (e.g., to actual LLM API)
RequestsInstrumentor().instrument()

# --- Prometheus Metrics Setup ---
# Latency of the LLM call itself
LLM_CALL_LATENCY_SECONDS = Histogram(
    'llm_call_latency_seconds',
    'Latency of LLM API calls in seconds',
    ['model_name', 'status_code'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0]
)

# Total input and output tokens
LLM_INPUT_TOKENS_TOTAL = Counter(
    'llm_input_tokens_total',
    'Total input tokens processed by LLM',
    ['model_name']
)
LLM_OUTPUT_TOKENS_TOTAL = Counter(
    'llm_output_tokens_total',
    'Total output tokens generated by LLM',
    ['model_name']
)

# Agent tool call success/failure
AGENT_TOOL_CALLS_TOTAL = Counter(
    'agent_tool_calls_total',
    'Total agent tool calls',
    ['tool_name', 'status'] # status: 'success' or 'failure'
)

# Agent overall request latency
AGENT_REQUEST_LATENCY_SECONDS = Histogram(
    'agent_request_latency_seconds',
    'End-to-end latency of agent requests in seconds',
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 30.0, 60.0]
)

# --- Structured Logger Setup ---
# Note: In a real Flask app, this would be integrated into the app's logger.
# This is a simplified example.
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s', # The formatter will produce JSON
    datefmt="%Y-%m-%dT%H:%M:%S%z"
)
logger = logging.getLogger(__name__)

# Override the default formatter to inject trace_id and span_id
class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": self.formatTime(record, self.datefmt),
            "level": record.levelname,
            "message": record.getMessage(), # Default to plain message
            "service": "llm-agent-app",
            "component": record.name,
        }

        # Attempt to parse message as JSON and merge
        try:
            msg_dict = json.loads(record.getMessage())
            log_data.update(msg_dict)
            if "message" not in msg_dict: # Ensure a 'message' field is always present
                log_data["message"] = record.getMessage()
        except json.JSONDecodeError:
            pass # Message is not JSON, use original record.getMessage()

        # Inject trace_id and span_id
        current_span = trace.get_current_span()
        if current_span and current_span.get_span_context().is_valid:
            log_data["trace_id"] = format(current_span.get_span_context().trace_id, 'x')
            log_data["span_id"] = format(current_span.get_span_context().span_id, 'x')
        else:
            log_data["trace_id"] = "0"
            log_data["span_id"] = "0"

        return json.dumps(log_data)

# Remove default handler and add one with custom formatter
# Note: For Flask, actual integration may differ.
for handler in logger.handlers[:]:
    logger.removeHandler(handler)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)


# --- Mock LLM and Agent Logic ---
import json
def mock_llm_call(prompt: str, model_name: str = "gpt-3.5-turbo"):
    """Simulates an LLM API call."""
    with tracer.start_as_current_span("llm_api_call") as span:
        span.set_attribute("model_name", model_name)
        span.set_attribute("llm.request.type", "chat")
        span.set_attribute("llm.prompts", json.dumps({"role": "user", "content": prompt})) # Store as JSON string

        start_time = time.perf_counter()
        time.sleep(0.5 + 0.5 * (len(prompt) / 100)) # Simulate variable latency
        end_time = time.perf_counter()
        latency = end_time - start_time

        # Simulate token usage
        input_tokens = len(prompt.split()) + 10 # Example
        output_tokens = 50 + len(prompt.split()) // 2 # Example

        response_text = f"This is a simulated response to: '{prompt}'. "
        response_action = None

        # Simulate agent action suggestion
        if "weather" in prompt.lower() or "forecast" in prompt.lower():
            response_text += "I recommend using a weather tool."
            response_action = {"tool_name": "weather_tool", "city": "New York"}
        elif "time" in prompt.lower():
            response_text += "I recommend using a time tool."
            response_action = {"tool_name": "time_tool"}
        elif "error" in prompt.lower():
            response_text += "Simulating an LLM error."
            status_code = 500
        else:
            response_text += "No tool suggested."
            status_code = 200

        span.set_attribute("llm.response.model", model_name)
        span.set_attribute("llm.response.tokens.total", input_tokens + output_tokens)
        span.set_attribute("llm.response.tokens.prompt", input_tokens)
        span.set_attribute("llm.response.tokens.completion", output_tokens)
        span.set_attribute("llm.response.content", response_text)
        span.set_attribute("http.status_code", status_code)


        LLM_CALL_LATENCY_SECONDS.labels(model_name=model_name, status_code=status_code).observe(latency)
        LLM_INPUT_TOKENS_TOTAL.labels(model_name=model_name).inc(input_tokens)
        LLM_OUTPUT_TOKENS_TOTAL.labels(model_name=model_name).inc(output_tokens)
        logger.info(json.dumps({
            "event": "llm_call_completed",
            "message": "LLM call finished",
            "model": model_name,
            "prompt": prompt,
            "response_summary": response_text,
            "latency_sec": f"{latency:.2f}",
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "status_code": status_code
        }))
        return {"text": response_text, "action": response_action, "status_code": status_code}

def mock_weather_tool(city: str):
    """Simulates an external weather API call."""
    with tracer.start_as_current_span("tool_call_weather") as span:
        span.set_attribute("tool.name", "weather_tool")
        span.set_attribute("tool.parameters.city", city)

        start_time = time.perf_counter()
        time.sleep(0.3)
        end_time = time.perf_counter()
        latency = end_time - start_time

        result = f"The weather in {city} is sunny with 25°C."
        status = "success"
        span.set_attribute("tool.status", status)
        span.set_attribute("tool.result", result)

        AGENT_TOOL_CALLS_TOTAL.labels(tool_name="weather_tool", status=status).inc()
        logger.info(json.dumps({
            "event": "tool_call",
            "message": "Weather tool call",
            "tool_name": "weather_tool",
            "city": city,
            "status": status,
            "latency_sec": f"{latency:.2f}"
        }))
        return result

def mock_time_tool():
    """Simulates an external time API call."""
    with tracer.start_as_current_span("tool_call_time") as span:
        span.set_attribute("tool.name", "time_tool")

        start_time = time.perf_counter()
        time.sleep(0.1)
        end_time = time.perf_counter()
        latency = end_time - start_time

        result = f"The current time is {time.strftime('%H:%M:%S')}."
        status = "success"
        span.set_attribute("tool.status", status)
        span.set_attribute("tool.result", result)

        AGENT_TOOL_CALLS_TOTAL.labels(tool_name="time_tool", status=status).inc()
        logger.info(json.dumps({
            "event": "tool_call",
            "message": "Time tool call",
            "tool_name": "time_tool",
            "status": status,
            "latency_sec": f"{latency:.2f}"
        }))
        return result


def llm_agent_handler(prompt: str, request_headers: dict = None):
    """The core logic of our simple LLM agent."""
    # Extract trace context from incoming request headers
    ctx = trace.Context()
    if request_headers:
        ctx = extract(request_headers)

    with tracer.start_as_current_span("llm_agent_request", context=ctx) as span:
        span.set_attribute("user.query", prompt)

        agent_start_time = time.perf_counter()

        logger.info(json.dumps({"event": "agent_started", "message": "Agent request initiated", "query": prompt}))

        llm_response_data = mock_llm_call(prompt, "gpt-3.5-turbo-mock")
        final_answer = llm_response_data["text"]

        # Step 2: Agent decision making based on LLM response
        if llm_response_data["action"]:
            action = llm_response_data["action"]
            tool_name = action["tool_name"]
            logger.info(json.dumps({"event": "agent_decision", "message": "Agent decided to call tool", "decision": "call_tool", "tool_name": tool_name}))
            with tracer.start_as_current_span("agent_decision_making") as decision_span:
                decision_span.set_attribute("decision.type", "tool_call")
                decision_span.set_attribute("decision.tool_name", tool_name)

                if tool_name == "weather_tool":
                    city = action.get("city", "New York")
                    tool_result = mock_weather_tool(city)
                    final_answer = f"{final_answer}\nTool result: {tool_result}"
                elif tool_name == "time_tool":
                    tool_result = mock_time_tool()
                    final_answer = f"{final_answer}\nTool result: {tool_result}"
                else:
                    logger.warning(json.dumps({"event": "unknown_tool", "message": "Agent attempted to call unknown tool", "tool_name": tool_name}))
                    AGENT_TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='failure').inc()

        agent_end_time = time.perf_counter()
        total_latency = agent_end_time - agent_start_time
        AGENT_REQUEST_LATENCY_SECONDS.observe(total_latency)

        span.set_attribute("agent.total_latency_seconds", total_latency)
        span.set_attribute("agent.final_response", final_answer)
        logger.info(json.dumps({"event": "agent_finished", "message": "Agent request completed", "final_response_summary": final_answer, "total_latency_sec": f"{total_latency:.2f}"}))
        return final_answer

This Python code snippet demonstrates the principles for building LLM observability on Kubernetes:

  • OpenTelemetry tracer: Used to create spans for the overall agent request, the LLM call, and each tool call. Attributes are added to these spans for rich context. The extract(request_headers) ensures trace context propagation.
  • Prometheus client metrics: Histogram for latencies and Counter for tokens and tool calls are exposed.
  • Structured Logging: logger.info calls output JSON logs, including trace_id and span_id for easy correlation. The custom JsonFormatter ensures proper structure.

Leveraging Kubernetes-Native Observability Tools for LLMs

Now, let’s tie this into your Kubernetes cluster using familiar tools to enhance LLM observability on Kubernetes.

Prometheus & Grafana for LLM Metrics

Prometheus is the de-facto standard for metric collection in Kubernetes. Grafana provides powerful visualization.

  1. Prometheus Operator: The easiest way to deploy and manage Prometheus in Kubernetes is using the Prometheus Operator. It introduces Custom Resource Definitions (CRDs) like ServiceMonitor and PodMonitor that simplify scraping configuration for your LLM applications.
  2. Grafana: A leading open-source dashboarding tool that integrates seamlessly with Prometheus for visualizing LLM metrics.

Logging Solutions with Fluent Bit and Loki/Elasticsearch

Centralized logging is non-negotiable for debugging microservices, including LLM-powered applications.

  1. Fluent Bit: A lightweight and efficient log processor and forwarder. Deploy it as a DaemonSet on your Kubernetes nodes to collect logs from container stdout/stderr.
  2. Loki: Grafana Labs’ log aggregation system, designed for cost-effectiveness and scalability, especially when paired with Grafana for visualization. It indexes metadata (labels) rather than full log content, which is great for LLM observability.
  3. Elasticsearch/Kibana: Another popular stack for log aggregation and analysis, especially powerful for full-text search.

OpenTelemetry Collector for Traces and LLM Observability

The OpenTelemetry Collector is an essential component for distributed tracing within your Kubernetes environment.

  • Role: It receives, processes, and exports telemetry data. Your LLM applications send traces to the collector, which then forwards them to your chosen tracing backend (e.g., Jaeger, Tempo).
  • Deployment: Deploy the collector as a Deployment in your Kubernetes cluster, typically in your observability namespace.
  • Configuration: Configure it to receive OTLP (OpenTelemetry Protocol) traces, process them (e.g., batching, sampling), and then export them to your backend.

Hands-on Guide: Building LLM Observability Pipeline on Kubernetes

Let’s put theory into practice. We’ll deploy our simple LLM agent application, then set up Prometheus, Grafana, OpenTelemetry Collector, and Loki to observe it, building out comprehensive LLM observability on Kubernetes.

Prerequisites

Ensure you have:

  • A running Kubernetes cluster (Minikube, k3s, or a cloud-managed cluster).
  • kubectl (v1.25+) installed and configured to connect to your cluster.
  • helm (v3.10+) installed.
  • docker installed (to build the application image).
  • A Docker Hub account (or other container registry) to push your image.

Step 1: Prepare the LLM Agent Application

First, let’s create our Python Flask application (app.py) which implements the agent logic and instrumentation we discussed.

# app.py
import time
import logging
import json
import os
from flask import Flask, request, jsonify
from prometheus_client import Histogram, Counter, generate_latest, REGISTRY
from opentelemetry import trace
from opentelemetry.propagate import set_global_textmap, extract
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.wsgi import WsgiMiddleware
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.sdk.trace.sampling import ALWAYS_ON
from opentelemetry.metrics import set_meter_provider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opel_propagate_b3 import B3Format # pip install opel-propagate-b3

# --- OpenTelemetry Setup ---
# Resource for the service
resource = Resource.create({
    "service.name": "llm-agent-app",
    "service.version": "1.0.0",
    "k8s.pod.name": os.getenv("HOSTNAME", "unknown"),
    "k8s.namespace.name": os.getenv("KUBERNETES_NAMESPACE", "default")
})

# Tracer Provider
provider = TracerProvider(resource=resource, sampler=ALWAYS_ON)
# Configure OTLP exporter to send traces to the OpenTelemetry Collector in the 'observability' namespace
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector.observability:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
provider.add_span_processor(span_processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm.agent.tracer")

# Propagator for B3 headers (commonly used in microservices for context propagation)
set_global_textmap(B3Format())

# Instrument requests library for outgoing HTTP calls (e.g., to actual LLM API)
RequestsInstrumentor().instrument()

# --- Prometheus Metrics Setup ---
# Latency of the LLM call itself
LLM_CALL_LATENCY_SECONDS = Histogram(
    'llm_call_latency_seconds',
    'Latency of LLM API calls in seconds',
    ['model_name', 'status_code'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0]
)

# Total input and output tokens
LLM_INPUT_TOKENS_TOTAL = Counter(
    'llm_input_tokens_total',
    'Total input tokens processed by LLM',
    ['model_name']
)
LLM_OUTPUT_TOKENS_TOTAL = Counter(
    'llm_output_tokens_total',
    'Total output tokens generated by LLM',
    ['model_name']
)

# Agent tool call success/failure
AGENT_TOOL_CALLS_TOTAL = Counter(
    'agent_tool_calls_total',
    'Total agent tool calls',
    ['tool_name', 'status'] # status: 'success' or 'failure'
)

# Agent overall request latency
AGENT_REQUEST_LATENCY_SECONDS = Histogram(
    'agent_request_latency_seconds',
    'End-to-end latency of agent requests in seconds',
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 30.0, 60.0]
)

# --- Structured Logger Setup ---
class JsonFormatter(logging.Formatter):
    def format(self, record):
        # Base log data
        log_data = {
            "timestamp": self.formatTime(record, self.datefmt),
            "level": record.levelname,
            "message": record.getMessage(),
            "service": "llm-agent-app",
            "component": record.name,
            "filename": record.filename,
            "lineno": record.lineno,
            "funcName": record.funcName,
        }

        # Attempt to parse message as JSON and merge. This supports logger.info(json.dumps({"event": "..."}))
        try:
            msg_dict = json.loads(record.getMessage())
            log_data.update(msg_dict)
            # If the JSON message didn't contain a 'message' field, keep the default one
            if "message" not in msg_dict:
                log_data["message"] = record.getMessage()
        except json.JSONDecodeError:
            pass # Message is not JSON, use original record.getMessage() as the 'message' field

        # Inject trace_id and span_id if available
        current_span = trace.get_current_span()
        if current_span and current_span.get_span_context().is_valid:
            log_data["trace_id"] = format(current_span.get_span_context().trace_id, 'x')
            log_data["span_id"] = format(current_span.get_span_context().span_id, 'x')
        else:
            log_data["trace_id"] = "0"
            log_data["span_id"] = "0"

        return json.dumps(log_data)

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
# Clear existing handlers to prevent duplicate logs from Flask's default logger
for handler in logger.handlers[:]:
    logger.removeHandler(handler)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)


# --- Mock LLM and Agent Logic ---
def mock_llm_call(prompt: str, model_name: str = "gpt-3.5-turbo"):
    """Simulates an LLM API call."""
    with tracer.start_as_current_span("llm_api_call") as span:
        span.set_attribute("model_name", model_name)
        span.set_attribute("llm.request.type", "chat")
        span.set_attribute("llm.prompts", json.dumps([{"role": "user", "content": prompt}])) # Store as JSON string

        start_time = time.perf_counter()
        time.sleep(0.5 + 0.5 * (len(prompt) / 100)) # Simulate variable latency
        end_time = time.perf_counter()
        latency = end_time - start_time

        # Simulate token usage
        input_tokens = len(prompt.split()) + 10 # Example: base 10 tokens + words
        output_tokens = 50 + len(prompt.split()) // 2 # Example: base 50 tokens + half of input words

        response_text = f"This is a simulated response to: '{prompt}'. "
        response_action = None
        status_code = 200

        # Simulate agent action suggestion
        if "weather" in prompt.lower() or "forecast" in prompt.lower():
            response_text += "I recommend using a weather tool."
            response_action = {"tool_name": "weather_tool", "city": "New York"}
        elif "time" in prompt.lower():
            response_text += "I recommend using a time tool."
            response_action = {"tool_name": "time_tool"}
        elif "error" in prompt.lower():
            response_text += "Simulating an LLM error."
            status_code = 500
        else:
            response_text += "No tool suggested."

        span.set_attribute("llm.response.model", model_name)
        span.set_attribute("llm.response.tokens.total", input_tokens + output_tokens)
        span.set_attribute("llm.response.tokens.prompt", input_tokens)
        span.set_attribute("llm.response.tokens.completion", output_tokens)
        span.set_attribute("llm.response.content", response_text)
        span.set_attribute("http.status_code", status_code)

        LLM_CALL_LATENCY_SECONDS.labels(model_name=model_name, status_code=status_code).observe(latency)
        LLM_INPUT_TOKENS_TOTAL.labels(model_name=model_name).inc(input_tokens)
        LLM_OUTPUT_TOKENS_TOTAL.labels(model_name=model_name).inc(output_tokens)

        logger.info(json.dumps({
            "event": "llm_call_completed",
            "message": "LLM call finished",
            "model": model_name,
            "prompt": prompt,
            "response_summary": response_text,
            "latency_sec": f"{latency:.2f}",
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "status_code": status_code
        }))

        return {"text": response_text, "action": response_action, "status_code": status_code}

def mock_weather_tool(city: str):
    """Simulates an external weather API call."""
    with tracer.start_as_current_span("tool_call_weather") as span:
        span.set_attribute("tool.name", "weather_tool")
        span.set_attribute("tool.parameters.city", city)

        start_time = time.perf_counter()
        time.sleep(0.3)
        end_time = time.perf_counter()
        latency = end_time - start_time

        result = f"The weather in {city} is sunny with 25°C."
        status = "success"
        span.set_attribute("tool.status", status)
        span.set_attribute("tool.result", result)

        AGENT_TOOL_CALLS_TOTAL.labels(tool_name="weather_tool", status=status).inc()
        logger.info(json.dumps({
            "event": "tool_call",
            "message": "Weather tool call",
            "tool_name": "weather_tool",
            "city": city,
            "status": status,
            "latency_sec": f"{latency:.2f}"
        }))
        return result

def mock_time_tool():
    """Simulates an external time API call."""
    with tracer.start_as_current_span("tool_call_time") as span:
        span.set_attribute("tool.name", "time_tool")

        start_time = time.perf_counter()
        time.sleep(0.1)
        end_time = time.perf_counter()
        latency = end_time - start_time

        result = f"The current time is {time.strftime('%H:%M:%S')}."
        status = "success"
        span.set_attribute("tool.status", status)
        span.set_attribute("tool.result", result)

        AGENT_TOOL_CALLS_TOTAL.labels(tool_name="time_tool", status=status).inc()
        logger.info(json.dumps({
            "event": "tool_call",
            "message": "Time tool call",
            "tool_name": "time_tool",
            "status": status,
            "latency_sec": f"{latency:.2f}"
        }))
        return result

def llm_agent_handler(prompt: str, request_headers: dict):
    """The core logic of our simple LLM agent."""
    # Extract trace context from incoming request headers for distributed tracing
    ctx = extract(request_headers)
    with tracer.start_as_current_span("llm_agent_request", context=ctx) as span:
        span.set_attribute("user.query", prompt)

        agent_start_time = time.perf_counter()

        logger.info(json.dumps({"event": "agent_started", "message": "Agent request initiated", "query": prompt}))

        llm_response_data = mock_llm_call(prompt, "gpt-3.5-turbo-mock")
        final_answer = llm_response_data["text"]

        # Step 2: Agent decision making based on LLM response
        if llm_response_data["action"]:
            action = llm_response_data["action"]
            tool_name = action["tool_name"]
            logger.info(json.dumps({"event": "agent_decision", "message": "Agent decided to call tool", "decision": "call_tool", "tool_name": tool_name}))
            with tracer.start_as_current_span("agent_decision_making") as decision_span:
                decision_span.set_attribute("decision.type", "tool_call")
                decision_span.set_attribute("decision.tool_name", tool_name)

                if tool_name == "weather_tool":
                    city = action.get("city", "New York")
                    tool_result = mock_weather_tool(city)
                    final_answer = f"{final_answer}\nTool result: {tool_result}"
                elif tool_name == "time_tool":
                    tool_result = mock_time_tool()
                    final_answer = f"{final_answer}\nTool result: {tool_result}"
                else:
                    logger.warning(json.dumps({"event": "unknown_tool", "message": "Agent attempted to call unknown tool", "tool_name": tool_name}))
                    AGENT_TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='failure').inc()

        agent_end_time = time.perf_counter()
        total_latency = agent_end_time - agent_start_time
        AGENT_REQUEST_LATENCY_SECONDS.observe(total_latency)

        span.set_attribute("agent.total_latency_seconds", total_latency)
        span.set_attribute("agent.final_response", final_answer)
        logger.info(json.dumps({"event": "agent_finished", "message": "Agent request completed", "final_response_summary": final_answer, "total_latency_sec": f"{total_latency:.2f}"}))
        return final_answer

# --- Flask App ---
app = Flask(__name__)
# Wrap Flask app with OpenTelemetry WSGI middleware for automatic request tracing
app.wsgi_app = WsgiMiddleware(app.wsgi_app)

@app.route('/healthz', methods=['GET'])
def healthz():
    return "OK", 200

@app.route('/metrics', methods=['GET'])
def metrics():
    """Expose Prometheus metrics."""
    return generate_latest(), 200

@app.route('/query', methods=['POST'])
def query_agent():
    """Handle LLM agent queries."""
    data = request.json
    prompt = data.get('prompt')
    if not prompt:
        return jsonify({"error": "Prompt is required"}), 400

    # Pass request headers for context propagation
    response = llm_agent_handler(prompt, request.headers)
    return jsonify({"response": response})

if __name__ == '__main__':
    # PrometheusMetricReader needs to be set up globally for default registry
    reader = PrometheusMetricReader()
    meter_provider = MeterProvider(metric_readers=[reader], resource=resource)
    set_meter_provider(meter_provider)

    app.run(host='0.0.0.0', port=5000)

Now, create a Dockerfile for our application:

# Dockerfile
FROM python:3.10-slim-buster

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app.py .

EXPOSE 5000

CMD ["python", "app.py"]

And requirements.txt:

Flask==2.3.3
prometheus_client==0.18.0
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-exporter-otlp-proto-grpc==1.25.0
opentelemetry-instrumentation-requests==0.45b0
opentelemetry-instrumentation-wsgi==0.45b0
opel-propagate-b3==1.0.0

Build and push your Docker image. Remember to replace your-dockerhub-user with your actual Docker Hub username.

docker build -t your-dockerhub-user/llm-agent-app:v1.0.0 .
docker push your-dockerhub-user/llm-agent-app:v1.0.0

Step 2: Deploy Observability Stack to Kubernetes

We’ll use Helm to deploy the core components for robust LLM observability. Create an observability namespace first.

kubectl create namespace observability

2.1 Deploy Prometheus Operator and Grafana

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack which includes Prometheus, Grafana, and Alertmanager
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace observability \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesLabels=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesLabels=false \
  --set grafana.enabled=true \
  --set grafana.adminPassword="prom-operator" \  # pragma: allowlist secret
  --set grafana.service.type="LoadBalancer" # Use "NodePort" for Minikube or local clusters

Wait for Prometheus and Grafana pods to be ready. You can get Grafana’s LoadBalancer IP (or NodePort):

kubectl get svc -n observability grafana

Log in to Grafana using admin as the username and prom-operator as the password.

2.2 Deploy Loki and Fluent Bit for Logging

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki
helm install loki grafana/loki \
  --namespace observability \
  --set service.type="ClusterIP" \
  --set persistence.enabled=true \
  --set persistence.size=10Gi # Adjust size as needed

# Install Fluent Bit
helm install fluent-bit grafana/fluent-bit \
  --namespace observability \
  --set config.service.flush=1 \
  --set config.inputs[0].name=tail \
  --set config.inputs[0].path="/var/log/containers/*.log" \
  --set config.inputs[0].multiline.parser=docker,cri \
  --set config.inputs[0].db=/var/log/flb_kube.db \
  --set config.inputs[0].mem_buf_limit=5MB \
  --set config.outputs[0].name=loki \
  --set config.outputs[0].host="loki.observability.svc.cluster.local" \
  --set config.outputs[0].port=3100 \
  --set config.outputs[0].labels="job=fluent-bit" \
  --set config.outputs[0].removeKeys="kubernetes.host,kubernetes.labels,kubernetes.annotations,kubernetes.pod_id,kubernetes.container_id,kubernetes.docker_id,kubernetes.container_hash,kubernetes.container_image,kubernetes.daemonset,kubernetes.deployment" \
  --set config.outputs[0].labelMapPath="/fluent-bit/config/label_map.json" \
  --set extraFiles.label_map_json="{\"kubernetes\": {\"container_name\": \"container\", \"namespace_name\": \"namespace\", \"pod_name\": \"pod\"}}" # Maps K8s metadata to Loki labels

2.3 Deploy OpenTelemetry Collector

Create otel-collector.yaml:

# otel-collector.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
  labels:
    app: otel-collector
spec:
  replicas: 1
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector:0.100.0 # Use a specific, stable version
        command:
        - "/otelcol"
        - "--config=/conf/otel-collector-config.yaml"
        ports:
        - containerPort: 4317 # OTLP gRPC receiver
        - containerPort: 4318 # OTLP HTTP receiver
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
      volumes:
      - name: otel-collector-config-vol
        configMap:
          name: otel-collector-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: observability
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch: # Batching for efficiency
        send_batch_size: 100
        timeout: 10s
    exporters:
      logging: # For demo purposes, logs traces to stdout. Replace with Jaeger/Tempo for a full setup.
        verbosity: detailed
      # Example for Jaeger/Tempo exporters:
      # jaeger:
      #   endpoint: "jaeger-collector.observability:14250" # Assuming Jaeger is deployed
      #   tls:
      #     insecure: true
      # tempo:
      #   endpoint: "tempo.observability:4317" # Assuming Tempo is deployed
      #   tls:
      #     insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [logging] # Change to [jaeger] or [tempo] if you have them configured
kubectl apply -f otel-collector.yaml -n observability

Step 3: Deploy the LLM Agent Application to Kubernetes

Now, deploy your instrumented application. Create llm-agent-app.yaml. Remember to replace your-dockerhub-user with your actual Docker Hub username.

# llm-agent-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-agent-app
  namespace: default # Deploy in default or your app namespace
  labels:
    app: llm-agent-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-agent-app
  template:
    metadata:
      labels:
        app: llm-agent-app
      annotations:
        prometheus.io/scrape: "true" # Enable Prometheus scraping
        prometheus.io/port: "5000"   # Port where metrics are exposed by the application
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: llm-agent-app
        image: your-dockerhub-user/llm-agent-app:v1.0.0 # Replace with your image
        ports:
        - containerPort: 5000
          name: http-app # This port will serve both the application and /metrics
        env:
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        resources:
          requests:
            cpu: "200m"
            memory: "512Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: llm-agent-app
  namespace: default
  labels:
    app: llm-agent-app
spec:
  selector:
    app: llm-agent-app
  ports:
  - protocol: TCP
    port: 80 # Service exposed port
    targetPort: http-app # Maps to containerPort: 5000
    name: http # Name of this service port
  type: LoadBalancer # Use "NodePort" for local clusters like Minikube/k3s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-agent-app-sm
  namespace: observability # ServiceMonitor should be in the same namespace as Prometheus
  labels:
    release: prometheus # This label links to the Prometheus instance from kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: llm-agent-app # Selects services with this label
  endpoints:
  - port: http # Name of the port in the Service that exposes the metrics endpoint
    path: /metrics
    interval: 15s
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - default # Or the namespace where your app is deployed
kubectl apply -f llm-agent-app.yaml -n default

Wait for the llm-agent-app pod and service to be ready. Get its external IP:

kubectl get svc llm-agent-app -n default

Step 4: Interact with the Agent to Generate Data

Use curl or a simple script to send queries to your agent. This will generate logs, metrics, and traces, populating your LLM observability stack.

# For LoadBalancer, get the external IP
AGENT_IP=$(kubectl get svc llm-agent-app -n default -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

# For NodePort (e.g., Minikube), get Minikube IP and NodePort
# AGENT_IP=$(minikube ip)
# AGENT_PORT=$(kubectl get svc llm-agent-app -n default -o jsonpath='{.spec.ports[?(@.name=="http")].nodePort}')
# export AGENT_URL="http://$AGENT_IP:$AGENT_PORT"

# Use AGENT_IP for LoadBalancer, or AGENT_URL for NodePort
export TARGET_URL="http://$AGENT_IP" # or $AGENT_URL if using NodePort

echo "Sending requests to $TARGET_URL/query"

# Send some queries
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the weather like in London today?"}' $TARGET_URL/query
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Tell me a fun fact about Kubernetes."}' $TARGET_URL/query
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What time is it?"}' $TARGET_URL/query
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Simulate an error now."}' $TARGET_URL/query # Test error paths

# Send more queries to generate enough data for dashboards
for i in $(seq 1 10); do
  curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Give me another interesting fact."}' $TARGET_URL/query &>/dev/null
  sleep 0.5
  curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the current weather?"}' $TARGET_URL/query &>/dev/null
  sleep 0.5
done

echo "Requests sent. Data should now be flowing into your observability stack."

Repeat these calls a few times to generate sufficient data for visualization.

Step 5: Visualize LLM Observability Data in Grafana

Access Grafana using the LoadBalancer IP or NodePort you obtained earlier.

5.1 Prometheus Dashboard for LLM Metrics

  1. Go to Connections -> Data sources and ensure Prometheus is configured (it should be automatically set up by kube-prometheus-stack).
  2. Create a new Dashboard (+ -> New Dashboard).
  3. Add new panels with the following PromQL queries to monitor your LLM observability on Kubernetes:
    • LLM Call Latency (95th Percentile):
      • Query: histogram_quantile(0.95, sum by(le, model_name) (rate(llm_call_latency_seconds_bucket{app="llm-agent-app", namespace="default"}[5m])))
    • LLM Token Usage (Input/Output Rate):
      • Query: sum by(model_name) (rate(llm_input_tokens_total{app="llm-agent-app", namespace="default"}[5m]))
      • Query: sum by(model_name) (rate(llm_output_tokens_total{app="llm-agent-app", namespace="default"}[5m]))
    • Agent Request Latency (99th Percentile):
      • Query: histogram_quantile(0.99, sum by(le) (rate(agent_request_latency_seconds_bucket{app="llm-agent-app", namespace="default"}[5m]))) (99th percentile end-to-end agent latency)
    • Agent Tool Call Success/Failure Rates:
      • Query: sum by(tool_name, status) (rate(agent_tool_calls_total{app="llm-agent-app", namespace="default"}[5m]))
    • Estimated Cost (Example, adjust token costs as per your LLM provider):
      • Query: (sum(rate(llm_input_tokens_total{app="llm-agent-app", namespace="default"}[5m])) / 1000 * 0.01) + (sum(rate(llm_output_tokens_total{app="llm-agent-app", namespace="default"}[5m])) / 1000 * 0.03)
      • Explanation: This query calculates an estimated cost based on a hypothetical rate of $0.01 per 1000 input tokens and $0.03 per 1000 output tokens.

5.2 Loki Dashboard for LLM Logs

  1. Go to Connections -> Data sources and add Loki as a new data source:

    • Name: Loki
    • URL: http://loki.observability.svc.cluster.local:3100
  2. Add a new panel to your dashboard. Set the visualization type to Logs.

  3. Select Loki as your data source.

  4. Log Queries (examples):

    • Show all logs from your app: {container="llm-agent-app", namespace="default"}
    • Filter for LLM calls: {container="llm-agent-app", namespace="default"} | json | event="llm_call_completed"
    • Filter for agent tool calls: {container="llm-agent-app", namespace="default"} | json | event="tool_call"
    • Filter for specific trace ID: {container="llm-agent-app", namespace="default"} | json | trace_id="<your-trace-id>" (You can pick a trace_id from the Prometheus data or another log query).
    • Filter for LLM errors: {container="llm-agent-app", namespace="default"} | json | status_code="500"

    The | json pipe command is crucial as our application emits structured JSON logs, allowing you to filter and parse fields within the log lines. You can then use | line_format "{{.message}}" to display only the message, | level="ERROR" to filter by log level, or | prompt=~".*weather.*" to search within specific fields.

5.3 OpenTelemetry Traces (Optional: If Jaeger/Tempo is set up)

If you had deployed a full tracing backend like Jaeger (or Tempo), you would typically configure Jaeger as a data source in Grafana. Then you could navigate to the Traces section in Grafana, search by service name (llm-agent-app) and trace ID to visualize the full agent execution flow with all its spans, greatly enhancing your LLM observability.

For this tutorial, since we configured the OpenTelemetry Collector to log traces to stdout, you can inspect the collector’s logs to see the trace data being received.

kubectl logs -f -n observability deployment/otel-collector

You’ll see detailed output for each trace, confirming that your application is successfully sending trace data to the collector. For example:

{
  "resource": {
    "attributes": [
      {"key": "service.name", "value": {"stringValue": "llm-agent-app"}},
      {"key": "k8s.pod.name", "value": {"stringValue": "llm-agent-app-..."}},
      {"key": "k8s.namespace.name", "value": {"stringValue": "default"}}
    ]
  },
  "scopeSpans": [
    {
      "scope": {"name": "llm.agent.tracer", "version": "1.0.0"},
      "spans": [
        {
          "traceId": "...",
          "spanId": "...",
          "parentSpanId": "...",
          "name": "llm_agent_request",
          "kind": "SPAN_KIND_SERVER", # Due to WSGIMiddleware
          "startTimeUnixNano": "...",
          "endTimeUnixNano": "...",
          "attributes": [
            {"key": "user.query", "value": {"stringValue": "..."}},
            {"key": "agent.total_latency_seconds", "value": {"doubleValue": 1.23}},
            ...
          ],
          "events": [],
          "status": {"code": "STATUS_CODE_UNSET"}
        },
        {
          "traceId": "...",
          "spanId": "...",
          "parentSpanId": "...", # Parent will be llm_agent_request's spanId
          "name": "llm_api_call",
          "kind": "SPAN_KIND_INTERNAL",
          "attributes": [
            {"key": "model_name", "value": {"stringValue": "gpt-3.5-turbo-mock"}},
            {"key": "llm.response.tokens.total", "value": {"intValue": 120}},
            ...
          ],
          ...
        }
        # ... more spans for agent_decision_making, tool_call_weather etc.
      ]
    }
  ]
}

This confirms traces are being generated and processed by the collector. Integrating a full Jaeger or Tempo setup is a worthy next step, but is beyond the immediate scope of this tutorial.

Actionable Insights, Alerting, and Troubleshooting for LLM Observability

Collecting data is only half the battle. You need to interpret it to derive actionable insights and set up effective alerts for your LLM observability on Kubernetes.

Interpreting LLM Observability Data

  • Performance Bottlenecks:
    • High LLM Call Latency (P95/P99): Check llm_call_latency_seconds_bucket. Is it the model itself, network, or rate limiting from the LLM provider? If your average LLM call latency is consistently above 5 seconds, it might indicate network issues or an overloaded LLM endpoint.
    • High Agent Request Latency: Examine traces (llm_agent_request span) to pinpoint which step (LLM call, tool call, internal logic) is contributing most to the delay. Look for unusually long tool_call spans.
    • Low Throughput: Is llm_agent_requests_total not meeting expectations? Check CPU/memory utilization of the pod. Is the LLM model slow or is the application itself bottlenecked? A drop of 20% in QPS without a corresponding reduction in load might indicate an issue.
  • Cost Overruns:
    • Spikes in Token Usage: Monitor llm_input_tokens_total and llm_output_tokens_total. Are prompts getting unexpectedly long? Is the model generating excessively verbose responses? This could indicate a prompt engineering issue or model drift. An sudden increase of 50% in output tokens per request could drastically raise costs.
    • Frequent LLM API Calls: llm_api_calls_total can highlight agents engaging in too many iterative LLM calls, indicating inefficiency.
  • Model Degradation & Agent Failures:
    • Increased LLM Error Rates: Monitor sum by(status_code) (rate(llm_call_latency_seconds_count{status_code!="200"}[5m])). Correlate with logs to see actual error messages and prompts.
    • Increased Agent Goal Failures: agent_goal_failures_total. What type of queries are failing? Dive into traces for these failed requests to see the agent’s decision path.
    • Decreased Tool Call Success Rates: agent_tool_call_failure_total. Is a specific external tool failing? This points to external dependency issues.
    • Unexpected Agent Behavior: Use Loki to search for keywords in prompt/response logs or agent decision logs. “Why did the agent choose X tool here?” can often be answered by reviewing the log of its thought process.

Setting Up Effective Alerts for LLM Workloads

Alerts should be proactive and actionable. Use Grafana Alerting (integrated with Prometheus Alertmanager).

  • Critical Alerts (PagerDuty, Slack):
    • High Error Rate: sum(rate(llm_call_latency_seconds_count{status_code!="200"}[5m])) > 5 (more than 5 LLM errors per 5 minutes for instance).
    • Service Unavailability: up{app="llm-agent-app", namespace="default"} == 0 (agent service is down).
    • Critical Latency Spike: histogram_quantile(0.99, sum by(le) (rate(agent_request_latency_seconds_bucket{app="llm-agent-app", namespace="default"}[5m]))) > 15 (99th percentile request latency exceeds 15 seconds).
  • Warning Alerts (Slack, Email):
    • High Token Consumption Rate: sum(rate(llm_output_tokens_total{app="llm-agent-app", namespace="default"}[1h])) > 1000000 (e.g., more than 1 million output tokens in the last hour, indicating potential cost overrun).
    • Increased Agent Tool Failures: sum(rate(agent_tool_calls_total{status="failure", app="llm-agent-app", namespace="default"}[5m])) > 1 (any tool failures in a 5-minute window could warrant investigation).
    • High GPU Utilization: gpu_utilization_percentage{pod="llm-agent-app-..."} > 90 (indicates resource saturation, potentially leading to performance degradation on GPU-accelerated workloads).
  • Informational Alerts: Track trends without immediate action, e.g., daily cost reports.

Troubleshooting Workflow for LLM Observability

  1. Alert Triggered: Receive an alert about high latency or errors.
  2. Dashboard Check: Go to your Grafana dashboard. Look at the specific metric that triggered the alert. Correlate with other metrics (e.g., if latency is high, are CPU/memory also high? Are token counts spiking?).
  3. Logs Investigation: If metrics point to an application-specific issue, jump to Loki. Use the trace ID from the metric context (if available, or infer from timestamps) to find relevant logs. Search for errors, warnings, or specific agent decision steps around the time of the incident. Review the full prompt/response pairs.
  4. Traces Deep Dive: If the issue is complex and spans multiple steps or services (especially for agents), use your tracing backend (Jaeger/Tempo). Find the trace for a representative failing request. Visualize the spans to identify the exact step (LLM call, tool call, external API) that introduced latency or an error. Examine attributes attached to spans for detailed context like LLM model, parameters, or tool call arguments.
  5. Identify Root Cause: Based on the correlated data, pinpoint the root cause: an inefficient prompt, a slow external tool, a bug in agent logic, or an overloaded Kubernetes node.
  6. Remediate and Verify: Implement the fix, then monitor your observability stack to verify the issue is resolved and new alerts are not triggered.

Frequently Asked Questions About LLM Observability on Kubernetes

What is LLM observability and why is it important on Kubernetes?

LLM observability is the ability to understand the internal state, performance, cost, and behavior of Large Language Model (LLM) powered applications and AI agents. It’s crucial on Kubernetes because traditional monitoring tools don’t capture the unique non-deterministic nature, token usage, and complex decision-making processes inherent to LLMs and AI agents, leading to blind spots in performance and cost management.

How do I monitor LLM costs on Kubernetes?

To monitor LLM costs on Kubernetes, track metrics like llm_input_tokens_total, llm_output_tokens_total, and llm_api_calls_total. These application-level metrics, collected by Prometheus, can then be used in Grafana with PromQL queries to calculate real-time estimated costs based on your LLM provider’s pricing for tokens or API calls.

Can OpenTelemetry be used for LLM metrics and traces?

Yes, OpenTelemetry is the recommended standard for instrumenting LLM applications to generate both metrics and traces. It provides APIs and SDKs to create detailed spans for LLM calls and agent steps, and to emit custom metrics like token usage and latency, all of which can be collected and exported by the OpenTelemetry Collector.

Why do traditional monitoring tools fail for AI agents?

Traditional monitoring tools primarily focus on infrastructure metrics (CPU, memory) and simple request/response codes. They fail for AI agents because agents have non-deterministic behavior, complex multi-step reasoning processes, rely on token usage for cost, and can hallucinate or go off-topic even with successful HTTP responses. Understanding these nuances requires deep application-level metrics, structured logs, and distributed traces.

What key metrics should I track for LLM performance on Kubernetes?

Key metrics for LLM performance on Kubernetes include:

  • Prompt Processing Latency
  • Response Generation Latency
  • Total Agent Request Latency
  • Throughput (Queries Per Second)
  • Tool Call Latency (for AI agents)
  • Input and Output Token Counts
  • LLM API Call Counts These provide a comprehensive view of speed, efficiency, and resource consumption.

Conclusion

Building robust LLM observability on Kubernetes is not just about extending your existing monitoring stack; it requires a paradigm shift. You need to look beyond infrastructure metrics and delve deep into the nuances of LLM behavior, token economics, and agent decision-making.

By leveraging OpenTelemetry for rich instrumentation, Prometheus and Grafana for comprehensive metrics, and Loki with Fluent Bit for structured logging, you can construct a powerful, integrated observability pipeline. This pipeline gives you the visibility needed to understand performance, control costs, and proactively troubleshoot the complex, often non-deterministic world of generative AI on Kubernetes.

Your next steps should be:

  1. Refine your application instrumentation: Integrate your actual LLM calls and agent logic with OpenTelemetry and Prometheus client metrics.
  2. Experiment with diverse prompts: Send various types of queries to your agent, including edge cases and error-inducing prompts, to see how your observability stack reacts.
  3. Build custom Grafana dashboards: Create dashboards tailored to your specific LLM applications, focusing on the most critical performance, cost, and quality metrics for LLM observability on Kubernetes.
  4. Implement robust alerting: Define clear, actionable alerts based on thresholds that matter for your application’s reliability and cost efficiency.
  5. Explore tracing backends: Consider deploying Jaeger or Tempo to gain full end-to-end distributed tracing capabilities, which are invaluable for complex AI agent debugging.

The journey to effective LLM observability is iterative. Continuously refine your instrumentation, dashboards, and alerts as your AI agents evolve and your understanding of their behavior deepens.

Advertisement

Stay up to date

Get DevOps tips, tutorials, and guides delivered to your inbox.