intermediate llmops · 60 minutes

How to Set Up LLM Observability with OpenTelemetry

Use OpenTelemetry to add LLM observability to your Python apps. Follow our guide to trace requests, track API costs, and build monitoring dashboards in Grafana.

Prerequisites

  • Python installed
  • Docker and Docker Compose installed
  • Basic knowledge of OpenTelemetry concepts

Tools Used

OpenTelemetryDockerPythonJaegerPrometheusGrafanaOpenLIT
How to Set Up LLM Observability with OpenTelemetry
Advertisement

Large Language Model (LLM) applications are difficult to monitor. Unlike traditional software, their outputs are non-deterministic, their internal logic is a “black box”, and their costs are directly tied to unpredictable token counts. This creates a blind spot for DevOps and MLOps teams. You can’t easily answer simple questions like: Which features are costing us the most? Is a specific model’s latency degrading over time? Why did a particular user request fail inside a complex agent chain?

OpenTelemetry (OTel) provides a solution. It is a vendor-neutral, open-source standard for collecting traces, metrics, and logs. This tutorial guides you through setting up a complete, local LLM observability stack using OTel. You will learn how to manually instrument a Python application to capture detailed traces, track token usage, calculate API costs, and visualize everything in Jaeger and Grafana.

Prerequisites

Before you begin, you need a few tools installed on your local machine.

  • Docker and Docker Compose: We use Docker to run our local observability stack, which includes the OTel Collector, Jaeger, Prometheus, and Grafana. Ensure you have Docker Engine and Docker Compose v2.0 or newer installed.
  • Python: The sample application is written in Python. You need Python 3.9+ and pip installed. This guide was tested with Python 3.11.
  • An OpenAI API Key: To make real requests, you need an API key from OpenAI. You can get one from their developer platform. Once you have it, export it as an environment variable in your terminal:
export OPENAI_API_KEY="sk-..."

Replace sk-... with your actual key. The application code reads this variable to authenticate with the OpenAI API.

Why OpenTelemetry for LLM Apps?

LLM-powered applications, especially those using chains or agents, behave more like complex distributed systems than monolithic applications. A single user request might trigger multiple model invocations, database lookups, and API calls. Trying to debug this flow with just logs is incredibly painful.

This is where OpenTelemetry excels. It provides three key signals:

  1. Traces: A trace represents the end-to-end journey of a single request. It is composed of spans, where each span represents a single operation (like an API call or a function execution). For LLMs, traces let you visualize the entire sequence of events, see latency for each step, and pinpoint exactly where an error occurred in a chain.
  2. Metrics: Metrics are aggregated numerical data, like counters, gauges, and histograms. They are perfect for monitoring high-level performance and business KPIs. For GenAI, the most critical custom metrics are token counts and API costs. You can use metrics to build dashboards that answer questions like “What is our total spend on gpt-4o this week?”
  3. Logs: Logs are timestamped text records. OTel allows you to correlate logs directly with traces and spans, providing deep context for debugging.

By combining these signals, you move from a reactive “what happened?” approach to a proactive “why did it happen and how can we optimize it?” mindset. This tutorial focuses on traces and metrics because they solve the most immediate problems in LLM observability: understanding request flows and tracking costs.

Step 1: Launch the Local Observability Stack

First, we need a backend to receive and visualize our telemetry data. We’ll use Docker Compose to launch a pre-configured stack containing four key services:

  • OTel Collector: A central agent that receives telemetry data from your application, processes it, and exports it to one or more backends.
  • Jaeger: An open-source tool for visualizing traces.
  • Prometheus: A time-series database for storing metrics.
  • Grafana: A dashboarding tool for visualizing metrics from Prometheus.

Create a directory for your project and add the following four files.

docker-compose.yml

version: '3.8'

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.105.0
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317" # OTLP gRPC receiver
      - "4318:4318" # OTLP HTTP receiver
    depends_on:
      - jaeger
      - prometheus

  jaeger:
    image: jaegertracing/all-in-one:1.59.0
    ports:
      - "16686:16686" # Jaeger UI
      - "14268:14268" # Jaeger collector

  prometheus:
    image: prom/prometheus:v2.53.1
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:11.1.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yaml

otel-collector-config.yaml

This file configures the OTel Collector. It sets up receivers to accept data, processors to batch it, and exporters to send it to Jaeger and Prometheus.

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

prometheus.yml

This file tells Prometheus where to scrape metrics from, which in our case is the OTel Collector.

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']

grafana-datasources.yml

This file pre-configures Grafana to connect to our Prometheus instance.

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

With these four files in your project directory, launch the stack:

docker-compose up -d

After a minute, verify that everything is running. The container names will be prefixed with your project directory name.

docker-compose ps

You should see grafana, jaeger, otel-collector, and prometheus services in the running state. You can now access the UIs in your browser:

Step 2: Instrument a Python App for Basic Tracing

Now let’s create a simple Python application that calls the OpenAI API and instrument it to send traces to our collector.

First, set up a virtual environment and install the required libraries.

python3 -m venv venv
source venv/bin/activate
pip install "openai==1.35.13" \
  "opentelemetry-api==1.26.0" \
  "opentelemetry-sdk==1.26.0" \
  "opentelemetry-exporter-otlp-proto-grpc==1.26.0"

Create a file named tracing_setup.py. This module contains the boilerplate code for initializing the OpenTelemetry SDK. Separating this makes your main application code cleaner.

tracing_setup.py

import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

def configure_tracer(service_name):
    """
    Configures and returns a tracer for a given service name.
    """
    resource = Resource(attributes={
        "service.name": service_name
    })

    # Set up OTLP exporter
    otlp_exporter = OTLPSpanExporter(
        endpoint="localhost:4317",  # OTel Collector gRPC endpoint
        insecure=True
    )

    # Set up trace provider and processor
    trace_provider = TracerProvider(resource=resource)
    span_processor = BatchSpanProcessor(otlp_exporter)
    trace_provider.add_span_processor(span_processor)

    # Set the global tracer provider
    trace.set_tracer_provider(trace_provider)

    return trace.get_tracer(__name__)

Now, create your main application file, app_manual.py. This script defines a simple function that takes a topic and asks an LLM to write a short story.

app_manual.py

import os
import time
from openai import OpenAI
from tracing_setup import configure_tracer

# Configure the tracer with a service name
tracer = configure_tracer("genai-story-app")

# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def generate_story(topic, user_id):
    # Create a parent span for the entire operation
    with tracer.start_as_current_span("generate_story") as parent_span:
        print(f"Generating story for topic: {topic}")

        # Add attributes to the span for context
        parent_span.set_attribute("app.topic", topic)
        parent_span.set_attribute("enduser.id", user_id)

        # Create a child span specifically for the OpenAI API call
        with tracer.start_as_current_span("openai.chat.completions.create") as child_span:
            try:
                start_time = time.time()
                response = client.chat.completions.create(
                    model="gpt-3.5-turbo-0125",
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant that writes short stories."},
                        {"role": "user", "content": f"Write a 50-word story about {topic}."}
                    ]
                )
                duration = time.time() - start_time
                story = response.choices[0].message.content

                # Add response attributes to the child span
                child_span.set_attribute("gen_ai.system", "openai")
                child_span.set_attribute("gen_ai.request.model", "gpt-3.5-turbo-0125")
                child_span.set_attribute("llm.response.duration_ms", round(duration * 1000))

                print("Story generated successfully.")
                return story
            except Exception as e:
                # Record exceptions on the span
                child_span.record_exception(e)
                child_span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
                print(f"An error occurred: {e}")
                raise

if __name__ == "__main__":
    story_topic = "a robot learning to paint"
    user = "user-123"
    generated_story = generate_story(story_topic, user)
    print("\n--- Generated Story ---")
    print(generated_story)

Run the application:

python app_manual.py

You should see output indicating the story was generated. Now, go to the Jaeger UI at http://localhost:16686.

In the “Service” dropdown, select genai-story-app and click “Find Traces”. You will see a waterfall diagram representing the trace. It shows one parent bar for the generate_story operation and a second, nested bar for the openai.chat.completions.create API call. Clicking on a span reveals all the attributes you added, like app.topic and enduser.id. This provides a detailed, contextual view of a single request’s lifecycle.

Step 3: Fast and Easy Tracing with Auto-instrumentation

Manual instrumentation gives you full control, but it can be verbose. For many standard libraries, auto-instrumentation can create basic spans for you with minimal code changes. Let’s try this with OpenLIT.

First, install the package.

pip install "openlit==0.1.48"

Now, create a new application file, app_auto.py. The code is much simpler.

app_auto.py

import os
from openai import OpenAI
import openlit

# Initialize OpenLIT with service name and OTLP endpoint
openlit.init(
    service_name="genai-story-app-auto",
    otlp_endpoint="http://localhost:4318"
)

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def generate_story_auto(topic):
    print(f"Generating story for topic: {topic}")
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Write a 50-word story about {topic}."}
        ]
    )
    story = response.choices[0].message.content
    return story

if __name__ == "__main__":
    story_topic = "a dragon who loves to bake"
    generated_story = generate_story_auto(story_topic)
    print("\n--- Generated Story ---")
    print(generated_story)

Notice the differences:

  1. There are no with tracer.start_as_current_span(...) blocks.
  2. We initialize OpenLIT once at the start.
  3. We don’t need our tracing_setup.py module.

Run this new script:

python app_auto.py

Go back to Jaeger. You will now see a new service called genai-story-app-auto. It will have a trace that was generated automatically. OpenLIT instruments the openai library and creates a span for the API call, automatically adding useful attributes like the model name and token counts.

Auto-instrumentation is great for getting started quickly. However, it lacks business-specific context. For example, it doesn’t know about our user_id or the feature_name this call belongs to. Most importantly, it doesn’t calculate cost. A hybrid approach is often best: use auto-instrumentation for standard libraries and apply manual instrumentation for your core business logic and custom metrics like cost.

Step 4: Add Cost Tracking with Custom Metrics

We’ll now modify our manual application to not only trace the request but also to calculate the cost of each LLM call and export it as a metric to Prometheus.

First, create a separate file metrics_setup.py to handle the metrics initialization.

metrics_setup.py

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource

def configure_meter_provider(service_name):
    """
    Configures and returns a meter for a given service name.
    """
    resource = Resource(attributes={
        "service.name": service_name
    })

    # Set up OTLP exporter for metrics
    metric_exporter = OTLPMetricExporter(
        endpoint="localhost:4317",
        insecure=True
    )

    # Set up the metric reader
    metric_reader = PeriodicExportingMetricReader(metric_exporter)

    # Set up the meter provider
    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])

    # Set the global meter provider
    metrics.set_meter_provider(meter_provider)

    return metrics.get_meter(__name__)

Next, create a pricing utility file, pricing.py. In a real-world scenario, you might fetch these prices from a central service, but a simple dictionary is enough for this tutorial. Keeping this file updated is important. While you can track costs after the fact with billing dashboards, real-time cost metrics allow you to set up alerts and catch unexpected spending spikes immediately. This proactive approach is better than just using LLM prompt caching to cut API costs.

pricing.py

# Prices per 1 million tokens (as of mid-2024)
MODEL_PRICING = {
    "gpt-4o": {"prompt": 5.00, "completion": 15.00},
    "gpt-3.5-turbo-0125": {"prompt": 0.50, "completion": 1.50},
    "default": {"prompt": 1.00, "completion": 2.00} # A fallback
}

def calculate_cost(model_name, prompt_tokens, completion_tokens):
    """
    Calculates the estimated cost of an LLM call in USD.
    """
    pricing = MODEL_PRICING.get(model_name, MODEL_PRICING["default"])

    prompt_cost = (prompt_tokens / 1_000_000) * pricing["prompt"]
    completion_cost = (completion_tokens / 1_000_000) * pricing["completion"]

    total_cost = prompt_cost + completion_cost
    return total_cost

Now, let’s create a new application file, app_full.py, that combines tracing and metrics. This version is our final, production-ready example.

app_full.py

import os
import time
from openai import OpenAI
from opentelemetry import trace
from tracing_setup import configure_tracer
from metrics_setup import configure_meter_provider
from pricing import calculate_cost

# --- OpenTelemetry Setup ---
SERVICE_NAME = "genai-story-app-prod"
tracer = configure_tracer(SERVICE_NAME)
meter = configure_meter_provider(SERVICE_NAME)

# --- Custom Metric Instruments ---
cost_histogram = meter.create_histogram(
    name="genai.cost.usd",
    description="Cost of LLM calls in USD",
    unit="usd"
)
token_histogram = meter.create_histogram(
    name="llm.usage.total_tokens",
    description="Total tokens used in LLM calls",
    unit="1"
)

# --- OpenAI Client ---
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def generate_story(topic, user_id, feature_name):
    # Parent span for our business logic
    with tracer.start_as_current_span("generate_story", attributes={
        "app.topic": topic,
        "enduser.id": user_id,
        "app.feature.name": feature_name
    }) as parent_span:

        model = "gpt-4o"

        # Child span for the API call
        with tracer.start_as_current_span("openai.chat.completions.create") as child_span:
            try:
                start_time = time.time()
                response = client.chat.completions.create(
                    model=model,
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant that writes short stories."},
                        {"role": "user", "content": f"Write a 100-word story about {topic}."}
                    ]
                )
                duration = time.time() - start_time
                story = response.choices[0].message.content

                # Extract usage data
                usage = response.usage
                prompt_tokens = usage.prompt_tokens
                completion_tokens = usage.completion_tokens
                total_tokens = usage.total_tokens

                # Calculate cost
                cost = calculate_cost(model, prompt_tokens, completion_tokens)

                print(f"Tokens Used: {total_tokens} | Cost: ${cost:.6f}")

                # Add attributes to the trace span
                span_attributes = {
                    "gen_ai.system": "openai",
                    "gen_ai.request.model": model,
                    "llm.usage.prompt_tokens": prompt_tokens,
                    "llm.usage.completion_tokens": completion_tokens,
                    "llm.usage.total_tokens": total_tokens,
                    "llm.response.duration_ms": round(duration * 1000),
                    "gen_ai.cost.usd": f"{cost:.6f}"
                }
                child_span.set_attributes(span_attributes)

                # Record the cost and token count as metrics
                metric_attributes = {
                    "gen_ai.request.model": model,
                    "app.feature.name": feature_name
                }
                cost_histogram.record(cost, attributes=metric_attributes)
                token_histogram.record(total_tokens, attributes=metric_attributes)

                return story
            except Exception as e:
                child_span.record_exception(e)
                child_span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
                raise

if __name__ == "__main__":
    tasks = [
        {"topic": "a spaceship powered by music", "user": "user-456", "feature": "story-generator"},
        {"topic": "a detective who is also a ghost", "user": "user-789", "feature": "mystery-plot-creator"},
        {"topic": "a chef cooking for aliens", "user": "user-456", "feature": "story-generator"}
    ]
    for task in tasks:
        generate_story(task["topic"], task["user"], task["feature"])

    # In a script, the program might exit before the OTel SDK has time to export 
    # the buffered telemetry data. A short delay helps ensure data is sent.
    # In a long-running service, this is not necessary.
    time.sleep(5)

Key improvements in this version:

  1. Metrics Initialization: We now call configure_meter_provider to set up the metrics pipeline.
  2. Histogram Creation: We define histograms for both cost and token count. A histogram is more powerful than a simple counter because it allows Prometheus to calculate quantiles, averages, sums and counts.
  3. Cost Calculation: After the API call, we extract token counts from the response.usage object and use our calculate_cost function.
  4. Recording Metrics: We use cost_histogram.record() and token_histogram.record() to send the calculated data to the OpenTelemetry SDK. We include attributes (model, feature_name) which become labels in Prometheus, allowing us to slice and dice the data.
  5. Richer Traces: We add the token counts and cost directly to the trace span. While metrics are for aggregation, having this data in the trace is invaluable for debugging a specific, expensive request.

Run the new script to generate some data:

python app_full.py
Tokens Used: 140 | Cost: $0.000850
Tokens Used: 142 | Cost: $0.000865
Tokens Used: 144 | Cost: $0.000885

Check Jaeger again. You will see the new traces for the genai-story-app-prod service. Inspect a span and notice the new llm.usage.* and gen_ai.cost.usd attributes. Now, let’s visualize the aggregated metrics.

Step 5: Build a Cost and Performance Dashboard in Grafana

The final piece is to visualize our custom metrics. Go to Grafana at http://localhost:3000.

  1. The Prometheus data source should already be configured from our grafana-datasources.yml file.
  2. Create a new dashboard. Click the ”+” icon in the left sidebar and select “Dashboard”.
  3. Click “Add visualization”.

Let’s create a few useful panels using PromQL (Prometheus Query Language).

Panel 1: Total Estimated Cost (Stat)

This panel shows the total cumulative cost.

  • Visualization: Stat
  • PromQL Query: sum(genai_cost_usd_sum)
  • Panel Title: Total Cost
  • Standard options -> Unit: Misc -> US Dollar ($)

Panel 2: Cost by Model (Time Series)

This shows how costs for different models are changing over time.

  • Visualization: Time series
  • PromQL Query: sum by (gen_ai_request_model) (rate(genai_cost_usd_sum[5m]))
  • Panel Title: Cost Rate by Model
  • Legend: {{gen_ai_request_model}}

Panel 3: Total Requests by Feature (Bar Chart)

This helps identify which features of your application are driving the most LLM calls.

  • Visualization: Bar chart
  • PromQL Query: sum by (app_feature_name) (genai_cost_usd_count)
  • Panel Title: Total Requests by Feature

Panel 4: Average Tokens per Call (Stat)

This panel shows the average number of tokens consumed per request.

  • Visualization: Stat
  • PromQL Query: sum(rate(llm_usage_total_tokens_sum[5m])) / sum(rate(llm_usage_total_tokens_count[5m]))
  • Panel Title: Avg Tokens per Call
  • Standard options -> Unit: Misc -> Count

After adding these panels and running your script a few times, your dashboard will populate with data, showing a grid of panels with a large “Total Cost” stat, a line graph for cost rate, a bar chart for requests by feature, and another stat for average tokens. You now have a complete, real-time view of your LLM application’s cost and performance.

Troubleshooting Common Issues

When setting up an observability pipeline, a few things can go wrong.

Traces or Metrics Not Appearing

  • Check the OTel Collector Logs: The first place to look is the collector. Run docker-compose logs otel-collector. Look for connection errors or messages like “exporter failed”. This often points to a misconfigured endpoint.
  • Verify Endpoints: Ensure your application is sending data to the correct OTel Collector port. Our setup uses localhost:4317 for gRPC. If you were using HTTP, it would be localhost:4318.
  • Check Service Name: In Jaeger and Grafana, make sure you are looking for the correct service.name you defined in your application (genai-story-app-prod in our final example).
  • Flush Telemetry Data: The OpenTelemetry SDKs batch data before sending it. For short-lived scripts, the program might exit before the batch is sent. The time.sleep(5) at the end of app_full.py is a simple fix for this during testing.

Metrics are in Prometheus but Not Grafana

  • Data Source Connection: In Grafana, go to “Connections” -> “Data sources” and test your Prometheus connection. Ensure the URL is http://prometheus:9090.
  • Metric Name Mismatch: OpenTelemetry metrics are converted to a Prometheus-compatible format. Our histogram genai.cost.usd becomes several Prometheus metrics: genai_cost_usd_bucket, genai_cost_usd_sum, and genai_cost_usd_count. Double-check your PromQL queries against the actual metric names available in the Prometheus UI (http://localhost:9090).

Incorrect Cost Calculations

  • Stale Pricing Data: The pricing.py file is a manual mapping. LLM providers update their prices. If your costs seem off, verify the prices in your code against the provider’s official pricing page.
  • Model Name Mismatch: The cost calculation depends on the exact model string (gpt-4o, gpt-3.5-turbo-0125). If the model variable in your code doesn’t exactly match a key in your MODEL_PRICING dictionary, it will fall back to the default, which could lead to inaccurate costs.

By building this observability stack, you’ve replaced the LLM “black box” with a rich, queryable system. You can now make data-driven decisions about model selection, feature development, and cost optimization. This standardized approach using OpenTelemetry not only solves today’s problems but also future-proofs your stack, allowing you to plug in different models, backends, or cloud services without rewriting all your instrumentation code.

Advertisement

Stay up to date

Get DevOps tips, tutorials, and guides delivered to your inbox.