How to Set Up LLM Observability with OpenTelemetry
Use OpenTelemetry to add LLM observability to your Python apps. Follow our guide to trace requests, track API costs, and build monitoring dashboards in Grafana.
Prerequisites
- Python installed
- Docker and Docker Compose installed
- Basic knowledge of OpenTelemetry concepts
Tools Used
Large Language Model (LLM) applications are difficult to monitor. Unlike traditional software, their outputs are non-deterministic, their internal logic is a “black box”, and their costs are directly tied to unpredictable token counts. This creates a blind spot for DevOps and MLOps teams. You can’t easily answer simple questions like: Which features are costing us the most? Is a specific model’s latency degrading over time? Why did a particular user request fail inside a complex agent chain?
OpenTelemetry (OTel) provides a solution. It is a vendor-neutral, open-source standard for collecting traces, metrics, and logs. This tutorial guides you through setting up a complete, local LLM observability stack using OTel. You will learn how to manually instrument a Python application to capture detailed traces, track token usage, calculate API costs, and visualize everything in Jaeger and Grafana.
Prerequisites
Before you begin, you need a few tools installed on your local machine.
- Docker and Docker Compose: We use Docker to run our local observability stack, which includes the OTel Collector, Jaeger, Prometheus, and Grafana. Ensure you have Docker Engine and Docker Compose v2.0 or newer installed.
- Python: The sample application is written in Python. You need Python 3.9+ and
pipinstalled. This guide was tested with Python 3.11. - An OpenAI API Key: To make real requests, you need an API key from OpenAI. You can get one from their developer platform. Once you have it, export it as an environment variable in your terminal:
export OPENAI_API_KEY="sk-..."
Replace sk-... with your actual key. The application code reads this variable to authenticate with the OpenAI API.
Why OpenTelemetry for LLM Apps?
LLM-powered applications, especially those using chains or agents, behave more like complex distributed systems than monolithic applications. A single user request might trigger multiple model invocations, database lookups, and API calls. Trying to debug this flow with just logs is incredibly painful.
This is where OpenTelemetry excels. It provides three key signals:
- Traces: A trace represents the end-to-end journey of a single request. It is composed of spans, where each span represents a single operation (like an API call or a function execution). For LLMs, traces let you visualize the entire sequence of events, see latency for each step, and pinpoint exactly where an error occurred in a chain.
- Metrics: Metrics are aggregated numerical data, like counters, gauges, and histograms. They are perfect for monitoring high-level performance and business KPIs. For GenAI, the most critical custom metrics are token counts and API costs. You can use metrics to build dashboards that answer questions like “What is our total spend on
gpt-4othis week?” - Logs: Logs are timestamped text records. OTel allows you to correlate logs directly with traces and spans, providing deep context for debugging.
By combining these signals, you move from a reactive “what happened?” approach to a proactive “why did it happen and how can we optimize it?” mindset. This tutorial focuses on traces and metrics because they solve the most immediate problems in LLM observability: understanding request flows and tracking costs.
Step 1: Launch the Local Observability Stack
First, we need a backend to receive and visualize our telemetry data. We’ll use Docker Compose to launch a pre-configured stack containing four key services:
- OTel Collector: A central agent that receives telemetry data from your application, processes it, and exports it to one or more backends.
- Jaeger: An open-source tool for visualizing traces.
- Prometheus: A time-series database for storing metrics.
- Grafana: A dashboarding tool for visualizing metrics from Prometheus.
Create a directory for your project and add the following four files.
docker-compose.yml
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.105.0
command: ["--config=/etc/otelcol-contrib/config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
depends_on:
- jaeger
- prometheus
jaeger:
image: jaegertracing/all-in-one:1.59.0
ports:
- "16686:16686" # Jaeger UI
- "14268:14268" # Jaeger collector
prometheus:
image: prom/prometheus:v2.53.1
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-lifecycle'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:11.1.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yaml
otel-collector-config.yaml
This file configures the OTel Collector. It sets up receivers to accept data, processors to batch it, and exporters to send it to Jaeger and Prometheus.
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
prometheus.yml
This file tells Prometheus where to scrape metrics from, which in our case is the OTel Collector.
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
grafana-datasources.yml
This file pre-configures Grafana to connect to our Prometheus instance.
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
With these four files in your project directory, launch the stack:
docker-compose up -d
After a minute, verify that everything is running. The container names will be prefixed with your project directory name.
docker-compose ps
You should see grafana, jaeger, otel-collector, and prometheus services in the running state. You can now access the UIs in your browser:
- Jaeger: http://localhost:16686
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (login with
admin/admin)
Step 2: Instrument a Python App for Basic Tracing
Now let’s create a simple Python application that calls the OpenAI API and instrument it to send traces to our collector.
First, set up a virtual environment and install the required libraries.
python3 -m venv venv
source venv/bin/activate
pip install "openai==1.35.13" \
"opentelemetry-api==1.26.0" \
"opentelemetry-sdk==1.26.0" \
"opentelemetry-exporter-otlp-proto-grpc==1.26.0"
Create a file named tracing_setup.py. This module contains the boilerplate code for initializing the OpenTelemetry SDK. Separating this makes your main application code cleaner.
tracing_setup.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def configure_tracer(service_name):
"""
Configures and returns a tracer for a given service name.
"""
resource = Resource(attributes={
"service.name": service_name
})
# Set up OTLP exporter
otlp_exporter = OTLPSpanExporter(
endpoint="localhost:4317", # OTel Collector gRPC endpoint
insecure=True
)
# Set up trace provider and processor
trace_provider = TracerProvider(resource=resource)
span_processor = BatchSpanProcessor(otlp_exporter)
trace_provider.add_span_processor(span_processor)
# Set the global tracer provider
trace.set_tracer_provider(trace_provider)
return trace.get_tracer(__name__)
Now, create your main application file, app_manual.py. This script defines a simple function that takes a topic and asks an LLM to write a short story.
app_manual.py
import os
import time
from openai import OpenAI
from tracing_setup import configure_tracer
# Configure the tracer with a service name
tracer = configure_tracer("genai-story-app")
# Initialize OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def generate_story(topic, user_id):
# Create a parent span for the entire operation
with tracer.start_as_current_span("generate_story") as parent_span:
print(f"Generating story for topic: {topic}")
# Add attributes to the span for context
parent_span.set_attribute("app.topic", topic)
parent_span.set_attribute("enduser.id", user_id)
# Create a child span specifically for the OpenAI API call
with tracer.start_as_current_span("openai.chat.completions.create") as child_span:
try:
start_time = time.time()
response = client.chat.completions.create(
model="gpt-3.5-turbo-0125",
messages=[
{"role": "system", "content": "You are a helpful assistant that writes short stories."},
{"role": "user", "content": f"Write a 50-word story about {topic}."}
]
)
duration = time.time() - start_time
story = response.choices[0].message.content
# Add response attributes to the child span
child_span.set_attribute("gen_ai.system", "openai")
child_span.set_attribute("gen_ai.request.model", "gpt-3.5-turbo-0125")
child_span.set_attribute("llm.response.duration_ms", round(duration * 1000))
print("Story generated successfully.")
return story
except Exception as e:
# Record exceptions on the span
child_span.record_exception(e)
child_span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
print(f"An error occurred: {e}")
raise
if __name__ == "__main__":
story_topic = "a robot learning to paint"
user = "user-123"
generated_story = generate_story(story_topic, user)
print("\n--- Generated Story ---")
print(generated_story)
Run the application:
python app_manual.py
You should see output indicating the story was generated. Now, go to the Jaeger UI at http://localhost:16686.
In the “Service” dropdown, select genai-story-app and click “Find Traces”. You will see a waterfall diagram representing the trace. It shows one parent bar for the generate_story operation and a second, nested bar for the openai.chat.completions.create API call. Clicking on a span reveals all the attributes you added, like app.topic and enduser.id. This provides a detailed, contextual view of a single request’s lifecycle.
Step 3: Fast and Easy Tracing with Auto-instrumentation
Manual instrumentation gives you full control, but it can be verbose. For many standard libraries, auto-instrumentation can create basic spans for you with minimal code changes. Let’s try this with OpenLIT.
First, install the package.
pip install "openlit==0.1.48"
Now, create a new application file, app_auto.py. The code is much simpler.
app_auto.py
import os
from openai import OpenAI
import openlit
# Initialize OpenLIT with service name and OTLP endpoint
openlit.init(
service_name="genai-story-app-auto",
otlp_endpoint="http://localhost:4318"
)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def generate_story_auto(topic):
print(f"Generating story for topic: {topic}")
response = client.chat.completions.create(
model="gpt-3.5-turbo-0125",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Write a 50-word story about {topic}."}
]
)
story = response.choices[0].message.content
return story
if __name__ == "__main__":
story_topic = "a dragon who loves to bake"
generated_story = generate_story_auto(story_topic)
print("\n--- Generated Story ---")
print(generated_story)
Notice the differences:
- There are no
with tracer.start_as_current_span(...)blocks. - We initialize OpenLIT once at the start.
- We don’t need our
tracing_setup.pymodule.
Run this new script:
python app_auto.py
Go back to Jaeger. You will now see a new service called genai-story-app-auto. It will have a trace that was generated automatically. OpenLIT instruments the openai library and creates a span for the API call, automatically adding useful attributes like the model name and token counts.
Auto-instrumentation is great for getting started quickly. However, it lacks business-specific context. For example, it doesn’t know about our user_id or the feature_name this call belongs to. Most importantly, it doesn’t calculate cost. A hybrid approach is often best: use auto-instrumentation for standard libraries and apply manual instrumentation for your core business logic and custom metrics like cost.
Step 4: Add Cost Tracking with Custom Metrics
We’ll now modify our manual application to not only trace the request but also to calculate the cost of each LLM call and export it as a metric to Prometheus.
First, create a separate file metrics_setup.py to handle the metrics initialization.
metrics_setup.py
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource
def configure_meter_provider(service_name):
"""
Configures and returns a meter for a given service name.
"""
resource = Resource(attributes={
"service.name": service_name
})
# Set up OTLP exporter for metrics
metric_exporter = OTLPMetricExporter(
endpoint="localhost:4317",
insecure=True
)
# Set up the metric reader
metric_reader = PeriodicExportingMetricReader(metric_exporter)
# Set up the meter provider
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
# Set the global meter provider
metrics.set_meter_provider(meter_provider)
return metrics.get_meter(__name__)
Next, create a pricing utility file, pricing.py. In a real-world scenario, you might fetch these prices from a central service, but a simple dictionary is enough for this tutorial. Keeping this file updated is important. While you can track costs after the fact with billing dashboards, real-time cost metrics allow you to set up alerts and catch unexpected spending spikes immediately. This proactive approach is better than just using LLM prompt caching to cut API costs.
pricing.py
# Prices per 1 million tokens (as of mid-2024)
MODEL_PRICING = {
"gpt-4o": {"prompt": 5.00, "completion": 15.00},
"gpt-3.5-turbo-0125": {"prompt": 0.50, "completion": 1.50},
"default": {"prompt": 1.00, "completion": 2.00} # A fallback
}
def calculate_cost(model_name, prompt_tokens, completion_tokens):
"""
Calculates the estimated cost of an LLM call in USD.
"""
pricing = MODEL_PRICING.get(model_name, MODEL_PRICING["default"])
prompt_cost = (prompt_tokens / 1_000_000) * pricing["prompt"]
completion_cost = (completion_tokens / 1_000_000) * pricing["completion"]
total_cost = prompt_cost + completion_cost
return total_cost
Now, let’s create a new application file, app_full.py, that combines tracing and metrics. This version is our final, production-ready example.
app_full.py
import os
import time
from openai import OpenAI
from opentelemetry import trace
from tracing_setup import configure_tracer
from metrics_setup import configure_meter_provider
from pricing import calculate_cost
# --- OpenTelemetry Setup ---
SERVICE_NAME = "genai-story-app-prod"
tracer = configure_tracer(SERVICE_NAME)
meter = configure_meter_provider(SERVICE_NAME)
# --- Custom Metric Instruments ---
cost_histogram = meter.create_histogram(
name="genai.cost.usd",
description="Cost of LLM calls in USD",
unit="usd"
)
token_histogram = meter.create_histogram(
name="llm.usage.total_tokens",
description="Total tokens used in LLM calls",
unit="1"
)
# --- OpenAI Client ---
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def generate_story(topic, user_id, feature_name):
# Parent span for our business logic
with tracer.start_as_current_span("generate_story", attributes={
"app.topic": topic,
"enduser.id": user_id,
"app.feature.name": feature_name
}) as parent_span:
model = "gpt-4o"
# Child span for the API call
with tracer.start_as_current_span("openai.chat.completions.create") as child_span:
try:
start_time = time.time()
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant that writes short stories."},
{"role": "user", "content": f"Write a 100-word story about {topic}."}
]
)
duration = time.time() - start_time
story = response.choices[0].message.content
# Extract usage data
usage = response.usage
prompt_tokens = usage.prompt_tokens
completion_tokens = usage.completion_tokens
total_tokens = usage.total_tokens
# Calculate cost
cost = calculate_cost(model, prompt_tokens, completion_tokens)
print(f"Tokens Used: {total_tokens} | Cost: ${cost:.6f}")
# Add attributes to the trace span
span_attributes = {
"gen_ai.system": "openai",
"gen_ai.request.model": model,
"llm.usage.prompt_tokens": prompt_tokens,
"llm.usage.completion_tokens": completion_tokens,
"llm.usage.total_tokens": total_tokens,
"llm.response.duration_ms": round(duration * 1000),
"gen_ai.cost.usd": f"{cost:.6f}"
}
child_span.set_attributes(span_attributes)
# Record the cost and token count as metrics
metric_attributes = {
"gen_ai.request.model": model,
"app.feature.name": feature_name
}
cost_histogram.record(cost, attributes=metric_attributes)
token_histogram.record(total_tokens, attributes=metric_attributes)
return story
except Exception as e:
child_span.record_exception(e)
child_span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
raise
if __name__ == "__main__":
tasks = [
{"topic": "a spaceship powered by music", "user": "user-456", "feature": "story-generator"},
{"topic": "a detective who is also a ghost", "user": "user-789", "feature": "mystery-plot-creator"},
{"topic": "a chef cooking for aliens", "user": "user-456", "feature": "story-generator"}
]
for task in tasks:
generate_story(task["topic"], task["user"], task["feature"])
# In a script, the program might exit before the OTel SDK has time to export
# the buffered telemetry data. A short delay helps ensure data is sent.
# In a long-running service, this is not necessary.
time.sleep(5)
Key improvements in this version:
- Metrics Initialization: We now call
configure_meter_providerto set up the metrics pipeline. - Histogram Creation: We define histograms for both cost and token count. A histogram is more powerful than a simple counter because it allows Prometheus to calculate quantiles, averages, sums and counts.
- Cost Calculation: After the API call, we extract token counts from the
response.usageobject and use ourcalculate_costfunction. - Recording Metrics: We use
cost_histogram.record()andtoken_histogram.record()to send the calculated data to the OpenTelemetry SDK. We include attributes (model,feature_name) which become labels in Prometheus, allowing us to slice and dice the data. - Richer Traces: We add the token counts and cost directly to the trace span. While metrics are for aggregation, having this data in the trace is invaluable for debugging a specific, expensive request.
Run the new script to generate some data:
python app_full.py
Tokens Used: 140 | Cost: $0.000850
Tokens Used: 142 | Cost: $0.000865
Tokens Used: 144 | Cost: $0.000885
Check Jaeger again. You will see the new traces for the genai-story-app-prod service. Inspect a span and notice the new llm.usage.* and gen_ai.cost.usd attributes. Now, let’s visualize the aggregated metrics.
Step 5: Build a Cost and Performance Dashboard in Grafana
The final piece is to visualize our custom metrics. Go to Grafana at http://localhost:3000.
- The Prometheus data source should already be configured from our
grafana-datasources.ymlfile. - Create a new dashboard. Click the ”+” icon in the left sidebar and select “Dashboard”.
- Click “Add visualization”.
Let’s create a few useful panels using PromQL (Prometheus Query Language).
Panel 1: Total Estimated Cost (Stat)
This panel shows the total cumulative cost.
- Visualization: Stat
- PromQL Query:
sum(genai_cost_usd_sum) - Panel Title: Total Cost
- Standard options -> Unit:
Misc -> US Dollar ($)
Panel 2: Cost by Model (Time Series)
This shows how costs for different models are changing over time.
- Visualization: Time series
- PromQL Query:
sum by (gen_ai_request_model) (rate(genai_cost_usd_sum[5m])) - Panel Title: Cost Rate by Model
- Legend:
{{gen_ai_request_model}}
Panel 3: Total Requests by Feature (Bar Chart)
This helps identify which features of your application are driving the most LLM calls.
- Visualization: Bar chart
- PromQL Query:
sum by (app_feature_name) (genai_cost_usd_count) - Panel Title: Total Requests by Feature
Panel 4: Average Tokens per Call (Stat)
This panel shows the average number of tokens consumed per request.
- Visualization: Stat
- PromQL Query:
sum(rate(llm_usage_total_tokens_sum[5m])) / sum(rate(llm_usage_total_tokens_count[5m])) - Panel Title: Avg Tokens per Call
- Standard options -> Unit:
Misc -> Count
After adding these panels and running your script a few times, your dashboard will populate with data, showing a grid of panels with a large “Total Cost” stat, a line graph for cost rate, a bar chart for requests by feature, and another stat for average tokens. You now have a complete, real-time view of your LLM application’s cost and performance.
Troubleshooting Common Issues
When setting up an observability pipeline, a few things can go wrong.
Traces or Metrics Not Appearing
- Check the OTel Collector Logs: The first place to look is the collector. Run
docker-compose logs otel-collector. Look for connection errors or messages like “exporter failed”. This often points to a misconfigured endpoint. - Verify Endpoints: Ensure your application is sending data to the correct OTel Collector port. Our setup uses
localhost:4317for gRPC. If you were using HTTP, it would belocalhost:4318. - Check Service Name: In Jaeger and Grafana, make sure you are looking for the correct
service.nameyou defined in your application (genai-story-app-prodin our final example). - Flush Telemetry Data: The OpenTelemetry SDKs batch data before sending it. For short-lived scripts, the program might exit before the batch is sent. The
time.sleep(5)at the end ofapp_full.pyis a simple fix for this during testing.
Metrics are in Prometheus but Not Grafana
- Data Source Connection: In Grafana, go to “Connections” -> “Data sources” and test your Prometheus connection. Ensure the URL is
http://prometheus:9090. - Metric Name Mismatch: OpenTelemetry metrics are converted to a Prometheus-compatible format. Our histogram
genai.cost.usdbecomes several Prometheus metrics:genai_cost_usd_bucket,genai_cost_usd_sum, andgenai_cost_usd_count. Double-check your PromQL queries against the actual metric names available in the Prometheus UI (http://localhost:9090).
Incorrect Cost Calculations
- Stale Pricing Data: The
pricing.pyfile is a manual mapping. LLM providers update their prices. If your costs seem off, verify the prices in your code against the provider’s official pricing page. - Model Name Mismatch: The cost calculation depends on the exact model string (
gpt-4o,gpt-3.5-turbo-0125). If themodelvariable in your code doesn’t exactly match a key in yourMODEL_PRICINGdictionary, it will fall back to the default, which could lead to inaccurate costs.
By building this observability stack, you’ve replaced the LLM “black box” with a rich, queryable system. You can now make data-driven decisions about model selection, feature development, and cost optimization. This standardized approach using OpenTelemetry not only solves today’s problems but also future-proofs your stack, allowing you to plug in different models, backends, or cloud services without rewriting all your instrumentation code.
Stay up to date
Get DevOps tips, tutorials, and guides delivered to your inbox.