Monitor Agent Performance: Observability and Evals

My Goal

In this entry, I want to monitor the performance of trip planner agent covering:

agent internal process
responses time, length and cost
tool usage efficiency
user sentiment
overall metrics

Preview

Try a few messages that trigger tools calls, for example “where to eat now”

Now switch over to Arize Phoenix observability platform below. Go to Projects > Trip Planner Agent v2. You should see your interactions at the top of the list. You can click a row view more details.

Background

Why Observability?

Observability is the process of understanding a system’s internal state by collecting and analysing data such as logs, traces and metrics. It measures application performance and provide clues to troubleshoot when things go wrong.

The agent log that you see in Trip Planner Agent is a simplified example of observability. It shows you what’s happening under the hood and helps you understand why the agent responds the way it does. A full fledged platform displays a lot more useful information for debugging.

Where things go wrong

Observability has historically been implemented in later stages of traditional software development as code generally functions predictably and transparently. For LLM applications or AI agents, it is necessary right at the start due to the non-deterministic and opaque nature of underlying LLMs.

Here are some common issues or challenges with AI agents:

Tools not being triggered or triggered incorrectly
Tools encountering error or returning unexpected output
Inconsistent agent responses to the same query
Responses not addressing user’s intent
Responses taking too long
Some issues hard to reproduced with complex conversation context

Observability tools help simplify the process by tracking related details across all users and all conversations.

Observability

Span

Spans are a unit of work for a specific operation. It contains related metadata such as time, message, model details, status to provide information about the operation it tracks.

There are different types of spans such as LLM, TOOL, CHAIN, AGENT, some which can be nested as part of larger operation. In the platform, you can toggle between “Root Spans” and “All” to switch between grouped and flat view. Root span refers to the top-level parent span.

Span List

Trace

A trace is a complete set of operations or steps to fulfil a request. It groups one or more spans from start to finish. This allows you to see the bigger picture of how a request is processed or where it fails.

In the below screen, you see the user input “based on weather at singapore and osaka next week, which should i visit?” led to chain of TOOL span (e.g. get_daily_conditions_for_location) and LLM span (ChatOpenAI) operations. Each is displayed in sequence with latency time.

Trace Details

Session

A session is a conversational thread between user and system with traces that represents each interaction. This helps you to see the context leading up to the current response. Unlike chatbot interface that contains only a single message, the full trace payload is displayed in the session view.

Session Details

Metrics

Observability platforms tracks a set of default metrics that addresses the most common needs:

Latency: Time to complete operation
Token usage: Split by prompt (tokens from user input + processing context) and completion (tokens generated by LLM)
Cost: Based on token usage for selected model
Annotations: Both system and manual annotations
LLM span and errors
Tool span and errors

Metrics Dashboard

Evals

Evaluations (often referred as Evals) is the process of assessing LLM outputs, especially for production use. This ensures LLM applications are reliably producing the expected interactions and indicate when performance is sub-par.

Types

Evals are generally implemented with three distinct types:

Code: Leverage code functions /libraries for simpler assessments which produces deterministic results
LLM-as-a-Judge: Leverage LLM for complex assessments which produces non-deterministic results (prompt engineering techniques like few-shot examples can make it close to deterministic)
Human Feedback: These are fallbacks for subjective assessments where Code and LLM are not suitable

Modes: Online vs Offline vs Guardrail

Evaluating LLM applications across their lifecycle requires multi-prong approach. The same evals can run on on more than one approach:

Offline: Predefined inputs in pre-production environment. Measure base performance under controlled environment.
Online: Real-time user interactions in production environment. Monitor actual performance under dynamic, unpredictable conditions for visibility purposes.
Guardrails: Real-time user interactions in production environment. Proactively limit interactions from reaching system or user for defensive purposes.

Goals

An agent is a system that perceives, reasons, and acts to achieve a specified outcome. The goal of agent evals therefore is to verify that it can reliably understand users’ intent and deliver successful real-world result without causing harm.

A good set of evals should cover the following areas

Fulfilment: Ensure final outcome matches the user’s stated goals and while meeting stated conditions.
Capability: Validate core ability of the agent to perform accurately, reliably, and effectively.
Safety & Alignment: Ensure outputs are free of bias, fair, safe, and adhere to specific policies or ethical guidelines.
Efficiency: Ensure agent operates within acceptable operational costs and runtime efficiencies.
Robustness: Ability to handle unexpected inputs, edge cases, errors or suboptimal conditions gracefully.

My Custom Evals

For the trip planner, I’ve implemented 5 evals:

Input Sentiment Polarity (safety & alignment / fulfilment): Scores user input across range of -1 (negative) to 1 (positive) to track user satisfaction with agent experience. Overly negative sentiment points to issues with alignment or fulfilment.
Tool Usage Count (capability): Count of tool use per response. Helps identify interactions with excessive tool calls
Tool Error Count (robustness): Count of tool error per response. Here I track the error responses such as invalid input parameters, which techically are considered successful tool calls by the platform.
Word Count (efficiency): Thought it would be useful in case I needed to tune length of responses. In retrospect, default token count tracking is a good proxy for this.
Input Language: Identifies the user input language. Enable segmentation or filtering for language-specific performance or issues.

FYI
By default, the platform tracks tool status which refers to whether it successfully responds. Even if responses are technically error messages, it would be logged as success or ok. If the tool fails to respond, it will be logged as error. Hence the need for separate tracking of error messages like the above.

In the below trace, the eval results on top shows 2 Tool Error Count and 5 Tool Usage Count. The tool output shows "content": "Error: Invalid date format for start_date. Please use YYYY-MM-DD (e.g., 2025-11-29)." while "status": "success"

Eval Annotations

What I’ve Learned

Improving with span

During the integration of the observability platform, I noticed responses taking quite long. In the below example, it took 31.9 seconds to complete. The chat completion step under ChatOpenAI spans averaged 4 seconds to complete.

Trace with slow response

I realised that max_completion_tokens at 10,000 was much higher than needed for the responses that averaged less than 1000 tokens for chat completion (not counting prompt input). The higher limit contributed to increased latency. Trace with slow response - model

I changed the setting to 1500 tokens and immediately experienced faster turnaround. The span latency dropped from 4 seconds to 1.6 seconds. The overall trace latency for the same interaction dropped by half from 31.9 seconds to 16.7 seconds.
Trace with faster response - model

Observability is not just for AI Engineers

With LLM applications being a black box, anyone who wants to work with AI needs to have a good understanding of observability and evals. I hope to cover other types and methods of evals in future articles.

Feedforward