Advanced AI Agents #observability #traces #spans #token-budget

Agent Observability

5 exercises — master the vocabulary of making AI agent systems visible and debuggable: traces, spans, LangSmith, LangFuse, token budgets, and span-level analysis.

0 / 5 completed

Agent observability vocabulary quick reference

Trace — the complete record of one agent run (all steps, costs, inputs/outputs)
Span — a single step within the trace (one LLM call, one tool call, etc.)
LLM span — span recording one LLM call (prompt, completion, tokens, latency)
Token budget — max tokens a run is allowed to consume (hard or soft limit)
Budget-exceeded status — run terminated by resource limit, not task completion
LangSmith / LangFuse — LLM observability platforms for tracing, eval, and cost analysis
Span-level token analysis — inspecting token counts per step to find optimisation targets

1 / 5

In agent observability platforms like LangSmith or LangFuse, what does a "trace" represent?

2 / 5

Within a trace, what is a "span"?

A span is the individual building block that makes up a trace — one discrete unit of work.

Types of spans in an agent trace:

Span type	Contains
LLM span	Prompt, completion, input tokens, output tokens, latency, model
Tool span	Tool name, parameters, result, latency
Retrieval span	Query, retrieved documents, similarity scores
Agent span	Wraps a complete sub-agent invocation

Trace hierarchy:
A trace is a tree of spans:
Trace (root span)
├── LLM span: "What tool should I use?" (step 1)
│ └── Tool span: get_customer(id="C-8847") (step 2)
├── LLM span: "How do I format the response?" (step 3)
└── LLM span: Final answer generation (step 4)

Key vocabulary:
• Parent span — the span that contains/triggered a child span
• Span duration — how long the step took (latency)
• Token count per span — how many tokens that specific LLM call consumed
• Span metadata — tags and attributes attached for filtering

3 / 5

An AI agents team is worried about costs. The team lead says: "We need to add a token budget to the research agent." What does this mean technically?

A token budget is a cost and safety control for preventing unbounded agent consumption.

Why agent token budgets are critical:
• Unlike single LLM calls, multi-step agents can run for many iterations
• Each iteration: system prompt + conversation history + tool schemas + tool results → grows the context window
• A 50-step agent run with GPT-4o can easily cost $2–5 USD or more
• Without a budget, a looping agent will exhaust your API quota

Token budget implementation types:

Type	Behaviour when exceeded
Hard limit	Stop immediately; return budget-exceeded status
Soft limit	Warn the agent: "You have X tokens remaining — wrap up"
Warning threshold	Trigger a notification at 80% usage; halt at 100%

Soft limit prompt injection pattern:
"SYSTEM: You have used 45,000 / 50,000 tokens. Prioritise completing the task in the next 1–2 steps."

Key vocabulary:
• Token budget — max tokens allowed for one run (hard or soft)
• Budget-exceeded status — the outcome code when a run hits the limit before completion
• Context window pressure — the growth of accumulated context across agent steps

4 / 5

What problem do LangSmith and LangFuse solve for AI agent developers, and how do they differ from traditional application logging?

LangSmith and LangFuse are purpose-built LLM observability platforms — not general logging tools.

Why traditional logging falls short for agents:
• Text logs show "agent called tool X" — but not why, what the full prompt was, or how many tokens it cost
• You can't easily compare two versions of a prompt from logs
• You can't search logs by "show me all runs where the agent took more than 15 steps"

What LangSmith / LangFuse add:

Capability	Use case
Trace viewer	See the full step tree, click any span to inspect input/output
Dataset + eval	Run saved traces through eval functions to measure regressions
Prompt playground	Edit prompts and compare outputs side-by-side
Cost analytics	Total spend per project, model, and individual trace
Feedback collection	Attach human ratings or automated eval scores to traces

Key vocabulary:
• LLM observability — the practice of making LLM and agent behaviour visible and measurable
• Prompt versioning — tracking and comparing iterations of a prompt over time
• Offline evaluation — running evaluations against a saved dataset (not real-time user traffic)

5 / 5

An agent completes a task in 47 steps. The team wants to find which steps cost the most tokens so they can optimise. A junior engineer says: "Should I read through all the application logs?"

A senior engineer proposes a better approach. Which answer uses the most precise observability vocabulary?