Sets up MLflow tracing for Python and TypeScript agents and LLM apps, with autoinstrumentation for LangChain, LangGraph, OpenAI, and other frameworks. The guide tells you what's actually worth tracing (LLM calls, retrieval, tool use) versus what adds noise (string formatting, config loading), which is more helpful than most observability docs. Includes verification steps to confirm traces are actually being logged before you waste time on evaluation, plus patterns for feedback collection and production deployment with sampling. Load this before running agent evaluation or you'll be debugging blind.
npx -y skills add mlflow/skills --skill instrumenting-with-mlflow-tracing --agent claude-codeInstalls into .claude/skills of the current project.
Based on the user's project, load the appropriate guide:
references/python.mdreferences/typescript.mdIf unclear, check for package.json (TypeScript) or requirements.txt/pyproject.toml (Python) in the project.
Trace these operations (high debugging/observability value):
| Operation Type | Examples | Why Trace |
|---|---|---|
| Root operations | Main entry points, top-level pipelines, workflow steps | End-to-end latency, input/output logging |
| LLM calls | Chat completions, embeddings | Token usage, latency, prompt/response inspection |
| Retrieval | Vector DB queries, document fetches, search | Relevance debugging, retrieval quality |
| Tool/function calls | API calls, database queries, web search | External dependency monitoring, error tracking |
| Agent decisions | Routing, planning, tool selection | Understand agent reasoning and choices |
| External services | HTTP APIs, file I/O, message queues | Dependency failures, timeout tracking |
Skip tracing these (too granular, adds noise):
Rule of thumb: Trace operations that are important for debugging and identifying issues in your application.
After instrumenting the code, always verify that tracing is working.
Planning to evaluate your agent? Tracing must be working before you run
agent-evaluation. Complete verification below first.
mlflow.search_traces() or MlflowClient().search_traces() to check that traces appear in the experiment:import mlflow
traces = mlflow.search_traces(experiment_ids=["<experiment_id>"])
print(f"Found {len(traces)} trace(s)")
assert len(traces) > 0, "No traces were logged — check tracking URI and experiment settings"
trace = traces.iloc[0]
spans = mlflow.get_trace(trace.trace_id).data.spans
print(f"Trace has {len(spans)} span(s)")
for span in spans:
print(f" - {span.name} ({span.span_type})")
Check these in order:
mlflow.set_tracking_uri(...) called before the agent run? Without this, traces go to a local ./mlruns directory instead of the configured server.mlflow.autolog() or framework-specific mlflow.<framework>.autolog() raise any warnings during setup? Check stderr for patching failures.search_traces() matches the experiment active when the code ran (mlflow.get_experiment_by_name(...) to confirm).For automated validation, use agent-evaluation/scripts/validate_tracing_runtime.py.
Log user feedback on traces for evaluation, debugging, and fine-tuning. Essential for identifying quality issues in production.
See references/feedback-collection.md for:
mlflow.log_feedback()See references/production.md for:
mlflow-tracing)See references/advanced-patterns.md for:
See references/distributed-tracing.md for:
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills