Pulls aggregated metrics from MLflow tracking servers so you can analyze token usage, latency, and quality scores without writing custom queries. You get flexible bucketing by time or dimensions like trace name and status, plus percentiles for understanding distribution. The examples show real use cases like hourly token trends over 24 hours or P95 latency grouped by trace. It's a straightforward wrapper around MLflow's metrics API that saves you from dealing with the raw endpoints. Most useful when you're running LLM experiments in MLflow and need quick cost or performance insights without building dashboards.
npx -y skills add mlflow/skills --skill querying-mlflow-metrics --agent claude-codeInstalls into .claude/skills of the current project.
Run scripts/fetch_metrics.py to query metrics from an MLflow tracking server.
Token usage summary:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG
Output: AVG: 223.91 SUM: 7613
Hourly token trend (last 24h):
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
-t 3600 --start-time="-24h" --end-time=now
Output: Time-bucketed token sums per hour
Latency percentiles by trace:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name
Error rate by status:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status
Quality scores by evaluator (assessments):
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
-m assessment_value -a AVG,P50 -d assessment_name
Output: Average and median scores for each evaluator (e.g., correctness, relevance)
Assessment count by name:
python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
-m assessment_count -a COUNT -d assessment_name
JSON output: Add -o json to any command.
| Arg | Required | Description |
|---|---|---|
-s, --server | Yes | MLflow server URL |
-x, --experiment-ids | Yes | Experiment IDs (comma-separated) |
-m, --metric | Yes | trace_count, latency, input_tokens, output_tokens, total_tokens |
-a, --aggregations | Yes | COUNT, SUM, AVG, MIN, MAX, P50, P95, P99 |
-d, --dimensions | No | Group by: trace_name, trace_status |
-t, --time-interval | No | Bucket size in seconds (3600=hourly, 86400=daily) |
--start-time | No | -24h, -7d, now, ISO 8601, or epoch ms |
--end-time | No | Same formats as start-time |
-o, --output | No | table (default) or json |
For SPANS metrics (span_count, latency), add -v SPANS.
For ASSESSMENTS metrics, add -v ASSESSMENTS.
See references/api_reference.md for filter syntax and full API details.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills