Querying Mlflow Metrics

308 installs48 stars

Summary

Pulls aggregated metrics from MLflow tracking servers so you can analyze token usage, latency, and quality scores without writing custom queries. You get flexible bucketing by time or dimensions like trace name and status, plus percentiles for understanding distribution. The examples show real use cases like hourly token trends over 24 hours or P95 latency grouped by trace. It's a straightforward wrapper around MLflow's metrics API that saves you from dealing with the raw endpoints. Most useful when you're running LLM experiments in MLflow and need quick cost or performance insights without building dashboards.

Install to Claude Code

npx -y skills add mlflow/skills --skill querying-mlflow-metrics --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Files

SKILL.mdView on GitHub

MLflow Metrics

Run scripts/fetch_metrics.py to query metrics from an MLflow tracking server.

Examples

Token usage summary:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM,AVG

Output: AVG: 223.91 SUM: 7613

Hourly token trend (last 24h):

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m total_tokens -a SUM \
    -t 3600 --start-time="-24h" --end-time=now

Output: Time-bucketed token sums per hour

Latency percentiles by trace:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m latency -a AVG,P95 -d trace_name

Error rate by status:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -m trace_count -a COUNT -d trace_status

Quality scores by evaluator (assessments):

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_value -a AVG,P50 -d assessment_name

Output: Average and median scores for each evaluator (e.g., correctness, relevance)

Assessment count by name:

python scripts/fetch_metrics.py -s http://localhost:5000 -x 1 -v ASSESSMENTS \
    -m assessment_count -a COUNT -d assessment_name

JSON output: Add -o json to any command.

Arguments

Arg	Required	Description
`-s, --server`	Yes	MLflow server URL
`-x, --experiment-ids`	Yes	Experiment IDs (comma-separated)
`-m, --metric`	Yes	`trace_count`, `latency`, `input_tokens`, `output_tokens`, `total_tokens`
`-a, --aggregations`	Yes	`COUNT`, `SUM`, `AVG`, `MIN`, `MAX`, `P50`, `P95`, `P99`
`-d, --dimensions`	No	Group by: `trace_name`, `trace_status`
`-t, --time-interval`	No	Bucket size in seconds (3600=hourly, 86400=daily)
`--start-time`	No	`-24h`, `-7d`, `now`, ISO 8601, or epoch ms
`--end-time`	No	Same formats as start-time
`-o, --output`	No	`table` (default) or `json`