CAT
/MCP
SkillsMCPMarketplacesDigestToolsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Cross AI Tools

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Tools
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

redline

gowtham0992/redline
STDIOregistry active
Summary

Turns your existing prompt-response logs into local regression tests. It generates eval suites from JSONL exports or traces you already have, then replays changed prompts and diffs the outputs to catch structural breaks before deployment. The MCP server exposes suite creation, case inspection, eval runs, and diff operations as tools. It flags missing JSON keys, dropped URLs, format changes, refusals, and empty responses without requiring an LLM judge or cloud service. Reach for this when you're iterating on prompts in coding assistants and want a safety check that spots behavioral regressions between versions. The core workflow is suite generation from baseline logs, then eval or diff commands on every prompt change.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →

redline

Catch prompt regressions before they ship.

Automatic eval suites from the prompt logs you already have.

redline turns real prompt-response logs into local regression tests. It selects representative cases, replays your changed prompt, and shows the behavioral diff before a bad prompt reaches users.

Website · Docs · MCP · MCP Registry · Security · License

CI GitHub Pages PyPI MCP Registry License: MIT Stars

Start Here

Install from PyPI:

python -m pip install redline-ai

Run the public proof:

redline demo --public --compact

The demo catches ten synthetic regressions without API keys, private logs, a cloud account, or an LLM judge. It writes JSON, Markdown, and self-contained HTML reports under .redline/demo.

Open the local demo report index:

redline dashboard --reports-dir .redline/demo/reports --open

On headless CI or remote shells, skip --open and use the printed HTML path:

redline dashboard --reports-dir .redline/demo/reports --out .redline/dashboard.html
First-run troubleshooting
  • redline: command not found: run python -m pip install redline-ai, then confirm python -m pip show redline-ai.
  • Dashboard did not open: use --out .redline/dashboard.html and open or upload that file from your environment.
  • Suite not found: run redline suite logs/baseline.jsonl --out redline-suite.json.
  • Validation failed: run redline validate redline-suite.json --strict and fix the first reported error.
  • GitHub Action cannot find a suite: commit redline-suite.json or point the action suite input at your prompt manifest.

Full guide: docs/troubleshooting.md.

redline product demo

Product Proof

These are real local artifacts generated by redline demo --public --compact.

DashboardHTML report
redline dashboard showing reports, benchmark evidence, history, and ship readinessredline HTML report showing concrete regression reasons and side-by-side baseline and candidate outputs

What Is redline?

redline is an open-source, local-first eval tool for AI teams. It uses logs you already have: prompts, outputs, support tickets, traces, model responses, and production JSONL exports.

Instead of asking you to hand-write evals first, redline generates the first suite from real behavior. You can then run that suite every time a prompt, model, or runner changes.

No cloud account is required. No manual test writing is required. No LLM judge is required for the core regression signal. The package has zero runtime dependencies, which keeps installs fast and the default supply-chain surface small.

How It Works

redline gives you three primitives that cover the prompt-regression loop:

1. Logs

Start with prompt-response data you already have. Import JSONL, convert exports from tools like Langfuse or Helicone, capture OpenAI/Anthropic SDK calls, or add bounded FastAPI/ASGI middleware.

redline import downloaded.jsonl --input-field instruction --output-field response --out logs/baseline.jsonl
redline suite logs/baseline.jsonl --out redline-suite.json
redline cases redline-suite.json

2. Suite

redline groups behavior into deterministic signatures and selects representative cases first. You can add pinned edge cases and explicit requirements when a scenario must never be missed.

redline cases redline-suite.json
redline suite add redline-suite.json --prompt "..." --response "..."

3. Eval

Replay a changed prompt or compare candidate outputs. redline names the behavior that broke: missing JSON keys, URLs, numbers, tables, code blocks, refusals, empty answers, or requirement failures.

redline eval --prompt prompts/v2.txt
redline diff redline-suite.json logs/candidate.jsonl

Product Promise

In under five minutes, on a real prompt log, redline should catch one regression you did not want to ship.

That promise is intentionally narrow. redline is not a hosted eval platform, a generic score, or a replacement for human judgment. It is the local safety loop between "I changed the prompt" and "this is safe enough to merge."

Real Workflow

Build a suite from baseline logs:

redline suite logs/baseline.jsonl --out redline-suite.json

Evaluate a changed prompt file through your configured runner:

redline eval --prompt prompts/v2.txt

Or compare candidate outputs you already generated:

redline diff redline-suite.json logs/candidate.jsonl

When redline finds a blocking change, it exits non-zero for CI and prints the reason:

REGRESSION case_004
- candidate missing JSON keys: owner, required_action
- candidate missing URL: https://example.com/policies/refunds

Confidence: HIGH | fix blocking cases before shipping

What redline Catches

SignalExample regression
JSON validity and keysCandidate stops returning valid JSON or drops owner.
Tables, lists, and code blocksMarkdown table becomes prose; code fence disappears.
Numbers, URLs, and entitiesRefund window, ticket ID, policy URL, or owner is missing.
Empty outputs and refusalsCandidate newly refuses a safe task or returns nothing.
Content driftSame-shape response changes substantially.
Explicit requirementsPinned cases require or forbid exact strings.

redline is deterministic and local-first by default. Optional judge commands are available for ambiguous changed cases, but redline does not call a cloud model unless you explicitly configure that command.

Suite generation groups logs by deterministic behavior signatures, not opaque embedding clusters. It picks one representative per group first, then adds high-variance edges and evenly spread prompt-diverse samples from large clusters when the case budget allows.

Trust Boundary

A green redline run means no configured high-signal structural blockers were found. It does not prove factual correctness, tone, hallucination safety, policy compliance, or subtle reasoning quality.

That boundary is visible in CLI output and reports because over-trusting eval tools is dangerous. Each reported case includes a confidence and signal (structural, shallow_semantic, requirement, judge, or human_judgment) so reviewers can see why redline is making the call. Use requirements or an optional judge for semantic risks that structural checks cannot prove.

Product Surface

redline is built around the full prompt-regression loop:

  • redline watch: collect prompt-response observations from logs, Python functions, OpenAI/Anthropic-compatible SDK calls, or ASGI apps, with best-effort common secrets and PII redacted before write by default.
  • RedlineMiddleware: capture bounded JSON FastAPI or ASGI request/response pairs locally, with optional skip diagnostics.
  • redline redact --check: scan logs for common secrets and PII, then write a scrubbed copy when needed. Redaction is best-effort pattern matching, not a privacy boundary; review sensitive logs before sharing.
  • redline cluster: inspect behavior groups before suite generation.
  • redline suite: generate a representative eval suite from baseline logs.
  • redline prompts: scan many prompt files and write or check a versionable prompt-to-suite manifest. Add --check-suites in CI when every prompt should already have a built and valid suite.
  • redline suite add: pin hand-picked edge cases the algorithm should never miss.
  • redline budget / redline benchmark: estimate suite or prompt-manifest runtime without executing replay commands, write budget artifacts, and optionally fail on a CI time budget. Add --measure-local to time redline's deterministic local diff work on your suite baselines without calling a model.
  • redline eval: replay each suite case through your local app or model runner.
  • redline diff: compare candidate JSONL outputs against the suite baseline.
  • redline mark and redline accept: review intentional changes and promote the new baseline.
  • redline require: add deterministic must-include or must-not-include rules.
  • redline audit --verify: inspect the local audit trail and verify the hash chain. Add --expect-last-hash or --expect-entries when you want to prove the local log tail still matches a checkpoint from CI or release evidence. Add --out-checkpoint .redline/audit-checkpoint.json to persist that evidence, then --checkpoint .redline/audit-checkpoint.json to verify against it later.
  • redline sbom: write CycloneDX SBOM release evidence for security review.
  • redline history, redline compare, and redline dashboard: track quality over time and inspect reports locally. The dashboard surfaces feature-level rollups, prompt-level eval rows, benchmark evidence, and a latest-report review queue when reports come from a prompt manifest. It also warns when reports exist without benchmark evidence from the same project.
  • redline summary: inspect suite readiness, or pass redline-prompts.json to roll up multi-prompt suite coverage, owners, requirements, and missing suites.
  • redline-mcp: let AI coding assistants run checks inside Claude, Codex, Cursor, Kiro, or any MCP client.

For repos with many prompt files, the manifest becomes the eval plan:

redline prompts prompts/ --suite-dir suites --out redline-prompts.json
redline prompts prompts/ --suite-dir suites --out redline-prompts.json --check --check-suites
redline summary redline-prompts.json
redline validate redline-prompts.json --strict
redline budget redline-prompts.json
redline eval redline-prompts.json

Manifest summaries show readiness across every mapped suite, manifest validation checks every mapped suite, manifest benchmarks aggregate runtime budget, and manifest evals print prompt-level rollups before case details. Large repos can see which prompt files or feature folders need attention first.

When mapped suites are valid, the check prints ready commands such as:

redline eval suites/support/triage.redline-suite.json --prompt prompts/support/triage.txt

Connect Your App

Any command that reads a prompt from stdin and prints a response to stdout can be a redline runner:

redline init --runner stdio --copy-runner --github-action

Built-in adapters cover provider-neutral stdio, OpenAI, Anthropic, LiteLLM, HTTP APIs, Python chains, JSONL log imports, and OpenAI/Anthropic SDK capture:

redline runners
redline runners --copy all

Runner details live in docs/runners.md. Log import and SDK capture adapters are for building suites from real observations, not for redline eval replay. The JSONL log adapter includes Langfuse, Helicone, LangSmith, and Braintrust presets for exported observability logs.

AI Assistant Native

redline ships a local Model Context Protocol server:

redline-mcp

Use docs/mcp.md to wire redline into an MCP client. The MCP surface exposes safe capture-readiness, privacy, audit, scale, read, case-inspection, eval, and report tools plus workflow prompts like setup_redline_project, check_prompt_change, build_suite_from_logs, and review_candidate_outputs. It can also list or copy runner adapters and optional judge templates during setup. The only mutating MCP tool is guarded: redline_mark requires allow_write: true and a note before it records an intentional case judgment. Baseline promotion stays CLI-only.

CI And GitHub

Create config plus a GitHub Actions workflow:

redline init --runner stdio --copy-runner --github-action

Use redline as a composite GitHub Action from another repo:

- uses: gowtham0992/redline@v0.2.1
  with:
    prompt-path: prompts/v2.txt
    benchmark-max-seconds: "300"

For multi-prompt repos, point suite at redline-prompts.json. The action checks every mapped suite with redline prompts --check --check-suites, runs a manifest-wide benchmark, then runs the manifest eval.

The action writes JSON, full Markdown, concise PR-comment Markdown, HTML, JUnit, Slack-ready JSON, history, dashboard, and audit checkpoint artifacts under .redline/, appends benchmark, concise eval, and trend summaries to the GitHub step summary, and exits with the eval gate status. Set benchmark-max-seconds when a suite should fail CI if its worst-case runtime budget grows too far.

Reports

Every diff and eval run can write:

  • JSON for machines and dashboards
  • full Markdown for detailed summaries, including prompt-manifest rollups
  • concise PR-comment Markdown for merge-review surfaces
  • self-contained HTML for side-by-side inspection, including feature and prompt eval tables
  • JUnit XML for CI test reporting
  • Slack Block Kit JSON for CI bots or webhook integrations you control
  • GitHub annotations for changed or blocking cases

Example:

redline diff redline-suite.json logs/candidate.jsonl \
  --out-json .redline/reports/diff.json \
  --out-md .redline/reports/diff.md \
  --out-comment .redline/reports/diff-comment.md \
  --out-html .redline/reports/diff.html \
  --out-junit .redline/reports/diff.xml \
  --out-slack .redline/reports/diff.slack.json

Optional Judges

Use judges only where structural checks are not enough. redline sends only ambiguous changed cases to the configured command as JSON on stdin:

redline judges
redline judges --copy openai
redline judges --copy support-rubric
redline diff logs/candidate.jsonl --judge "python examples/judge_changed.py"

Repo examples and installable templates:

  • examples/judge_changed.py
  • examples/openai_judge.sh
  • examples/anthropic_judge.sh
  • examples/litellm_judge.sh
  • examples/judges/support_rubric.md
  • examples/judges/extraction_rubric.md
  • examples/judges/safety_rubric.md

Calibration guidance lives in docs/judges.md.

Config

redline init writes redline.json with a $schema reference for editor autocomplete. Important keys:

KeyPurpose
suiteSuite baseline path, default redline-suite.json.
input_field, output_fieldJSONL field paths for prompts and responses.
max_casesMaximum representative cases selected for a suite.
replayCommand used by eval; prompts go to stdin by default. {prompt} is for small legacy argv runners; {prompt_file} passes a temporary rendered-prompt file path.
workersNumber of replay cases to run concurrently.
ownersOptional pattern-to-owner rules so regressions show the responsible team.
approvalOptional local guardrail; require_approver makes accept record an approver.
fail_onStatuses that fail diff or eval; use "none" for report-only setup.
reportsJSON, Markdown, PR-comment Markdown, HTML, JUnit, and Slack-ready JSON output paths.
logsObserved prompt-response log path and optional middleware skip diagnostics path.
auditAppend-only JSONL audit log path for evals, judgments, requirements, and accepted baselines. New entries include operator/approver context plus a local hash chain that redline audit --verify can check; use expected hash/count checkpoints or --out-checkpoint evidence files to detect tail truncation.
judgeOptional command for ambiguous changed cases.

Check setup before relying on a suite:

redline doctor --strict
redline validate redline-suite.json --strict
redline summary redline-suite.json

doctor shows whether the suite has explicit requirements or recorded judgments before you rely on structural checks in CI. summary reports cluster/case coverage, owner coverage, accepted baseline history, approver coverage, and explicit guard coverage for cases with requirements or recorded judgments so teams can review suite readiness before CI. dashboard also shows audit checkpoint evidence when .redline/audit-checkpoint.json is present.

Dogfood Assets

The public fixture is synthetic, shaped after public instruction/chat dataset patterns, and documented in examples/public_dogfood_sources.md.

python -m redline suite examples/public_dogfood_baseline.jsonl --out /tmp/redline-public-suite.json --all-cases
python -m redline diff /tmp/redline-public-suite.json examples/public_dogfood_candidate.jsonl --compact --fail-on none

For AI-assistant session dogfood, use docs/ai-session-dogfood-prompts.jsonl and normalize raw exports with scripts/normalize_ai_session_logs.py. Reproducible dogfood case studies live in docs/case-studies.md. Public dataset candidates for internet dogfood are ranked in docs/internet-dogfood-sources.md.

From a repo checkout, record the public demo:

bash scripts/demo_terminal.sh
bash scripts/demo_gif.sh .redline/launch .redline/launch/redline-demo.gif

Development

python -m pip install -e ".[dev]"
python -m pytest -q
python -m ruff check .
python -m mypy redline tests scripts examples

Before cutting a release or asking someone else to try a branch:

bash scripts/release_check.sh

Project Docs

  • docs/release.md: package, tag, PyPI, and MCP Registry release flow
  • docs/launch.md: public alpha launch plan
  • docs/troubleshooting.md: first-run and CI failure recovery
  • docs/commands.md: compact CLI command reference
  • docs/dogfood.md: first-user dogfood protocol
  • docs/case-studies.md: reproducible dogfood case studies
  • docs/internet-dogfood-sources.md: public prompt-response datasets for dogfood sourcing
  • docs/runners.md: runner and log adapter setup
  • docs/mcp.md: MCP server setup
  • docs/benchmarks.md: performance contract and CI benchmark artifacts
  • docs/repository.md: GitHub repository controls
  • scripts/README.md: maintainer script index
  • CONTRIBUTING.md: contributor validation
  • SECURITY.md: privacy and vulnerability reporting
  • LICENSE: MIT open source license

Website source for GitHub Pages lives in site/ and deploys from the committed static assets on main.

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Categories
AI & LLM Tools
Registryactive
Packageredline-ai
TransportSTDIO
UpdatedMay 27, 2026
View on GitHub

Related AI & LLM Tools MCP Servers

View all →
SkillFM LLM Cost Optimizer

io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage

LLM cost optimizer for OpenAI, Anthropic, token usage, BYOK, and SkillFM Beacon audits.
Llm Orchestration Agent

io.github.mikerawsonnz/llm-orchestration-agent

Run a prompt through a LangChain (system + human) chain over Gemini on Vertex AI; optional LangSmith
Authenticated Llm Agent

io.github.mikerawsonnz/authenticated-llm-agent

JWT-gated LLM gateway: authenticate (bcrypt/JWT), then run a LangChain-on-Vertex Gemini completion.
Copilot Memory MCP

labforgedev/copilot-memory-mcp

Persistent semantic memory for AI agents using local ChromaDB vector search. No cloud required.
1
Agent Prompt Injection Firewall Mcp

csoai-org/agent-prompt-injection-firewall-mcp

The WAF for agents. Pattern-based + heuristic firewall scans prompts, RAG documents, tool argume...
Authenticated Multi Llm Agent

io.github.mikerawsonnz/authenticated-multi-llm-agent

Google-OAuth-gated LLM gateway: verify a Google ID token, then run a Gemini (Vertex AI) completion f