CAT
/MCP
SkillsMCPMarketplacesDigestToolsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Cross AI Tools

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Tools
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

Forgejudge

ahmedeid1/forgejudge
STDIOregistry active
Summary

Gives Claude programmatic access to ForgeJudge's autonomous coding agent evaluation infrastructure. You get operations to run the solver against tasks from the golden set, grade patches with the deterministic harness, fetch leaderboard results, and query per-run traces from the Langfuse observability backend. The golden set is 18 contamination-resistant make-CI-green tasks, each mutation-hardened to catch wrong fixes. Useful when you're building or tuning agentic coding workflows and need reproducible benchmarking with execution-as-judge grading, or when you want Claude to help analyze regression patterns across model swaps or seed sweeps. The harness runs patches in sandboxed GitHub Actions VMs and returns strict RESOLVED verdicts based on FAIL_TO_PASS and PASS_TO_PASS test outcomes.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →

ForgeJudge

An open, always-on leaderboard and CI gate for autonomous coding agents — every patch runs in a sandbox, every run has a public trace, every regression fails the build.

CI regression gate license: MIT python 3.12

▶ Live leaderboard: forgejudge.ahmedhobeishy.tech · playground · methodology · model swap · MCP registry

Current numbers (hidden-test = the agent never sees the failing test; $0 free tier; same harness, swap the model; 18 tasks × 3 seeds = 54 runs/model, 162 total):

Modelpass@1pass@3
gpt-oss-120b90.7%100%
llama-3.3-70b88.9%94.4%
llama-3.1-8b48.1%66.7%

The score rises with the better model while the harness stays fixed (model-swap proof), and pass@3 > pass@1 shows real run-to-run variance — which is exactly why the CI gate is multi-seed. Every run deep-links its Langfuse trace.

ForgeJudge is the only open-source autonomous software-engineering agent that proves its quality in public on every commit: a hand-rolled single-agent solver, a deterministic execution-as-judge harness, an always-on leaderboard with per-run traces, and a CI gate that blocks regressions — all on a $0 / self-hostable stack against a contamination-resistant, intrinsically-verifiable golden set.

The engineered harness, observability, and gate are the deliverable — not a high resolution rate. A $0 free-model agent will score modestly by design. We prove value with a model-swap comparison: the score rises with a better model while the harness stays fixed.

How it works

flowchart TD
    G["Golden set · Git-canonical<br/>18 intrinsically-verifiable, mutation-hardened<br/>make-CI-green tasks"]

    subgraph SOLVER["Single-agent solver"]
        direction LR
        L["localize<br/>(BM25)"] --> R["repair<br/>(LLM router · critic · syntax edit-gate)"] --> V["validate<br/>(run tests)"]
    end

    G --> SOLVER
    SOLVER --> PATCH["unified diff"]
    SOLVER -. "every step traced" .-> TRACE["OTel → Langfuse<br/>per-run public trace"]

    PATCH --> H["Deterministic harness, in a sandbox<br/>apply test_patch + candidate patch · run F2P / P2P<br/><b>RESOLVED iff</b> every FAIL_TO_PASS passes AND every PASS_TO_PASS stays green<br/>swebench-equivalent · stricter on skips · cheat-resistant"]

    H --> STORE["Run store<br/>Neon + pgvector"]
    STORE --> LB["Leaderboard<br/>pass@1 / pass@3 · cost · tokens · trace link"]
    H --> GATE["Multi-seed CI gate<br/>a PR that lowers the resolution rate fails the build"]

    style G stroke:#3fb950,stroke-width:2px
    style H stroke:#4cc2ff,stroke-width:2px
    style GATE stroke:#f0883e,stroke-width:2px
  • Solver — a single, phase-structured loop (localize → repair → validate), not a multi-agent swarm: cheapest, most deterministic, most debuggable. BM25 localization, an LLM router over free tiers, a syntax edit-gate, a cheap critic pre-filter, and a cost/step budget with autosubmit.
  • Harness — encodes the SWE-bench RESOLVED_FULL rule and is verified equivalent to swebench.harness.grading on real PASS/FAIL/ERROR/XFAIL outcomes in CI — and deliberately stricter on a skipped FAIL_TO_PASS: swebench 4.1.0 rates a skipped oracle test RESOLVED_FULL (a skip is neither success nor failure), so a patch that makes the oracle skip rather than run grades as resolved. ForgeJudge counts a skip as not-passed, closing that cheat vector. Patches are also cheat-resistant: the canonical test files are restored before grading, so a patch can't neuter the oracle.
  • Golden set — 15 purpose-built post-cutoff fixtures + 3 tasks mined from the author's own repos (real commit SHAs, MIT/own license — zero leak/copyleft risk). Each is mutation-hardened: a wrong fix to the patched region is caught (16 mutation-hardened at mean score 0.94; 2 inconclusive for regex/string code; 0 weak).
  • Sandbox / CI / cron — GitHub Actions on a public repo does triple duty (ephemeral isolated VM sandbox + regression gate + scheduled sweep) at $0.
  • Observability — OpenTelemetry GenAI spans (invoke_agent → retrieval / chat / execute_tool, gen_ai.usage.*, a gen_ai.evaluation.result pass/fail verdict) exported to Langfuse Cloud; every run is a clickable trace.

Two gates, two jobs

The deterministic gold-integrity gate (does the harness itself still work?) is kept separate from the stochastic regression gate (did a change make the agent meaningfully worse?) — because gold grading is deterministic and must never be averaged with noisy per-seed runs.

flowchart TD
    PR["Pull request / commit"] --> GG["Gold-integrity gate<br/>deterministic · $0 · re-grade all gold patches"]
    GG -->|"any gold task unresolved"| F1["fail — the harness broke"]
    GG -->|"all gold tasks resolved"| OK1["harness intact"]

    CRON["Scheduled multi-seed sweep"] --> SEEDS["run the agent × N seeds<br/>→ one resolution rate per seed"]
    SEEDS --> RG["Regression gate<br/>small-sample CI (Student-t / Wilson)"]
    BASE["baseline_scores.json<br/>per-seed reference"] --> RG
    RG -->|"candidate CI upper bound &lt; baseline CI lower bound"| F2["❌ fail — real regression"]
    RG -->|"overlapping · equal · improved"| OK2["✓ no regression"]

    style GG stroke:#4cc2ff,stroke-width:2px
    style RG stroke:#f0883e,stroke-width:2px

Quickstart

Prereq: uv (Python 3.12 is provisioned for you) — curl -LsSf https://astral.sh/uv/install.sh | sh.

git clone https://github.com/ahmedEid1/forgejudge && cd forgejudge
uv sync                       # Python 3.12, deps via uv

# Run the deterministic harness self-test (no API key, no network):
uv run python -m forgejudge.harness.runner_actions --patch-source gold   # 18/18 resolved

# Solve a task with a free model and grade it.
# Needs a (free) Groq key. Either export it, or put it in .env and pass --env-file:
#   export GROQ_API_KEY=...                     # or
#   cp .env.example .env && edit GROQ_API_KEY   # then: uv run --env-file .env python - <<'PY'
uv run python - <<'PY'
from forgejudge.golden.loader import load_tasks
from forgejudge.agent.solver import solve
from forgejudge.harness.grade import grade
task = {t.instance_id: t for t in load_tasks("golden/dataset.jsonl")}["fixture-semver-001"]
res = solve(task, run_id="demo", budget_usd=0.10, seed=0)
print(res.status, "→ resolved:", grade(task, res.patch).resolved)
PY

Fast tests: uv run pytest -m "not slow". Full golden validation + mutation hardening: uv run pytest -m slow. Sweep the leaderboard: uv run python -m forgejudge.eval.sweep --model groq/llama-3.3-70b-versatile --seeds 0,1,2. See CONTRIBUTING.md for the full pytest marker map and dev workflow.

Install

Working on the agent/harness itself? Clone and uv sync (above). To consume ForgeJudge as a package:

# Library + the `forgejudge` CLI (selftest / mcp / info):
pip install forgejudge
forgejudge selftest           # deterministic harness check — 18/18 resolved, no key
forgejudge mcp                # MCP server over stdio (needs the [mcp] extra)

# Zero-install MCP server (no venv to manage) — for an MCP client config:
uvx --from "forgejudge[mcp]" forgejudge mcp

Optional extras (installed only when you need them):

ExtraPulls inFor
forgejudge[harness]swebenchthe swebench-equivalence grading check
forgejudge[mcp]fastmcpthe MCP server (forgejudge mcp)
forgejudge[playground]fastapi, uvicorn, httpxthe guarded live playground API
pip install "forgejudge[mcp]"            # one extra
pip install "forgejudge[harness,mcp]"    # several

forgejudge selftest and forgejudge info work with the base install — no extras, no API key, no network.

Six objections, pre-empted

  1. "Your benchmark is contaminated / cherry-picked." The golden set is freshly authored / post-cutoff, sourced only from the author's own repos + fixtures (no third-party leak surface), and mutation-hardened so a wrong patch can't pass. SWE-bench Verified is now widely held contaminated — OpenAI stopped reporting it (2026-02); >32% of "passed" cases leaked the solution and ~31% passed on weak tests. Decontamination here is a documented, tested property — not a footnote.
  2. "Thin wrapper around an LLM / a framework." The orchestrator is hand-rolled (no LangChain): the control loop, the sandbox-and-score harness, the cheat-resistant grader, the mutation hardener, the OTel instrumentation, and the multi-seed CI gate are the work.
  3. "Your resolution rate is low vs SOTA." SOTA is ~88–94% with premium models and budgets; a $0 free-model number is modest on purpose. The deliverable is the engineered system; the model-swap comparison (score rises with a better model, harness fixed) is the proof.
  4. "Is it actually autonomous or staged?" Every run has a public OpenTelemetry/Langfuse trace and a deterministic, reproducible score. The replay-first playground demos a real solve without exposing cost/abuse surface.
  5. "Three agent projects — one-trick pony?" One eval methodology — golden set + judge + traces + CI gate — across three domains at rising autonomy (Lumen → Thoth → ForgeJudge).
  6. Determinism. temperature=0 does not guarantee determinism (pass@1 varies 2–6pp). The scorer is fully deterministic; the gate is multi-seed (fail only when the candidate's CI upper bound is below the baseline's CI lower bound), so flaky single runs don't break the build.

Repository layout

PathWhat
forgejudge/golden/golden-set loader, fixture contract, dataset builder, mutation hardener
forgejudge/harness/deterministic grade(), cheat-resistant runner, swebench-equivalence check, sandbox executor
forgejudge/agent/localize → repair → validate solve loop, critic
forgejudge/llm/role-based LiteLLM router with fallback + cost accounting
forgejudge/obs/OpenTelemetry GenAI tracing → Langfuse / Phoenix
forgejudge/eval/scheduled sweep, multi-seed regression gate, LLM-as-judge + Cohen's κ
forgejudge/store/Neon (Postgres + pgvector) run store + leaderboard query
golden/dataset.jsonlcanonical golden set (one Task per line)
.github/workflows/ci, eval (sandbox), sweep (cron), gate (regression)

License

MIT © 2026 Ahmed Hobeishy. Imports and attributes the MIT-licensed swebench grading harness.

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Categories
Monitoring & Observability
Registryactive
Packageforgejudge
TransportSTDIO
UpdatedMay 30, 2026
View on GitHub

Related Monitoring & Observability MCP Servers

View all →
Mcp Observability

io.github.infoinlet-marketplace/mcp-observability

Observability for incident agents — query Loki (LogQL), Prometheus (PromQL), Elasticsearch.
Monitor

betterdb-inc/monitor

BetterDB MCP server - Valkey observability for Claude Code and other MCP clients
1.1k
Datadog

com.mcparmory/datadog

Monitor infrastructure, manage agents and deployments, track metrics, logs, and events
25
Observability Mcp

thotischner/observability-mcp

Unified observability gateway for AI agents — Prometheus, Loki & more, with anomaly detection.
5
Datadog Mcp

io.github.tantiope/datadog-mcp

Full Datadog API access: monitors, logs, metrics, traces, dashboards, and observability tools
4
Datadog

io.github.us-all/datadog

Datadog MCP — 165 tools for metrics, monitors, logs, APM, RUM, incidents, CI/CD, fleet
1