Forgejudge

STDIOregistry active

Summary

Gives Claude programmatic access to ForgeJudge's autonomous coding agent evaluation infrastructure. You get operations to run the solver against tasks from the golden set, grade patches with the deterministic harness, fetch leaderboard results, and query per-run traces from the Langfuse observability backend. The golden set is 18 contamination-resistant make-CI-green tasks, each mutation-hardened to catch wrong fixes. Useful when you're building or tuning agentic coding workflows and need reproducible benchmarking with execution-as-judge grading, or when you want Claude to help analyze regression patterns across model swaps or seed sweeps. The harness runs patches in sandboxed GitHub Actions VMs and returns strict RESOLVED verdicts based on FAIL_TO_PASS and PASS_TO_PASS test outcomes.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

ForgeJudge

An open, always-on leaderboard and CI gate for autonomous coding agents — every patch runs in a sandbox, every run has a public trace, every regression fails the build.

▶ Live leaderboard: forgejudge.ahmedhobeishy.tech · playground · methodology · model swap · MCP registry

Current numbers (hidden-test = the agent never sees the failing test; $0 free tier; same harness, swap the model; 18 tasks × 3 seeds = 54 runs/model, 162 total):

Model pass@1 pass@3
gpt-oss-120b 90.7% 100%
llama-3.3-70b 88.9% 94.4%
llama-3.1-8b 48.1% 66.7%

The score rises with the better model while the harness stays fixed (model-swap proof), and pass@3 > pass@1 shows real run-to-run variance — which is exactly why the CI gate is multi-seed. Every run deep-links its Langfuse trace.

ForgeJudge is the only open-source autonomous software-engineering agent that proves its quality in public on every commit: a hand-rolled single-agent solver, a deterministic execution-as-judge harness, an always-on leaderboard with per-run traces, and a CI gate that blocks regressions — all on a $0 / self-hostable stack against a contamination-resistant, intrinsically-verifiable golden set.

The engineered harness, observability, and gate are the deliverable — not a high resolution rate. A $0 free-model agent will score modestly by design. We prove value with a model-swap comparison: the score rises with a better model while the harness stays fixed.

How it works

flowchart TD
    G["Golden set · Git-canonical<br/>18 intrinsically-verifiable, mutation-hardened<br/>make-CI-green tasks"]

    subgraph SOLVER["Single-agent solver"]
        direction LR
        L["localize<br/>(BM25)"] --> R["repair<br/>(LLM router · critic · syntax edit-gate)"] --> V["validate<br/>(run tests)"]
    end

    G --> SOLVER
    SOLVER --> PATCH["unified diff"]
    SOLVER -. "every step traced" .-> TRACE["OTel → Langfuse<br/>per-run public trace"]

    PATCH --> H["Deterministic harness, in a sandbox<br/>apply test_patch + candidate patch · run F2P / P2P<br/><b>RESOLVED iff</b> every FAIL_TO_PASS passes AND every PASS_TO_PASS stays green<br/>swebench-equivalent · stricter on skips · cheat-resistant"]

    H --> STORE["Run store<br/>Neon + pgvector"]
    STORE --> LB["Leaderboard<br/>pass@1 / pass@3 · cost · tokens · trace link"]
    H --> GATE["Multi-seed CI gate<br/>a PR that lowers the resolution rate fails the build"]

    style G stroke:#3fb950,stroke-width:2px
    style H stroke:#4cc2ff,stroke-width:2px
    style GATE stroke:#f0883e,stroke-width:2px

Solver — a single, phase-structured loop (localize → repair → validate), not a multi-agent swarm: cheapest, most deterministic, most debuggable. BM25 localization, an LLM router over free tiers, a syntax edit-gate, a cheap critic pre-filter, and a cost/step budget with autosubmit.
Harness — encodes the SWE-bench RESOLVED_FULL rule and is verified equivalent to swebench.harness.grading on real PASS/FAIL/ERROR/XFAIL outcomes in CI — and deliberately stricter on a skipped FAIL_TO_PASS: swebench 4.1.0 rates a skipped oracle test RESOLVED_FULL (a skip is neither success nor failure), so a patch that makes the oracle skip rather than run grades as resolved. ForgeJudge counts a skip as not-passed, closing that cheat vector. Patches are also cheat-resistant: the canonical test files are restored before grading, so a patch can't neuter the oracle.
Golden set — 15 purpose-built post-cutoff fixtures + 3 tasks mined from the author's own repos (real commit SHAs, MIT/own license — zero leak/copyleft risk). Each is mutation-hardened: a wrong fix to the patched region is caught (16 mutation-hardened at mean score 0.94; 2 inconclusive for regex/string code; 0 weak).
Sandbox / CI / cron — GitHub Actions on a public repo does triple duty (ephemeral isolated VM sandbox + regression gate + scheduled sweep) at $0.
Observability — OpenTelemetry GenAI spans (invoke_agent → retrieval / chat / execute_tool, gen_ai.usage.*, a gen_ai.evaluation.result pass/fail verdict) exported to Langfuse Cloud; every run is a clickable trace.

Two gates, two jobs

The deterministic gold-integrity gate (does the harness itself still work?) is kept separate from the stochastic regression gate (did a change make the agent meaningfully worse?) — because gold grading is deterministic and must never be averaged with noisy per-seed runs.

flowchart TD
    PR["Pull request / commit"] --> GG["Gold-integrity gate<br/>deterministic · $0 · re-grade all gold patches"]
    GG -->|"any gold task unresolved"| F1["fail — the harness broke"]
    GG -->|"all gold tasks resolved"| OK1["harness intact"]

    CRON["Scheduled multi-seed sweep"] --> SEEDS["run the agent × N seeds<br/>→ one resolution rate per seed"]
    SEEDS --> RG["Regression gate<br/>small-sample CI (Student-t / Wilson)"]
    BASE["baseline_scores.json<br/>per-seed reference"] --> RG
    RG -->|"candidate CI upper bound &lt; baseline CI lower bound"| F2["❌ fail — real regression"]
    RG -->|"overlapping · equal · improved"| OK2["✓ no regression"]

    style GG stroke:#4cc2ff,stroke-width:2px
    style RG stroke:#f0883e,stroke-width:2px

Quickstart

Prereq: uv (Python 3.12 is provisioned for you) — curl -LsSf https://astral.sh/uv/install.sh | sh.

git clone https://github.com/ahmedEid1/forgejudge && cd forgejudge
uv sync                       # Python 3.12, deps via uv

# Run the deterministic harness self-test (no API key, no network):
uv run python -m forgejudge.harness.runner_actions --patch-source gold   # 18/18 resolved

# Solve a task with a free model and grade it.
# Needs a (free) Groq key. Either export it, or put it in .env and pass --env-file:
#   export GROQ_API_KEY=...                     # or
#   cp .env.example .env && edit GROQ_API_KEY   # then: uv run --env-file .env python - <<'PY'
uv run python - <<'PY'
from forgejudge.golden.loader import load_tasks
from forgejudge.agent.solver import solve
from forgejudge.harness.grade import grade
task = {t.instance_id: t for t in load_tasks("golden/dataset.jsonl")}["fixture-semver-001"]
res = solve(task, run_id="demo", budget_usd=0.10, seed=0)
print(res.status, "→ resolved:", grade(task, res.patch).resolved)
PY

Fast tests: uv run pytest -m "not slow". Full golden validation + mutation hardening: uv run pytest -m slow. Sweep the leaderboard: uv run python -m forgejudge.eval.sweep --model groq/llama-3.3-70b-versatile --seeds 0,1,2. See CONTRIBUTING.md for the full pytest marker map and dev workflow.

Install

Working on the agent/harness itself? Clone and uv sync (above). To consume ForgeJudge as a package:

# Library + the `forgejudge` CLI (selftest / mcp / info):
pip install forgejudge
forgejudge selftest           # deterministic harness check — 18/18 resolved, no key
forgejudge mcp                # MCP server over stdio (needs the [mcp] extra)

# Zero-install MCP server (no venv to manage) — for an MCP client config:
uvx --from "forgejudge[mcp]" forgejudge mcp

Optional extras (installed only when you need them):

Extra	Pulls in	For
`forgejudge[harness]`	`swebench`	the swebench-equivalence grading check
`forgejudge[mcp]`	`fastmcp`	the MCP server (`forgejudge mcp`)
`forgejudge[playground]`	`fastapi`, `uvicorn`, `httpx`	the guarded live playground API

pip install "forgejudge[mcp]"            # one extra
pip install "forgejudge[harness,mcp]"    # several

forgejudge selftest and forgejudge info work with the base install — no extras, no API key, no network.

Six objections, pre-empted

"Your benchmark is contaminated / cherry-picked." The golden set is freshly authored / post-cutoff, sourced only from the author's own repos + fixtures (no third-party leak surface), and mutation-hardened so a wrong patch can't pass. SWE-bench Verified is now widely held contaminated — OpenAI stopped reporting it (2026-02); >32% of "passed" cases leaked the solution and ~31% passed on weak tests. Decontamination here is a documented, tested property — not a footnote.
"Thin wrapper around an LLM / a framework." The orchestrator is hand-rolled (no LangChain): the control loop, the sandbox-and-score harness, the cheat-resistant grader, the mutation hardener, the OTel instrumentation, and the multi-seed CI gate are the work.
"Your resolution rate is low vs SOTA." SOTA is ~88–94% with premium models and budgets; a $0 free-model number is modest on purpose. The deliverable is the engineered system; the model-swap comparison (score rises with a better model, harness fixed) is the proof.
"Is it actually autonomous or staged?" Every run has a public OpenTelemetry/Langfuse trace and a deterministic, reproducible score. The replay-first playground demos a real solve without exposing cost/abuse surface.
"Three agent projects — one-trick pony?" One eval methodology — golden set + judge + traces + CI gate — across three domains at rising autonomy (Lumen → Thoth → ForgeJudge).
Determinism. temperature=0 does not guarantee determinism (pass@1 varies 2–6pp). The scorer is fully deterministic; the gate is multi-seed (fail only when the candidate's CI upper bound is below the baseline's CI lower bound), so flaky single runs don't break the build.

Repository layout

Path	What
`forgejudge/golden/`	golden-set loader, fixture contract, dataset builder, mutation hardener
`forgejudge/harness/`	deterministic `grade()`, cheat-resistant runner, swebench-equivalence check, sandbox executor
`forgejudge/agent/`	`localize → repair → validate` solve loop, critic
`forgejudge/llm/`	role-based LiteLLM router with fallback + cost accounting
`forgejudge/obs/`	OpenTelemetry GenAI tracing → Langfuse / Phoenix
`forgejudge/eval/`	scheduled sweep, multi-seed regression gate, LLM-as-judge + Cohen's κ
`forgejudge/store/`	Neon (Postgres + pgvector) run store + leaderboard query
`golden/dataset.jsonl`	canonical golden set (one `Task` per line)
`.github/workflows/`	`ci`, `eval` (sandbox), `sweep` (cron), `gate` (regression)

License

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

ForgeJudge

An open, always-on leaderboard and CI gate for autonomous coding agents — every patch runs in a sandbox, every run has a public trace, every regression fails the build.

▶ Live leaderboard: forgejudge.ahmedhobeishy.tech · playground · methodology · model swap · MCP registry

Current numbers (hidden-test = the agent never sees the failing test; $0 free tier; same harness, swap the model; 18 tasks × 3 seeds = 54 runs/model, 162 total):

Model pass@1 pass@3
gpt-oss-120b 90.7% 100%
llama-3.3-70b 88.9% 94.4%
llama-3.1-8b 48.1% 66.7%

The score rises with the better model while the harness stays fixed (model-swap proof), and pass@3 > pass@1 shows real run-to-run variance — which is exactly why the CI gate is multi-seed. Every run deep-links its Langfuse trace.

The engineered harness, observability, and gate are the deliverable — not a high resolution rate. A $0 free-model agent will score modestly by design. We prove value with a model-swap comparison: the score rises with a better model while the harness stays fixed.

How it works

flowchart TD
    G["Golden set · Git-canonical<br/>18 intrinsically-verifiable, mutation-hardened<br/>make-CI-green tasks"]

    subgraph SOLVER["Single-agent solver"]
        direction LR
        L["localize<br/>(BM25)"] --> R["repair<br/>(LLM router · critic · syntax edit-gate)"] --> V["validate<br/>(run tests)"]
    end

    G --> SOLVER
    SOLVER --> PATCH["unified diff"]
    SOLVER -. "every step traced" .-> TRACE["OTel → Langfuse<br/>per-run public trace"]

    PATCH --> H["Deterministic harness, in a sandbox<br/>apply test_patch + candidate patch · run F2P / P2P<br/><b>RESOLVED iff</b> every FAIL_TO_PASS passes AND every PASS_TO_PASS stays green<br/>swebench-equivalent · stricter on skips · cheat-resistant"]

    H --> STORE["Run store<br/>Neon + pgvector"]
    STORE --> LB["Leaderboard<br/>pass@1 / pass@3 · cost · tokens · trace link"]
    H --> GATE["Multi-seed CI gate<br/>a PR that lowers the resolution rate fails the build"]

    style G stroke:#3fb950,stroke-width:2px
    style H stroke:#4cc2ff,stroke-width:2px
    style GATE stroke:#f0883e,stroke-width:2px

Solver — a single, phase-structured loop (localize → repair → validate), not a multi-agent swarm: cheapest, most deterministic, most debuggable. BM25 localization, an LLM router over free tiers, a syntax edit-gate, a cheap critic pre-filter, and a cost/step budget with autosubmit.
Harness — encodes the SWE-bench RESOLVED_FULL rule and is verified equivalent to swebench.harness.grading on real PASS/FAIL/ERROR/XFAIL outcomes in CI — and deliberately stricter on a skipped FAIL_TO_PASS: swebench 4.1.0 rates a skipped oracle test RESOLVED_FULL (a skip is neither success nor failure), so a patch that makes the oracle skip rather than run grades as resolved. ForgeJudge counts a skip as not-passed, closing that cheat vector. Patches are also cheat-resistant: the canonical test files are restored before grading, so a patch can't neuter the oracle.
Golden set — 15 purpose-built post-cutoff fixtures + 3 tasks mined from the author's own repos (real commit SHAs, MIT/own license — zero leak/copyleft risk). Each is mutation-hardened: a wrong fix to the patched region is caught (16 mutation-hardened at mean score 0.94; 2 inconclusive for regex/string code; 0 weak).
Sandbox / CI / cron — GitHub Actions on a public repo does triple duty (ephemeral isolated VM sandbox + regression gate + scheduled sweep) at $0.
Observability — OpenTelemetry GenAI spans (invoke_agent → retrieval / chat / execute_tool, gen_ai.usage.*, a gen_ai.evaluation.result pass/fail verdict) exported to Langfuse Cloud; every run is a clickable trace.

Two gates, two jobs

flowchart TD
    PR["Pull request / commit"] --> GG["Gold-integrity gate<br/>deterministic · $0 · re-grade all gold patches"]
    GG -->|"any gold task unresolved"| F1["fail — the harness broke"]
    GG -->|"all gold tasks resolved"| OK1["harness intact"]

    CRON["Scheduled multi-seed sweep"] --> SEEDS["run the agent × N seeds<br/>→ one resolution rate per seed"]
    SEEDS --> RG["Regression gate<br/>small-sample CI (Student-t / Wilson)"]
    BASE["baseline_scores.json<br/>per-seed reference"] --> RG
    RG -->|"candidate CI upper bound &lt; baseline CI lower bound"| F2["❌ fail — real regression"]
    RG -->|"overlapping · equal · improved"| OK2["✓ no regression"]

    style GG stroke:#4cc2ff,stroke-width:2px
    style RG stroke:#f0883e,stroke-width:2px

Quickstart

Prereq: uv (Python 3.12 is provisioned for you) — curl -LsSf https://astral.sh/uv/install.sh | sh.

git clone https://github.com/ahmedEid1/forgejudge && cd forgejudge
uv sync                       # Python 3.12, deps via uv

# Run the deterministic harness self-test (no API key, no network):
uv run python -m forgejudge.harness.runner_actions --patch-source gold   # 18/18 resolved

# Solve a task with a free model and grade it.
# Needs a (free) Groq key. Either export it, or put it in .env and pass --env-file:
#   export GROQ_API_KEY=...                     # or
#   cp .env.example .env && edit GROQ_API_KEY   # then: uv run --env-file .env python - <<'PY'
uv run python - <<'PY'
from forgejudge.golden.loader import load_tasks
from forgejudge.agent.solver import solve
from forgejudge.harness.grade import grade
task = {t.instance_id: t for t in load_tasks("golden/dataset.jsonl")}["fixture-semver-001"]
res = solve(task, run_id="demo", budget_usd=0.10, seed=0)
print(res.status, "→ resolved:", grade(task, res.patch).resolved)
PY

Install

Working on the agent/harness itself? Clone and uv sync (above). To consume ForgeJudge as a package:

# Library + the `forgejudge` CLI (selftest / mcp / info):
pip install forgejudge
forgejudge selftest           # deterministic harness check — 18/18 resolved, no key
forgejudge mcp                # MCP server over stdio (needs the [mcp] extra)

# Zero-install MCP server (no venv to manage) — for an MCP client config:
uvx --from "forgejudge[mcp]" forgejudge mcp

Optional extras (installed only when you need them):

Extra	Pulls in	For
`forgejudge[harness]`	`swebench`	the swebench-equivalence grading check
`forgejudge[mcp]`	`fastmcp`	the MCP server (`forgejudge mcp`)
`forgejudge[playground]`	`fastapi`, `uvicorn`, `httpx`	the guarded live playground API

pip install "forgejudge[mcp]"            # one extra
pip install "forgejudge[harness,mcp]"    # several

forgejudge selftest and forgejudge info work with the base install — no extras, no API key, no network.

Six objections, pre-empted

"Your benchmark is contaminated / cherry-picked." The golden set is freshly authored / post-cutoff, sourced only from the author's own repos + fixtures (no third-party leak surface), and mutation-hardened so a wrong patch can't pass. SWE-bench Verified is now widely held contaminated — OpenAI stopped reporting it (2026-02); >32% of "passed" cases leaked the solution and ~31% passed on weak tests. Decontamination here is a documented, tested property — not a footnote.
"Thin wrapper around an LLM / a framework." The orchestrator is hand-rolled (no LangChain): the control loop, the sandbox-and-score harness, the cheat-resistant grader, the mutation hardener, the OTel instrumentation, and the multi-seed CI gate are the work.
"Your resolution rate is low vs SOTA." SOTA is ~88–94% with premium models and budgets; a $0 free-model number is modest on purpose. The deliverable is the engineered system; the model-swap comparison (score rises with a better model, harness fixed) is the proof.
"Is it actually autonomous or staged?" Every run has a public OpenTelemetry/Langfuse trace and a deterministic, reproducible score. The replay-first playground demos a real solve without exposing cost/abuse surface.
"Three agent projects — one-trick pony?" One eval methodology — golden set + judge + traces + CI gate — across three domains at rising autonomy (Lumen → Thoth → ForgeJudge).
Determinism. temperature=0 does not guarantee determinism (pass@1 varies 2–6pp). The scorer is fully deterministic; the gate is multi-seed (fail only when the candidate's CI upper bound is below the baseline's CI lower bound), so flaky single runs don't break the build.

Repository layout

Path	What
`forgejudge/golden/`	golden-set loader, fixture contract, dataset builder, mutation hardener
`forgejudge/harness/`	deterministic `grade()`, cheat-resistant runner, swebench-equivalence check, sandbox executor
`forgejudge/agent/`	`localize → repair → validate` solve loop, critic
`forgejudge/llm/`	role-based LiteLLM router with fallback + cost accounting
`forgejudge/obs/`	OpenTelemetry GenAI tracing → Langfuse / Phoenix
`forgejudge/eval/`	scheduled sweep, multi-seed regression gate, LLM-as-judge + Cohen's κ
`forgejudge/store/`	Neon (Postgres + pgvector) run store + leaderboard query
`golden/dataset.jsonl`	canonical golden set (one `Task` per line)
`.github/workflows/`	`ci`, `eval` (sandbox), `sweep` (cron), `gate` (regression)

Forgejudge

ForgeJudge

How it works

Two gates, two jobs

Quickstart

Install

Six objections, pre-empted

Repository layout

License

Forgejudge

ForgeJudge

How it works

Two gates, two jobs

Quickstart

Install

Six objections, pre-empted

Repository layout

License

Related Monitoring & Observability MCP Servers

Related Monitoring & Observability MCP Servers

Model	pass@1	pass@3
`gpt-oss-120b`	90.7%	100%
`llama-3.3-70b`	88.9%	94.4%
`llama-3.1-8b`	48.1%	66.7%