This server wraps the GoldenMatch entity resolution toolkit, which finds duplicate records across messy datasets with a 97.2% F1 score out of the box. You get access to deduplication, clustering, and the broader Golden Suite pipeline (InferMap for schema alignment, GoldenCheck for profiling, GoldenFlow for standardization). The MCP layer exposes 36+ tools including auto_configure for adaptive tuning and controller_telemetry for inspecting clustering decisions. Useful when you need to clean customer lists, merge data sources, or resolve entities across organizations without writing custom fuzzy matching logic. The zero-config defaults work immediately, and a learning memory system stops asking for the same correction twice across runs.
claude mcp add --transport http goldenmatch https://goldenmatch-mcp-production.up.railway.app/mcp/Run in your terminal. Add --scope user to make it available in every project.
Review the command, arguments, and environment values before installing — MCP servers run with your local permissions.
Verified live against the running server on Jun 10, 2026.
analyze_dataProfile data, detect domain, recommend ER strategy1 paramsProfile data, detect domain, recommend ER strategy
file_path*stringauto_configureRun AutoConfigController on a CSV; return the committed GoldenMatchConfig (incl. negative_evidence / Path Y when chosen) plus telemetry — stop_reason, health, decision trace, indicator column priors. Programmatic equivalent of `goldenmatch autoconfig`.2 paramsRun AutoConfigController on a CSV; return the committed GoldenMatchConfig (incl. negative_evidence / Path Y when chosen) plus telemetry — stop_reason, health, decision trace, indicator column priors. Programmatic equivalent of `goldenmatch autoconfig`.
file_path*stringconstraintsobjectcontroller_telemetryReturn the AutoConfigController telemetry from the most recent `auto_configure` or `agent_deduplicate` call in this MCP session. Same JSON shape as the web /api/v1/controller/telemetry endpoint.Return the AutoConfigController telemetry from the most recent `auto_configure` or `agent_deduplicate` call in this MCP session. Same JSON shape as the web /api/v1/controller/telemetry endpoint.
No parameters — call it with no arguments.
agent_deduplicateRun full ER pipeline with confidence gating and reasoning2 paramsRun full ER pipeline with confidence gating and reasoning
configobjectfile_path*stringagent_match_sourcesMatch two files with intelligent strategy selection3 paramsMatch two files with intelligent strategy selection
configobjectfile_a*stringfile_b*stringagent_explain_pairNatural language explanation for a record pair4 paramsNatural language explanation for a record pair
exactarrayfuzzyobjectrecord_a*objectrecord_b*objectagent_explain_clusterExplain why records are in the same cluster1 paramsExplain why records are in the same cluster
cluster_id*integeragent_review_queueGet borderline pairs awaiting approval1 paramsGet borderline pairs awaiting approval
job_name*stringagent_approve_rejectApprove or reject a review queue pair6 paramsApprove or reject a review queue pair
id_a*integerid_b*integerreasonstringdecision*stringjob_name*stringdecided_by*stringagent_compare_strategiesCompare ER strategies on your data2 paramsCompare ER strategies on your data
file_path*stringground_truthstringsuggest_pprlCheck if data needs privacy-preserving matching1 paramsCheck if data needs privacy-preserving matching
file_path*stringscan_qualityRun GoldenCheck data quality scan on a CSV file. Returns issues found (encoding errors, Unicode problems, format violations) without applying fixes. Requires goldencheck: pip install goldenmatch[quality]2 paramsRun GoldenCheck data quality scan on a CSV file. Returns issues found (encoding errors, Unicode problems, format violations) without applying fixes. Requires goldencheck: pip install goldenmatch[quality]
domainstringfile_path*stringfix_qualityRun GoldenCheck scan and apply fixes to a CSV file. Returns the fixed data summary and a manifest of all fixes applied. Requires goldencheck: pip install goldenmatch[quality]4 paramsRun GoldenCheck scan and apply fixes to a CSV file. Returns the fixed data summary and a manifest of all fixes applied. Requires goldencheck: pip install goldenmatch[quality]
domainstringfix_modestringsafe · moderatedefault: safefile_path*stringoutput_pathstringrun_transformsRun GoldenFlow data transforms on a CSV file. Normalizes phone numbers (E.164), dates (ISO), categorical spelling, and Unicode issues. Returns a manifest of transforms applied. Requires goldenflow: pip install goldenmatch[transform]2 paramsRun GoldenFlow data transforms on a CSV file. Normalizes phone numbers (E.164), dates (ISO), categorical spelling, and Unicode issues. Returns a manifest of transforms applied. Requires goldenflow: pip install goldenmatch[transform]
file_path*stringoutput_pathstringlist_correctionsList stored Learning Memory corrections, optionally filtered by dataset. Returns id_a, id_b, decision, source, trust, reason, matchkey_name, dataset, original_score, created_at.2 paramsList stored Learning Memory corrections, optionally filtered by dataset. Returns id_a, id_b, decision, source, trust, reason, matchkey_name, dataset, original_score, created_at.
pathstringdatasetstringadd_correctionAdd a pair correction to Learning Memory. Source is set to 'agent' with trust=0.5 (lower than human steward decisions which are 1.0). Pair (id_a, id_b) is canonicalized to (min, max) before storage.7 paramsAdd a pair correction to Learning Memory. Source is set to 'agent' with trust=0.5 (lower than human steward decisions which are 1.0). Pair (id_a, id_b) is canonicalized to (min, max) before storage.
id_a*integerid_b*integerpathstringreasonstringdataset*stringdecision*stringapprove · rejectmatchkey_namestringlearn_thresholdsForce a MemoryLearner pass over accumulated corrections. Returns the list of LearnedAdjustments produced (matchkey_name, threshold, sample_size, learned_at). Requires >= 10 corrections per matchkey before threshold tuning fires; otherwise returns an empty list.2 paramsForce a MemoryLearner pass over accumulated corrections. Returns the list of LearnedAdjustments produced (matchkey_name, threshold, sample_size, learned_at). Requires >= 10 corrections per matchkey before threshold tuning fires; otherwise returns an empty list.
pathstringmatchkey_namestringmemory_statsReturn Learning Memory status: total correction count, last learn time, and current learned adjustments. Cheap; safe for status checks.1 paramsReturn Learning Memory status: total correction count, last learn time, and current learned adjustments. Cheap; safe for status checks.
pathstringmemory_exportReturn all corrections as a list of dicts (CSV-shaped). Caller is responsible for writing the file. Optionally filter by dataset.2 paramsReturn all corrections as a list of dicts (CSV-shaped). Caller is responsible for writing the file. Optionally filter by dataset.
pathstringdatasetstringidentity_resolveResolve a record_id to its durable identity. Returns the full identity view (members, evidence edges, recent events) or null when no identity exists for that record.2 paramsResolve a record_id to its durable identity. Returns the full identity view (members, evidence edges, recent events) or null when no identity exists for that record.
pathstringrecord_id*stringidentity_listList identities, optionally filtered by dataset/status.5 paramsList identities, optionally filtered by dataset/status.
pathstringlimitintegeroffsetintegerstatusstringdatasetstringidentity_historyReturn the temporal event log for an identity.3 paramsReturn the temporal event log for an identity.
pathstringlimitintegerentity_id*stringidentity_conflictsList evidence edges marked `conflicts_with`.2 paramsList evidence edges marked `conflicts_with`.
pathstringdatasetstringidentity_mergeManually merge two identities. All records from `absorb_entity_id` are reassigned to `keep_entity_id`.4 paramsManually merge two identities. All records from `absorb_entity_id` are reassigned to `keep_entity_id`.
pathstringreasonstringkeep_entity_id*stringabsorb_entity_id*stringidentity_splitSplit a subset of records off an identity into a brand-new identity. The original keeps the remaining records.4 paramsSplit a subset of records off an identity into a brand-new identity. The original keeps the remaining records.
pathstringreasonstringentity_id*stringrecord_ids*arrayget_statsGet dataset statistics: record count, cluster count, match rate, cluster sizes.Get dataset statistics: record count, cluster count, match rate, cluster sizes.
No parameters — call it with no arguments.
find_duplicatesFind duplicate matches for a record. Provide field values to search against the loaded dataset.2 paramsFind duplicate matches for a record. Provide field values to search against the loaded dataset.
top_kintegerrecord*objectexplain_matchExplain why two records match or don't match. Shows per-field score breakdown.2 paramsExplain why two records match or don't match. Shows per-field score breakdown.
record_a*objectrecord_b*objectlist_clustersList duplicate clusters found in the dataset. Returns cluster IDs, sizes, and member counts.2 paramsList duplicate clusters found in the dataset. Returns cluster IDs, sizes, and member counts.
limitintegermin_sizeintegerget_clusterGet details of a specific cluster: all member records and their field values.1 paramsGet details of a specific cluster: all member records and their field values.
cluster_id*integerget_golden_recordGet the merged golden (canonical) record for a cluster.1 paramsGet the merged golden (canonical) record for a cluster.
cluster_id*integermatch_recordMatch a single record against the loaded dataset in real-time. Paste a record's fields and instantly see if it matches any existing record. Uses the configured matchkeys, scorers, and thresholds. Example: {"name": "John Smith", "email": "john@test.com", "zip": "10001"}3 paramsMatch a single record against the loaded dataset in real-time. Paste a record's fields and instantly see if it matches any existing record. Uses the configured matchkeys, scorers, and thresholds. Example: {"name": "John Smith", "email": "john@test.com", "zip": "10001"}
top_kintegerrecord*objectthresholdnumberunmerge_recordRemove a record from its cluster. The record becomes a singleton. Remaining cluster members are re-clustered using stored pair scores. Use this to fix bad merges.1 paramsRemove a record from its cluster. The record becomes a singleton. Remaining cluster members are re-clustered using stored pair scores. Use this to fix bad merges.
record_id*integershatter_clusterBreak an entire cluster into individual records. All members become singletons. Use when a cluster is completely wrong.1 paramsBreak an entire cluster into individual records. All members become singletons. Use when a cluster is completely wrong.
cluster_id*integersuggest_configAnalyze bad merges and suggest config changes. Provide examples of incorrect merges (pairs that should NOT have matched) and GoldenMatch will identify which fields/thresholds to tighten. Example: [{"record_a": {...}, "record_b": {...}, "reason": "different people"}]1 paramsAnalyze bad merges and suggest config changes. Provide examples of incorrect merges (pairs that should NOT have matched) and GoldenMatch will identify which fields/thresholds to tighten. Example: [{"record_a": {...}, "record_b": {...}, "reason": "different people"}]
bad_merges*arrayprofile_dataGet data quality profile: column types, null rates, unique counts, sample values.Get data quality profile: column types, null rates, unique counts, sample values.
No parameters — call it with no arguments.
export_resultsExport matching results to a file (CSV or JSON).2 paramsExport matching results to a file (CSV or JSON).
formatstringcsv · jsondefault: csvoutput_path*stringlist_domainsList available domain extraction rulebooks (built-in + user-defined).List available domain extraction rulebooks (built-in + user-defined).
No parameters — call it with no arguments.
create_domainCreate a custom domain extraction rulebook. Define patterns for a specific data domain (medical devices, automotive parts, real estate, etc.).7 paramsCreate a custom domain extraction rulebook. Define patterns for a specific data domain (medical devices, automotive parts, real estate, etc.).
name*stringscopestringlocal · globaldefault: localsignals*arraystop_wordsarraybrand_patternsarrayattribute_patternsobjectidentifier_patternsobjecttest_domainTest a domain extraction rulebook against sample records. Shows what features would be extracted from the loaded data.2 paramsTest a domain extraction rulebook against sample records. Shows what features would be extracted from the loaded data.
domain_name*stringsample_sizeintegerpprl_auto_configAnalyze the loaded dataset and recommend optimal PPRL (privacy-preserving record linkage) configuration. Returns recommended fields, bloom filter parameters, threshold, and explanation.2 paramsAnalyze the loaded dataset and recommend optimal PPRL (privacy-preserving record linkage) configuration. Returns recommended fields, bloom filter parameters, threshold, and explanation.
use_llmbooleansecurity_levelstringstandard · high · paranoiddefault: highpprl_linkRun privacy-preserving record linkage between two parties' data. Computes bloom filters, matches records without sharing raw data. Specify fields, threshold, and security level.5 paramsRun privacy-preserving record linkage between two parties' data. Computes bloom filters, matches records without sharing raw data. Specify fields, threshold, and security level.
fields*arrayfile_a*stringfile_b*stringthresholdnumbersecurity_levelstringstandard · high · paranoiddefault: highA polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.
GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenAnalysis reports, all orchestrated by GoldenPipe. With InferMap for schema mapping, a Rust extension layer for Postgres / DuckDB, and optional WebAssembly acceleration behind the edge-safe TypeScript ports.
⚡ GoldenMatch scales from a CSV on your laptop to 100M+ rows on a Ray cluster — verified: 100,000,000 records deduped recall-complete (correct across any partitioning) in 9.2 min, with a 0.36 GB driver footprint.
Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots →
# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv
# TypeScript / Edge runtimes
npm install goldenmatch
🆕 v2.3.0 — Auto-enabled semantic blocking, now default-on — text-heavy data automatically routes to SimHash-over-embeddings blocking when an embedder is reachable (a byte-identical no-op otherwise). Plus pluggable pgvector / DuckDB-HNSW vector-index backends and opt-in Fellegi-Sunter routing for no-strong-identifier datasets (
GOLDENMATCH_AUTOCONFIG_ROUTE_PROBABILISTIC=1).v2.2.0 — Semantic blocking — an opt-in recall lever for abbreviations and aliases.
dedupe_df(semantic_blocking=...)unions extra candidate sources (initialism/abbreviation blocking, a business-alias canonical-form table, and an embedding ANN pass) into the pipeline. Off by default; on the abbreviation-heavy benchmark it adds +5.3pp recall at zero precision cost.v2.1.0 — Correlated survivorship — golden-record survivorship can now keep correlated fields (street/city/postcode) in lock-step from a single winning source instead of mixing best-per-field values across records. New
FieldGroupSpec+DomainPack.groups(domain-pack schema v3, additive), ananchor/allow_fillgroup-winner strategy, and per-cluster provenance surfaced through lineage,explain, the MCP tools, and the review queue. Plus chunked PPRL linkage (peak memory ~9-14x lower, byte-identical) andresult.nativedispatch telemetry that flags a silently-slow Python fallback.
Each tool stands alone, but they compose into a single pipeline:
flowchart LR
raw([raw rows])
golden([golden records])
subgraph orchestration ["GoldenPipe orchestrates"]
direction LR
infermap[InferMap]
goldencheck[GoldenCheck]
goldenflow[GoldenFlow]
goldenmatch[GoldenMatch]
infermap --> goldencheck --> goldenflow --> goldenmatch
end
raw --> infermap
goldenmatch --> golden
| Step | Role |
|---|---|
| InferMap | schema mapping — auto-aligns columns across heterogeneous sources |
| GoldenCheck | profile + validate — encoding, format, anomaly detection |
| GoldenFlow | standardize + transform — phone, date, address, categorical normalization |
| GoldenMatch | dedupe + cluster + survivorship — fuzzy / exact / probabilistic / LLM |
| GoldenAnalysis | analysis + reporting — one exportable report over any stage's output, plus cross-run regression detection |
| GoldenPipe | orchestrator — declarative YAML pipeline wiring the steps |
historical_50k pairwise F1 0.778 vs 0.757, cluster-level B³ 0.844 vs 0.789; one shared evaluator, reproducible bake-off).entity_ids that survive across runs, an append-only event log, and create / absorb / merge / split semantics, surfaced on the CLI, REST, MCP, and SQL interfaces (the Identity Graph v2 feature, shipped in GoldenMatch v1.15).auto_configure + controller_telemetry for v1.7-v1.12 introspection.ControllerPanel, TUI Ctrl+A, CLI goldenmatch autoconfig, REST /autoconfig + /controller/telemetry, Postgres goldenmatch_autoconfig + gm_telemetry, DuckDB UDFs, MCP/A2A telemetry tools. One JSON shape across every interface.node:*-free, so they run in browsers, Cloudflare Workers, Vercel Edge, and Deno. An opt-in WebAssembly backend (await enableWasm() / enableAnalysisWasm()) swaps in the same pyo3-free Rust kernels the Python wheels and the SQL UDFs use — pure-TS stays the default and the byte-identical fallback, so default users download zero wasm bytes.evaluate, Fellegi-Sunter probabilistic scoring, and GoldenFlow transforms.| Package | Lang | What it does | Install |
|---|---|---|---|
| GoldenMatch 🟡 | Python · TS | Zero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package. | pip install goldenmatch · npm i goldenmatch |
| GoldenCheck | Python · TS | Data-quality scanning: encoding, Unicode, format validation, anomaly detection. | pip install goldencheck · npm i goldencheck |
| GoldenFlow | Python · TS | Transforms & standardizers: phone, date, address, categorical normalization. | pip install goldenflow · npm i goldenflow |
| GoldenPipe | Python · TS | Orchestrator that wires Check → Flow → Match into one declarative pipeline. | pip install goldenpipe · npm i goldenpipe |
| InferMap | Python · TS | Schema mapping engine — auto-aligns columns across heterogeneous sources. | pip install infermap · npm i infermap |
| GoldenAnalysis | Python · TS | Cross-cutting analysis & reporting — consumes any stage's typed artifacts (or a raw DataFrame) and emits a unified, exportable AnalysisReport; optional Rust / WASM histogram+quantile kernels. | pip install goldenanalysis · npm i goldenanalysis |
| goldenmatch-extensions | Rust | Postgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching. | source build |
| dbt-goldensuite | dbt · Python | dbt package — dedupe + two-table match materializations (incl. zero-config Fellegi-Sunter), an ER match-quality build gate, quality-gate tests, transforms, and identity-graph reads for warehouse models. | packages.yml (git subdir) |
| goldencheck-action | YAML | GitHub Action — fail PRs that introduce data-quality regressions. | Marketplace |
Headline pitch and the deepest docs live in packages/python/goldenmatch/README.md (~1,300 lines, full feature list, CLI, architecture, benchmarks).
Entity resolution is the stage most GraphRAG pipelines do badly — duplicate surface forms of the same entity scatter across documents. Two new packages put GoldenMatch's resolution there:
| Package | What it does | Status |
|---|---|---|
| goldenmatch-kg | Drop-in GoldenMatch resolution as the entity-resolution stage of existing KG frameworks (neo4j-graphrag, LlamaIndex PropertyGraphIndex, Graphiti). One framework-agnostic resolve_entities core + per-framework adapters. The ER-stage lift is measured by ER-KG-Bench, not asserted. | in-repo · first PyPI release pending |
| goldengraph | Build-your-own-KG from text — text → LLM extraction → GoldenMatch resolution → a durable bi-temporal store. The engine (store / query / community detection) is pyo3-free Rust; ER is the differentiator. Early evidence program. | in-repo · first PyPI release pending |
| I want to... | Go here |
|---|---|
| Deduplicate a CSV right now | packages/python/goldenmatch |
| Use from Claude Desktop / Code | packages/python/goldenmatch — MCP |
| Edit rules in a browser, label pairs, compare runs | packages/python/goldenmatch — Web UI |
| Build AI agents that deduplicate | ER Agent / A2A wiki page |
| Profile data quality before matching | packages/python/goldencheck |
| Standardize messy fields (phone, date, address) | packages/python/goldenflow |
| Run the full pipeline declaratively | packages/python/goldenpipe |
| Map columns across schemas | packages/python/infermap |
| Analyze + report across stages and runs | packages/python/goldenanalysis |
| Write TypeScript / Node.js / Edge (browser, Workers; optional WASM) | packages/typescript/goldenmatch |
| Match in Postgres / DuckDB SQL | packages/rust/extensions |
| Add data-quality gates to dbt | packages/dbt/goldensuite |
| Block bad data in GitHub PRs | packages/actions/goldencheck |
| Run as Airflow DAGs | examples/airflow/ — 12 drop-in DAGs |
| Run from a single MCP container | docker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest |
| Pull every Suite container | GitHub Packages |
import goldenmatch as gm
# Zero-config
result = gm.dedupe("customers.csv")
print(result) # DedupeResult(records=5000, clusters=847, match_rate=12.0%)
result.golden.write_csv("deduped.csv")
# Or be explicit
result = gm.dedupe("customers.csv",
exact=["email"],
fuzzy={"name": 0.85, "zip": 0.95},
blocking=["zip"],
threshold=0.85)
import { dedupe } from "goldenmatch";
const result = dedupe(rows, {
fuzzy: { name: 0.85 },
blocking: ["zip"],
threshold: 0.85,
});
console.log(result.stats); // { totalRecords, totalClusters, matchRate, ... }
Runs in browsers, Vercel Edge, Cloudflare Workers, Deno — and optionally swaps in the Rust score-core kernel via await enableWasm(). ~940 tests, strict TypeScript (noUncheckedIndexedAccess, exactOptionalPropertyTypes).
pip install 'goldenmatch[web]'
goldenmatch serve-ui my-project # opens http://localhost:5050

Edit rules with live validation, preview against a sampled slice, label pairs
(mirrored into Learning Memory automatically), compare runs (CCMS), sweep
parameters, browse the corrections store. Single-process localhost workbench
shipped as the optional [web] extra.
import goldenpipe as gp
pipeline = gp.Pipeline.from_yaml("pipeline.yaml") # check → flow → match
result = pipeline.run("customers.csv")
result.report.write_html("report.html")
More: examples/ has runnable demos for every Suite scenario:
Python (quickstart, full pipeline, customer 360, PPRL, review workflow, MCP client) ·
TypeScript (quickstart, Vercel Edge route, MCP client) ·
Airflow DAGs (12 production-shaped pipelines).
Reproducible end-to-end pipelines running GoldenMatch on public data at scale, each with measured headline numbers vs baselines:
GoldenMatch ships fat optional extras so you only pay for what you use:
pip install goldenmatch # core (CSV in, CSV out) + native acceleration on common platforms
pip install goldenmatch[native] # back-compat alias; native is already default on common platforms
pip install goldenmatch[embeddings] # + sentence-transformers, FAISS
pip install goldenmatch[llm] # + Claude / OpenAI for LLM boost
pip install goldenmatch[postgres] # + Postgres sync
pip install goldenmatch[snowflake] # + Snowflake connector
pip install goldenmatch[bigquery] # + BigQuery connector
pip install goldenmatch[databricks] # + Databricks connector
pip install goldenmatch[salesforce] # + Salesforce connector
pip install goldenmatch[duckdb] # + DuckDB out-of-core backend
pip install goldenmatch[ray] # + Ray distributed backend (50M+ rows)
pip install goldenmatch[quality] # + GoldenCheck integration
pip install goldenmatch[transform] # + GoldenFlow integration
pip install goldenmatch[mcp] # + MCP server for Claude Desktop
pip install goldenmatch[agent] # + A2A agent (aiohttp)
pip install goldenmatch[web] # + localhost browser workbench (FastAPI + React)
goldenmatch setup # interactive wizard: GPU, API keys, database
Sister packages compose: pip install goldenpipe[full] brings in Check + Flow + Match together.
GoldenMatch is hosted as an MCP server on Smithery — connect from any MCP client without installing anything.
{
"mcpServers": {
"goldenmatch": {
"url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
}
}
}
50+ MCP tools across the suite: deduplicate, match, explain, review, link privately, configure, scan quality, transform, synthesize golden records, and manage Learning Memory corrections.
Every Suite package ships as a multi-arch container image (linux/amd64 + linux/arm64) on GitHub Container Registry. Pull anonymously, no auth needed:
# One container, every Suite tool — the convenience option
docker run -p 8300:8300 ghcr.io/benseverndev-oss/goldensuite-mcp:latest
# Per-package containers — narrower deployments
docker run -p 8200:8200 ghcr.io/benseverndev-oss/goldenmatch-mcp:latest
docker run -p 8100:8100 ghcr.io/benseverndev-oss/goldencheck-mcp:latest
docker run -p 8150:8150 ghcr.io/benseverndev-oss/goldenflow-mcp:latest
docker run -p 8250:8250 ghcr.io/benseverndev-oss/goldenpipe-mcp:latest
docker run -p 8400:8400 ghcr.io/benseverndev-oss/infermap-mcp:latest
# Postgres + extension preinstalled
docker run -e POSTGRES_PASSWORD=secret ghcr.io/benseverndev-oss/goldenmatch-extensions:latest
Tags:
:latest — current main:main-<sha7> — every push to main, immutable:vX.Y.Z and :vX.Y — pushed when a <package>-vX.Y.Z tag is createdSee packages/python/goldensuite-mcp/README.md for the aggregator's tool-collision behaviour.
12 drop-in DAGs at examples/airflow/, grouped by lifecycle stage:
| Group | DAGs |
|---|---|
| Core pipeline | daily_dedupe, incremental_match, warehouse_native (Snowflake), customer_360 (multi-source) |
| Privacy | pprl_linkage (two-party PPRL) |
| Onboarding & monitoring | schema_align_and_load, schema_drift_alarm, quality_gate |
| Feedback loop | review_worker, active_learning |
| Operationalize | reverse_etl (Salesforce/HubSpot), backfill |
TaskFlow API, Airflow 2.7+ (compatible with 3.x). Each DAG has tunable knobs at the top, idempotent retries, and is marker-protected against double-processing. Drop the file you want into your Airflow dags/ folder.
goldenmatch/
├── packages/
│ ├── python/
│ │ ├── goldenmatch/ # entity resolution — headline package
│ │ ├── goldencheck/ # data quality scanning
│ │ ├── goldenflow/ # transforms & standardizers
│ │ ├── goldenpipe/ # orchestrator
│ │ ├── infermap/ # schema mapping
│ │ └── goldenanalysis/ # cross-cutting analysis & reporting
│ ├── typescript/
│ │ ├── goldenmatch/ # full TS port (edge-safe core)
│ │ ├── goldencheck/ # TS implementation
│ │ ├── goldencheck-types/ # shared TS types
│ │ ├── goldenflow/ # TS transforms
│ │ ├── infermap/ # TS schema mapping
│ │ └── goldenanalysis/ # TS analysis & reporting (edge-safe + WASM)
│ ├── rust/
│ │ └── extensions/ # Postgres pgrx + DuckDB UDFs (own Cargo workspace)
│ ├── python/goldensuite-mcp/ # aggregator MCP server (one container, all tools)
│ ├── dbt/goldensuite/ # dbt package (materializations, tests, macros)
│ └── actions/goldencheck/ # GitHub Action
├── examples/
│ ├── python/ # 6 runnable Python scripts (quickstart → MCP)
│ ├── typescript/ # 3 TS scripts (quickstart, Vercel Edge, MCP)
│ └── airflow/ # 12 drop-in Airflow DAGs
├── docs/superpowers/ # design specs and implementation plans
├── justfile # install / test / lint / build, all languages
├── pyproject.toml # uv workspace (root)
├── pnpm-workspace.yaml # TypeScript pnpm workspace (Turborepo)
├── package.json # root scripts + pnpm workspace root
└── .github/workflows/ci.yml
packages/rust/extensions/ is itself a Cargo workspace (the postgres crate is excluded for pgrx-specific build requirements). Cargo doesn't allow nested workspaces sharing members, so Cargo commands run from inside packages/rust/extensions/.packages/typescript/* form one pnpm + Turborepo workspace (see TypeScript dev setup). .npmrc pins node-linker=hoisted, giving a flat node_modules that avoids the Windows symlink issues an earlier per-package layout hit.just install # uv sync + per-package npm install + cargo fetch
just test # all languages
just lint
just build
Published GoldenMatch numbers (DQbench composite 91.04, DBLP-ACM 0.9641 F1, Febrl3 0.9443 F1, NCVR 0.9719 F1) map back to a single committed runner: scripts/run_benchmarks.py. See docs/reproducing-benchmarks.md for per-number commands, dataset URLs, expected output (with tolerance), variance notes (deterministic vs LLM-augmented), and a copy-pasteable one-click reproduction snippet for the DQbench composite. The same runner powers the weekly benchmarks.yml workflow.
"How big can this handle?" is answered in docs/scale-envelope.md: per-backend ranges (Polars in-memory < 500K, DuckDB out-of-core 500K - 50M, Ray distributed >= 50M), block-size failure modes, candidate-pair math, and a single-page decision tree for picking a backend.
Verified at the top end: a full 100,000,000-row GoldenMatch dedupe on a 5-node Ray cluster (e2-standard-16, 80 CPU) in 9.2 min (554 s), 20,000,000 golden records recovered exactly, driver process peak 0.36 GB RSS — the default distributed path is now recall-complete (blocking-key shuffle scoring + a distributed randomized-contraction WCC), so duplicates merge correctly no matter how the input is partitioned, and it stays driver-collect-free end to end (#844). A faster per-partition path is available via GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0 (driver-collect-free, ~213 s on a 4-worker run) for inputs where duplicates already co-locate within partitions — but it under-merges when a cluster's members land in different input partitions, which is why recall-complete is the default. Recipe in packages/python/goldenmatch/configs/distributed-100m.yaml.
feature/<name> branches; merge via squash PR.feat: <description>, fix: <description>, docs: <description>.packages/typescript/goldenmatch/tests/parity/ enforces 4-decimal-tolerance Python ↔ TypeScript scorer parity.docs/superpowers/specs/ for design rationale on architectural decisions.The TypeScript packages live in a single pnpm workspace orchestrated by Turborepo. From the repo root:
corepack enable # one-time, picks up pnpm@9.15.0 from package.json
pnpm install # installs all workspace packages
pnpm turbo run build test typecheck lint # full pipeline (cached after first run)
pnpm --filter goldenmatch test # single package
Windows: enable Developer Mode for pnpm. pnpm install creates symlinks under node_modules/. Settings → For Developers → Developer Mode → On. If you see EPERM: operation not permitted, symlink ... during install, Dev Mode is off.
If corepack enable fails (often needs an admin shell on Windows), the fallback is npm i -g pnpm@9.15.0 — functionally equivalent.
This repository was formed on 2026-05-01 by folding 8 sibling repos into the existing goldenmatch repo using git filter-repo. Full commit history is preserved for every source. See docs/superpowers/specs/2026-05-01-goldenmatch-monorepo-fold-in-design.md for the design rationale and docs/superpowers/plans/2026-05-01-goldenmatch-monorepo-fold-in.md for the step-by-step migration plan.
Built by Ben Severn.
MIT — see LICENSE.