CAT
/MCP
SkillsMCPMarketplacesDigestToolsAdvertise

This week in Claude

Every Monday: Claude Code, Agent SDK, MCP, and the Anthropic platform moves worth your time.

Skills by Category
Frontend DevelopmentBackend & APIsTesting & QASecurityDevOps & CI/CDGit & Pull RequestsDocumentationCode Review & QualityAI & Agent BuildingSkill Development
MCP Servers by Category
Sales & MarketingWeb & Browser AutomationDatabasesAI & LLM ToolsCloud & InfrastructureCommunication & MessagingDeveloper ToolsDesign & CreativeDocuments & KnowledgeSearch & Web Crawling
Marketplaces by Category
AI Agents & OrchestrationLLM IntegrationDevelopment ToolsFrontend & UIBackend & APIsDatabasesTesting & Code QualityDevOps & CloudSecurity & ComplianceGit & Version Control

Cross AI Tools

Discover Claude Code plugins, extensions, and tools. Automatically updated directory of Anthropic Claude AI marketplaces with development tools, productivity plugins, and integrations.

Resources

  • Browse Skills
  • Browse MCP Servers
  • Browse Marketplaces
  • Plugins Reference

Community

  • About
  • Tools
  • Feedback
  • Privacy Policy
  • Advertise

Built for the Claude Code community with Claude Code by @mertduzgun

Independent project, not affiliated with Anthropic

GoldenMatch

benseverndev-oss/goldenmatch
10342 toolsSTDIO, HTTPregistry active
Summary

This server wraps the GoldenMatch entity resolution toolkit, which finds duplicate records across messy datasets with a 97.2% F1 score out of the box. You get access to deduplication, clustering, and the broader Golden Suite pipeline (InferMap for schema alignment, GoldenCheck for profiling, GoldenFlow for standardization). The MCP layer exposes 36+ tools including auto_configure for adaptive tuning and controller_telemetry for inspecting clustering decisions. Useful when you need to clean customer lists, merge data sources, or resolve entities across organizations without writing custom fuzzy matching logic. The zero-config defaults work immediately, and a learning memory system stops asking for the same correction twice across runs.

Install to Claude Code

verified
claude mcp add --transport http goldenmatch https://goldenmatch-mcp-production.up.railway.app/mcp/

Run in your terminal. Add --scope user to make it available in every project.

Review the command, arguments, and environment values before installing — MCP servers run with your local permissions.

CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →

Tools

Verified live against the running server on Jun 10, 2026.

verified live42 tools
analyze_dataProfile data, detect domain, recommend ER strategy1 params

Profile data, detect domain, recommend ER strategy

Parameters* required
file_path*string
auto_configureRun AutoConfigController on a CSV; return the committed GoldenMatchConfig (incl. negative_evidence / Path Y when chosen) plus telemetry — stop_reason, health, decision trace, indicator column priors. Programmatic equivalent of `goldenmatch autoconfig`.2 params

Run AutoConfigController on a CSV; return the committed GoldenMatchConfig (incl. negative_evidence / Path Y when chosen) plus telemetry — stop_reason, health, decision trace, indicator column priors. Programmatic equivalent of `goldenmatch autoconfig`.

Parameters* required
file_path*string
constraintsobject
controller_telemetryReturn the AutoConfigController telemetry from the most recent `auto_configure` or `agent_deduplicate` call in this MCP session. Same JSON shape as the web /api/v1/controller/telemetry endpoint.

Return the AutoConfigController telemetry from the most recent `auto_configure` or `agent_deduplicate` call in this MCP session. Same JSON shape as the web /api/v1/controller/telemetry endpoint.

No parameters — call it with no arguments.

agent_deduplicateRun full ER pipeline with confidence gating and reasoning2 params

Run full ER pipeline with confidence gating and reasoning

Parameters* required
configobject
file_path*string
agent_match_sourcesMatch two files with intelligent strategy selection3 params

Match two files with intelligent strategy selection

Parameters* required
configobject
file_a*string
file_b*string
agent_explain_pairNatural language explanation for a record pair4 params

Natural language explanation for a record pair

Parameters* required
exactarray
fuzzyobject
record_a*object
record_b*object
agent_explain_clusterExplain why records are in the same cluster1 params

Explain why records are in the same cluster

Parameters* required
cluster_id*integer
agent_review_queueGet borderline pairs awaiting approval1 params

Get borderline pairs awaiting approval

Parameters* required
job_name*string
agent_approve_rejectApprove or reject a review queue pair6 params

Approve or reject a review queue pair

Parameters* required
id_a*integer
id_b*integer
reasonstring
decision*string
job_name*string
decided_by*string
agent_compare_strategiesCompare ER strategies on your data2 params

Compare ER strategies on your data

Parameters* required
file_path*string
ground_truthstring
suggest_pprlCheck if data needs privacy-preserving matching1 params

Check if data needs privacy-preserving matching

Parameters* required
file_path*string
scan_qualityRun GoldenCheck data quality scan on a CSV file. Returns issues found (encoding errors, Unicode problems, format violations) without applying fixes. Requires goldencheck: pip install goldenmatch[quality]2 params

Run GoldenCheck data quality scan on a CSV file. Returns issues found (encoding errors, Unicode problems, format violations) without applying fixes. Requires goldencheck: pip install goldenmatch[quality]

Parameters* required
domainstring
Optional domain hint (healthcare, finance, ecommerce)
file_path*string
Path to the CSV file to scan
fix_qualityRun GoldenCheck scan and apply fixes to a CSV file. Returns the fixed data summary and a manifest of all fixes applied. Requires goldencheck: pip install goldenmatch[quality]4 params

Run GoldenCheck scan and apply fixes to a CSV file. Returns the fixed data summary and a manifest of all fixes applied. Requires goldencheck: pip install goldenmatch[quality]

Parameters* required
domainstring
Optional domain hint (healthcare, finance, ecommerce)
fix_modestring
Fix aggressiveness: safe (conservative) or moderate (balanced). Default: safeone of safe · moderatedefault: safe
file_path*string
Path to the CSV file to fix
output_pathstring
Optional path to save the fixed CSV. If omitted, returns summary only.
run_transformsRun GoldenFlow data transforms on a CSV file. Normalizes phone numbers (E.164), dates (ISO), categorical spelling, and Unicode issues. Returns a manifest of transforms applied. Requires goldenflow: pip install goldenmatch[transform]2 params

Run GoldenFlow data transforms on a CSV file. Normalizes phone numbers (E.164), dates (ISO), categorical spelling, and Unicode issues. Returns a manifest of transforms applied. Requires goldenflow: pip install goldenmatch[transform]

Parameters* required
file_path*string
Path to the CSV file to transform
output_pathstring
Optional path to save the transformed CSV. If omitted, returns summary only.
list_correctionsList stored Learning Memory corrections, optionally filtered by dataset. Returns id_a, id_b, decision, source, trust, reason, matchkey_name, dataset, original_score, created_at.2 params

List stored Learning Memory corrections, optionally filtered by dataset. Returns id_a, id_b, decision, source, trust, reason, matchkey_name, dataset, original_score, created_at.

Parameters* required
pathstring
SQLite memory DB path. Default: .goldenmatch/memory.db
datasetstring
Optional dataset filter (e.g. file path).
add_correctionAdd a pair correction to Learning Memory. Source is set to 'agent' with trust=0.5 (lower than human steward decisions which are 1.0). Pair (id_a, id_b) is canonicalized to (min, max) before storage.7 params

Add a pair correction to Learning Memory. Source is set to 'agent' with trust=0.5 (lower than human steward decisions which are 1.0). Pair (id_a, id_b) is canonicalized to (min, max) before storage.

Parameters* required
id_a*integer
id_b*integer
pathstring
SQLite memory DB path. Default: .goldenmatch/memory.db
reasonstring
dataset*string
Dataset identifier (e.g. file path). Required, non-empty.
decision*string
one of approve · reject
matchkey_namestring
learn_thresholdsForce a MemoryLearner pass over accumulated corrections. Returns the list of LearnedAdjustments produced (matchkey_name, threshold, sample_size, learned_at). Requires >= 10 corrections per matchkey before threshold tuning fires; otherwise returns an empty list.2 params

Force a MemoryLearner pass over accumulated corrections. Returns the list of LearnedAdjustments produced (matchkey_name, threshold, sample_size, learned_at). Requires >= 10 corrections per matchkey before threshold tuning fires; otherwise returns an empty list.

Parameters* required
pathstring
SQLite memory DB path. Default: .goldenmatch/memory.db
matchkey_namestring
Optional: learn only for this matchkey.
memory_statsReturn Learning Memory status: total correction count, last learn time, and current learned adjustments. Cheap; safe for status checks.1 params

Return Learning Memory status: total correction count, last learn time, and current learned adjustments. Cheap; safe for status checks.

Parameters* required
pathstring
SQLite memory DB path. Default: .goldenmatch/memory.db
memory_exportReturn all corrections as a list of dicts (CSV-shaped). Caller is responsible for writing the file. Optionally filter by dataset.2 params

Return all corrections as a list of dicts (CSV-shaped). Caller is responsible for writing the file. Optionally filter by dataset.

Parameters* required
pathstring
SQLite memory DB path. Default: .goldenmatch/memory.db
datasetstring
identity_resolveResolve a record_id to its durable identity. Returns the full identity view (members, evidence edges, recent events) or null when no identity exists for that record.2 params

Resolve a record_id to its durable identity. Returns the full identity view (members, evidence edges, recent events) or null when no identity exists for that record.

Parameters* required
pathstring
Identity DB path
record_id*string
record id in `{source}:{source_pk}` form
identity_listList identities, optionally filtered by dataset/status.5 params

List identities, optionally filtered by dataset/status.

Parameters* required
pathstring
limitinteger
default: 50
offsetinteger
default: 0
statusstring
datasetstring
identity_historyReturn the temporal event log for an identity.3 params

Return the temporal event log for an identity.

Parameters* required
pathstring
limitinteger
default: 100
entity_id*string
identity_conflictsList evidence edges marked `conflicts_with`.2 params

List evidence edges marked `conflicts_with`.

Parameters* required
pathstring
datasetstring
identity_mergeManually merge two identities. All records from `absorb_entity_id` are reassigned to `keep_entity_id`.4 params

Manually merge two identities. All records from `absorb_entity_id` are reassigned to `keep_entity_id`.

Parameters* required
pathstring
reasonstring
keep_entity_id*string
absorb_entity_id*string
identity_splitSplit a subset of records off an identity into a brand-new identity. The original keeps the remaining records.4 params

Split a subset of records off an identity into a brand-new identity. The original keeps the remaining records.

Parameters* required
pathstring
reasonstring
entity_id*string
record_ids*array
get_statsGet dataset statistics: record count, cluster count, match rate, cluster sizes.

Get dataset statistics: record count, cluster count, match rate, cluster sizes.

No parameters — call it with no arguments.

find_duplicatesFind duplicate matches for a record. Provide field values to search against the loaded dataset.2 params

Find duplicate matches for a record. Provide field values to search against the loaded dataset.

Parameters* required
top_kinteger
Max results to return (default 5)default: 5
record*object
Record fields to match (e.g. {"name": "John Smith", "zip": "10001"})
explain_matchExplain why two records match or don't match. Shows per-field score breakdown.2 params

Explain why two records match or don't match. Shows per-field score breakdown.

Parameters* required
record_a*object
First record fields
record_b*object
Second record fields
list_clustersList duplicate clusters found in the dataset. Returns cluster IDs, sizes, and member counts.2 params

List duplicate clusters found in the dataset. Returns cluster IDs, sizes, and member counts.

Parameters* required
limitinteger
Max clusters to return (default 20)default: 20
min_sizeinteger
Minimum cluster size to include (default 2)default: 2
get_clusterGet details of a specific cluster: all member records and their field values.1 params

Get details of a specific cluster: all member records and their field values.

Parameters* required
cluster_id*integer
Cluster ID to look up
get_golden_recordGet the merged golden (canonical) record for a cluster.1 params

Get the merged golden (canonical) record for a cluster.

Parameters* required
cluster_id*integer
Cluster ID
match_recordMatch a single record against the loaded dataset in real-time. Paste a record's fields and instantly see if it matches any existing record. Uses the configured matchkeys, scorers, and thresholds. Example: {"name": "John Smith", "email": "john@test.com", "zip": "10001"}3 params

Match a single record against the loaded dataset in real-time. Paste a record's fields and instantly see if it matches any existing record. Uses the configured matchkeys, scorers, and thresholds. Example: {"name": "John Smith", "email": "john@test.com", "zip": "10001"}

Parameters* required
top_kinteger
Max matches to return (default 5)default: 5
record*object
Record fields to match against the dataset
thresholdnumber
Minimum score to consider a match (default: use config threshold)
unmerge_recordRemove a record from its cluster. The record becomes a singleton. Remaining cluster members are re-clustered using stored pair scores. Use this to fix bad merges.1 params

Remove a record from its cluster. The record becomes a singleton. Remaining cluster members are re-clustered using stored pair scores. Use this to fix bad merges.

Parameters* required
record_id*integer
Row ID of the record to unmerge
shatter_clusterBreak an entire cluster into individual records. All members become singletons. Use when a cluster is completely wrong.1 params

Break an entire cluster into individual records. All members become singletons. Use when a cluster is completely wrong.

Parameters* required
cluster_id*integer
Cluster ID to shatter
suggest_configAnalyze bad merges and suggest config changes. Provide examples of incorrect merges (pairs that should NOT have matched) and GoldenMatch will identify which fields/thresholds to tighten. Example: [{"record_a": {...}, "record_b": {...}, "reason": "different people"}]1 params

Analyze bad merges and suggest config changes. Provide examples of incorrect merges (pairs that should NOT have matched) and GoldenMatch will identify which fields/thresholds to tighten. Example: [{"record_a": {...}, "record_b": {...}, "reason": "different people"}]

Parameters* required
bad_merges*array
List of bad merge examples with record_a, record_b, and optional reason
profile_dataGet data quality profile: column types, null rates, unique counts, sample values.

Get data quality profile: column types, null rates, unique counts, sample values.

No parameters — call it with no arguments.

export_resultsExport matching results to a file (CSV or JSON).2 params

Export matching results to a file (CSV or JSON).

Parameters* required
formatstring
Output format (default csv)one of csv · jsondefault: csv
output_path*string
File path to save results
list_domainsList available domain extraction rulebooks (built-in + user-defined).

List available domain extraction rulebooks (built-in + user-defined).

No parameters — call it with no arguments.

create_domainCreate a custom domain extraction rulebook. Define patterns for a specific data domain (medical devices, automotive parts, real estate, etc.).7 params

Create a custom domain extraction rulebook. Define patterns for a specific data domain (medical devices, automotive parts, real estate, etc.).

Parameters* required
name*string
Domain name (e.g. 'medical_devices', 'automotive_parts')
scopestring
Save locally (.goldenmatch/domains/) or globally (~/.goldenmatch/domains/). Default: local.one of local · globaldefault: local
signals*array
Column name keywords that trigger this domain (e.g. ['ndc', 'fda', 'implant'])
stop_wordsarray
Words to strip during name normalization
brand_patternsarray
Brand/manufacturer names to extract (e.g. ['Medtronic', 'Abbott'])
attribute_patternsobject
Named regex patterns for domain attributes (e.g. {'size': '\\b(\\d+mm)\\b'})
identifier_patternsobject
Named regex patterns for domain identifiers (e.g. {'ndc': '\\b(\\d{5}-\\d{4}-\\d{2})\\b'})
test_domainTest a domain extraction rulebook against sample records. Shows what features would be extracted from the loaded data.2 params

Test a domain extraction rulebook against sample records. Shows what features would be extracted from the loaded data.

Parameters* required
domain_name*string
Name of the domain rulebook to test
sample_sizeinteger
Number of records to test (default 10)default: 10
pprl_auto_configAnalyze the loaded dataset and recommend optimal PPRL (privacy-preserving record linkage) configuration. Returns recommended fields, bloom filter parameters, threshold, and explanation.2 params

Analyze the loaded dataset and recommend optimal PPRL (privacy-preserving record linkage) configuration. Returns recommended fields, bloom filter parameters, threshold, and explanation.

Parameters* required
use_llmboolean
Use LLM for enhanced recommendations (requires API key)default: false
security_levelstring
Security level (default: high)one of standard · high · paranoiddefault: high
pprl_linkRun privacy-preserving record linkage between two parties' data. Computes bloom filters, matches records without sharing raw data. Specify fields, threshold, and security level.5 params

Run privacy-preserving record linkage between two parties' data. Computes bloom filters, matches records without sharing raw data. Specify fields, threshold, and security level.

Parameters* required
fields*array
Field names to match on (e.g. ['first_name', 'last_name', 'zip_code'])
file_a*string
Path to party A's CSV file
file_b*string
Path to party B's CSV file
thresholdnumber
Match threshold (default: auto-detected)
security_levelstring
one of standard · high · paranoiddefault: high

🟡 Golden Suite

A polyglot data-quality and entity-resolution toolkit. Polished, opinionated, AI-native.

GoldenCheck profiles → GoldenFlow standardizes → GoldenMatch deduplicates → GoldenAnalysis reports, all orchestrated by GoldenPipe. With InferMap for schema mapping, a Rust extension layer for Postgres / DuckDB, and optional WebAssembly acceleration behind the edge-safe TypeScript ports.

⚡ GoldenMatch scales from a CSV on your laptop to 100M+ rows on a Ray cluster — verified: 100,000,000 records deduped recall-complete (correct across any partitioning) in 9.2 min, with a 0.36 GB driver footprint.


PyPI — goldenmatch npm — goldenmatch Python Node License: MIT

CI codecov OpenSSF Scorecard Fellegi-Sunter beats hand-rolled Splink DBLP-ACM F1

PyPI downloads (suite) npm downloads (suite) GitHub stars

Docs Wiki Web UI Smithery MCP

Last commit

GoldenMatch web workbench — pair drilldown with NL prose

Pair drilldown in the web workbench: cluster members, field-level diff, and a one-line NL explanation per pair. pip install goldenmatch[web] then goldenmatch serve-ui <project>. More screenshots →

# Headline package: dedupe a CSV in 30 seconds
pip install goldenmatch && goldenmatch dedupe customers.csv

# TypeScript / Edge runtimes
npm install goldenmatch

🆕 v2.3.0 — Auto-enabled semantic blocking, now default-on — text-heavy data automatically routes to SimHash-over-embeddings blocking when an embedder is reachable (a byte-identical no-op otherwise). Plus pluggable pgvector / DuckDB-HNSW vector-index backends and opt-in Fellegi-Sunter routing for no-strong-identifier datasets (GOLDENMATCH_AUTOCONFIG_ROUTE_PROBABILISTIC=1).

v2.2.0 — Semantic blocking — an opt-in recall lever for abbreviations and aliases. dedupe_df(semantic_blocking=...) unions extra candidate sources (initialism/abbreviation blocking, a business-alias canonical-form table, and an embedding ANN pass) into the pipeline. Off by default; on the abbreviation-heavy benchmark it adds +5.3pp recall at zero precision cost.

v2.1.0 — Correlated survivorship — golden-record survivorship can now keep correlated fields (street/city/postcode) in lock-step from a single winning source instead of mixing best-per-field values across records. New FieldGroupSpec + DomainPack.groups (domain-pack schema v3, additive), an anchor/allow_fill group-winner strategy, and per-cluster provenance surfaced through lineage, explain, the MCP tools, and the review queue. Plus chunked PPRL linkage (peak memory ~9-14x lower, byte-identical) and result.native dispatch telemetry that flags a silently-slow Python fallback.


Why a suite?

Each tool stands alone, but they compose into a single pipeline:

flowchart LR
    raw([raw rows])
    golden([golden records])

    subgraph orchestration ["GoldenPipe orchestrates"]
        direction LR
        infermap[InferMap]
        goldencheck[GoldenCheck]
        goldenflow[GoldenFlow]
        goldenmatch[GoldenMatch]
        infermap --> goldencheck --> goldenflow --> goldenmatch
    end

    raw --> infermap
    goldenmatch --> golden
StepRole
InferMapschema mapping — auto-aligns columns across heterogeneous sources
GoldenCheckprofile + validate — encoding, format, anomaly detection
GoldenFlowstandardize + transform — phone, date, address, categorical normalization
GoldenMatchdedupe + cluster + survivorship — fuzzy / exact / probabilistic / LLM
GoldenAnalysisanalysis + reporting — one exportable report over any stage's output, plus cross-run regression detection
GoldenPipeorchestrator — declarative YAML pipeline wiring the steps
  • Zero-config defaults that admit when they're unsure — every step has a self-verifying preflight + postflight; results carry an inspectable report instead of failing silently.
  • 96.4% F1 on DBLP-ACM out of the box for entity resolution — and the opt-in Fellegi-Sunter engine beats hand-rolled, expert-tuned Splink head-to-head on every dataset Splink scores (historical_50k pairwise F1 0.778 vs 0.757, cluster-level B³ 0.844 vs 0.789; one shared evaluator, reproducible bake-off).
  • Learning Memory — corrections persist across runs and re-anchor across row reorders, so the system stops needing the same correction twice (GoldenMatch v1.6.0; off by default).
  • Identity Graph — a durable graph layer above run-local clusters: stable entity_ids that survive across runs, an append-only event log, and create / absorb / merge / split semantics, surfaced on the CLI, REST, MCP, and SQL interfaces (the Identity Graph v2 feature, shipped in GoldenMatch v1.15).
  • Privacy-preserving record linkage — match across organizations without sharing raw data (PPRL, 92.4% F1 on FEBRL4).
  • AI-native by design — every package ships an MCP server, a REST API, and an A2A agent surface. 50+ MCP tools across the suite, including auto_configure + controller_telemetry for v1.7-v1.12 introspection.
  • AutoConfigController visible everywhere (v1.7-v1.12 surface-parity arc) — web ControllerPanel, TUI Ctrl+A, CLI goldenmatch autoconfig, REST /autoconfig + /controller/telemetry, Postgres goldenmatch_autoconfig + gm_telemetry, DuckDB UDFs, MCP/A2A telemetry tools. One JSON shape across every interface.
  • Polyglot parity — the full suite ships on npm (goldenmatch, goldencheck, goldenflow, goldenanalysis, infermap, goldenpipe) alongside PyPI; the TypeScript and Python implementations track the same outputs to 4-decimal precision via a cross-language parity harness.
  • Edge-safe, with optional native speed — the TypeScript cores are dependency-free and node:*-free, so they run in browsers, Cloudflare Workers, Vercel Edge, and Deno. An opt-in WebAssembly backend (await enableWasm() / enableAnalysisWasm()) swaps in the same pyo3-free Rust kernels the Python wheels and the SQL UDFs use — pure-TS stays the default and the byte-identical fallback, so default users download zero wasm bytes.
  • SQL-native, both engines at parity — the same functions run inside PostgreSQL (pgrx extension) and DuckDB: dedupe / match / score / auto-config + telemetry / identity graph, plus data profiling, evaluate, Fellegi-Sunter probabilistic scoring, and GoldenFlow transforms.
  • Production paths — Postgres sync, daemon mode, lineage tracking, review queues, dbt integration, GitHub Actions, and a Rust extension layer for Postgres / DuckDB.

The Suite

PackageLangWhat it doesInstall
GoldenMatch 🟡Python · TSZero-config entity resolution. Fuzzy + exact + probabilistic + LLM. Headline package.pip install goldenmatch · npm i goldenmatch
GoldenCheckPython · TSData-quality scanning: encoding, Unicode, format validation, anomaly detection.pip install goldencheck · npm i goldencheck
GoldenFlowPython · TSTransforms & standardizers: phone, date, address, categorical normalization.pip install goldenflow · npm i goldenflow
GoldenPipePython · TSOrchestrator that wires Check → Flow → Match into one declarative pipeline.pip install goldenpipe · npm i goldenpipe
InferMapPython · TSSchema mapping engine — auto-aligns columns across heterogeneous sources.pip install infermap · npm i infermap
GoldenAnalysisPython · TSCross-cutting analysis & reporting — consumes any stage's typed artifacts (or a raw DataFrame) and emits a unified, exportable AnalysisReport; optional Rust / WASM histogram+quantile kernels.pip install goldenanalysis · npm i goldenanalysis
goldenmatch-extensionsRustPostgres extension (pgrx) + DuckDB UDFs. SQL-native fuzzy matching.source build
dbt-goldensuitedbt · Pythondbt package — dedupe + two-table match materializations (incl. zero-config Fellegi-Sunter), an ER match-quality build gate, quality-gate tests, transforms, and identity-graph reads for warehouse models.packages.yml (git subdir)
goldencheck-actionYAMLGitHub Action — fail PRs that introduce data-quality regressions.Marketplace

Headline pitch and the deepest docs live in packages/python/goldenmatch/README.md (~1,300 lines, full feature list, CLI, architecture, benchmarks).

Knowledge graphs

Entity resolution is the stage most GraphRAG pipelines do badly — duplicate surface forms of the same entity scatter across documents. Two new packages put GoldenMatch's resolution there:

PackageWhat it doesStatus
goldenmatch-kgDrop-in GoldenMatch resolution as the entity-resolution stage of existing KG frameworks (neo4j-graphrag, LlamaIndex PropertyGraphIndex, Graphiti). One framework-agnostic resolve_entities core + per-framework adapters. The ER-stage lift is measured by ER-KG-Bench, not asserted.in-repo · first PyPI release pending
goldengraphBuild-your-own-KG from text — text → LLM extraction → GoldenMatch resolution → a durable bi-temporal store. The engine (store / query / community detection) is pyo3-free Rust; ER is the differentiator. Early evidence program.in-repo · first PyPI release pending

Choose your path

I want to...Go here
Deduplicate a CSV right nowpackages/python/goldenmatch
Use from Claude Desktop / Codepackages/python/goldenmatch — MCP
Edit rules in a browser, label pairs, compare runspackages/python/goldenmatch — Web UI
Build AI agents that deduplicateER Agent / A2A wiki page
Profile data quality before matchingpackages/python/goldencheck
Standardize messy fields (phone, date, address)packages/python/goldenflow
Run the full pipeline declarativelypackages/python/goldenpipe
Map columns across schemaspackages/python/infermap
Analyze + report across stages and runspackages/python/goldenanalysis
Write TypeScript / Node.js / Edge (browser, Workers; optional WASM)packages/typescript/goldenmatch
Match in Postgres / DuckDB SQLpackages/rust/extensions
Add data-quality gates to dbtpackages/dbt/goldensuite
Block bad data in GitHub PRspackages/actions/goldencheck
Run as Airflow DAGsexamples/airflow/ — 12 drop-in DAGs
Run from a single MCP containerdocker run ghcr.io/benseverndev-oss/goldensuite-mcp:latest
Pull every Suite containerGitHub Packages

Quick examples

Python — dedupe in 30 seconds

import goldenmatch as gm

# Zero-config
result = gm.dedupe("customers.csv")
print(result)  # DedupeResult(records=5000, clusters=847, match_rate=12.0%)
result.golden.write_csv("deduped.csv")

# Or be explicit
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "zip": 0.95},
    blocking=["zip"],
    threshold=0.85)

TypeScript — edge-safe core

import { dedupe } from "goldenmatch";

const result = dedupe(rows, {
  fuzzy: { name: 0.85 },
  blocking: ["zip"],
  threshold: 0.85,
});
console.log(result.stats);  // { totalRecords, totalClusters, matchRate, ... }

Runs in browsers, Vercel Edge, Cloudflare Workers, Deno — and optionally swaps in the Rust score-core kernel via await enableWasm(). ~940 tests, strict TypeScript (noUncheckedIndexedAccess, exactOptionalPropertyTypes).

Web workbench — browser UI for matching

pip install 'goldenmatch[web]'
goldenmatch serve-ui my-project   # opens http://localhost:5050

GoldenMatch web UI

Edit rules with live validation, preview against a sampled slice, label pairs (mirrored into Learning Memory automatically), compare runs (CCMS), sweep parameters, browse the corrections store. Single-process localhost workbench shipped as the optional [web] extra.

Composed pipeline

import goldenpipe as gp

pipeline = gp.Pipeline.from_yaml("pipeline.yaml")  # check → flow → match
result = pipeline.run("customers.csv")
result.report.write_html("report.html")

More: examples/ has runnable demos for every Suite scenario: Python (quickstart, full pipeline, customer 360, PPRL, review workflow, MCP client) · TypeScript (quickstart, Vercel Edge route, MCP client) · Airflow DAGs (12 production-shaped pipelines).


Use cases (real-world pipelines)

Reproducible end-to-end pipelines running GoldenMatch on public data at scale, each with measured headline numbers vs baselines:

  • 🕵️ goldenmatch-shell-company-network — investigative ER across ICIJ Offshore Leaks + OpenSanctions + GLEIF + UK PSC + UK disqualified-directors. Confidence-weighted graph, structure mining, named investigative candidates. −62.5% analyst-hours to triage vs single-source baselines; +133% adversarial perturbation recovery.
  • 🛡️ goldenmatch-vuln-attribution — cross-database ER on 6.1M OSS vulnerability records across 40 sources (OSV, GHSA, PyPA, RustSec, Go vulndb, EPSS, CISA KEV, CVE Project bulk). 6,126,895 records → 847,475 canonical vulns in ~5 minutes end-to-end on a single 64GB runner via the full Golden Suite (Check + Flow + Match + Pipe).
  • ⚖️ goldenmatch-sanctions-reconciliation — cross-list coverage analysis on 85 public sanctions lists across 50+ jurisdictions via OpenSanctions, plus 10-year OFAC SDN history and PEP/crypto cross-analysis. Coverage-gap benchmark for any sanctions-screening vendor.

Install variants

GoldenMatch ships fat optional extras so you only pay for what you use:

pip install goldenmatch                    # core (CSV in, CSV out) + native acceleration on common platforms
pip install goldenmatch[native]            # back-compat alias; native is already default on common platforms
pip install goldenmatch[embeddings]        # + sentence-transformers, FAISS
pip install goldenmatch[llm]               # + Claude / OpenAI for LLM boost
pip install goldenmatch[postgres]          # + Postgres sync
pip install goldenmatch[snowflake]         # + Snowflake connector
pip install goldenmatch[bigquery]          # + BigQuery connector
pip install goldenmatch[databricks]        # + Databricks connector
pip install goldenmatch[salesforce]        # + Salesforce connector
pip install goldenmatch[duckdb]            # + DuckDB out-of-core backend
pip install goldenmatch[ray]               # + Ray distributed backend (50M+ rows)
pip install goldenmatch[quality]           # + GoldenCheck integration
pip install goldenmatch[transform]         # + GoldenFlow integration
pip install goldenmatch[mcp]               # + MCP server for Claude Desktop
pip install goldenmatch[agent]             # + A2A agent (aiohttp)
pip install goldenmatch[web]               # + localhost browser workbench (FastAPI + React)

goldenmatch setup    # interactive wizard: GPU, API keys, database

Sister packages compose: pip install goldenpipe[full] brings in Check + Flow + Match together.


Remote MCP Server

GoldenMatch is hosted as an MCP server on Smithery — connect from any MCP client without installing anything.

{
  "mcpServers": {
    "goldenmatch": {
      "url": "https://goldenmatch-mcp-production.up.railway.app/mcp/"
    }
  }
}

50+ MCP tools across the suite: deduplicate, match, explain, review, link privately, configure, scan quality, transform, synthesize golden records, and manage Learning Memory corrections.


Container images

Every Suite package ships as a multi-arch container image (linux/amd64 + linux/arm64) on GitHub Container Registry. Pull anonymously, no auth needed:

# One container, every Suite tool — the convenience option
docker run -p 8300:8300 ghcr.io/benseverndev-oss/goldensuite-mcp:latest

# Per-package containers — narrower deployments
docker run -p 8200:8200 ghcr.io/benseverndev-oss/goldenmatch-mcp:latest
docker run -p 8100:8100 ghcr.io/benseverndev-oss/goldencheck-mcp:latest
docker run -p 8150:8150 ghcr.io/benseverndev-oss/goldenflow-mcp:latest
docker run -p 8250:8250 ghcr.io/benseverndev-oss/goldenpipe-mcp:latest
docker run -p 8400:8400 ghcr.io/benseverndev-oss/infermap-mcp:latest

# Postgres + extension preinstalled
docker run -e POSTGRES_PASSWORD=secret ghcr.io/benseverndev-oss/goldenmatch-extensions:latest

Tags:

  • :latest — current main
  • :main-<sha7> — every push to main, immutable
  • :vX.Y.Z and :vX.Y — pushed when a <package>-vX.Y.Z tag is created

See packages/python/goldensuite-mcp/README.md for the aggregator's tool-collision behaviour.


Airflow

12 drop-in DAGs at examples/airflow/, grouped by lifecycle stage:

GroupDAGs
Core pipelinedaily_dedupe, incremental_match, warehouse_native (Snowflake), customer_360 (multi-source)
Privacypprl_linkage (two-party PPRL)
Onboarding & monitoringschema_align_and_load, schema_drift_alarm, quality_gate
Feedback loopreview_worker, active_learning
Operationalizereverse_etl (Salesforce/HubSpot), backfill

TaskFlow API, Airflow 2.7+ (compatible with 3.x). Each DAG has tunable knobs at the top, idempotent retries, and is marker-protected against double-processing. Drop the file you want into your Airflow dags/ folder.


Repository layout

goldenmatch/
├── packages/
│   ├── python/
│   │   ├── goldenmatch/      # entity resolution — headline package
│   │   ├── goldencheck/      # data quality scanning
│   │   ├── goldenflow/       # transforms & standardizers
│   │   ├── goldenpipe/       # orchestrator
│   │   ├── infermap/         # schema mapping
│   │   └── goldenanalysis/   # cross-cutting analysis & reporting
│   ├── typescript/
│   │   ├── goldenmatch/      # full TS port (edge-safe core)
│   │   ├── goldencheck/      # TS implementation
│   │   ├── goldencheck-types/ # shared TS types
│   │   ├── goldenflow/       # TS transforms
│   │   ├── infermap/         # TS schema mapping
│   │   └── goldenanalysis/   # TS analysis & reporting (edge-safe + WASM)
│   ├── rust/
│   │   └── extensions/       # Postgres pgrx + DuckDB UDFs (own Cargo workspace)
│   ├── python/goldensuite-mcp/ # aggregator MCP server (one container, all tools)
│   ├── dbt/goldensuite/      # dbt package (materializations, tests, macros)
│   └── actions/goldencheck/  # GitHub Action
├── examples/
│   ├── python/               # 6 runnable Python scripts (quickstart → MCP)
│   ├── typescript/           # 3 TS scripts (quickstart, Vercel Edge, MCP)
│   └── airflow/              # 12 drop-in Airflow DAGs
├── docs/superpowers/         # design specs and implementation plans
├── justfile                  # install / test / lint / build, all languages
├── pyproject.toml            # uv workspace (root)
├── pnpm-workspace.yaml       # TypeScript pnpm workspace (Turborepo)
├── package.json              # root scripts + pnpm workspace root
└── .github/workflows/ci.yml

Workspaces (Cargo vs pnpm)

  • Cargo — no root workspace. packages/rust/extensions/ is itself a Cargo workspace (the postgres crate is excluded for pgrx-specific build requirements). Cargo doesn't allow nested workspaces sharing members, so Cargo commands run from inside packages/rust/extensions/.
  • TypeScript — a single pnpm workspace. packages/typescript/* form one pnpm + Turborepo workspace (see TypeScript dev setup). .npmrc pins node-linker=hoisted, giving a flat node_modules that avoids the Windows symlink issues an earlier per-package layout hit.

Build / test / lint everything

just install   # uv sync + per-package npm install + cargo fetch
just test      # all languages
just lint
just build

Reproducing benchmarks

Published GoldenMatch numbers (DQbench composite 91.04, DBLP-ACM 0.9641 F1, Febrl3 0.9443 F1, NCVR 0.9719 F1) map back to a single committed runner: scripts/run_benchmarks.py. See docs/reproducing-benchmarks.md for per-number commands, dataset URLs, expected output (with tolerance), variance notes (deterministic vs LLM-augmented), and a copy-pasteable one-click reproduction snippet for the DQbench composite. The same runner powers the weekly benchmarks.yml workflow.

Scale envelope

"How big can this handle?" is answered in docs/scale-envelope.md: per-backend ranges (Polars in-memory < 500K, DuckDB out-of-core 500K - 50M, Ray distributed >= 50M), block-size failure modes, candidate-pair math, and a single-page decision tree for picking a backend.

Verified at the top end: a full 100,000,000-row GoldenMatch dedupe on a 5-node Ray cluster (e2-standard-16, 80 CPU) in 9.2 min (554 s), 20,000,000 golden records recovered exactly, driver process peak 0.36 GB RSS — the default distributed path is now recall-complete (blocking-key shuffle scoring + a distributed randomized-contraction WCC), so duplicates merge correctly no matter how the input is partitioned, and it stays driver-collect-free end to end (#844). A faster per-partition path is available via GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0 (driver-collect-free, ~213 s on a 4-worker run) for inputs where duplicates already co-locate within partitions — but it under-merges when a cluster's members land in different input partitions, which is why recall-complete is the default. Recipe in packages/python/goldenmatch/configs/distributed-100m.yaml.


Contributing

  • Feature work goes on feature/<name> branches; merge via squash PR.
  • PR title format: feat: <description>, fix: <description>, docs: <description>.
  • Tests must pass on all three languages where the change applies; the parity harness in packages/typescript/goldenmatch/tests/parity/ enforces 4-decimal-tolerance Python ↔ TypeScript scorer parity.
  • See docs/superpowers/specs/ for design rationale on architectural decisions.

TypeScript dev setup (pnpm + Turborepo)

The TypeScript packages live in a single pnpm workspace orchestrated by Turborepo. From the repo root:

corepack enable                               # one-time, picks up pnpm@9.15.0 from package.json
pnpm install                                  # installs all workspace packages
pnpm turbo run build test typecheck lint      # full pipeline (cached after first run)
pnpm --filter goldenmatch test                # single package

Windows: enable Developer Mode for pnpm. pnpm install creates symlinks under node_modules/. Settings → For Developers → Developer Mode → On. If you see EPERM: operation not permitted, symlink ... during install, Dev Mode is off.

If corepack enable fails (often needs an admin shell on Windows), the fallback is npm i -g pnpm@9.15.0 — functionally equivalent.


History

This repository was formed on 2026-05-01 by folding 8 sibling repos into the existing goldenmatch repo using git filter-repo. Full commit history is preserved for every source. See docs/superpowers/specs/2026-05-01-goldenmatch-monorepo-fold-in-design.md for the design rationale and docs/superpowers/plans/2026-05-01-goldenmatch-monorepo-fold-in.md for the step-by-step migration plan.


Author & License

Built by Ben Severn.

MIT — see LICENSE.

Featured
CodeRabbit
CodeRabbit
AI writes the code. CodeRabbit catches the slop.
Try For Free →
Keep your Mac awake
Keep your Mac awake
Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.
One time payment $9 →
Context.devContext.dev
Context.dev
Integrate web data into your AI product. One API to scrape website & brand data.
Get API Key Now →
Make your agent a DeFi expert
Make your agent a DeFi expert
Agent, run crypto. Access onchain data & trade routes via 1inch.
Install now →
Make money from your Skills
Make money from your Skills
On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.
Start earning →
AppSignal
AppSignal
Monitor with ease. Code with confidence.
Start Free Trial →
Registryactive
Packagegoldenmatch
TransportSTDIO, HTTP
Prompts5
Tools verifiedJun 10, 2026
UpdatedJun 9, 2026
View on GitHub