Matchbox concepts¶
Matchbox orchestrates entity resolution pipelines and stores their outputs in a shared backend. It gives data engineers, analysts, reviewers, and downstream services a common view of sources, matching logic, and resolved entities.
A Matchbox backend stores two connected structures:
- An execution graph containing sources, models, and resolvers.
- A data graph containing source clusters, model score edges, and resolver clusters.
The execution graph is the small DAG you build and publish. The data graph follows the same topology, but stores the source clusters, score edges, and resolver clusters created by those steps.
graph LR
subgraph EG["Execution graph"]
direction TD
SA["Source A"] --> MA["Deduper A"]
MA --> RA["Resolver A"]
RA --> LAB["Linker AB"]
SB["Source B"] --> LAB
LAB --> RF["Final resolver"]
end
subgraph DG["Data graph"]
direction TD
SAC["Source clusters A"] --> MAS["Deduper A score edges"]
MAS --> RAC["Resolver A clusters"]
RAC --> LABS["Linker AB score edges"]
SBC["Source clusters B"] --> LABS
LABS --> RFC["Final clusters"]
end
SA -.-> SAC
MA -.-> MAS
RA -.-> RAC
SB -.-> SBC
LAB -.-> LABS
RF -.-> RFC
Sources¶
A source is a curated view of the records you want to match. It usually comes from a warehouse query, file extract, or other structured data feed.
Every source needs:
- A location that tells Matchbox where the data lives.
- An extract-transform definition that produces the source rows.
- A key field that uniquely identifies each row.
- Index fields that Matchbox is allowed to use for matching.
Imagine a warehouse with customer and customer_addresses tables linked by customer_id.
erDiagram
customer ||--o{ customer_addresses : has
customer {
int customer_id PK
string full_name
string email
}
customer_addresses {
int address_id PK
int customer_id FK
string street
string city
string postal_code
}
One source might use this SQL:
SELECT
customer.customer_id,
full_name,
email,
ARRAY_AGG(postal_code) AS postal_codes
FROM customer
LEFT JOIN customer_addresses
ON customer.customer_id = customer_addresses.customer_id
GROUP BY customer.customer_id;
The source key is customer_id. The index fields are full_name, email, and postal_codes.
| customer_id | full_name | postal_codes | |
|---|---|---|---|
| 1 | Alice Johnson | alice@johnson.com | {“90210”, “10001”} |
| 2 | Alice Johnson | ajohnson@domain.com | {“10001”} |
| 3 | Bob Smith | bsmith@domain.com | {“12345”} |
| 4 | Bob Smith | bsmith@domain.com | {“12345”} |
Note that the third and fourth rows, excluding the key, are identical. No model could differentiate between them based on the fields returned by the source. For this reason, we index them as one item but record that our indexed item maps to two distinct source keys.
Matchbox never sends raw source fields to the backend. It hashes the indexed values client-side and uploads those hashes instead, so the server stores stable identifiers for matching without storing the source data itself.
Models and scores¶
Models perform the matching work.
- A deduper consumes one query.
- A linker consumes a left query and a right query.
- A model can consume sources directly or query through upstream resolvers.
The output of a model is a table of scored pairs. Each row contains:
left_idright_idscore
The score is a floating-point value between 0.0 and 1.0. Deterministic models usually emit 1.0. Learned or weighted models can emit any value in that range.
Matchbox uses the word score rather than probability because these values
act as match-strength signals without claiming a formal probabilistic
interpretation.
For example, if a deduper thinks customer 1 and customer 2 refer to the same entity with score 0.8, the model output looks like this:
graph LR
1((1))
2((2))
1 -- 0.8 --> 2
Model steps store those scored edges on the backend. They do not define the final entity view on their own.
Resolvers and clusters¶
Resolvers turn model score edges into clusters. A resolver can consume one model or several models, which makes clustering policy explicit and reusable.
One common strategy is connected components over all model edges that meet per-model thresholds. Imagine a second model produced the following model output:
graph LR
2((2)) -- 0.9 --> 3((3))
Concatenating that with the first model’s edges gives:
| left | right | score |
|---|---|---|
| 1 | 2 | 0.8 |
| 2 | 3 | 0.9 |
Connected components over that combined edge set produces one cluster containing all three customers:
graph LR
0((cluster))
1((1))
2((2))
3((3))
0 --> 1
0 --> 2
0 --> 3
Matchbox uses three important ideas here:
- Source steps create source clusters.
- Model steps create score edges between existing clusters.
- Resolver steps create the clusters that users query.
When you query Matchbox, you always query through a resolver. The default resolver is the single final resolver for a published DAG.
Architecture¶
Sources, models, and resolvers run client-side.
- Sources materialise data and hashes locally.
- Models compute score edges locally.
- Resolvers compute cluster assignments locally.
The backend stores fingerprints, step metadata, model scores, resolver clusters, and evaluation data. This keeps the server focused on coordination, storage, and querying rather than warehouse-side matching logic.
The PostgreSQL adapter is one implementation of that backend contract. Other adapters can implement the same interfaces as long as they preserve the same high-level behaviour.