API reference¶
matchbox.client
is the client used to interact with the Matchbox server.
All names in matchbox.client
are also accessible from the top-level matchbox
module.
matchbox.client
¶
All client-side functionalities of Matchbox.
Modules:
-
authorisation
–Utilities for JWT API authorisation.
-
dags
–Objects to define a DAG which indexes, deduplicates and links data.
-
eval
–Module implementing client-side evaluation features.
-
extract
–Functions to extract data out of the Matchbox server.
-
helpers
–Core functionalities of the Matchbox client.
-
models
–Deduplication and linking methodologies.
-
results
–Objects representing the results of running a model client-side.
-
visualisation
–Visualisation utilities.
Functions:
-
index
–Indexes data in Matchbox.
-
clean
–Clean data using DuckDB with the provided cleaning SQL.
-
match
–Matches IDs against the selected backend.
-
query
–Runs queries against the selected backend.
-
select
–From one location client, builds and verifies a list of selectors.
-
make_model
–Create a unified model instance for either linking or deduping operations.
index
¶
index(
source_config: SourceConfig,
batch_size: int | None = None,
) -> None
Indexes data in Matchbox.
Parameters:
-
source_config
¶SourceConfig
) –A SourceConfig with client set
-
batch_size
¶int | None
, default:None
) –the size of each batch when fetching data from the warehouse, which helps reduce the load on the database. Default is None.
clean
¶
clean(
data: DataFrame, cleaning_dict: dict[str, str] | None
) -> DataFrame
Clean data using DuckDB with the provided cleaning SQL.
- ID is passed through automatically
- If present, leaf_id and key are passed through automatically
- Columns not mentioned in the cleaning_dict are passed through unchanged
- Each key in cleaning_dict is an alias for a SQL expression
Parameters:
-
data
¶DataFrame
) –Raw polars dataframe to clean
-
cleaning_dict
¶dict[str, str] | None
) –A dictionary mapping field aliases to SQL expressions. The SQL expressions can reference columns in the data using their names. If None, no cleaning is applied and the original data is returned.
SourceConfig.f()
can be used to help reference qualified fields.
Returns:
-
DataFrame
–Cleaned polars dataframe
Examples:
Column passthrough behavior:
data = pl.DataFrame(
{
"id": [1, 2, 3],
"name": ["John", "Jane", "Bob"],
"age": [25, 30, 35],
"city": ["London", "Hull", "Stratford-upon-Avon"],
}
)
cleaning_dict = {
"full_name": "name" # Only references 'name' column
}
result = clean(data, cleaning_dict)
# Result columns: id, full_name, age, city
# 'name' is dropped because it was used in cleaning_dict
# 'age' and 'city' are passed through unchanged
Multiple column references:
data = pl.DataFrame(
{
"id": [1, 2, 3],
"first": ["John", "Jane", "Bob"],
"last": ["Doe", "Smith", "Johnson"],
"salary": [50000, 60000, 55000],
}
)
cleaning_dict = {
"name": "first || ' ' || last", # References both 'first' and 'last'
"high_earner": "salary > 55000",
}
result = clean(data, cleaning_dict)
# Result columns: id, name, high_earner
# 'first', 'last', and 'salary' are dropped (used in expressions)
# No other columns to pass through
Special columns (leaf_id, key) handling:
data = pl.DataFrame(
{
"id": [1, 2, 3],
"leaf_id": ["a", "b", "c"],
"key": ["x", "y", "z"],
"value": [10, 20, 30],
"status": ["active", "inactive", "pending"],
}
)
cleaning_dict = {"processed_value": "value * 2"}
result = clean(data, cleaning_dict)
# Result columns: id, leaf_id, key, processed_value, status
# 'id', 'leaf_id', 'key' always included automatically
# 'value' dropped (used in expression), 'status' passed through
No cleaning (returns original data):
match
¶
match(
*targets: list[SourceResolutionName],
source: SourceResolutionName,
key: str,
resolution: ResolutionName = DEFAULT_RESOLUTION,
threshold: int | None = None,
) -> list[Match]
Matches IDs against the selected backend.
Parameters:
-
targets
¶list[SourceResolutionName]
, default:()
) –Source resolutions to find keys in
-
source
¶SourceResolutionName
) –The source resolution the provided key belongs to
-
key
¶str
) –The value to match from the source. Usually a primary key
-
resolution
¶optional
, default:DEFAULT_RESOLUTION
) –The resolution to use to resolve matches against If not set, it will look for a default resolution.
-
threshold
¶optional
, default:None
) –The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
Examples:
query
¶
query(
*selectors: list[Selector],
resolution: ResolutionName | None = None,
combine_type: Literal[
"concat", "explode", "set_agg"
] = "concat",
return_leaf_id: bool = True,
return_type: ReturnTypeStr = "pandas",
threshold: int | None = None,
batch_size: int | None = None,
) -> QueryReturnType
Runs queries against the selected backend.
Parameters:
-
selectors
¶list[Selector]
, default:()
) –Each selector is the output of
select()
. This allows querying sources coming from different engines -
resolution
¶optional
, default:None
) –The name of the resolution point to query If not set:
- If querying a single source, it will use the source resolution
- If querying 2 or more sources, it will look for a default resolution
-
combine_type
¶Literal['concat', 'explode', 'set_agg']
, default:'concat'
) –How to combine the data from different sources.
- If
concat
, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available - If
explode
, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources - If
set_agg
, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
- If
-
return_leaf_id
¶bool
, default:True
) –Whether matchbox IDs for source clusters should also be returned
-
return_type
¶ReturnTypeStr
, default:'pandas'
) –The form to return data in, one of “pandas” or “arrow” Defaults to pandas for ease of use
-
threshold
¶optional
, default:None
) –The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
-
batch_size
¶optional
, default:None
) –The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.
Returns: Data in the requested return type (DataFrame or ArrowTable).
Examples:
select
¶
select(
*selection: SourceResolutionName
| dict[SourceResolutionName, list[str]],
client: Any | None = None,
) -> list[Selector]
From one location client, builds and verifies a list of selectors.
Can be used on any number of sources as long as they share the same client.
Parameters:
-
selection
¶SourceResolutionName | dict[SourceResolutionName, list[str]]
, default:()
) –The source resolutions to retrieve data from
-
client
¶Any | None
, default:None
) –The client to use for the source. Datatype will depend on the source’s location type. For example, a RelationalDBLocation will require a SQLAlchemy engine. If not provided, will populate with a SQLAlchemy engine from the default warehouse set in the environment variable
MB__CLIENT__DEFAULT_WAREHOUSE
Returns:
Examples:
make_model
¶
make_model(
name: ModelResolutionName,
description: str,
model_class: type[Linker] | type[Deduper],
model_settings: dict[str, Any],
left_data: DataFrame,
left_resolution: ResolutionName,
right_data: DataFrame | None = None,
right_resolution: ResolutionName | None = None,
) -> Model
Create a unified model instance for either linking or deduping operations.
Parameters:
-
name
¶ModelResolutionName
) –Your unique identifier for the model
-
description
¶str
) –Description of the model run
-
model_class
¶type[Linker] | type[Deduper]
) –Either Linker or Deduper class
-
model_settings
¶dict[str, Any]
) –Configuration settings for the model
-
left_data
¶DataFrame
) –Primary data
-
left_resolution
¶ResolutionName
) –Resolution name for primary model or source
-
right_data
¶DataFrame | None
, default:None
) –Secondary data (linking only)
-
right_resolution
¶ResolutionName | None
, default:None
) –Resolution name for secondary model or source (linking only)
Returns:
-
Model
(Model
) –Configured model instance ready for execution