Helpers¶

matchbox.client.helpers ¶

Core functionalities of the Matchbox client.

Modules:

comparison –

Functions to compare fields in different sources.
delete –

Functions to delete resolutions from the Matchbox server.
index –

Functions to index data sources to the Matchbox server.
selector –

Functions to select and retrieve data from the Matchbox server.

Functions:

delete_resolution –

Deletes a resolution from Matchbox.
get_source –

Get a SourceConfig for an existing source.
clean –

Clean data using DuckDB with the provided cleaning SQL.
select –

From one location client, builds and verifies a list of selectors.

delete_resolution ¶

delete_resolution(name: ResolutionName, certain: bool = False) -> None

Deletes a resolution from Matchbox.

Will delete:

The resolution itself
All descendants of the resolution
All endorsements of clusters made by those resolutions, either probabilities for models, or keys for sources

Will not delete:

The clusters themselves

Parameters:

name ¶
(ResolutionName) –

The name of the source to delete.
certain ¶
(bool, default: False ) –

Must be true to delete the source. Default is False.

get_source ¶

get_source(name: str, location: Location | None = None, extract_transform: str | None = None, key_field: str | None = None, index_fields: list[str] | None = None) -> SourceConfig

Get a SourceConfig for an existing source.

Parameters:

name ¶
(str) –

The name of the source resolution.
location ¶
(Location | None, default: None ) –

If provided, will validate the returned SourceConfig against the location.
extract_transform ¶
(str | None, default: None ) –

If provided, will validate the returned SourceConfig against the extract/transform logic.
key_field ¶
(str | None, default: None ) –

If provided, will validate the returned SourceConfig against the key field.
index_fields ¶
(list[str] | None, default: None ) –

If provided, will validate the returned SourceConfig against the index fields.

Returns:

SourceConfig –

A SourceConfig object.

clean ¶

clean(data: DataFrame, cleaning_dict: dict[str, str] | None) -> DataFrame

Clean data using DuckDB with the provided cleaning SQL.

ID is passed through automatically
If present, leaf_id and key are passed through automatically
Columns not mentioned in the cleaning_dict are passed through unchanged
Each key in cleaning_dict is an alias for a SQL expression

Parameters:

data ¶
(DataFrame) –

Raw polars dataframe to clean
cleaning_dict ¶
(dict[str, str] | None) –

A dictionary mapping field aliases to SQL expressions. The SQL expressions can reference columns in the data using their names. If None, no cleaning is applied and the original data is returned. SourceConfig.f() can be used to help reference qualified fields.

Returns:

DataFrame –

Cleaned polars dataframe

Examples:

Column passthrough behavior:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 35],
        "city": ["London", "Hull", "Stratford-upon-Avon"],
    }
)
cleaning_dict = {
    "full_name": "name"  # Only references 'name' column
}
result = clean(data, cleaning_dict)
# Result columns: id, full_name, age, city
# 'name' is dropped because it was used in cleaning_dict
# 'age' and 'city' are passed through unchanged

Multiple column references:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "first": ["John", "Jane", "Bob"],
        "last": ["Doe", "Smith", "Johnson"],
        "salary": [50000, 60000, 55000],
    }
)
cleaning_dict = {
    "name": "first || ' ' || last",  # References both 'first' and 'last'
    "high_earner": "salary > 55000",
}
result = clean(data, cleaning_dict)
# Result columns: id, name, high_earner
# 'first', 'last', and 'salary' are dropped (used in expressions)
# No other columns to pass through

Special columns (leaf_id, key) handling:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "leaf_id": ["a", "b", "c"],
        "key": ["x", "y", "z"],
        "value": [10, 20, 30],
        "status": ["active", "inactive", "pending"],
    }
)
cleaning_dict = {"processed_value": "value * 2"}
result = clean(data, cleaning_dict)
# Result columns: id, leaf_id, key, processed_value, status
# 'id', 'leaf_id', 'key' always included automatically
# 'value' dropped (used in expression), 'status' passed through

No cleaning (returns original data):

data = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"], "score": [95, 87]})
result = clean(data, None)
# Returns exact same dataframe with all original columns

select ¶

select(*selection: SourceResolutionName | dict[SourceResolutionName, list[str]], client: Any | None = None) -> list[Selector]

From one location client, builds and verifies a list of selectors.

Can be used on any number of sources as long as they share the same client.

Parameters:

selection ¶
(SourceResolutionName | dict[SourceResolutionName, list[str]], default: () ) –

The source resolutions to retrieve data from
client ¶
(Any | None, default: None ) –

The client to use for the source. Datatype will depend on the source’s location type. For example, a RelationalDBLocation will require a SQLAlchemy engine. If not provided, will populate with a SQLAlchemy engine from the default warehouse set in the environment variable MB__CLIENT__DEFAULT_WAREHOUSE

Returns:

list[Selector] –

A list of Selector objects

Examples:

select("companies_house", client=engine)

select({"companies_house": ["crn"], "hmrc_exporters": ["name"]}, client=engine)

comparison ¶

Functions to compare fields in different sources.

Functions:

comparison –

Validates any number of SQL conditions and prepares for a WHERE clause.

comparison ¶

comparison(sql_condition: str, dialect: str = 'postgres') -> str

Validates any number of SQL conditions and prepares for a WHERE clause.

Requires all column references be explicitly declared as from “l” and “r” tables.

delete ¶

Functions to delete resolutions from the Matchbox server.

Functions:

delete_resolution –

Deletes a resolution from Matchbox.

delete_resolution ¶

delete_resolution(name: ResolutionName, certain: bool = False) -> None

Deletes a resolution from Matchbox.

Will delete:

The resolution itself
All descendants of the resolution
All endorsements of clusters made by those resolutions, either probabilities for models, or keys for sources

Will not delete:

The clusters themselves

Parameters:

name ¶
(ResolutionName) –

The name of the source to delete.
certain ¶
(bool, default: False ) –

Must be true to delete the source. Default is False.

index ¶

Functions to index data sources to the Matchbox server.

Functions:

index –

Indexes data in Matchbox.
get_source –

Get a SourceConfig for an existing source.

index ¶

index(source_config: SourceConfig, batch_size: int | None = None) -> None

Indexes data in Matchbox.

Parameters:

source_config ¶
(SourceConfig) –

A SourceConfig with client set
batch_size ¶
(int | None, default: None ) –

the size of each batch when fetching data from the warehouse, which helps reduce the load on the database. Default is None.

get_source ¶

get_source(name: str, location: Location | None = None, extract_transform: str | None = None, key_field: str | None = None, index_fields: list[str] | None = None) -> SourceConfig

Get a SourceConfig for an existing source.

Parameters:

name ¶
(str) –

The name of the source resolution.
location ¶
(Location | None, default: None ) –

If provided, will validate the returned SourceConfig against the location.
extract_transform ¶
(str | None, default: None ) –

If provided, will validate the returned SourceConfig against the extract/transform logic.
key_field ¶
(str | None, default: None ) –

If provided, will validate the returned SourceConfig against the key field.
index_fields ¶
(list[str] | None, default: None ) –

If provided, will validate the returned SourceConfig against the index fields.

Returns:

SourceConfig –

A SourceConfig object.

selector ¶

Functions to select and retrieve data from the Matchbox server.

Classes:

Selector –

A selector to choose a source and optionally a subset of columns to select.

Functions:

select –

From one location client, builds and verifies a list of selectors.
query –

Runs queries against the selected backend.
match –

Matches IDs against the selected backend.
clean –

Clean data using DuckDB with the provided cleaning SQL.

Selector ¶

Bases: BaseModel

A selector to choose a source and optionally a subset of columns to select.

Methods:

ensure_client –

Ensure that the source has client set.
ensure_fields –

Ensure that the fields are valid.
from_name_and_client –

Create a Selector from a source name and location client.

Attributes:

model_config –
source (SourceConfig) –
fields (list[SourceField]) –
qualified_key (str) –

Get the qualified key name for the selected source.
qualified_fields (list[str]) –

Get the qualified field names for the selected fields.

model_config `class-attribute` `instance-attribute` ¶

model_config = ConfigDict(arbitrary_types_allowed=True)

source `instance-attribute` ¶

source: SourceConfig

fields `instance-attribute` ¶

fields: list[SourceField]

qualified_key `property` ¶

qualified_key: str

Get the qualified key name for the selected source.

qualified_fields `property` ¶

qualified_fields: list[str]

Get the qualified field names for the selected fields.

ensure_client `classmethod` ¶

ensure_client(source: SourceConfig) -> SourceConfig

Ensure that the source has client set.

ensure_fields ¶

ensure_fields() -> Self

Ensure that the fields are valid.

from_name_and_client `classmethod` ¶

from_name_and_client(name: SourceResolutionName, client: Any, fields: list[str] | None = None) -> Selector

Create a Selector from a source name and location client.

Parameters:

name ¶
(SourceResolutionName) –

The name of the source to select from
client ¶
(Any) –

The client to use for the source
fields ¶
(list[str] | None, default: None ) –

A list of fields to select from the source

select ¶

select(*selection: SourceResolutionName | dict[SourceResolutionName, list[str]], client: Any | None = None) -> list[Selector]

From one location client, builds and verifies a list of selectors.

Can be used on any number of sources as long as they share the same client.

Parameters:

selection ¶
(SourceResolutionName | dict[SourceResolutionName, list[str]], default: () ) –

The source resolutions to retrieve data from
client ¶
(Any | None, default: None ) –

The client to use for the source. Datatype will depend on the source’s location type. For example, a RelationalDBLocation will require a SQLAlchemy engine. If not provided, will populate with a SQLAlchemy engine from the default warehouse set in the environment variable MB__CLIENT__DEFAULT_WAREHOUSE

Returns:

list[Selector] –

A list of Selector objects

Examples:

select("companies_house", client=engine)

select({"companies_house": ["crn"], "hmrc_exporters": ["name"]}, client=engine)

query ¶

query(*selectors: list[Selector], resolution: ResolutionName | None = None, combine_type: Literal['concat', 'explode', 'set_agg'] = 'concat', return_leaf_id: bool = True, return_type: ReturnTypeStr = 'pandas', threshold: int | None = None, batch_size: int | None = None) -> QueryReturnType

Runs queries against the selected backend.

Parameters:

selectors ¶
(list[Selector], default: () ) –

Each selector is the output of select(). This allows querying sources coming from different engines
resolution ¶
(optional, default: None ) –
The name of the resolution point to query If not set:
- If querying a single source, it will use the source resolution
- If querying 2 or more sources, it will look for a default resolution
combine_type ¶
(Literal['concat', 'explode', 'set_agg'], default: 'concat' ) –
How to combine the data from different sources.
- If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
- If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
- If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
return_leaf_id ¶
(bool, default: True ) –

Whether matchbox IDs for source clusters should also be returned
return_type ¶
(ReturnTypeStr, default: 'pandas' ) –

The form to return data in, one of “pandas” or “arrow” Defaults to pandas for ease of use
threshold ¶
(optional, default: None ) –

The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
batch_size ¶
(optional, default: None ) –

The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.

Returns: Data in the requested return type (DataFrame or ArrowTable).

Examples:

query(
    select({"companies_house": ["crn", "name"]}, engine=engine),
)

query(
    select("companies_house", engine=engine1),
    select("datahub_companies", engine=engine2),
    resolution="last_linker",
)

match ¶

match(*targets: list[SourceResolutionName], source: SourceResolutionName, key: str, resolution: ResolutionName = DEFAULT_RESOLUTION, threshold: int | None = None) -> list[Match]

Matches IDs against the selected backend.

Parameters:

targets ¶
(list[SourceResolutionName], default: () ) –

Source resolutions to find keys in
source ¶
(SourceResolutionName) –

The source resolution the provided key belongs to
key ¶
(str) –

The value to match from the source. Usually a primary key
resolution ¶
(optional, default: DEFAULT_RESOLUTION ) –

The resolution to use to resolve matches against If not set, it will look for a default resolution.
threshold ¶
(optional, default: None ) –

The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

Examples:

mb.match(
    "datahub_companies",
    "hmrc_exporters",
    source="companies_house",
    key="8534735",
    resolution="last_linker",
)

clean ¶

clean(data: DataFrame, cleaning_dict: dict[str, str] | None) -> DataFrame

Clean data using DuckDB with the provided cleaning SQL.

ID is passed through automatically
If present, leaf_id and key are passed through automatically
Columns not mentioned in the cleaning_dict are passed through unchanged
Each key in cleaning_dict is an alias for a SQL expression

Parameters:

data ¶
(DataFrame) –

Raw polars dataframe to clean
cleaning_dict ¶
(dict[str, str] | None) –

A dictionary mapping field aliases to SQL expressions. The SQL expressions can reference columns in the data using their names. If None, no cleaning is applied and the original data is returned. SourceConfig.f() can be used to help reference qualified fields.

Returns:

DataFrame –

Cleaned polars dataframe

Examples:

Column passthrough behavior:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 35],
        "city": ["London", "Hull", "Stratford-upon-Avon"],
    }
)
cleaning_dict = {
    "full_name": "name"  # Only references 'name' column
}
result = clean(data, cleaning_dict)
# Result columns: id, full_name, age, city
# 'name' is dropped because it was used in cleaning_dict
# 'age' and 'city' are passed through unchanged

Multiple column references:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "first": ["John", "Jane", "Bob"],
        "last": ["Doe", "Smith", "Johnson"],
        "salary": [50000, 60000, 55000],
    }
)
cleaning_dict = {
    "name": "first || ' ' || last",  # References both 'first' and 'last'
    "high_earner": "salary > 55000",
}
result = clean(data, cleaning_dict)
# Result columns: id, name, high_earner
# 'first', 'last', and 'salary' are dropped (used in expressions)
# No other columns to pass through

Special columns (leaf_id, key) handling:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "leaf_id": ["a", "b", "c"],
        "key": ["x", "y", "z"],
        "value": [10, 20, 30],
        "status": ["active", "inactive", "pending"],
    }
)
cleaning_dict = {"processed_value": "value * 2"}
result = clean(data, cleaning_dict)
# Result columns: id, leaf_id, key, processed_value, status
# 'id', 'leaf_id', 'key' always included automatically
# 'value' dropped (used in expression), 'status' passed through

No cleaning (returns original data):

data = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"], "score": [95, 87]})
result = clean(data, None)
# Returns exact same dataframe with all original columns

Helpers¶

matchbox.client.helpers ¶

delete_resolution ¶

name ¶

certain ¶

get_source ¶

name ¶

location ¶

extract_transform ¶

key_field ¶

index_fields ¶

clean ¶

data ¶

cleaning_dict ¶

select ¶

selection ¶

client ¶

comparison ¶

comparison ¶

delete ¶

delete_resolution ¶

name ¶

certain ¶

index ¶

index ¶

source_config ¶

batch_size ¶

get_source ¶

name ¶

location ¶

extract_transform ¶

key_field ¶

index_fields ¶

selector ¶

Selector ¶

model_config class-attribute instance-attribute ¶

source instance-attribute ¶

fields instance-attribute ¶

qualified_key property ¶

qualified_fields property ¶

ensure_client classmethod ¶

ensure_fields ¶

from_name_and_client classmethod ¶

name ¶

client ¶

fields ¶

select ¶

selection ¶

client ¶

query ¶

selectors ¶

resolution ¶

combine_type ¶

return_leaf_id ¶

return_type ¶

threshold ¶

batch_size ¶

match ¶

targets ¶

source ¶

key ¶

resolution ¶

threshold ¶

clean ¶

data ¶

cleaning_dict ¶

`name` ¶

`certain` ¶

`name` ¶

`location` ¶

`extract_transform` ¶

`key_field` ¶

`index_fields` ¶

`data` ¶

`cleaning_dict` ¶

`selection` ¶

`client` ¶

`name` ¶

`certain` ¶

`source_config` ¶

`batch_size` ¶

`name` ¶

`location` ¶

`extract_transform` ¶

`key_field` ¶

`index_fields` ¶

model_config `class-attribute` `instance-attribute` ¶

source `instance-attribute` ¶

fields `instance-attribute` ¶

qualified_key `property` ¶

qualified_fields `property` ¶

ensure_client `classmethod` ¶

from_name_and_client `classmethod` ¶

`name` ¶

`client` ¶

`fields` ¶

`selection` ¶

`client` ¶

`selectors` ¶

`resolution` ¶

`combine_type` ¶

`return_leaf_id` ¶

`return_type` ¶

`threshold` ¶

`batch_size` ¶

`targets` ¶

`source` ¶

`key` ¶

`resolution` ¶

`threshold` ¶

`data` ¶

`cleaning_dict` ¶