API reference¶

matchbox.client is the client used to interact with the Matchbox server.

All names in matchbox.client are also accessible from the top-level matchbox module.

matchbox.client ¶

All client-side functionalities of Matchbox.

Modules:

clean –

Library of default cleaning functions.
dags –

Objects to define a DAG which indexes, deduplicates and links data.
extract –

Functions to extract data out of the Matchbox server.
helpers –

Core functionalities of the Matchbox client.
models –

Deduplication and linking methodologies.
results –

Objects representing the results of running a model client-side.
visualisation –

Visualisation utilities.

Functions:

process –

Apply cleaners to input dataframe.
index –

Indexes data in Matchbox.
match –

Matches IDs against the selected backend.
query –

Runs queries against the selected backend.
select –

From one set of credentials, builds and verifies a list of selectors.
make_model –

Create a unified model instance for either linking or deduping operations.

process ¶

process(
    data: DataFrame, pipeline: dict[str, dict[str, Any]]
) -> DataFrame

Apply cleaners to input dataframe.

Parameters:

data ¶
(DataFrame) –

The dataframe to process
pipeline ¶
(dict[str, dict[str, Any]]) –

Output of the cleaners() function

Returns:

DataFrame –

The processed data

index ¶

index(
    source_config: SourceConfig,
    batch_size: int | None = None,
) -> None

Indexes data in Matchbox.

Parameters:

source_config ¶
(SourceConfig) –

A SourceConfig with credentials set
batch_size ¶
(int | None, default: None ) –

the size of each batch when fetching data from the warehouse, which helps reduce the load on the database. Default is None.

match ¶

match(
    *targets: list[SourceResolutionName],
    source: SourceResolutionName,
    key: str,
    resolution: ResolutionName = DEFAULT_RESOLUTION,
    threshold: int | None = None,
) -> list[Match]

Matches IDs against the selected backend.

Parameters:

targets ¶
(list[SourceResolutionName], default: () ) –

Source resolutions to find keys in
source ¶
(SourceResolutionName) –

The source resolution the provided key belongs to
key ¶
(str) –

The value to match from the source. Usually a primary key
resolution ¶
(optional, default: DEFAULT_RESOLUTION ) –

The resolution to use to resolve matches against If not set, it will look for a default resolution.
threshold ¶
(optional, default: None ) –

The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

Examples:

mb.match(
    "datahub_companies",
    "hmrc_exporters",
    source="companies_house",
    key="8534735",
    resolution="last_linker",
)

query ¶

query(
    *selectors: list[Selector],
    resolution: ResolutionName | None = None,
    combine_type: Literal[
        "concat", "explode", "set_agg"
    ] = "concat",
    return_type: ReturnTypeStr = "pandas",
    threshold: int | None = None,
    batch_size: int | None = None,
) -> QueryReturnType

Runs queries against the selected backend.

Parameters:

selectors ¶
(list[Selector], default: () ) –

Each selector is the output of select(). This allows querying sources coming from different engines
resolution ¶
(optional, default: None ) –
The name of the resolution point to query If not set:
- If querying a single source, it will use the source resolution
- If querying 2 or more sources, it will look for a default resolution
combine_type ¶
(Literal['concat', 'explode', 'set_agg'], default: 'concat' ) –
How to combine the data from different sources.
- If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
- If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
- If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
return_type ¶
(ReturnTypeStr, default: 'pandas' ) –

The form to return data in, one of “pandas” or “arrow” Defaults to pandas for ease of use
threshold ¶
(optional, default: None ) –

The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
batch_size ¶
(optional, default: None ) –

The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.

Returns: Data in the requested return type (DataFrame or ArrowTable).

Examples:

query(
    select({"companies_house": ["crn", "name"]}, engine=engine),
)

query(
    select("companies_house", engine=engine1),
    select("datahub_companies", engine=engine2),
    resolution="last_linker",
)

select ¶

select(
    *selection: SourceResolutionName
    | dict[SourceResolutionName, list[str]],
    credentials: Any | None = None,
) -> list[Selector]

From one set of credentials, builds and verifies a list of selectors.

Can be used on any number of sources as long as they share the same credentials.

Parameters:

selection ¶
(SourceResolutionName | dict[SourceResolutionName, list[str]], default: () ) –

The source resolutions to retrieve data from
credentials ¶
(Any | None, default: None ) –

The credentials to use for the source. Datatype will depend on the source’s location type. For example, a RelationalDBLocation will require a SQLAlchemy engine. If not provided, will populate with a SQLAlchemy engine from the default warehouse set in the environment variable MB__CLIENT__DEFAULT_WAREHOUSE

Returns:

list[Selector] –

A list of Selector objects

Examples:

select("companies_house", credentials=engine)

select(
    {"companies_house": ["crn"], "hmrc_exporters": ["name"]}, credentials=engine
)

make_model ¶

make_model(
    name: ModelResolutionName,
    description: str,
    model_class: type[Linker] | type[Deduper],
    model_settings: dict[str, Any],
    left_data: DataFrame,
    left_resolution: ResolutionName,
    right_data: DataFrame | None = None,
    right_resolution: ResolutionName | None = None,
) -> Model

Create a unified model instance for either linking or deduping operations.

Parameters:

name ¶
(ModelResolutionName) –

Your unique identifier for the model
description ¶
(str) –

Description of the model run
model_class ¶
(type[Linker] | type[Deduper]) –

Either Linker or Deduper class
model_settings ¶
(dict[str, Any]) –

Configuration settings for the model
left_data ¶
(DataFrame) –

Primary data
left_resolution ¶
(ResolutionName) –

Resolution name for primary model or source
right_data ¶
(DataFrame | None, default: None ) –

Secondary data (linking only)
right_resolution ¶
(ResolutionName | None, default: None ) –

Resolution name for secondary model or source (linking only)

Returns:

Model ( Model ) –

Configured model instance ready for execution

API reference¶

matchbox.client ¶

process ¶

`data` ¶

`pipeline` ¶

index ¶

`source_config` ¶

`batch_size` ¶

match ¶

`targets` ¶

`source` ¶

`key` ¶

`resolution` ¶

`threshold` ¶

query ¶

`selectors` ¶

`resolution` ¶

`combine_type` ¶

`return_type` ¶

`threshold` ¶

`batch_size` ¶

select ¶

`selection` ¶

`credentials` ¶

make_model ¶

`name` ¶

`description` ¶

`model_class` ¶

`model_settings` ¶

`left_data` ¶

`left_resolution` ¶

`right_data` ¶

`right_resolution` ¶

API reference¶

matchbox.client ¶

process ¶

data ¶

pipeline ¶

index ¶

source_config ¶

batch_size ¶

match ¶

targets ¶

source ¶

key ¶

resolution ¶

threshold ¶

query ¶

selectors ¶

resolution ¶

combine_type ¶

return_type ¶

threshold ¶

batch_size ¶

select ¶

selection ¶

credentials ¶

make_model ¶

name ¶

description ¶

model_class ¶

model_settings ¶

left_data ¶

left_resolution ¶

right_data ¶

right_resolution ¶

`data` ¶

`pipeline` ¶

`source_config` ¶

`batch_size` ¶

`targets` ¶

`source` ¶

`key` ¶

`resolution` ¶

`threshold` ¶

`selectors` ¶

`resolution` ¶

`combine_type` ¶

`return_type` ¶

`threshold` ¶

`batch_size` ¶

`selection` ¶

`credentials` ¶

`name` ¶

`description` ¶

`model_class` ¶

`model_settings` ¶

`left_data` ¶

`left_resolution` ¶

`right_data` ¶

`right_resolution` ¶