Skip to content

API reference

matchbox.client is the client used to interact with the Matchbox server.

All names in matchbox.client are also accessible from the top-level matchbox module.

matchbox.client

All client-side functionalities of Matchbox.

Modules:

  • clean

    Library of default cleaning functions.

  • dags

    Objects to define a DAG which indexes, deduplicates and links data.

  • helpers

    Core functionalities of the Matchbox client.

  • models

    Deduplication and linking methodologies.

  • results

    Objects representing the results of running a model client-side.

  • visualisation

    Visualisation utilities.

Functions:

  • process

    Apply cleaners to input dataframe.

  • index

    Indexes data in Matchbox.

  • match

    Matches IDs against the selected backend.

  • query

    Runs queries against the selected backend.

  • make_model

    Create a unified model instance for either linking or deduping operations.

process

process(
    data: DataFrame, pipeline: dict[str, dict[str, Any]]
) -> DataFrame

Apply cleaners to input dataframe.

Parameters:

  • data

    (DataFrame) –

    The dataframe to process

  • pipeline

    (dict[str, dict[str, Any]]) –

    Output of the cleaners() function

Returns:

  • DataFrame

    The processed dataset

index

index(
    full_name: str,
    db_pk: str,
    engine: Engine,
    resolution_name: str | None = None,
    columns: list[str]
    | list[dict[str, dict[str, str]]]
    | None = None,
    batch_size: int | None = None,
) -> None

Indexes data in Matchbox.

Parameters:

  • full_name

    (str) –

    the full name of the source

  • db_pk

    (str) –

    the primary key of the source

  • engine

    (Engine) –

    the engine to connect to a data warehouse

  • resolution_name

    (str | None, default: None ) –

    a custom resolution name If missing, will use the default name for a Source

  • columns

    (list[str] | list[dict[str, dict[str, str]]] | None, default: None ) –

    the columns to index

  • batch_size

    (int | None, default: None ) –

    the size of each batch when fetching data from the warehouse, which helps reduce the load on the database. Default is None.

Examples:

index("mb.test_orig", "id", engine=engine)
index("mb.test_cl2", "id", engine=engine, columns=["name", "age"])
index(
    "mb.test_cl2",
    "id",
    engine=engine,
    columns=[
        {"name": "name", "type": "TEXT"},
        {"name": "age", "type": "BIGINT"},
    ],
)
index("mb.test_orig", "id", engine=engine, batch_size=10_000)

match

match(
    *targets: list[Selector],
    source: list[Selector],
    source_pk: str,
    resolution_name: str = DEFAULT_RESOLUTION,
    threshold: int | None = None,
) -> list[Match]

Matches IDs against the selected backend.

Parameters:

  • targets

    (list[Selector], default: () ) –

    Each target is the output of select(). This allows matching against sources coming from different engines

  • source

    (list[Selector]) –

    The output of using select() on a single source.

  • source_pk

    (str) –

    The primary key value to match from the source.

  • resolution_name

    (optional, default: DEFAULT_RESOLUTION ) –

    The resolution name to use for filtering results. If not set, it will look for a default resolution.

  • threshold

    (optional, default: None ) –

    The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

Examples:

mb.match(
    select("datahub_companies", engine=engine),
    source=select("companies_house", engine=engine),
    source_pk="8534735",
    resolution_name="last_linker",
)

query

query(
    *selectors: list[Selector],
    resolution_name: str | None = None,
    combine_type: Literal[
        "concat", "explode", "set_agg"
    ] = "concat",
    return_type: ReturnTypeStr = "pandas",
    threshold: int | None = None,
    limit: int | None = None,
    batch_size: int | None = None,
    return_batches: bool = False,
) -> QueryReturnType | Iterator[QueryReturnType]

Runs queries against the selected backend.

Parameters:

  • selectors

    (list[Selector], default: () ) –

    Each selector is the output of select(). This allows querying sources coming from different engines

  • resolution_name

    (optional, default: None ) –

    The name of the resolution point to query If not set:

    • If querying a single source, it will use the source resolution
    • If querying 2 or more sources, it will look for a default resolution
  • combine_type

    (Literal['concat', 'explode', 'set_agg'], default: 'concat' ) –

    How to combine the data from different sources.

    • If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
    • If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
    • If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
  • return_type

    (ReturnTypeStr, default: 'pandas' ) –

    The form to return data in, one of “pandas” or “arrow” Defaults to pandas for ease of use

  • threshold

    (optional, default: None ) –

    The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

  • limit

    (optional, default: None ) –

    The number to use in a limit clause. Useful for testing

  • batch_size

    (optional, default: None ) –

    The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.

  • return_batches

    (optional, default: False ) –

    If True, returns an iterator of batches instead of a single combined result, which is useful for processing large datasets with limited memory. Default is False.

Returns:

Examples:

query(
    select({"companies_house": ["crn", "name"]}, engine=engine),
)
query(
    select("companies_house", engine=engine1),
    select("datahub_companies", engine=engine2),
    resolution_name="last_linker",
)
# Process large results in batches of 5000 rows
for batch in query(
    select("companies_house", engine=engine),
    batch_size=5000,
    return_batches=True,
):
    batch.head()

make_model

make_model(
    model_name: str,
    description: str,
    model_class: type[Linker] | type[Deduper],
    model_settings: dict[str, Any],
    left_data: DataFrame,
    left_resolution: str,
    right_data: DataFrame | None = None,
    right_resolution: str | None = None,
) -> Model

Create a unified model instance for either linking or deduping operations.

Parameters:

  • model_name

    (str) –

    Your unique identifier for the model

  • description

    (str) –

    Description of the model run

  • model_class

    (type[Linker] | type[Deduper]) –

    Either Linker or Deduper class

  • model_settings

    (dict[str, Any]) –

    Configuration settings for the model

  • left_data

    (DataFrame) –

    Primary dataset

  • left_resolution

    (str) –

    Resolution name for primary model or dataset

  • right_data

    (DataFrame | None, default: None ) –

    Secondary dataset (linking only)

  • right_resolution

    (str | None, default: None ) –

    Resolution name for secondary model or dataset (linking only)

Returns:

  • Model ( Model ) –

    Configured model instance ready for execution