Skip to content

Helpers

matchbox.client.helpers

Core functionalities of the Matchbox client.

Modules:

  • cleaner

    Functions to pre-process data sources.

  • comparison

    Functions to compare fields in different sources.

  • delete

    Functions to delete resolutions from the Matchbox server.

  • index

    Functions to index data sources to the Matchbox server.

  • selector

    Functions to select and retrieve data from the Matchbox server.

Functions:

  • cleaners

    Combine multiple cleaners in a single object to pass to process().

  • delete_resolution

    Deletes a resolution from Matchbox.

  • select

    From one set of credentials, builds and verifies a list of selectors.

cleaners

cleaners(
    *cleaner: dict[str, dict[str, Any]],
) -> dict[str, dict[str, Any]]

Combine multiple cleaners in a single object to pass to process().

Parameters:

Returns:

  • dict[str, dict[str, Any]]

    A representation of multiple cleaners to be passed to the process() function

Examples:

clean_pipeline = cleaners(
    cleaner(
        normalise_company_number,
        {"column": "company_number"},
    ),
    cleaner(
        normalise_postcode,
        {"column": "postcode"},
    ),
)

delete_resolution

delete_resolution(
    name: ResolutionName, certain: bool = False
) -> None

Deletes a resolution from Matchbox.

Will delete:

  • The resolution itself
  • All descendants of the resolution
  • All endorsements of clusters made by those resolutions, either probabilities for models, or keys for sources

Will not delete:

  • The clusters themselves

Parameters:

  • name

    (ResolutionName) –

    The name of the source to delete.

  • certain

    (bool, default: False ) –

    Must be true to delete the source. Default is False.

select

select(
    *selection: SourceResolutionName
    | dict[SourceResolutionName, list[str]],
    credentials: Any | None = None,
) -> list[Selector]

From one set of credentials, builds and verifies a list of selectors.

Can be used on any number of sources as long as they share the same credentials.

Parameters:

  • selection

    (SourceResolutionName | dict[SourceResolutionName, list[str]], default: () ) –

    The source resolutions to retrieve data from

  • credentials

    (Any | None, default: None ) –

    The credentials to use for the source. Datatype will depend on the source’s location type. For example, a RelationalDBLocation will require a SQLAlchemy engine. If not provided, will populate with a SQLAlchemy engine from the default warehouse set in the environment variable MB__CLIENT__DEFAULT_WAREHOUSE

Returns:

Examples:

select("companies_house", credentials=engine)
select(
    {"companies_house": ["crn"], "hmrc_exporters": ["name"]}, credentials=engine
)

cleaner

Functions to pre-process data sources.

Functions:

  • cleaner

    Define a function to clean data.

  • cleaners

    Combine multiple cleaners in a single object to pass to process().

  • process

    Apply cleaners to input dataframe.

cleaner

Define a function to clean data.

Parameters:

  • function
    (Callable) –

    the callable implementing the cleaning behaviour

  • arguments
    (dict) –

    a dictionary of keyword arguments to pass to the cleaning function

Returns:

  • dict[str, dict[str, Any]]

    A representation of the cleaner ready to be passed to the cleaners() function

cleaners

cleaners(
    *cleaner: dict[str, dict[str, Any]],
) -> dict[str, dict[str, Any]]

Combine multiple cleaners in a single object to pass to process().

Parameters:

Returns:

  • dict[str, dict[str, Any]]

    A representation of multiple cleaners to be passed to the process() function

Examples:

clean_pipeline = cleaners(
    cleaner(
        normalise_company_number,
        {"column": "company_number"},
    ),
    cleaner(
        normalise_postcode,
        {"column": "postcode"},
    ),
)

process

process(
    data: DataFrame, pipeline: dict[str, dict[str, Any]]
) -> DataFrame

Apply cleaners to input dataframe.

Parameters:

  • data
    (DataFrame) –

    The dataframe to process

  • pipeline
    (dict[str, dict[str, Any]]) –

    Output of the cleaners() function

Returns:

  • DataFrame

    The processed data

comparison

Functions to compare fields in different sources.

Functions:

  • comparison

    Validates any number of SQL conditions and prepares for a WHERE clause.

comparison

comparison(
    sql_condition: str, dialect: str = "postgres"
) -> str

Validates any number of SQL conditions and prepares for a WHERE clause.

Requires all column references be explicitly declared as from “l” and “r” tables.

delete

Functions to delete resolutions from the Matchbox server.

Functions:

delete_resolution

delete_resolution(
    name: ResolutionName, certain: bool = False
) -> None

Deletes a resolution from Matchbox.

Will delete:

  • The resolution itself
  • All descendants of the resolution
  • All endorsements of clusters made by those resolutions, either probabilities for models, or keys for sources

Will not delete:

  • The clusters themselves

Parameters:

  • name
    (ResolutionName) –

    The name of the source to delete.

  • certain
    (bool, default: False ) –

    Must be true to delete the source. Default is False.

index

Functions to index data sources to the Matchbox server.

Functions:

  • index

    Indexes data in Matchbox.

index

index(
    source_config: SourceConfig,
    batch_size: int | None = None,
) -> None

Indexes data in Matchbox.

Parameters:

  • source_config
    (SourceConfig) –

    A SourceConfig with credentials set

  • batch_size
    (int | None, default: None ) –

    the size of each batch when fetching data from the warehouse, which helps reduce the load on the database. Default is None.

selector

Functions to select and retrieve data from the Matchbox server.

Classes:

  • Selector

    A selector to choose a source and optionally a subset of columns to select.

Functions:

  • select

    From one set of credentials, builds and verifies a list of selectors.

  • query

    Runs queries against the selected backend.

  • match

    Matches IDs against the selected backend.

Selector

Bases: BaseModel

A selector to choose a source and optionally a subset of columns to select.

Methods:

Attributes:

model_config class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
source instance-attribute
source: SourceConfig
fields instance-attribute
fields: list[SourceField]
qualified_key property
qualified_key: str

Get the qualified key name for the selected source.

qualified_fields property
qualified_fields: list[str]

Get the qualified field names for the selected fields.

ensure_credentials classmethod
ensure_credentials(source: SourceConfig) -> SourceConfig

Ensure that the source has credentials set.

ensure_fields
ensure_fields() -> Self

Ensure that the fields are valid.

from_name_and_credentials classmethod
from_name_and_credentials(
    name: SourceResolutionName,
    credentials: Any,
    fields: list[str] | None = None,
) -> Selector

Create a Selector from a source name and location credentials.

Parameters:

  • name
    (SourceResolutionName) –

    The name of the source to select from

  • credentials
    (Any) –

    The credentials to use for the source

  • fields
    (list[str] | None, default: None ) –

    A list of fields to select from the source

select

select(
    *selection: SourceResolutionName
    | dict[SourceResolutionName, list[str]],
    credentials: Any | None = None,
) -> list[Selector]

From one set of credentials, builds and verifies a list of selectors.

Can be used on any number of sources as long as they share the same credentials.

Parameters:

  • selection
    (SourceResolutionName | dict[SourceResolutionName, list[str]], default: () ) –

    The source resolutions to retrieve data from

  • credentials
    (Any | None, default: None ) –

    The credentials to use for the source. Datatype will depend on the source’s location type. For example, a RelationalDBLocation will require a SQLAlchemy engine. If not provided, will populate with a SQLAlchemy engine from the default warehouse set in the environment variable MB__CLIENT__DEFAULT_WAREHOUSE

Returns:

Examples:

select("companies_house", credentials=engine)
select(
    {"companies_house": ["crn"], "hmrc_exporters": ["name"]}, credentials=engine
)

query

query(
    *selectors: list[Selector],
    resolution: ResolutionName | None = None,
    combine_type: Literal[
        "concat", "explode", "set_agg"
    ] = "concat",
    return_type: ReturnTypeStr = "pandas",
    threshold: int | None = None,
    batch_size: int | None = None,
) -> QueryReturnType

Runs queries against the selected backend.

Parameters:

  • selectors
    (list[Selector], default: () ) –

    Each selector is the output of select(). This allows querying sources coming from different engines

  • resolution
    (optional, default: None ) –

    The name of the resolution point to query If not set:

    • If querying a single source, it will use the source resolution
    • If querying 2 or more sources, it will look for a default resolution
  • combine_type
    (Literal['concat', 'explode', 'set_agg'], default: 'concat' ) –

    How to combine the data from different sources.

    • If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
    • If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
    • If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
  • return_type
    (ReturnTypeStr, default: 'pandas' ) –

    The form to return data in, one of “pandas” or “arrow” Defaults to pandas for ease of use

  • threshold
    (optional, default: None ) –

    The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

  • batch_size
    (optional, default: None ) –

    The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.

Returns: Data in the requested return type (DataFrame or ArrowTable).

Examples:

query(
    select({"companies_house": ["crn", "name"]}, engine=engine),
)
query(
    select("companies_house", engine=engine1),
    select("datahub_companies", engine=engine2),
    resolution="last_linker",
)

match

Matches IDs against the selected backend.

Parameters:

  • targets
    (list[SourceResolutionName], default: () ) –

    Source resolutions to find keys in

  • source
    (SourceResolutionName) –

    The source resolution the provided key belongs to

  • key
    (str) –

    The value to match from the source. Usually a primary key

  • resolution
    (optional, default: DEFAULT_RESOLUTION ) –

    The resolution to use to resolve matches against If not set, it will look for a default resolution.

  • threshold
    (optional, default: None ) –

    The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

Examples:

mb.match(
    "datahub_companies",
    "hmrc_exporters",
    source="companies_house",
    key="8534735",
    resolution="last_linker",
)