Skip to content

API reference

matchbox.client is the client used to interact with the Matchbox server.

All names in matchbox.client are also accessible from the top-level matchbox module.

matchbox.client

All client-side functionalities of Matchbox.

Modules:

  • authorisation

    Utilities for JWT API authorisation.

  • cli

    CLI package for Matchbox client.

  • dags

    Objects to define a DAG which indexes, deduplicates and links data.

  • eval

    Public evaluation helpers for Matchbox clients.

  • locations

    Interface to locations where source data is stored.

  • models

    Deduplication and linking methodologies.

  • queries

    Definition of model inputs.

  • results

    Objects representing the results of running a model client-side.

  • sources

    Interface to source data.

Classes:

  • DAG

    Self-sufficient pipeline of indexing, deduping and linking steps.

  • RelationalDBLocation

    A location for a relational database.

DAG

DAG(name: str)

Self-sufficient pipeline of indexing, deduping and linking steps.

Methods:

  • set_downstream_to_rerun

    Mark step and downstream steps as not run.

  • source

    Create Source and add it to the DAG.

  • model

    Create Model and add it to the DAG.

  • add_resolution

    Convert a resolution from the server to a Source or Model and add to DAG.

  • get_source

    Get a source by name from the DAG.

  • get_model

    Get a model by name from the DAG.

  • query

    Create Query object.

  • draw

    Create a string representation of the DAG as a tree structure.

  • new_run

    Start a new run.

  • set_client

    Assign a client to all sources at once.

  • load_default

    Attach to default run in this collection, loading all DAG nodes.

  • load_pending

    Attach to the pending run in this collection, loading all DAG nodes.

  • run_and_sync

    Run entire DAG and send results to server.

  • set_default

    Set the current run as the default for the collection.

  • lookup_key

    Matches IDs against the selected backend.

  • extract_lookup

    Return matchbox IDs to source key mapping, optionally filtering.

Attributes:

name instance-attribute

nodes instance-attribute

nodes: dict[ResolutionName, Source | Model] = {}

graph instance-attribute

run property writable

run: RunID

Return run ID if available, else error.

final_steps property

final_steps: list[Source | Model]

Returns all apex nodes in the DAG.

Returns:

final_step property

final_step: Source | Model

Returns the root node in the DAG.

Returns:

Raises:

  • ValueError

    If the DAG does not have exactly one final step

set_downstream_to_rerun

set_downstream_to_rerun(step_name: ResolutionName)

Mark step and downstream steps as not run.

source

source(*args, **kwargs) -> Source

Create Source and add it to the DAG.

model

model(*args, **kwargs) -> Model

Create Model and add it to the DAG.

add_resolution

add_resolution(name: ResolutionName, resolution: Resolution) -> None

Convert a resolution from the server to a Source or Model and add to DAG.

get_source

get_source(name: ResolutionName) -> Source

Get a source by name from the DAG.

Parameters:

Returns:

  • Source

    The Source object.

Raises:

  • ValueError

    If the name doesn’t exist in the DAG or isn’t a Source.

get_model

get_model(name: ResolutionName) -> Model

Get a model by name from the DAG.

Parameters:

Returns:

  • Model

    The Model object.

Raises:

  • ValueError

    If the name doesn’t exist in the DAG or isn’t a Model.

query

query(*args, **kwargs) -> Query

Create Query object.

draw

draw(start_time: datetime | None = None, doing: str | None = None, skipped: list[str] | None = None) -> str

Create a string representation of the DAG as a tree structure.

If start_time is provided, it will show the status of each node based on the last run time. The status indicators are:

  • ✅ Done
  • 🔄 Working
  • ⏸️ Awaiting
  • ⏭️ Skipped

Parameters:

  • start_time
    (datetime | None, default: None ) –

    Start time of the DAG run. Used to calculate node status.

  • doing
    (str | None, default: None ) –

    Name of the node currently being processed (if any).

  • skipped
    (list[str] | None, default: None ) –

    List of node names that were skipped.

Returns:

  • str

    String representation of the DAG with status indicators.

new_run

new_run() -> Self

Start a new run.

set_client

set_client(client: Any) -> Self

Assign a client to all sources at once.

load_default

load_default() -> Self

Attach to default run in this collection, loading all DAG nodes.

load_pending

load_pending() -> Self

Attach to the pending run in this collection, loading all DAG nodes.

Pending is defined as the last non-default run.

run_and_sync

run_and_sync(full_rerun: bool = False, start: str | None = None, finish: str | None = None)

Run entire DAG and send results to server.

set_default

set_default() -> None

Set the current run as the default for the collection.

Makes it immutable, then moves the default pointer to it.

lookup_key

lookup_key(from_source: str, to_sources: list[str], key: str, threshold: int | None = None) -> dict[str, list[str]]

Matches IDs against the selected backend.

Parameters:

  • from_source
    (str) –

    Name of source the provided key belongs to

  • to_sources
    (list[str]) –

    Names of sources to find keys in

  • key
    (str) –

    The value to match from the source. Usually a primary key

  • threshold
    (optional, default: None ) –

    The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

Returns:

  • dict[str, list[str]]

    Dictionary mapping source names to list of keys within that source.

Examples:

dag.lookup_key(
    from_source="companies_house",
    to_sources=[
        "datahub_companies",
        "hmrc_exporters",
    ]
    key="8534735",
)

extract_lookup

extract_lookup(source_filter: list[str] | None = None, location_names: list[str] | None = None) -> Table

Return matchbox IDs to source key mapping, optionally filtering.

Parameters:

  • source_filter
    (list[str] | None, default: None ) –

    An optional list of source resolution names to filter by.

  • location_names
    (list[str] | None, default: None ) –

    An optional list of location names to filter by.

RelationalDBLocation

RelationalDBLocation(name: str)

Bases: Location

A location for a relational database.

Methods:

  • set_client

    Set client for location and return the location.

  • from_config

    Initialise location from a location config.

  • connect

    Establish connection to the data location.

  • validate_extract_transform

    Check that the SQL statement only contains a single data-extracting command.

  • infer_types

    Extract all data types from the ET logic.

  • execute

    Execute ET logic against location and return batches.

Attributes:

config instance-attribute

config = LocationConfig(type=location_type, name=name)

client instance-attribute

client: Engine

Retrieve client.

location_type class-attribute instance-attribute

location_type: LocationType = RDBMS

Output location type string.

client_type class-attribute instance-attribute

client_type: ClientType = SQLALCHEMY

Client type string.

set_client

set_client(client: Any) -> Self

Set client for location and return the location.

from_config

from_config(config: LocationConfig) -> Self

Initialise location from a location config.

connect

connect() -> bool

Establish connection to the data location.

Raises:

validate_extract_transform

validate_extract_transform(extract_transform: str) -> None

Check that the SQL statement only contains a single data-extracting command.

We are NOT attempting a full sanitisation of the SQL statement

Validation is done purely to stop accidental mistakes, not malicious actors
Users should only run indexing using SourceConfigs they trust and have read,
using least privilege credentials

Parameters:

  • extract_transform
    (str) –

    The SQL statement to validate

Raises:

infer_types

infer_types(extract_transform: str) -> dict[str, DataTypes]

Extract all data types from the ET logic.

execute

execute(extract_transform: str, batch_size: int | None = None, rename: dict[str, str] | Callable | None = None, return_type: QueryReturnType = POLARS, keys: tuple[str, list[str]] | None = None, schema_overrides: dict[str, DataType] | None = None) -> Generator[QueryReturnClass, None, None]

Execute ET logic against location and return batches.

Parameters:

  • extract_transform
    (str) –

    The ET logic to execute.

  • batch_size
    (int | None, default: None ) –

    The size of the batches to return.

  • rename
    (dict[str, str] | Callable | None, default: None ) –

    Renaming to apply after the ET logic is executed.

    • If a dictionary is provided, it will be used to rename the columns.
    • If a callable is provided, it will take the old name as input and return the new name.
  • return_type
    (QueryReturnType, default: POLARS ) –

    The type of data to return. Defaults to “polars”.

  • keys
    (tuple[str, list[str]] | None, default: None ) –

    Rule to only retrieve rows by specific keys. The key of the dictionary is a field name on which to filter. Filters source entries where the key field is in the dict values.

Raises: