Skip to content

API reference

matchbox.client is the client used to interact with the Matchbox server.

All names in matchbox.client are also accessible from the top-level matchbox module.

matchbox.client

All client-side functionalities of Matchbox.

Modules:

  • authorisation

    Utilities for JWT API authorisation.

  • dags

    Objects to define a DAG which indexes, deduplicates and links data.

  • eval

    Module implementing client-side evaluation features.

  • models

    Deduplication and linking methodologies.

  • queries

    Definition of model inputs.

  • results

    Objects representing the results of running a model client-side.

  • sources

    Interface to locations where source data is stored.

Classes:

  • DAG

    Self-sufficient pipeline of indexing, deduping and linking steps.

  • RelationalDBLocation

    A location for a relational database.

DAG

DAG(name: str)

Self-sufficient pipeline of indexing, deduping and linking steps.

Methods:

  • source

    Create Source and add it to the DAG.

  • model

    Create Model and add it to the DAG.

  • add_resolution

    Convert a resolution to a Source or Model and add to DAG.

  • get_source

    Get a source by name from the DAG.

  • get_model

    Get a model by name from the DAG.

  • query

    Create Query object.

  • draw

    Create a string representation of the DAG as a tree structure.

  • new_run

    Start a new run.

  • load_default

    Attach to default run in this collection, loading all DAG nodes.

  • run_and_sync

    Run entire DAG and send results to server.

  • set_default

    Set the current run as the default for the collection.

  • lookup_key

    Matches IDs against the selected backend.

  • extract_lookup

    Return matchbox IDs to source key mapping, optionally filtering.

Attributes:

name instance-attribute

nodes instance-attribute

nodes: dict[ResolutionName, Source | Model] = {}

graph instance-attribute

run property writable

run: RunID

Return run ID if available, else error.

final_step property

final_step: Source | Model

Returns the root node in the DAG.

Returns:

Raises:

  • ValueError

    If the DAG does not have a final step

source

source(*args, **kwargs) -> Source

Create Source and add it to the DAG.

model

model(*args, **kwargs) -> Model

Create Model and add it to the DAG.

add_resolution

add_resolution(name: ResolutionName, resolution: Resolution, location: Location) -> None

Convert a resolution to a Source or Model and add to DAG.

get_source

get_source(name: ResolutionName) -> Source

Get a source by name from the DAG.

Parameters:

Returns:

  • Source

    The Source object.

Raises:

  • ValueError

    If the name doesn’t exist in the DAG or isn’t a Source.

get_model

get_model(name: ResolutionName) -> Model

Get a model by name from the DAG.

Parameters:

Returns:

  • Model

    The Model object.

Raises:

  • ValueError

    If the name doesn’t exist in the DAG or isn’t a Model.

query

query(*args, **kwargs) -> Query

Create Query object.

draw

draw(start_time: datetime | None = None, doing: str | None = None, skipped: list[str] | None = None) -> str

Create a string representation of the DAG as a tree structure.

If start_time is provided, it will show the status of each node based on the last run time. The status indicators are:

  • ✅ Done
  • 🔄 Working
  • ⏸️ Awaiting
  • ⏭️ Skipped

Parameters:

  • start_time
    (datetime | None, default: None ) –

    Start time of the DAG run. Used to calculate node status.

  • doing
    (str | None, default: None ) –

    Name of the node currently being processed (if any).

  • skipped
    (list[str] | None, default: None ) –

    List of node names that were skipped.

Returns:

  • str

    String representation of the DAG with status indicators.

new_run

new_run() -> Self

Start a new run.

load_default

load_default(location: Location) -> Self

Attach to default run in this collection, loading all DAG nodes.

Parameters:

  • location
    (Location) –

    The Location object that will be attached to nodes coming from default Run. Can be updated per-source after instantiation if necessary.

run_and_sync

run_and_sync(full_rerun: bool = False, start: str | None = None, finish: str | None = None)

Run entire DAG and send results to server.

set_default

set_default() -> None

Set the current run as the default for the collection.

Makes it immutable, then moves the default pointer to it.

lookup_key

lookup_key(from_source: str, to_sources: list[str], key: str, threshold: int | None = None) -> dict[str, list[str]]

Matches IDs against the selected backend.

Parameters:

  • from_source
    (str) –

    Name of source the provided key belongs to

  • to_sources
    (list[str]) –

    Names of sources to find keys in

  • key
    (str) –

    The value to match from the source. Usually a primary key

  • threshold
    (optional, default: None ) –

    The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

Returns:

  • dict[str, list[str]]

    Dictionary mapping source names to list of keys within that source.

Examples:

dag.lookup_key(
    from_source="companies_house",
    to_sources=[
        "datahub_companies",
        "hmrc_exporters",
    ]
    key="8534735",
)

extract_lookup

extract_lookup(source_filter: list[str] | None = None, location_names: list[str] | None = None) -> Table

Return matchbox IDs to source key mapping, optionally filtering.

Parameters:

  • source_filter
    (list[str] | None, default: None ) –

    An optional list of source resolution names to filter by.

  • location_names
    (list[str] | None, default: None ) –

    An optional list of location names to filter by.

RelationalDBLocation

RelationalDBLocation(name: str, client: Any)

Bases: Location

A location for a relational database.

Methods:

  • from_config

    Initialise location from a location config and an appropriate client.

  • connect

    Establish connection to the data location.

  • validate_extract_transform

    Check that the SQL statement only contains a single data-extracting command.

  • infer_types

    Extract all data types from the ET logic.

  • execute

    Execute ET logic against location and return batches.

Attributes:

config instance-attribute

config = LocationConfig(type=location_type, name=name)

client instance-attribute

client: Engine

location_type class-attribute instance-attribute

location_type: LocationType = RDBMS

Output location type string.

from_config

from_config(config: LocationConfig, client: Any) -> Self

Initialise location from a location config and an appropriate client.

connect

connect() -> bool

Establish connection to the data location.

Raises:

validate_extract_transform

validate_extract_transform(extract_transform: str) -> bool

Check that the SQL statement only contains a single data-extracting command.

We are NOT attempting a full sanitisation of the SQL statement

Validation is done purely to stop accidental mistakes, not malicious actors
Users should only run indexing using SourceConfigs they trust and have read,
using least privilege credentials

Parameters:

  • extract_transform
    (str) –

    The SQL statement to validate

Returns:

  • bool ( bool ) –

    True if the SQL statement is valid

Raises:

infer_types

infer_types(extract_transform: str) -> dict[str, DataTypes]

Extract all data types from the ET logic.

execute

execute(extract_transform: str, batch_size: int | None = None, rename: dict[str, str] | Callable | None = None, return_type: QueryReturnType = POLARS, keys: tuple[str, list[str]] | None = None, schema_overrides: dict[str, DataType] | None = None) -> Generator[QueryReturnClass, None, None]

Execute ET logic against location and return batches.

Parameters:

  • extract_transform
    (str) –

    The ET logic to execute.

  • batch_size
    (int | None, default: None ) –

    The size of the batches to return.

  • rename
    (dict[str, str] | Callable | None, default: None ) –

    Renaming to apply after the ET logic is executed.

    • If a dictionary is provided, it will be used to rename the columns.
    • If a callable is provided, it will take the old name as input and return the new name.
  • return_type
    (QueryReturnType, default: POLARS ) –

    The type of data to return. Defaults to “polars”.

  • keys
    (tuple[str, list[str]] | None, default: None ) –

    Rule to only retrieve rows by specific keys. The key of the dictionary is a field name on which to filter. Filters source entries where the key field is in the dict values.

Raises: