Skip to content

Overview

matchbox.server

Matchbox server.

Includes the API, and database adapters for various backends.

Modules:

  • api

    Matchbox API.

  • base

    Base classes and utilities for Matchbox database adapters.

  • postgresql

    PostgreSQL adapter for Matchbox server.

Classes:

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

Attributes:

settings instance-attribute

settings: MatchboxSettings

datasets instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

match(
    source_pk: str,
    source: SourceAddress,
    targets: list[SourceAddress],
    resolution_name: str,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index abstractmethod

index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source abstractmethod

get_source(address: SourceAddress) -> Source

Get a source from its name address.

Parameters:

  • address
    (SourceAddress) –

    The name address for the source

Returns:

  • Source

    A Source object

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes abstractmethod

validate_hashes(hashes: list[bytes]) -> bool

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash abstractmethod

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph abstractmethod

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

clear abstractmethod

clear(certain: bool) -> None

Clears all data from the database.

Parameters:

  • certain
    (bool) –

    Whether to clear the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot, clear: bool) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model abstractmethod

insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model abstractmethod

get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results abstractmethod

set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results abstractmethod

get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth abstractmethod

set_model_truth(model: str, truth: float) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(model: str) -> float

Gets the current truth threshold for this model.

get_model_ancestors abstractmethod

get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache abstractmethod

set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache abstractmethod

get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model abstractmethod

delete_model(model: str, certain: bool) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

MatchboxSettings

Bases: BaseSettings

Settings for the Matchbox application.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

matchbox.server.base

Base classes and utilities for Matchbox database adapters.

Classes:

Functions:

MatchboxBackends

Bases: StrEnum

The available backends for Matchbox.

Attributes:

POSTGRES class-attribute instance-attribute

POSTGRES = 'postgres'

MatchboxSnapshot

Bases: BaseModel

A snapshot of the Matchbox database.

Methods:

Attributes:

backend_type instance-attribute

backend_type: MatchboxBackends

data instance-attribute

data: Any

check_serialisable classmethod

check_serialisable(value: Any) -> Any

Validate that the value can be serialised to JSON.

MatchboxDatastoreSettings

Bases: BaseSettings

Settings specific to the datastore configuration.

Methods:

  • get_client

    Returns an S3 client for the datastore.

Attributes:

host class-attribute instance-attribute

host: str | None = None

port class-attribute instance-attribute

port: int | None = None

access_key_id class-attribute instance-attribute

access_key_id: SecretStr | None = None

secret_access_key class-attribute instance-attribute

secret_access_key: SecretStr | None = None

default_region class-attribute instance-attribute

default_region: str | None = None

cache_bucket_name instance-attribute

cache_bucket_name: str

get_client

get_client() -> S3Client

Returns an S3 client for the datastore.

Creates S3 buckets if they don’t exist.

MatchboxSettings

Bases: BaseSettings

Settings for the Matchbox application.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

APISettings

Bases: BaseSettings

Settings for the Matchbox API.

Attributes:

api_key class-attribute instance-attribute

api_key: str | None = None

BackendManager

Manages the Matchbox backend instance and settings.

Methods:

initialise classmethod

initialise(settings: MatchboxSettings)

Initialise the backend with the given settings.

get_backend classmethod

get_backend() -> MatchboxDBAdapter

Get the backend instance.

get_settings classmethod

get_settings() -> MatchboxSettings

Get the backend settings.

Countable

Bases: Protocol

A protocol for objects that can be counted.

Methods:

  • count

    Counts the number of items in the object.

count

count() -> int

Counts the number of items in the object.

Listable

Bases: Protocol

A protocol for objects that can be listed.

Methods:

  • list

    Lists the items in the object.

list

list() -> list[str]

Lists the items in the object.

ListableAndCountable

Bases: Countable, Listable

A protocol for objects that can be counted and listed.

Methods:

  • list

    Lists the items in the object.

  • count

    Counts the number of items in the object.

list

list() -> list[str]

Lists the items in the object.

count

count() -> int

Counts the number of items in the object.

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

Attributes:

settings instance-attribute

settings: MatchboxSettings

datasets instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

match(
    source_pk: str,
    source: SourceAddress,
    targets: list[SourceAddress],
    resolution_name: str,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index abstractmethod

index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source abstractmethod

get_source(address: SourceAddress) -> Source

Get a source from its name address.

Parameters:

  • address
    (SourceAddress) –

    The name address for the source

Returns:

  • Source

    A Source object

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes abstractmethod

validate_hashes(hashes: list[bytes]) -> bool

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash abstractmethod

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph abstractmethod

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

clear abstractmethod

clear(certain: bool) -> None

Clears all data from the database.

Parameters:

  • certain
    (bool) –

    Whether to clear the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot, clear: bool) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model abstractmethod

insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model abstractmethod

get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results abstractmethod

set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results abstractmethod

get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth abstractmethod

set_model_truth(model: str, truth: float) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(model: str) -> float

Gets the current truth threshold for this model.

get_model_ancestors abstractmethod

get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache abstractmethod

set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache abstractmethod

get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model abstractmethod

delete_model(model: str, certain: bool) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

get_backend_settings

get_backend_settings(
    backend_type: MatchboxBackends,
) -> type[MatchboxSettings]

Get the appropriate settings class based on the backend type.

get_backend_class

get_backend_class(
    backend_type: MatchboxBackends,
) -> type[MatchboxDBAdapter]

Get the appropriate backend class based on the backend type.

initialise_backend

initialise_backend(settings: MatchboxSettings) -> None

Utility function to initialise the Matchbox backend based on settings.

initialise_matchbox

initialise_matchbox() -> None

Initialise the Matchbox backend based on environment variables.