Skip to content

Overview

matchbox.server

Matchbox server.

Includes the API, and database adapters for various backends.

Modules:

  • api

    Matchbox API.

  • base

    Base classes and utilities for Matchbox database adapters.

  • postgresql

    PostgreSQL adapter for Matchbox server.

Classes:

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

  • query

    Queries the database from an optional point of truth.

  • match

    Matches an ID in a source dataset and returns the keys in the targets.

  • index

    Indexes to Matchbox a source dataset in your warehouse.

  • get_source

    Get a source from its address.

  • get_resolution_sources

    Get a list of sources queriable from a resolution.

  • validate_ids

    Validates a list of IDs exist in the database.

  • validate_hashes

    Validates a list of hashes exist in the database.

  • cluster_id_to_hash

    Get a lookup of Cluster hashes from a list of IDs.

  • get_resolution_graph

    Get the full resolution graph.

  • dump

    Dumps the entire database to a snapshot.

  • drop

    Hard clear the database by dropping all tables and re-creating.

  • clear

    Soft clear the database by deleting all rows but retaining tables.

  • restore

    Restores the database from a snapshot.

  • verify

    Checks the database schema against expected and logs outcome.

  • insert_model

    Writes a model to Matchbox.

  • get_model

    Get a model from the database.

  • set_model_results

    Set the results for a model.

  • get_model_results

    Get the results for a model.

  • set_model_truth

    Sets the truth threshold for this model, changing the default clusters.

  • get_model_truth

    Gets the current truth threshold for this model.

  • get_model_ancestors

    Gets the current truth values of all ancestors.

  • set_model_ancestors_cache

    Updates the cached ancestor thresholds.

  • get_model_ancestors_cache

    Gets the cached ancestor thresholds, converting hashes to model names.

  • delete_model

    Delete a model from the database.

Attributes:

settings instance-attribute

datasets instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index abstractmethod

index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source abstractmethod

get_source(address: SourceAddress) -> Source

Get a source from its address.

Parameters:

Returns:

get_resolution_sources abstractmethod

get_resolution_sources(
    resolution_name: str,
) -> list[Source]

Get a list of sources queriable from a resolution.

Parameters:

  • resolution_name
    (str) –

    Name of the resolution to query.

Returns:

  • list[Source]

    List of relevant Source objects.

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes abstractmethod

validate_hashes(hashes: list[bytes]) -> bool

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash abstractmethod

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph abstractmethod

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop abstractmethod

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear abstractmethod

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot, clear: bool) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

verify abstractmethod

verify() -> None

Checks the database schema against expected and logs outcome.

insert_model abstractmethod

insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model abstractmethod

get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results abstractmethod

set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results abstractmethod

get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth abstractmethod

set_model_truth(model: str, truth: float) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(model: str) -> float

Gets the current truth threshold for this model.

get_model_ancestors abstractmethod

get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache abstractmethod

set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache abstractmethod

get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model abstractmethod

delete_model(model: str, certain: bool) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

MatchboxServerSettings

Bases: BaseSettings

Settings for the Matchbox application.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

api_key class-attribute instance-attribute

api_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

matchbox.server.base

Base classes and utilities for Matchbox database adapters.

Classes:

Functions:

MatchboxBackends

Bases: StrEnum

The available backends for Matchbox.

Attributes:

POSTGRES class-attribute instance-attribute

POSTGRES = 'postgres'

MatchboxSnapshot

Bases: BaseModel

A snapshot of the Matchbox database.

Methods:

Attributes:

backend_type instance-attribute

backend_type: MatchboxBackends

data instance-attribute

data: Any

check_serialisable classmethod

check_serialisable(value: Any) -> Any

Validate that the value can be serialised to JSON.

MatchboxDatastoreSettings

Bases: BaseSettings

Settings specific to the datastore configuration.

Methods:

  • get_client

    Returns an S3 client for the datastore.

Attributes:

host class-attribute instance-attribute

host: str | None = None

port class-attribute instance-attribute

port: int | None = None

access_key_id class-attribute instance-attribute

access_key_id: SecretStr | None = None

secret_access_key class-attribute instance-attribute

secret_access_key: SecretStr | None = None

default_region class-attribute instance-attribute

default_region: str | None = None

cache_bucket_name instance-attribute

cache_bucket_name: str

get_client

get_client() -> S3Client

Returns an S3 client for the datastore.

Creates S3 buckets if they don’t exist.

MatchboxServerSettings

Bases: BaseSettings

Settings for the Matchbox application.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

api_key class-attribute instance-attribute

api_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

BackendManager

Manages the Matchbox backend instance and settings.

Methods:

initialise classmethod

initialise(settings: MatchboxServerSettings)

Initialise the backend with the given settings.

get_backend classmethod

get_backend() -> MatchboxDBAdapter

Get the backend instance.

get_settings classmethod

get_settings() -> MatchboxServerSettings

Get the backend settings.

Countable

Bases: Protocol

A protocol for objects that can be counted.

Methods:

  • count

    Counts the number of items in the object.

count

count() -> int

Counts the number of items in the object.

Listable

Bases: Protocol

A protocol for objects that can be listed.

Methods:

  • list_all

    Lists the items in the object.

list_all

list_all() -> list[str]

Lists the items in the object.

ListableAndCountable

Bases: Countable, Listable

A protocol for objects that can be counted and listed.

Methods:

  • list_all

    Lists the items in the object.

  • count

    Counts the number of items in the object.

list_all

list_all() -> list[str]

Lists the items in the object.

count

count() -> int

Counts the number of items in the object.

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

  • query

    Queries the database from an optional point of truth.

  • match

    Matches an ID in a source dataset and returns the keys in the targets.

  • index

    Indexes to Matchbox a source dataset in your warehouse.

  • get_source

    Get a source from its address.

  • get_resolution_sources

    Get a list of sources queriable from a resolution.

  • validate_ids

    Validates a list of IDs exist in the database.

  • validate_hashes

    Validates a list of hashes exist in the database.

  • cluster_id_to_hash

    Get a lookup of Cluster hashes from a list of IDs.

  • get_resolution_graph

    Get the full resolution graph.

  • dump

    Dumps the entire database to a snapshot.

  • drop

    Hard clear the database by dropping all tables and re-creating.

  • clear

    Soft clear the database by deleting all rows but retaining tables.

  • restore

    Restores the database from a snapshot.

  • verify

    Checks the database schema against expected and logs outcome.

  • insert_model

    Writes a model to Matchbox.

  • get_model

    Get a model from the database.

  • set_model_results

    Set the results for a model.

  • get_model_results

    Get the results for a model.

  • set_model_truth

    Sets the truth threshold for this model, changing the default clusters.

  • get_model_truth

    Gets the current truth threshold for this model.

  • get_model_ancestors

    Gets the current truth values of all ancestors.

  • set_model_ancestors_cache

    Updates the cached ancestor thresholds.

  • get_model_ancestors_cache

    Gets the cached ancestor thresholds, converting hashes to model names.

  • delete_model

    Delete a model from the database.

Attributes:

settings instance-attribute

datasets instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index abstractmethod

index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source abstractmethod

get_source(address: SourceAddress) -> Source

Get a source from its address.

Parameters:

Returns:

get_resolution_sources abstractmethod

get_resolution_sources(
    resolution_name: str,
) -> list[Source]

Get a list of sources queriable from a resolution.

Parameters:

  • resolution_name
    (str) –

    Name of the resolution to query.

Returns:

  • list[Source]

    List of relevant Source objects.

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes abstractmethod

validate_hashes(hashes: list[bytes]) -> bool

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash abstractmethod

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph abstractmethod

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop abstractmethod

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear abstractmethod

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot, clear: bool) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

verify abstractmethod

verify() -> None

Checks the database schema against expected and logs outcome.

insert_model abstractmethod

insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model abstractmethod

get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results abstractmethod

set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results abstractmethod

get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth abstractmethod

set_model_truth(model: str, truth: float) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(model: str) -> float

Gets the current truth threshold for this model.

get_model_ancestors abstractmethod

get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache abstractmethod

set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache abstractmethod

get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model abstractmethod

delete_model(model: str, certain: bool) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

get_backend_settings

get_backend_settings(
    backend_type: MatchboxBackends,
) -> type[MatchboxServerSettings]

Get the appropriate settings class based on the backend type.

get_backend_class

get_backend_class(
    backend_type: MatchboxBackends,
) -> type[MatchboxDBAdapter]

Get the appropriate backend class based on the backend type.

settings_to_backend

settings_to_backend(
    settings: MatchboxServerSettings,
) -> MatchboxDBAdapter

Create backend adapter with injected settings.

initialise_matchbox

initialise_matchbox() -> None

Initialise the Matchbox backend based on environment variables.