Skip to content

Overview

matchbox.server

Matchbox server.

Includes the API, and database adapters for various backends.

Modules:

  • api

    Matchbox API.

  • base

    Base classes and utilities for Matchbox database adapters.

  • postgresql

    PostgreSQL adapter for Matchbox server.

Classes:

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

Attributes:

settings instance-attribute

sources instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionName) –

    the SourceResolutionName string identifying the source to query

  • resolution
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionName) –

    The name of the source resolution.

  • targets
    (list[SourceResolutionName]) –

    The names of the target source resolutions.

  • resolution
    (ResolutionName) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index abstractmethod

index(
    source_config: SourceConfig, data_hashes: Table
) -> None

Indexes a source in your warehouse to Matchbox.

Parameters:

  • source_config
    (SourceConfig) –

    The source configuration to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source_config abstractmethod

get_source_config(
    name: SourceResolutionName,
) -> SourceConfig

Get a source configuration from its resolution name.

Parameters:

Returns:

get_resolution_source_configs abstractmethod

get_resolution_source_configs(
    name: ResolutionName,
) -> list[SourceConfig]

Get a list of source configurations queriable from a resolution.

Parameters:

Returns:

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes abstractmethod

validate_hashes(hashes: list[bytes]) -> bool

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash abstractmethod

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph abstractmethod

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop abstractmethod

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear abstractmethod

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model abstractmethod

insert_model(model_config: ModelConfig) -> None

Writes a model to Matchbox.

Parameters:

  • model_config
    (ModelConfig) –

    ModelConfig object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

  • MatchboxModelConfigError

    If the model configuration is invalid, such as the resolutions sharing ancestors

get_model abstractmethod

get_model(name: ModelResolutionName) -> ModelConfig

Get a model from the database.

set_model_results abstractmethod

set_model_results(
    name: ModelResolutionName, results: Table
) -> None

Set the results for a model.

get_model_results abstractmethod

get_model_results(name: ModelResolutionName) -> Table

Get the results for a model.

set_model_truth abstractmethod

set_model_truth(
    name: ModelResolutionName, truth: float
) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(name: ModelResolutionName) -> float

Gets the current truth threshold for this model.

get_model_ancestors abstractmethod

get_model_ancestors(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache abstractmethod

set_model_ancestors_cache(
    name: ModelResolutionName,
    ancestors_cache: list[ModelAncestor],
) -> None

Updates the cached ancestor thresholds.

Parameters:

get_model_ancestors_cache abstractmethod

get_model_ancestors_cache(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds.

Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_resolution abstractmethod

delete_resolution(
    name: ResolutionName, certain: bool
) -> None

Delete a resolution from the database.

Parameters:

  • name
    (ResolutionName) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

MatchboxServerSettings

Bases: BaseSettings

Settings for the Matchbox application.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

api_key class-attribute instance-attribute

api_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

matchbox.server.base

Base classes and utilities for Matchbox database adapters.

Classes:

Functions:

MatchboxBackends

Bases: StrEnum

The available backends for Matchbox.

Attributes:

POSTGRES class-attribute instance-attribute

POSTGRES = 'postgres'

MatchboxSnapshot

Bases: BaseModel

A snapshot of the Matchbox database.

Methods:

Attributes:

backend_type instance-attribute

backend_type: MatchboxBackends

data instance-attribute

data: Any

check_serialisable classmethod

check_serialisable(value: Any) -> Any

Validate that the value can be serialised to JSON.

MatchboxDatastoreSettings

Bases: BaseSettings

Settings specific to the datastore configuration.

Methods:

  • get_client

    Returns an S3 client for the datastore.

Attributes:

host class-attribute instance-attribute

host: str | None = None

port class-attribute instance-attribute

port: int | None = None

access_key_id class-attribute instance-attribute

access_key_id: SecretStr | None = None

secret_access_key class-attribute instance-attribute

secret_access_key: SecretStr | None = None

default_region class-attribute instance-attribute

default_region: str | None = None

cache_bucket_name instance-attribute

cache_bucket_name: str

get_client

get_client() -> S3Client

Returns an S3 client for the datastore.

Creates S3 buckets if they don’t exist.

MatchboxServerSettings

Bases: BaseSettings

Settings for the Matchbox application.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

api_key class-attribute instance-attribute

api_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

BackendManager

Manages the Matchbox backend instance and settings.

Methods:

initialise classmethod

initialise(settings: MatchboxServerSettings)

Initialise the backend with the given settings.

get_backend classmethod

get_backend() -> MatchboxDBAdapter

Get the backend instance.

get_settings classmethod

get_settings() -> MatchboxServerSettings

Get the backend settings.

Countable

Bases: Protocol

A protocol for objects that can be counted.

Methods:

  • count

    Counts the number of items in the object.

count

count() -> int

Counts the number of items in the object.

Listable

Bases: Protocol

A protocol for objects that can be listed.

Methods:

  • list_all

    Lists the items in the object.

list_all

list_all() -> list[str]

Lists the items in the object.

ListableAndCountable

Bases: Countable, Listable

A protocol for objects that can be counted and listed.

Methods:

  • list_all

    Lists the items in the object.

  • count

    Counts the number of items in the object.

list_all

list_all() -> list[str]

Lists the items in the object.

count

count() -> int

Counts the number of items in the object.

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

Attributes:

settings instance-attribute

sources instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionName) –

    the SourceResolutionName string identifying the source to query

  • resolution
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionName) –

    The name of the source resolution.

  • targets
    (list[SourceResolutionName]) –

    The names of the target source resolutions.

  • resolution
    (ResolutionName) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index abstractmethod

index(
    source_config: SourceConfig, data_hashes: Table
) -> None

Indexes a source in your warehouse to Matchbox.

Parameters:

  • source_config
    (SourceConfig) –

    The source configuration to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source_config abstractmethod

get_source_config(
    name: SourceResolutionName,
) -> SourceConfig

Get a source configuration from its resolution name.

Parameters:

Returns:

get_resolution_source_configs abstractmethod

get_resolution_source_configs(
    name: ResolutionName,
) -> list[SourceConfig]

Get a list of source configurations queriable from a resolution.

Parameters:

Returns:

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes abstractmethod

validate_hashes(hashes: list[bytes]) -> bool

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash abstractmethod

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph abstractmethod

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop abstractmethod

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear abstractmethod

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model abstractmethod

insert_model(model_config: ModelConfig) -> None

Writes a model to Matchbox.

Parameters:

  • model_config
    (ModelConfig) –

    ModelConfig object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

  • MatchboxModelConfigError

    If the model configuration is invalid, such as the resolutions sharing ancestors

get_model abstractmethod

get_model(name: ModelResolutionName) -> ModelConfig

Get a model from the database.

set_model_results abstractmethod

set_model_results(
    name: ModelResolutionName, results: Table
) -> None

Set the results for a model.

get_model_results abstractmethod

get_model_results(name: ModelResolutionName) -> Table

Get the results for a model.

set_model_truth abstractmethod

set_model_truth(
    name: ModelResolutionName, truth: float
) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(name: ModelResolutionName) -> float

Gets the current truth threshold for this model.

get_model_ancestors abstractmethod

get_model_ancestors(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache abstractmethod

set_model_ancestors_cache(
    name: ModelResolutionName,
    ancestors_cache: list[ModelAncestor],
) -> None

Updates the cached ancestor thresholds.

Parameters:

get_model_ancestors_cache abstractmethod

get_model_ancestors_cache(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds.

Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_resolution abstractmethod

delete_resolution(
    name: ResolutionName, certain: bool
) -> None

Delete a resolution from the database.

Parameters:

  • name
    (ResolutionName) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

get_backend_settings

get_backend_settings(
    backend_type: MatchboxBackends,
) -> type[MatchboxServerSettings]

Get the appropriate settings class based on the backend type.

get_backend_class

get_backend_class(
    backend_type: MatchboxBackends,
) -> type[MatchboxDBAdapter]

Get the appropriate backend class based on the backend type.

settings_to_backend

settings_to_backend(
    settings: MatchboxServerSettings,
) -> MatchboxDBAdapter

Create backend adapter with injected settings.

initialise_matchbox

initialise_matchbox() -> None

Initialise the Matchbox backend based on environment variables.