Skip to content

Overview

matchbox.server

Matchbox server.

Includes the API, and database adapters for various backends.

Modules:

  • api

    Matchbox API.

  • base

    Base classes and utilities for Matchbox database adapters.

  • postgresql

    PostgreSQL adapter for Matchbox server.

  • uploads

    Worker logic to process user uploads.

Classes:

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

Attributes:

settings instance-attribute

sources instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionPath) –

    the resolution pathidentifying the source to query

  • point_of_truth
    (optional, default: None ) –

    the resolution path to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • return_leaf_id
    (optional, default: False ) –

    whether to return cluster ID of leaves

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionPath) –

    The path of the source resolution.

  • targets
    (list[SourceResolutionPath]) –

    The paths of the target source resolutions.

  • point_of_truth
    (ResolutionPath) –

    The path of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

create_collection abstractmethod

create_collection(name: CollectionName) -> Collection

Create a new collection.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

get_collection abstractmethod

get_collection(name: CollectionName) -> Collection

Get collection metadata.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

list_collections abstractmethod

list_collections() -> list[CollectionName]

List all collection names.

Returns:

delete_collection abstractmethod

delete_collection(name: CollectionName, certain: bool) -> None

Delete a collection and all its versions.

Parameters:

  • name
    (CollectionName) –

    The name of the collection to delete.

  • certain
    (bool) –

    Whether to delete the collection without confirmation.

create_run abstractmethod

create_run(collection: CollectionName) -> Run

Create a new run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection to create the run in.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

set_run_mutable abstractmethod

Set the mutability of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • mutable
    (bool) –

    Whether the run should be mutable.

Returns:

  • Run

    The updated Run object.

set_run_default abstractmethod

Set the default status of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • default
    (bool) –

    Whether the run should be the default run.

Returns:

  • Run

    The updated Run object.

get_run abstractmethod

Get run metadata and resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to get.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

delete_run abstractmethod

delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None

Delete a run and all its resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to delete.

  • certain
    (bool) –

    Whether to delete the run without confirmation.

create_resolution abstractmethod

create_resolution(resolution: Resolution, path: ResolutionPath) -> None

Writes a resolution to Matchbox.

Parameters:

  • resolution
    (Resolution) –

    Resolution object with a source or model config

  • path
    (ResolutionPath) –

    The resolution path for the source

Raises:

  • MatchboxModelConfigError

    If the configuration is invalid, such as the ModelConfig’s resolutions sharing ancestors

get_resolution abstractmethod

get_resolution(path: ResolutionPath, validate: ResolutionType | None = None) -> Resolution

Get a resolution from its path.

Parameters:

Returns:

delete_resolution abstractmethod

delete_resolution(path: ResolutionPath, certain: bool) -> None

Delete a resolution from the database.

Parameters:

  • path
    (ResolutionPath) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

insert_source_data abstractmethod

insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None

Inserts hash data for a source resolution.

Parameters:

  • path
    (SourceResolutionPath) –

    The path of the source resolution to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

insert_model_data abstractmethod

insert_model_data(path: ModelResolutionPath, results: Table) -> None

Inserts results data for a model resolution.

get_model_data abstractmethod

get_model_data(path: ModelResolutionPath) -> Table

Get the results for a model resolution.

set_model_truth abstractmethod

set_model_truth(path: ModelResolutionPath, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(path: ModelResolutionPath) -> int

Gets the current truth threshold for this model.

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop abstractmethod

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear abstractmethod

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

login abstractmethod

login(user_name: str) -> int

Receives a user name and returns user ID.

insert_judgement abstractmethod

insert_judgement(judgement: Judgement) -> None

Adds an evaluation judgement to the database.

Parameters:

  • judgement
    (Judgement) –

    representation of the proposed clusters.

get_judgements abstractmethod

get_judgements() -> tuple[Table, Table]

Retrieves all evaluation judgements.

Returns:

  • Table

    Two PyArrow tables with the judgments and their expansion.

  • Table

    See matchbox.common.arrow for information on the schema.

compare_models abstractmethod

Compare metrics of models based on evaluation data.

Parameters:

Returns:

  • ModelComparison

    A model comparison object, listing metrics for each model.

sample_for_eval abstractmethod

sample_for_eval(n: int, path: ModelResolutionPath, user_id: int) -> Table

Sample a cluster to validate.

Parameters:

  • n
    (int) –

    Number of clusters to sample

  • path
    (ModelResolutionPath) –

    Path of resolution from which to sample

  • user_id
    (int) –

    ID of user requesting the sample

Returns:

  • Table

    An Arrow table with the same schema as returned by query()

MatchboxServerSettings

Bases: BaseSettings

Settings for the Matchbox application.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

task_runner instance-attribute

task_runner: Literal['api', 'celery']

redis_uri instance-attribute

redis_uri: str | None

uploads_expiry_minutes instance-attribute

uploads_expiry_minutes: int | None

authorisation class-attribute instance-attribute

authorisation: bool = False

public_key class-attribute instance-attribute

public_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

check_settings

check_settings() -> Self

Check that legal combinations of settings are provided.

matchbox.server.base

Base classes and utilities for Matchbox database adapters.

Classes:

Functions:

MatchboxBackends

Bases: StrEnum

The available backends for Matchbox.

Attributes:

POSTGRES class-attribute instance-attribute

POSTGRES = 'postgres'

MatchboxSnapshot

Bases: BaseModel

A snapshot of the Matchbox database.

Methods:

Attributes:

backend_type instance-attribute

backend_type: MatchboxBackends

data instance-attribute

data: Any

check_serialisable classmethod

check_serialisable(value: Any) -> Any

Validate that the value can be serialised to JSON.

MatchboxDatastoreSettings

Bases: BaseSettings

Settings specific to the datastore configuration.

Methods:

  • get_client

    Returns an S3 client for the datastore.

Attributes:

host class-attribute instance-attribute

host: str | None = None

port class-attribute instance-attribute

port: int | None = None

access_key_id class-attribute instance-attribute

access_key_id: SecretStr | None = None

secret_access_key class-attribute instance-attribute

secret_access_key: SecretStr | None = None

default_region class-attribute instance-attribute

default_region: str | None = None

cache_bucket_name instance-attribute

cache_bucket_name: str

get_client

get_client() -> S3Client

Returns an S3 client for the datastore.

Creates S3 buckets if they don’t exist.

MatchboxServerSettings

Bases: BaseSettings

Settings for the Matchbox application.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

backend_type instance-attribute

backend_type: MatchboxBackends

datastore instance-attribute

task_runner instance-attribute

task_runner: Literal['api', 'celery']

redis_uri instance-attribute

redis_uri: str | None

uploads_expiry_minutes instance-attribute

uploads_expiry_minutes: int | None

authorisation class-attribute instance-attribute

authorisation: bool = False

public_key class-attribute instance-attribute

public_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

check_settings

check_settings() -> Self

Check that legal combinations of settings are provided.

BackendManager

Manages the Matchbox backend instance and settings.

Methods:

initialise classmethod

initialise(settings: MatchboxServerSettings)

Initialise the backend with the given settings.

get_backend classmethod

get_backend() -> MatchboxDBAdapter

Get the backend instance.

get_settings classmethod

get_settings() -> MatchboxServerSettings

Get the backend settings.

Countable

Bases: Protocol

A protocol for objects that can be counted.

Methods:

  • count

    Counts the number of items in the object.

count

count() -> int

Counts the number of items in the object.

Listable

Bases: Protocol

A protocol for objects that can be listed.

Methods:

  • list_all

    Lists the items in the object.

list_all

list_all() -> list[str]

Lists the items in the object.

ListableAndCountable

Bases: Countable, Listable

A protocol for objects that can be counted and listed.

Methods:

  • list_all

    Lists the items in the object.

  • count

    Counts the number of items in the object.

list_all

list_all() -> list[str]

Lists the items in the object.

count

count() -> int

Counts the number of items in the object.

MatchboxDBAdapter

Bases: ABC

An abstract base class for Matchbox database adapters.

Methods:

Attributes:

settings instance-attribute

sources instance-attribute

models instance-attribute

models: Countable

data instance-attribute

data: Countable

clusters instance-attribute

clusters: Countable

creates instance-attribute

creates: Countable

merges instance-attribute

merges: Countable

proposes instance-attribute

proposes: Countable

source_resolutions instance-attribute

source_resolutions: Countable

query abstractmethod

query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionPath) –

    the resolution pathidentifying the source to query

  • point_of_truth
    (optional, default: None ) –

    the resolution path to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • return_leaf_id
    (optional, default: False ) –

    whether to return cluster ID of leaves

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match abstractmethod

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionPath) –

    The path of the source resolution.

  • targets
    (list[SourceResolutionPath]) –

    The paths of the target source resolutions.

  • point_of_truth
    (ResolutionPath) –

    The path of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

create_collection abstractmethod

create_collection(name: CollectionName) -> Collection

Create a new collection.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

get_collection abstractmethod

get_collection(name: CollectionName) -> Collection

Get collection metadata.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

list_collections abstractmethod

list_collections() -> list[CollectionName]

List all collection names.

Returns:

delete_collection abstractmethod

delete_collection(name: CollectionName, certain: bool) -> None

Delete a collection and all its versions.

Parameters:

  • name
    (CollectionName) –

    The name of the collection to delete.

  • certain
    (bool) –

    Whether to delete the collection without confirmation.

create_run abstractmethod

create_run(collection: CollectionName) -> Run

Create a new run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection to create the run in.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

set_run_mutable abstractmethod

Set the mutability of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • mutable
    (bool) –

    Whether the run should be mutable.

Returns:

  • Run

    The updated Run object.

set_run_default abstractmethod

Set the default status of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • default
    (bool) –

    Whether the run should be the default run.

Returns:

  • Run

    The updated Run object.

get_run abstractmethod

Get run metadata and resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to get.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

delete_run abstractmethod

delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None

Delete a run and all its resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to delete.

  • certain
    (bool) –

    Whether to delete the run without confirmation.

create_resolution abstractmethod

create_resolution(resolution: Resolution, path: ResolutionPath) -> None

Writes a resolution to Matchbox.

Parameters:

  • resolution
    (Resolution) –

    Resolution object with a source or model config

  • path
    (ResolutionPath) –

    The resolution path for the source

Raises:

  • MatchboxModelConfigError

    If the configuration is invalid, such as the ModelConfig’s resolutions sharing ancestors

get_resolution abstractmethod

get_resolution(path: ResolutionPath, validate: ResolutionType | None = None) -> Resolution

Get a resolution from its path.

Parameters:

Returns:

delete_resolution abstractmethod

delete_resolution(path: ResolutionPath, certain: bool) -> None

Delete a resolution from the database.

Parameters:

  • path
    (ResolutionPath) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

insert_source_data abstractmethod

insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None

Inserts hash data for a source resolution.

Parameters:

  • path
    (SourceResolutionPath) –

    The path of the source resolution to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

insert_model_data abstractmethod

insert_model_data(path: ModelResolutionPath, results: Table) -> None

Inserts results data for a model resolution.

get_model_data abstractmethod

get_model_data(path: ModelResolutionPath) -> Table

Get the results for a model resolution.

set_model_truth abstractmethod

set_model_truth(path: ModelResolutionPath, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth abstractmethod

get_model_truth(path: ModelResolutionPath) -> int

Gets the current truth threshold for this model.

validate_ids abstractmethod

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

dump abstractmethod

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop abstractmethod

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear abstractmethod

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore abstractmethod

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

login abstractmethod

login(user_name: str) -> int

Receives a user name and returns user ID.

insert_judgement abstractmethod

insert_judgement(judgement: Judgement) -> None

Adds an evaluation judgement to the database.

Parameters:

  • judgement
    (Judgement) –

    representation of the proposed clusters.

get_judgements abstractmethod

get_judgements() -> tuple[Table, Table]

Retrieves all evaluation judgements.

Returns:

  • Table

    Two PyArrow tables with the judgments and their expansion.

  • Table

    See matchbox.common.arrow for information on the schema.

compare_models abstractmethod

Compare metrics of models based on evaluation data.

Parameters:

Returns:

  • ModelComparison

    A model comparison object, listing metrics for each model.

sample_for_eval abstractmethod

sample_for_eval(n: int, path: ModelResolutionPath, user_id: int) -> Table

Sample a cluster to validate.

Parameters:

  • n
    (int) –

    Number of clusters to sample

  • path
    (ModelResolutionPath) –

    Path of resolution from which to sample

  • user_id
    (int) –

    ID of user requesting the sample

Returns:

  • Table

    An Arrow table with the same schema as returned by query()

get_backend_settings

get_backend_settings(backend_type: MatchboxBackends) -> type[MatchboxServerSettings]

Get the appropriate settings class based on the backend type.

get_backend_class

get_backend_class(backend_type: MatchboxBackends) -> type[MatchboxDBAdapter]

Get the appropriate backend class based on the backend type.

settings_to_backend

settings_to_backend(settings: MatchboxServerSettings) -> MatchboxDBAdapter

Create backend adapter with injected settings.

initialise_matchbox

initialise_matchbox() -> None

Initialise the Matchbox backend based on environment variables.