Overview¶
matchbox.server
¶
Matchbox server.
Includes the API, and database adapters for various backends.
Modules:
-
api
–Matchbox API.
-
base
–Base classes and utilities for Matchbox database adapters.
-
postgresql
–PostgreSQL adapter for Matchbox server.
Classes:
-
MatchboxDBAdapter
–An abstract base class for Matchbox database adapters.
-
MatchboxSettings
–Settings for the Matchbox application.
MatchboxDBAdapter
¶
Bases: ABC
An abstract base class for Matchbox database adapters.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
-
index
–Indexes to Matchbox a source dataset in your warehouse.
-
get_source
–Get a source from its name address.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
clear
–Clears all data from the database.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds, converting hashes to model names.
-
delete_model
–Delete a model from the database.
Attributes:
-
settings
(MatchboxSettings
) – -
datasets
(ListableAndCountable
) – -
models
(Countable
) – -
data
(Countable
) – -
clusters
(Countable
) – -
creates
(Countable
) – -
merges
(Countable
) – -
proposes
(Countable
) – -
source_resolutions
(Countable
) –
query
abstractmethod
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source_address
¶SourceAddress
) –the
SourceAddress
object identifying the source to query -
resolution_name
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
abstractmethod
¶
match(
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
Parameters:
-
source_pk
¶str
) –The primary key to match from the source.
-
source
¶SourceAddress
) –The address of the source dataset.
-
targets
¶list[SourceAddress]
) –The addresses of the target datasets.
-
resolution_name
¶str
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
abstractmethod
¶
index(source: Source, data_hashes: Table) -> None
get_source
abstractmethod
¶
get_source(address: SourceAddress) -> Source
Get a source from its name address.
Parameters:
-
address
¶SourceAddress
) –The name address for the source
Returns:
-
Source
–A Source object
validate_ids
abstractmethod
¶
validate_hashes
abstractmethod
¶
cluster_id_to_hash
abstractmethod
¶
get_resolution_graph
abstractmethod
¶
get_resolution_graph() -> ResolutionGraph
Get the full resolution graph.
dump
abstractmethod
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
clear
abstractmethod
¶
restore
abstractmethod
¶
restore(snapshot: MatchboxSnapshot, clear: bool) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
clear
¶bool
) –Whether to clear the database before restoration
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
abstractmethod
¶
insert_model(model: ModelMetadata) -> None
Writes a model to Matchbox.
Parameters:
-
model
¶ModelMetadata
) –ModelMetadata object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
set_model_results
abstractmethod
¶
set_model_results(model: str, results: Table) -> None
Set the results for a model.
get_model_results
abstractmethod
¶
get_model_results(model: str) -> Table
Get the results for a model.
set_model_truth
abstractmethod
¶
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
abstractmethod
¶
Gets the current truth threshold for this model.
get_model_ancestors
abstractmethod
¶
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
abstractmethod
¶
set_model_ancestors_cache(
model: str, ancestors_cache: list[ModelAncestor]
) -> None
get_model_ancestors_cache
abstractmethod
¶
Gets the cached ancestor thresholds, converting hashes to model names.
Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
MatchboxSettings
¶
Bases: BaseSettings
Settings for the Matchbox application.
Attributes:
matchbox.server.base
¶
Base classes and utilities for Matchbox database adapters.
Classes:
-
MatchboxBackends
–The available backends for Matchbox.
-
MatchboxSnapshot
–A snapshot of the Matchbox database.
-
MatchboxDatastoreSettings
–Settings specific to the datastore configuration.
-
MatchboxSettings
–Settings for the Matchbox application.
-
APISettings
–Settings for the Matchbox API.
-
BackendManager
–Manages the Matchbox backend instance and settings.
-
Countable
–A protocol for objects that can be counted.
-
Listable
–A protocol for objects that can be listed.
-
ListableAndCountable
–A protocol for objects that can be counted and listed.
-
MatchboxDBAdapter
–An abstract base class for Matchbox database adapters.
Functions:
-
get_backend_settings
–Get the appropriate settings class based on the backend type.
-
get_backend_class
–Get the appropriate backend class based on the backend type.
-
initialise_backend
–Utility function to initialise the Matchbox backend based on settings.
-
initialise_matchbox
–Initialise the Matchbox backend based on environment variables.
MatchboxBackends
¶
MatchboxSnapshot
¶
Bases: BaseModel
A snapshot of the Matchbox database.
Methods:
-
check_serialisable
–Validate that the value can be serialised to JSON.
Attributes:
-
backend_type
(MatchboxBackends
) – -
data
(Any
) –
MatchboxDatastoreSettings
¶
Bases: BaseSettings
Settings specific to the datastore configuration.
Methods:
-
get_client
–Returns an S3 client for the datastore.
Attributes:
-
host
(str | None
) – -
port
(int | None
) – -
access_key_id
(SecretStr | None
) – -
secret_access_key
(SecretStr | None
) – -
default_region
(str | None
) – -
cache_bucket_name
(str
) –
get_client
¶
Returns an S3 client for the datastore.
Creates S3 buckets if they don’t exist.
MatchboxSettings
¶
Bases: BaseSettings
Settings for the Matchbox application.
Attributes:
APISettings
¶
BackendManager
¶
Manages the Matchbox backend instance and settings.
Methods:
-
initialise
–Initialise the backend with the given settings.
-
get_backend
–Get the backend instance.
-
get_settings
–Get the backend settings.
initialise
classmethod
¶
initialise(settings: MatchboxSettings)
Initialise the backend with the given settings.
Countable
¶
Listable
¶
ListableAndCountable
¶
A protocol for objects that can be counted and listed.
Methods:
MatchboxDBAdapter
¶
Bases: ABC
An abstract base class for Matchbox database adapters.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
-
index
–Indexes to Matchbox a source dataset in your warehouse.
-
get_source
–Get a source from its name address.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
clear
–Clears all data from the database.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds, converting hashes to model names.
-
delete_model
–Delete a model from the database.
Attributes:
-
settings
(MatchboxSettings
) – -
datasets
(ListableAndCountable
) – -
models
(Countable
) – -
data
(Countable
) – -
clusters
(Countable
) – -
creates
(Countable
) – -
merges
(Countable
) – -
proposes
(Countable
) – -
source_resolutions
(Countable
) –
query
abstractmethod
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source_address
¶SourceAddress
) –the
SourceAddress
object identifying the source to query -
resolution_name
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
abstractmethod
¶
match(
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
Parameters:
-
source_pk
¶str
) –The primary key to match from the source.
-
source
¶SourceAddress
) –The address of the source dataset.
-
targets
¶list[SourceAddress]
) –The addresses of the target datasets.
-
resolution_name
¶str
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
abstractmethod
¶
index(source: Source, data_hashes: Table) -> None
get_source
abstractmethod
¶
get_source(address: SourceAddress) -> Source
Get a source from its name address.
Parameters:
-
address
¶SourceAddress
) –The name address for the source
Returns:
-
Source
–A Source object
validate_ids
abstractmethod
¶
validate_hashes
abstractmethod
¶
cluster_id_to_hash
abstractmethod
¶
get_resolution_graph
abstractmethod
¶
get_resolution_graph() -> ResolutionGraph
Get the full resolution graph.
dump
abstractmethod
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
clear
abstractmethod
¶
restore
abstractmethod
¶
restore(snapshot: MatchboxSnapshot, clear: bool) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
clear
¶bool
) –Whether to clear the database before restoration
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
abstractmethod
¶
insert_model(model: ModelMetadata) -> None
Writes a model to Matchbox.
Parameters:
-
model
¶ModelMetadata
) –ModelMetadata object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
set_model_results
abstractmethod
¶
set_model_results(model: str, results: Table) -> None
Set the results for a model.
get_model_results
abstractmethod
¶
get_model_results(model: str) -> Table
Get the results for a model.
set_model_truth
abstractmethod
¶
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
abstractmethod
¶
Gets the current truth threshold for this model.
get_model_ancestors
abstractmethod
¶
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
abstractmethod
¶
set_model_ancestors_cache(
model: str, ancestors_cache: list[ModelAncestor]
) -> None
get_model_ancestors_cache
abstractmethod
¶
Gets the cached ancestor thresholds, converting hashes to model names.
Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
get_backend_settings
¶
get_backend_settings(
backend_type: MatchboxBackends,
) -> type[MatchboxSettings]
Get the appropriate settings class based on the backend type.
get_backend_class
¶
get_backend_class(
backend_type: MatchboxBackends,
) -> type[MatchboxDBAdapter]
Get the appropriate backend class based on the backend type.
initialise_backend
¶
initialise_backend(settings: MatchboxSettings) -> None
Utility function to initialise the Matchbox backend based on settings.
initialise_matchbox
¶
Initialise the Matchbox backend based on environment variables.