Overview¶
matchbox.server
¶
Matchbox server.
Includes the API, and database adapters for various backends.
Modules:
-
api
–Matchbox API.
-
base
–Base classes and utilities for Matchbox database adapters.
-
postgresql
–PostgreSQL adapter for Matchbox server.
Classes:
-
MatchboxDBAdapter
–An abstract base class for Matchbox database adapters.
-
MatchboxServerSettings
–Settings for the Matchbox application.
MatchboxDBAdapter
¶
Bases: ABC
An abstract base class for Matchbox database adapters.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source resolution and returns the keys in the targets.
-
index
–Indexes a source in your warehouse to Matchbox.
-
get_source_config
–Get a source configuration from its resolution name.
-
get_resolution_source_configs
–Get a list of source configurations queriable from a resolution.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
drop
–Hard clear the database by dropping all tables and re-creating.
-
clear
–Soft clear the database by deleting all rows but retaining tables.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds.
-
delete_resolution
–Delete a resolution from the database.
Attributes:
-
settings
(MatchboxServerSettings
) – -
sources
(ListableAndCountable
) – -
models
(Countable
) – -
data
(Countable
) – -
clusters
(Countable
) – -
creates
(Countable
) – -
merges
(Countable
) – -
proposes
(Countable
) – -
source_resolutions
(Countable
) –
query
abstractmethod
¶
query(
source: SourceResolutionName,
resolution: ResolutionName | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source
¶SourceResolutionName
) –the
SourceResolutionName
string identifying the source to query -
resolution
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the source resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
abstractmethod
¶
match(
key: str,
source: SourceResolutionName,
targets: list[SourceResolutionName],
resolution: ResolutionName,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source resolution and returns the keys in the targets.
Parameters:
-
key
¶str
) –The key to match from the source.
-
source
¶SourceResolutionName
) –The name of the source resolution.
-
targets
¶list[SourceResolutionName]
) –The names of the target source resolutions.
-
resolution
¶ResolutionName
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
abstractmethod
¶
index(
source_config: SourceConfig, data_hashes: Table
) -> None
Indexes a source in your warehouse to Matchbox.
Parameters:
-
source_config
¶SourceConfig
) –The source configuration to index.
-
data_hashes
¶Table
) –The Arrow table with the hash of each data row
get_source_config
abstractmethod
¶
get_source_config(
name: SourceResolutionName,
) -> SourceConfig
Get a source configuration from its resolution name.
Parameters:
-
name
¶SourceResolutionName
) –The name resolution name for the source
Returns:
-
SourceConfig
–A SourceConfig object
get_resolution_source_configs
abstractmethod
¶
get_resolution_source_configs(
name: ResolutionName,
) -> list[SourceConfig]
Get a list of source configurations queriable from a resolution.
Parameters:
-
name
¶ResolutionName
) –Name of the resolution to query.
Returns:
-
list[SourceConfig]
–List of relevant SourceConfig objects.
validate_ids
abstractmethod
¶
validate_hashes
abstractmethod
¶
cluster_id_to_hash
abstractmethod
¶
get_resolution_graph
abstractmethod
¶
get_resolution_graph() -> ResolutionGraph
Get the full resolution graph.
dump
abstractmethod
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
drop
abstractmethod
¶
clear
abstractmethod
¶
restore
abstractmethod
¶
restore(snapshot: MatchboxSnapshot) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
abstractmethod
¶
insert_model(model_config: ModelConfig) -> None
Writes a model to Matchbox.
Parameters:
-
model_config
¶ModelConfig
) –ModelConfig object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
-
MatchboxModelConfigError
–If the model configuration is invalid, such as the resolutions sharing ancestors
get_model
abstractmethod
¶
get_model(name: ModelResolutionName) -> ModelConfig
Get a model from the database.
set_model_results
abstractmethod
¶
set_model_results(
name: ModelResolutionName, results: Table
) -> None
Set the results for a model.
get_model_results
abstractmethod
¶
get_model_results(name: ModelResolutionName) -> Table
Get the results for a model.
set_model_truth
abstractmethod
¶
set_model_truth(
name: ModelResolutionName, truth: float
) -> None
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
abstractmethod
¶
get_model_truth(name: ModelResolutionName) -> float
Gets the current truth threshold for this model.
get_model_ancestors
abstractmethod
¶
get_model_ancestors(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
abstractmethod
¶
set_model_ancestors_cache(
name: ModelResolutionName,
ancestors_cache: list[ModelAncestor],
) -> None
Updates the cached ancestor thresholds.
Parameters:
-
name
¶ModelResolutionName
) –The name of the model to update
-
ancestors_cache
¶list[ModelAncestor]
) –List of ModelAncestor objects mapping model resolution names to their truth thresholds
get_model_ancestors_cache
abstractmethod
¶
get_model_ancestors_cache(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the cached ancestor thresholds.
Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
delete_resolution
abstractmethod
¶
delete_resolution(
name: ResolutionName, certain: bool
) -> None
Delete a resolution from the database.
Parameters:
-
name
¶ResolutionName
) –The name of the resolution to delete.
-
certain
¶bool
) –Whether to delete the model without confirmation.
MatchboxServerSettings
¶
Bases: BaseSettings
Settings for the Matchbox application.
Attributes:
-
batch_size
(int
) – -
backend_type
(MatchboxBackends
) – -
datastore
(MatchboxDatastoreSettings
) – -
api_key
(SecretStr | None
) – -
log_level
(LogLevelType
) –
matchbox.server.base
¶
Base classes and utilities for Matchbox database adapters.
Classes:
-
MatchboxBackends
–The available backends for Matchbox.
-
MatchboxSnapshot
–A snapshot of the Matchbox database.
-
MatchboxDatastoreSettings
–Settings specific to the datastore configuration.
-
MatchboxServerSettings
–Settings for the Matchbox application.
-
BackendManager
–Manages the Matchbox backend instance and settings.
-
Countable
–A protocol for objects that can be counted.
-
Listable
–A protocol for objects that can be listed.
-
ListableAndCountable
–A protocol for objects that can be counted and listed.
-
MatchboxDBAdapter
–An abstract base class for Matchbox database adapters.
Functions:
-
get_backend_settings
–Get the appropriate settings class based on the backend type.
-
get_backend_class
–Get the appropriate backend class based on the backend type.
-
settings_to_backend
–Create backend adapter with injected settings.
-
initialise_matchbox
–Initialise the Matchbox backend based on environment variables.
MatchboxBackends
¶
MatchboxSnapshot
¶
Bases: BaseModel
A snapshot of the Matchbox database.
Methods:
-
check_serialisable
–Validate that the value can be serialised to JSON.
Attributes:
-
backend_type
(MatchboxBackends
) – -
data
(Any
) –
MatchboxDatastoreSettings
¶
Bases: BaseSettings
Settings specific to the datastore configuration.
Methods:
-
get_client
–Returns an S3 client for the datastore.
Attributes:
-
host
(str | None
) – -
port
(int | None
) – -
access_key_id
(SecretStr | None
) – -
secret_access_key
(SecretStr | None
) – -
default_region
(str | None
) – -
cache_bucket_name
(str
) –
get_client
¶
Returns an S3 client for the datastore.
Creates S3 buckets if they don’t exist.
MatchboxServerSettings
¶
Bases: BaseSettings
Settings for the Matchbox application.
Attributes:
-
batch_size
(int
) – -
backend_type
(MatchboxBackends
) – -
datastore
(MatchboxDatastoreSettings
) – -
api_key
(SecretStr | None
) – -
log_level
(LogLevelType
) –
BackendManager
¶
Manages the Matchbox backend instance and settings.
Methods:
-
initialise
–Initialise the backend with the given settings.
-
get_backend
–Get the backend instance.
-
get_settings
–Get the backend settings.
initialise
classmethod
¶
initialise(settings: MatchboxServerSettings)
Initialise the backend with the given settings.
Countable
¶
Listable
¶
ListableAndCountable
¶
A protocol for objects that can be counted and listed.
Methods:
MatchboxDBAdapter
¶
Bases: ABC
An abstract base class for Matchbox database adapters.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source resolution and returns the keys in the targets.
-
index
–Indexes a source in your warehouse to Matchbox.
-
get_source_config
–Get a source configuration from its resolution name.
-
get_resolution_source_configs
–Get a list of source configurations queriable from a resolution.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
drop
–Hard clear the database by dropping all tables and re-creating.
-
clear
–Soft clear the database by deleting all rows but retaining tables.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds.
-
delete_resolution
–Delete a resolution from the database.
Attributes:
-
settings
(MatchboxServerSettings
) – -
sources
(ListableAndCountable
) – -
models
(Countable
) – -
data
(Countable
) – -
clusters
(Countable
) – -
creates
(Countable
) – -
merges
(Countable
) – -
proposes
(Countable
) – -
source_resolutions
(Countable
) –
query
abstractmethod
¶
query(
source: SourceResolutionName,
resolution: ResolutionName | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source
¶SourceResolutionName
) –the
SourceResolutionName
string identifying the source to query -
resolution
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the source resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
abstractmethod
¶
match(
key: str,
source: SourceResolutionName,
targets: list[SourceResolutionName],
resolution: ResolutionName,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source resolution and returns the keys in the targets.
Parameters:
-
key
¶str
) –The key to match from the source.
-
source
¶SourceResolutionName
) –The name of the source resolution.
-
targets
¶list[SourceResolutionName]
) –The names of the target source resolutions.
-
resolution
¶ResolutionName
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
abstractmethod
¶
index(
source_config: SourceConfig, data_hashes: Table
) -> None
Indexes a source in your warehouse to Matchbox.
Parameters:
-
source_config
¶SourceConfig
) –The source configuration to index.
-
data_hashes
¶Table
) –The Arrow table with the hash of each data row
get_source_config
abstractmethod
¶
get_source_config(
name: SourceResolutionName,
) -> SourceConfig
Get a source configuration from its resolution name.
Parameters:
-
name
¶SourceResolutionName
) –The name resolution name for the source
Returns:
-
SourceConfig
–A SourceConfig object
get_resolution_source_configs
abstractmethod
¶
get_resolution_source_configs(
name: ResolutionName,
) -> list[SourceConfig]
Get a list of source configurations queriable from a resolution.
Parameters:
-
name
¶ResolutionName
) –Name of the resolution to query.
Returns:
-
list[SourceConfig]
–List of relevant SourceConfig objects.
validate_ids
abstractmethod
¶
validate_hashes
abstractmethod
¶
cluster_id_to_hash
abstractmethod
¶
get_resolution_graph
abstractmethod
¶
get_resolution_graph() -> ResolutionGraph
Get the full resolution graph.
dump
abstractmethod
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
drop
abstractmethod
¶
clear
abstractmethod
¶
restore
abstractmethod
¶
restore(snapshot: MatchboxSnapshot) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
abstractmethod
¶
insert_model(model_config: ModelConfig) -> None
Writes a model to Matchbox.
Parameters:
-
model_config
¶ModelConfig
) –ModelConfig object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
-
MatchboxModelConfigError
–If the model configuration is invalid, such as the resolutions sharing ancestors
get_model
abstractmethod
¶
get_model(name: ModelResolutionName) -> ModelConfig
Get a model from the database.
set_model_results
abstractmethod
¶
set_model_results(
name: ModelResolutionName, results: Table
) -> None
Set the results for a model.
get_model_results
abstractmethod
¶
get_model_results(name: ModelResolutionName) -> Table
Get the results for a model.
set_model_truth
abstractmethod
¶
set_model_truth(
name: ModelResolutionName, truth: float
) -> None
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
abstractmethod
¶
get_model_truth(name: ModelResolutionName) -> float
Gets the current truth threshold for this model.
get_model_ancestors
abstractmethod
¶
get_model_ancestors(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
abstractmethod
¶
set_model_ancestors_cache(
name: ModelResolutionName,
ancestors_cache: list[ModelAncestor],
) -> None
Updates the cached ancestor thresholds.
Parameters:
-
name
¶ModelResolutionName
) –The name of the model to update
-
ancestors_cache
¶list[ModelAncestor]
) –List of ModelAncestor objects mapping model resolution names to their truth thresholds
get_model_ancestors_cache
abstractmethod
¶
get_model_ancestors_cache(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the cached ancestor thresholds.
Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
delete_resolution
abstractmethod
¶
delete_resolution(
name: ResolutionName, certain: bool
) -> None
Delete a resolution from the database.
Parameters:
-
name
¶ResolutionName
) –The name of the resolution to delete.
-
certain
¶bool
) –Whether to delete the model without confirmation.
get_backend_settings
¶
get_backend_settings(
backend_type: MatchboxBackends,
) -> type[MatchboxServerSettings]
Get the appropriate settings class based on the backend type.
get_backend_class
¶
get_backend_class(
backend_type: MatchboxBackends,
) -> type[MatchboxDBAdapter]
Get the appropriate backend class based on the backend type.
settings_to_backend
¶
settings_to_backend(
settings: MatchboxServerSettings,
) -> MatchboxDBAdapter
Create backend adapter with injected settings.
initialise_matchbox
¶
Initialise the Matchbox backend based on environment variables.