PostgreSQL¶
A backend adapter for deploying Matchbox using PostgreSQL.
There are two graph-like trees in place here.
- In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
- In the data subgraph the tree is implemented as an adjacency list, which means recursive queries are required to resolve it, but less data is stored
erDiagram
Sources {
bigint resolution_id PK,FK
string resolution_name
string full_name
bytes warehouse_hash
string db_pk
}
SourceColumns {
bigint column_id PK
bigint source_id FK
int column_index
string column_name
string column_type
}
Clusters {
bigint cluster_id PK
bytes cluster_hash
}
ClusterSourcePK {
bigint pk_id PK
bigint cluster_id FK
bigint source_id FK
string source_pk
}
Contains {
bigint parent PK,FK
bigint child PK,FK
}
Probabilities {
bigint resolution PK,FK
bigint cluster PK,FK
smallint probability
}
Resolutions {
bigint resolution_id PK
bytes resolution_hash
string type
string name
string description
smallint truth
}
ResolutionFrom {
bigint parent PK,FK
bigint child PK,FK
int level
smallint truth_cache
}
Sources |o--|| Resolutions : ""
Sources ||--o{ SourceColumns : ""
Sources ||--o{ ClusterSourcePK : ""
Clusters ||--o{ ClusterSourcePK : ""
Clusters ||--o{ Probabilities : ""
Clusters ||--o{ Contains : "parent"
Contains }o--|| Clusters : "child"
Resolutions ||--o{ Probabilities : ""
Resolutions ||--o{ ResolutionFrom : "parent"
ResolutionFrom }o--|| Resolutions : "child"
matchbox.server.postgresql
¶
PostgreSQL adapter for Matchbox server.
Modules:
-
adapter
–PostgreSQL adapter for Matchbox server.
-
benchmark
–Benchmarking utilities for Matchbox’s PostgreSQL backend.
-
db
–Matchbox PostgreSQL database connection.
-
mixin
–A module for defining mixins for the PostgreSQL backend ORM.
-
orm
–ORM classes for the Matchbox PostgreSQL database.
-
utils
–Utilities for using the PostgreSQL backend.
Classes:
-
MatchboxPostgres
–A PostgreSQL adapter for Matchbox.
-
MatchboxPostgresSettings
–Settings for the Matchbox PostgreSQL backend.
MatchboxPostgres
¶
MatchboxPostgres(settings: MatchboxPostgresSettings)
Bases: MatchboxDBAdapter
A PostgreSQL adapter for Matchbox.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
-
index
–Indexes to Matchbox a source dataset in your warehouse.
-
get_source
–Get a source from its address.
-
get_resolution_sources
–Get a list of sources queriable from a resolution.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
drop
–Hard clear the database by dropping all tables and re-creating.
-
clear
–Soft clear the database by deleting all rows but retaining tables.
-
restore
–Restores the database from a snapshot.
-
verify
–Checks the database schema against expected and logs outcome.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds, converting hashes to model names.
-
delete_model
–Delete a model from the database.
Attributes:
models
instance-attribute
¶
models = FilteredResolutions(
datasets=False, humans=False, models=True
)
source_resolutions
instance-attribute
¶
source_resolutions = FilteredResolutions(
datasets=True, humans=False, models=False
)
query
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int | None = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source_address
¶SourceAddress
) –the
SourceAddress
object identifying the source to query -
resolution_name
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
¶
match(
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
Parameters:
-
source_pk
¶str
) –The primary key to match from the source.
-
source
¶SourceAddress
) –The address of the source dataset.
-
targets
¶list[SourceAddress]
) –The addresses of the target datasets.
-
resolution_name
¶str
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
¶
index(source: Source, data_hashes: Table) -> None
get_source
¶
get_source(address: SourceAddress) -> Source
Get a source from its address.
Parameters:
-
address
¶SourceAddress
) –The name address for the source
Returns:
-
Source
–A Source object
get_resolution_sources
¶
get_resolution_sources(
resolution_name: str,
) -> list[Source]
validate_ids
¶
validate_hashes
¶
cluster_id_to_hash
¶
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
drop
¶
clear
¶
restore
¶
restore(
snapshot: MatchboxSnapshot, clear: bool = False
) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
clear
¶bool
) –Whether to clear the database before restoration
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
¶
insert_model(model: ModelMetadata) -> None
Writes a model to Matchbox.
Parameters:
-
model
¶ModelMetadata
) –ModelMetadata object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
set_model_results
¶
set_model_results(model: str, results: Table) -> None
Set the results for a model.
set_model_truth
¶
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
¶
Gets the current truth threshold for this model.
get_model_ancestors
¶
get_model_ancestors(model: str) -> list[ModelAncestor]
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
¶
set_model_ancestors_cache(
model: str, ancestors_cache: list[ModelAncestor]
) -> None
Updates the cached ancestor thresholds.
Parameters:
-
model
¶str
) –The name of the model to update
-
ancestors_cache
¶list[ModelAncestor]
) –List of ModelAncestor objects mapping model names to their truth thresholds
get_model_ancestors_cache
¶
get_model_ancestors_cache(
model: str,
) -> list[ModelAncestor]
Gets the cached ancestor thresholds, converting hashes to model names.
Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
MatchboxPostgresSettings
¶
Bases: MatchboxServerSettings
Settings for the Matchbox PostgreSQL backend.
Inherits the core settings and adds the PostgreSQL-specific settings.
Attributes:
-
model_config
– -
batch_size
(int
) – -
datastore
(MatchboxDatastoreSettings
) – -
api_key
(SecretStr | None
) – -
log_level
(LogLevelType
) – -
backend_type
(MatchboxBackends
) – -
postgres
(MatchboxPostgresCoreSettings
) –
model_config
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(
env_prefix="MB__SERVER__",
env_nested_delimiter="__",
use_enum_values=True,
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
postgres
class-attribute
instance-attribute
¶
postgres: MatchboxPostgresCoreSettings = Field(
default_factory=MatchboxPostgresCoreSettings
)
adapter
¶
PostgreSQL adapter for Matchbox server.
Classes:
-
FilteredClusters
–Wrapper class for filtered cluster queries.
-
FilteredProbabilities
–Wrapper class for filtered probability queries.
-
FilteredResolutions
–Wrapper class for filtered resolution queries.
-
MatchboxPostgres
–A PostgreSQL adapter for Matchbox.
Attributes:
FilteredClusters
¶
Bases: BaseModel
Wrapper class for filtered cluster queries.
Methods:
-
count
–Counts the number of clusters in the database.
Attributes:
-
has_dataset
(bool | None
) –
FilteredProbabilities
¶
Bases: BaseModel
Wrapper class for filtered probability queries.
Methods:
-
count
–Counts the number of probabilities in the database.
Attributes:
-
over_truth
(bool
) –
FilteredResolutions
¶
Bases: BaseModel
Wrapper class for filtered resolution queries.
Methods:
-
count
–Counts the number of resolutions in the database.
Attributes:
MatchboxPostgres
¶
MatchboxPostgres(settings: MatchboxPostgresSettings)
Bases: MatchboxDBAdapter
A PostgreSQL adapter for Matchbox.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
-
index
–Indexes to Matchbox a source dataset in your warehouse.
-
get_source
–Get a source from its address.
-
get_resolution_sources
–Get a list of sources queriable from a resolution.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
drop
–Hard clear the database by dropping all tables and re-creating.
-
clear
–Soft clear the database by deleting all rows but retaining tables.
-
restore
–Restores the database from a snapshot.
-
verify
–Checks the database schema against expected and logs outcome.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds, converting hashes to model names.
-
delete_model
–Delete a model from the database.
Attributes:
models
instance-attribute
¶
models = FilteredResolutions(
datasets=False, humans=False, models=True
)
source_resolutions
instance-attribute
¶
source_resolutions = FilteredResolutions(
datasets=True, humans=False, models=False
)
query
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int | None = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source_address
¶SourceAddress
) –the
SourceAddress
object identifying the source to query -
resolution_name
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
¶
match(
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
Parameters:
-
source_pk
¶str
) –The primary key to match from the source.
-
source
¶SourceAddress
) –The address of the source dataset.
-
targets
¶list[SourceAddress]
) –The addresses of the target datasets.
-
resolution_name
¶str
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
¶
index(source: Source, data_hashes: Table) -> None
get_source
¶
get_source(address: SourceAddress) -> Source
Get a source from its address.
Parameters:
-
address
¶SourceAddress
) –The name address for the source
Returns:
-
Source
–A Source object
get_resolution_sources
¶
get_resolution_sources(
resolution_name: str,
) -> list[Source]
validate_ids
¶
validate_hashes
¶
cluster_id_to_hash
¶
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
drop
¶
clear
¶
restore
¶
restore(
snapshot: MatchboxSnapshot, clear: bool = False
) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
clear
¶bool
) –Whether to clear the database before restoration
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
¶
insert_model(model: ModelMetadata) -> None
Writes a model to Matchbox.
Parameters:
-
model
¶ModelMetadata
) –ModelMetadata object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
set_model_results
¶
set_model_results(model: str, results: Table) -> None
Set the results for a model.
set_model_truth
¶
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
¶
Gets the current truth threshold for this model.
get_model_ancestors
¶
get_model_ancestors(model: str) -> list[ModelAncestor]
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
¶
set_model_ancestors_cache(
model: str, ancestors_cache: list[ModelAncestor]
) -> None
Updates the cached ancestor thresholds.
Parameters:
-
model
¶str
) –The name of the model to update
-
ancestors_cache
¶list[ModelAncestor]
) –List of ModelAncestor objects mapping model names to their truth thresholds
get_model_ancestors_cache
¶
get_model_ancestors_cache(
model: str,
) -> list[ModelAncestor]
Gets the cached ancestor thresholds, converting hashes to model names.
Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
benchmark
¶
Benchmarking utilities for Matchbox’s PostgreSQL backend.
Modules:
-
cluster_pipeline
–Script to benchmark the full dummy data generation pipeline.
-
generate_tables
–Generate tables for benchmarking PostgreSQL backend.
-
query
–Benchmarking utilities for the PostgreSQL backend.
cluster_pipeline
¶
Script to benchmark the full dummy data generation pipeline.
Functions:
-
timer
–Context manager to time a block of code.
Attributes:
-
ROOT
– -
config
– -
left_ids
– -
right_ids
– -
probs
– -
all_probs
– -
lookup
– -
hm
– -
probs_with_ccs
– -
hierarchy
–
right_ids
module-attribute
¶
probs
module-attribute
¶
probs = generate_dummy_probabilities(
left_ids,
right_ids,
[0.6, 1],
config["link_components"],
config["link_len"],
)
lookup
module-attribute
¶
lookup = table(
{
"id": all_probs,
"hash": array(
[hash_data(p) for p in to_pylist()],
type=large_binary(),
),
}
)
probs_with_ccs
module-attribute
¶
probs_with_ccs = attach_components_to_probabilities(
table(
{
"left_id": get_hashes(probs["left_id"]),
"right_id": get_hashes(probs["right_id"]),
"probability": probs["probability"],
}
)
)
hierarchy
module-attribute
¶
hierarchy = to_hierarchical_clusters(
probabilities=probs_with_ccs,
hash_func=hash_values,
dtype=large_binary,
)
generate_tables
¶
Generate tables for benchmarking PostgreSQL backend.
Functions:
-
generate_sources
–Generate sources and source_columns tables.
-
generate_resolutions
–Generate resolutions table.
-
generate_resolution_from
–Generate resolution_from table.
-
generate_cluster_source
–Generate both Clusters and ClusterSourcePK tables for source rows.
-
generate_result_tables
–Generate probabilities, contains and clusters tables.
-
generate_all_tables
–Make all six PostgreSQL backend tables.
-
main
–Command line tool for generating data.
Attributes:
-
PRESETS
–
PRESETS
module-attribute
¶
PRESETS = {
"xs": {
"source_len": 10000,
"dedupe_components": 8000,
"dedupe_len": 2000,
"link_components": 6000,
"link_len": 10000,
},
"s": {
"source_len": 100000,
"dedupe_components": 80000,
"dedupe_len": 20000,
"link_components": 60000,
"link_len": 100000,
},
"m": {
"source_len": 1000000,
"dedupe_components": 800000,
"dedupe_len": 200000,
"link_components": 600000,
"link_len": 1000000,
},
"l": {
"source_len": 10000000,
"dedupe_components": 8000000,
"dedupe_len": 2000000,
"link_components": 6000000,
"link_len": 10000000,
},
"xl": {
"source_len": 100000000,
"dedupe_components": 80000000,
"dedupe_len": 20000000,
"link_components": 60000000,
"link_len": 100000000,
},
}
generate_sources
¶
generate_sources(
dataset_start_id: int = 1,
) -> tuple[Table, Table]
generate_resolutions
¶
generate_resolutions(dataset_start_id: int = 1) -> Table
generate_resolution_from
¶
generate_resolution_from(
dataset_start_id: int = 1,
) -> Table
generate_cluster_source
¶
generate_cluster_source(
range_left: int,
range_right: int,
source_id: int,
cluster_start_id: int = 0,
pk_start_id: int = 0,
) -> tuple[Table, Table]
Generate both Clusters and ClusterSourcePK tables for source rows.
Parameters:
-
range_left
¶int
) –first ID to generate
-
range_right
¶int
) –last ID to generate, plus one
-
source_id
¶int
) –source ID for the source
-
cluster_start_id
¶int
, default:0
) –Starting ID for clusters
-
pk_start_id
¶int
, default:0
) –Starting ID for primary keys
Returns:
-
tuple[Table, Table]
–Tuple of (Clusters table, ClusterSourcePK table)
generate_result_tables
¶
generate_result_tables(
left_ids: Iterable[int],
right_ids: Iterable[int] | None,
resolution_id: int,
next_id: int,
n_components: int,
n_probs: int,
prob_min: float = 0.6,
prob_max: float = 1,
) -> tuple[list[int], Table, Table, Table, int]
Generate probabilities, contains and clusters tables.
Parameters:
-
left_ids
¶Iterable[int]
) –list of IDs for rows to dedupe, or for left rows to link
-
right_ids
¶Iterable[int] | None
) –list of IDs for right rows to link
-
resolution_id
¶int
) –ID of resolution for this dedupe or link model
-
next_id
¶int
) –the next ID to use when generating IDs
-
n_components
¶int
) –number of implied connected components
-
n_probs
¶int
) –total number of probability edges to be generated
-
prob_min
¶float
, default:0.6
) –minimum value for probabilities to be generated
-
prob_max
¶float
, default:1
) –maximum value for probabilities to be generated
Returns:
generate_all_tables
¶
generate_all_tables(
source_len: int,
dedupe_components: int,
dedupe_len: int,
link_components: int,
link_len: int,
cluster_start_id: int = 0,
dataset_start_id: int = 1,
pk_start_id: int = 0,
) -> dict[str, Table]
Make all six PostgreSQL backend tables.
It will create two sources, one deduper for each, and one linker from each deduper.
Parameters:
-
source_len
¶int
) –length of each data source
-
dedupe_components
¶int
) –number of connected components implied by each deduper
-
dedupe_len
¶int
) –probabilities generated by each deduper
-
link_components
¶int
) –number of connected components implied by each linker
-
link_len
¶int
) –probabilities generated by each linker
-
cluster_start_id
¶int
, default:0
) –Starting ID for clusters
-
dataset_start_id
¶int
, default:1
) –Starting ID for dataset resolution IDs
-
pk_start_id
¶int
, default:0
) –Starting ID for primary keys (globally unique)
Returns:
main
¶
main(
settings: str,
output_dir: Path,
cluster_start_id: int,
dataset_start_id: int,
) -> None
Command line tool for generating data.
Parameters:
-
settings
¶str
) –The key of the settings dictionary to use
-
output_dir
¶Path
) –Where to save the output files
-
cluster_start_id
¶int
) –The first integer to use for clusters
-
dataset_start_id
¶int
) –The first integer to use for datasets
Examples:
query
¶
Benchmarking utilities for the PostgreSQL backend.
Functions:
-
compile_query_sql
–Compiles a the SQL for query() based on a single point of truth and dataset.
-
compile_match_sql
–Compiles a the SQL for match() based on a single point of truth and dataset.
compile_query_sql
¶
compile_query_sql(
point_of_truth: str, source_address: SourceAddress
) -> str
Compiles a the SQL for query() based on a single point of truth and dataset.
Parameters:
-
point_of_truth
¶str
) –The name of the resolution to use, like “linker_1”
-
source_address
¶SourceAddress
) –The address of the source to retrieve
Returns:
-
str
–A compiled PostgreSQL query, including semicolon, ready to run on Matchbox
compile_match_sql
¶
compile_match_sql(
source_pk: str,
source_address: SourceAddress,
point_of_truth: str,
) -> str
Compiles a the SQL for match() based on a single point of truth and dataset.
Note this only tests the query that retrieves all valid matches for the supplied key. The actual match function goes on to merge this with the user’s requested target table(s).
Parameters:
-
source_pk
¶str
) –The name of the primary key of the source table
-
source_address
¶SourceAddress
) –The address of the source to use
-
point_of_truth
¶str
) –The name of the resolution to use, like “linker_1”
Returns:
-
str
–A compiled PostgreSQL query, including semicolon, ready to run on Matchbox
db
¶
Matchbox PostgreSQL database connection.
Classes:
-
MatchboxPostgresCoreSettings
–PostgreSQL-specific settings for Matchbox.
-
MatchboxPostgresSettings
–Settings for the Matchbox PostgreSQL backend.
-
MatchboxDatabase
–Matchbox PostgreSQL database connection.
Attributes:
-
MBDB
–
MatchboxPostgresCoreSettings
¶
MatchboxPostgresSettings
¶
Bases: MatchboxServerSettings
Settings for the Matchbox PostgreSQL backend.
Inherits the core settings and adds the PostgreSQL-specific settings.
Attributes:
-
backend_type
(MatchboxBackends
) – -
postgres
(MatchboxPostgresCoreSettings
) – -
model_config
– -
batch_size
(int
) – -
datastore
(MatchboxDatastoreSettings
) – -
api_key
(SecretStr | None
) – -
log_level
(LogLevelType
) –
postgres
class-attribute
instance-attribute
¶
postgres: MatchboxPostgresCoreSettings = Field(
default_factory=MatchboxPostgresCoreSettings
)
model_config
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(
env_prefix="MB__SERVER__",
env_nested_delimiter="__",
use_enum_values=True,
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
MatchboxDatabase
¶
MatchboxDatabase(settings: MatchboxPostgresSettings)
Matchbox PostgreSQL database connection.
Methods:
-
connection_string
–Get the connection string for PostgreSQL.
-
get_engine
–Get the database engine.
-
get_session
–Get a new session.
-
get_adbc_connection
–Get a new ADBC connection.
-
run_migrations
–Create the database and all tables expected in the schema.
-
clear_database
–Delete all rows in every table in the database schema.
-
drop_database
–Drop all tables in the database schema and re-recreate them.
-
verify_schema
–Verify the database schema live is in sync with the ORM.
Attributes:
MatchboxBase
instance-attribute
¶
connection_string
¶
Get the connection string for PostgreSQL.
get_adbc_connection
¶
Get a new ADBC connection.
The connection must be used within a context manager.
verify_schema
¶
Verify the database schema live is in sync with the ORM.
If any differences are detected, log this as an error.
NOTE: this was originally implemented prior to alembic. In principle alembic is best placed to manage any such diff, and this remains for now only as an informative aid and could be removed.
mixin
¶
A module for defining mixins for the PostgreSQL backend ORM.
Classes:
-
CountMixin
–A mixin for counting the number of rows in a table.
Attributes:
-
T
–
orm
¶
ORM classes for the Matchbox PostgreSQL database.
Classes:
-
ResolutionFrom
–Resolution lineage closure table with cached truth values.
-
Resolutions
–Table of resolution points: models, datasets and humans.
-
SourceColumns
–Table for storing column details for Sources.
-
ClusterSourcePK
–Table for storing source primary keys for clusters.
-
Sources
–Table of sources of data for Matchbox.
-
Contains
–Cluster lineage table.
-
Clusters
–Table of indexed data and clusters that match it.
-
Probabilities
–Table of probabilities that a cluster is correct, according to a resolution.
ResolutionFrom
¶
Bases: CountMixin
, MatchboxBase
Resolution lineage closure table with cached truth values.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
parent
– -
child
– -
level
– -
truth_cache
– -
__table_args__
–
parent
class-attribute
instance-attribute
¶
parent = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
child
class-attribute
instance-attribute
¶
child = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"parent != child", name="no_self_reference"
),
CheckConstraint("level > 0", name="positive_level"),
)
Resolutions
¶
Bases: CountMixin
, MatchboxBase
Table of resolution points: models, datasets and humans.
Resolutions produce probabilities or own data in the clusters table.
Methods:
-
get_lineage
–Returns all ancestors and their cached truth values from this model.
-
get_lineage_to_dataset
–Returns the resolution lineage and cached truth values to a dataset.
-
next_id
–Returns the next available resolution_id.
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
resolution_id
– -
resolution_hash
– -
type
– -
name
– -
description
– -
truth
– -
source
– -
probabilities
– -
children
– -
__table_args__
– -
ancestors
(set[Resolutions]
) –Returns all ancestors (parents, grandparents, etc.) of this resolution.
-
descendants
(set[Resolutions]
) –Returns descendants (children, grandchildren, etc.) of this resolution.
resolution_hash
class-attribute
instance-attribute
¶
source
class-attribute
instance-attribute
¶
probabilities
class-attribute
instance-attribute
¶
probabilities = relationship(
"Probabilities",
back_populates="proposed_by",
cascade="all, delete-orphan",
passive_deletes=True,
)
children
class-attribute
instance-attribute
¶
children = relationship(
"Resolutions",
secondary=__table__,
primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
backref="parents",
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"type IN ('model', 'dataset', 'human')",
name="resolution_type_constraints",
),
UniqueConstraint(
"resolution_hash", name="resolutions_hash_key"
),
UniqueConstraint("name", name="resolutions_name_key"),
)
ancestors
property
¶
ancestors: set[Resolutions]
Returns all ancestors (parents, grandparents, etc.) of this resolution.
descendants
property
¶
descendants: set[Resolutions]
Returns descendants (children, grandchildren, etc.) of this resolution.
get_lineage
¶
Returns all ancestors and their cached truth values from this model.
get_lineage_to_dataset
¶
Returns the resolution lineage and cached truth values to a dataset.
SourceColumns
¶
Bases: CountMixin
, MatchboxBase
Table for storing column details for Sources.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
column_id
– -
source_id
– -
column_index
– -
column_name
– -
column_type
– -
source
– -
__table_args__
–
ClusterSourcePK
¶
Bases: CountMixin
, MatchboxBase
Table for storing source primary keys for clusters.
Methods:
Attributes:
-
__tablename__
– -
pk_id
– -
cluster_id
– -
source_id
– -
source_pk
– -
cluster
– -
source
– -
__table_args__
–
cluster_id
class-attribute
instance-attribute
¶
cluster_id = Column(
BIGINT,
ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
nullable=False,
)
source_id
class-attribute
instance-attribute
¶
cluster
class-attribute
instance-attribute
¶
source
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
Index("ix_cluster_source_pks_cluster_id", "cluster_id"),
Index("ix_cluster_source_pks_source_pk", "source_pk"),
UniqueConstraint(
"pk_id", "source_id", name="unique_pk_source"
),
)
Sources
¶
Bases: CountMixin
, MatchboxBase
Table of sources of data for Matchbox.
Methods:
-
list_all
–Returns all sources in the database.
-
to_common_source
–Convert ORM source to a matchbox.common Source object.
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
source_id
– -
resolution_id
– -
resolution_name
– -
full_name
– -
warehouse_hash
– -
db_pk
– -
dataset_resolution
– -
columns
– -
cluster_source_pks
– -
clusters
– -
__table_args__
–
source_id
class-attribute
instance-attribute
¶
resolution_id
class-attribute
instance-attribute
¶
dataset_resolution
class-attribute
instance-attribute
¶
columns
class-attribute
instance-attribute
¶
columns = relationship(
"SourceColumns",
back_populates="source",
cascade="all, delete-orphan",
passive_deletes=True,
)
cluster_source_pks
class-attribute
instance-attribute
¶
cluster_source_pks = relationship(
"ClusterSourcePK",
back_populates="source",
cascade="all, delete-orphan",
passive_deletes=True,
)
clusters
class-attribute
instance-attribute
¶
clusters = relationship(
"Clusters",
secondary=__table__,
primaryjoin="Sources.source_id == ClusterSourcePK.source_id",
secondaryjoin="ClusterSourcePK.cluster_id == Clusters.cluster_id",
viewonly=True,
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
UniqueConstraint(
"full_name",
"warehouse_hash",
name="unique_source_address",
),
)
to_common_source
¶
Convert ORM source to a matchbox.common Source object.
Contains
¶
Bases: CountMixin
, MatchboxBase
Cluster lineage table.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
parent
– -
child
– -
__table_args__
–
parent
class-attribute
instance-attribute
¶
child
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"parent != child", name="no_self_containment"
),
Index("ix_contains_parent_child", "parent", "child"),
Index("ix_contains_child_parent", "child", "parent"),
)
Clusters
¶
Bases: CountMixin
, MatchboxBase
Table of indexed data and clusters that match it.
Methods:
Attributes:
-
__tablename__
– -
cluster_id
– -
cluster_hash
– -
source_pks
– -
probabilities
– -
children
– -
sources
– -
__table_args__
–
source_pks
class-attribute
instance-attribute
¶
source_pks = relationship(
"ClusterSourcePK",
back_populates="cluster",
cascade="all, delete-orphan",
passive_deletes=True,
)
probabilities
class-attribute
instance-attribute
¶
probabilities = relationship(
"Probabilities",
back_populates="proposes",
cascade="all, delete-orphan",
passive_deletes=True,
)
children
class-attribute
instance-attribute
¶
children = relationship(
"Clusters",
secondary=__table__,
primaryjoin="Clusters.cluster_id == Contains.parent",
secondaryjoin="Clusters.cluster_id == Contains.child",
backref="parents",
)
sources
class-attribute
instance-attribute
¶
sources = relationship(
"Sources",
secondary=__table__,
primaryjoin="Clusters.cluster_id == ClusterSourcePK.cluster_id",
secondaryjoin="ClusterSourcePK.source_id == Sources.source_id",
viewonly=True,
)
__table_args__
class-attribute
instance-attribute
¶
Probabilities
¶
Bases: CountMixin
, MatchboxBase
Table of probabilities that a cluster is correct, according to a resolution.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
resolution
– -
cluster
– -
probability
– -
proposed_by
– -
proposes
– -
__table_args__
–
resolution
class-attribute
instance-attribute
¶
resolution = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
cluster
class-attribute
instance-attribute
¶
cluster = Column(
BIGINT,
ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
primary_key=True,
)
proposed_by
class-attribute
instance-attribute
¶
proposes
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"probability BETWEEN 0 AND 100",
name="valid_probability",
),
Index("ix_probabilities_resolution", "resolution"),
)
utils
¶
Utilities for using the PostgreSQL backend.
Modules:
-
db
–General utilities for the PostgreSQL backend.
-
insert
–Utilities for inserting data into the PostgreSQL backend.
-
query
–Utilities for querying and matching in the PostgreSQL backend.
-
results
–Utilities for querying model results from the PostgreSQL backend.
db
¶
General utilities for the PostgreSQL backend.
Functions:
-
resolve_model_name
–Resolves a model name to a Resolution object.
-
get_resolution_graph
–Retrieves the resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
restore
–Restores the database from a snapshot.
-
sqa_profiled
–SQLAlchemy profiler.
-
compile_sql
–Compiles a SQLAlchemy statement into a string.
-
large_ingest
–Append a PyArrow table to a PostgreSQL table using ADBC.
resolve_model_name
¶
resolve_model_name(
model: str, engine: Engine
) -> Resolutions
Resolves a model name to a Resolution object.
Parameters:
Raises:
-
MatchboxResolutionNotFoundError
–If the model doesn’t exist.
get_resolution_graph
¶
get_resolution_graph(engine: Engine) -> ResolutionGraph
Retrieves the resolution graph.
dump
¶
dump(engine: Engine) -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Parameters:
-
engine
¶Engine
) –The database engine.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
restore
¶
restore(
engine: Engine,
snapshot: MatchboxSnapshot,
batch_size: int,
) -> None
Restores the database from a snapshot.
Parameters:
-
engine
¶Engine
) –The database engine.
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
batch_size
¶int
) –The number of records to insert in each batch
Raises:
-
ValueError
–If the snapshot is missing data
sqa_profiled
¶
SQLAlchemy profiler.
Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling
compile_sql
¶
large_ingest
¶
large_ingest(
data: Table,
table_class: DeclarativeMeta,
max_chunksize: int | None = None,
upsert_keys: list[str] | None = None,
update_columns: list[str] | None = None,
)
Append a PyArrow table to a PostgreSQL table using ADBC.
It will either copy directly (and error if primary key constraints are violated), or it can be run in upsert mode by using a staging table, which is slower.
Parameters:
-
data
¶Table
) –A PyArrow table to write.
-
table_class
¶DeclarativeMeta
) –The SQLAlchemy ORM class for the table to write to.
-
max_chunksize
¶int | None
, default:None
) –Size of data chunks to be read and copied.
-
upsert_keys
¶list[str] | None
, default:None
) –Columns used as keys for “on conflict do update”. If passed, it will run ingest in slower upsert mode. If not passed and
update_columns
is passed, defaults to primary keys. -
update_columns
¶list[str] | None
, default:None
) –Columns to update when upserting. If passed, it will run ingest in slower upsert mode. If not passed and
upsert_keys
is passed, defaults to all other columns.
insert
¶
Utilities for inserting data into the PostgreSQL backend.
Classes:
-
HashIDMap
–An object to help map between IDs and hashes.
Functions:
-
insert_dataset
–Indexes a dataset from your data warehouse within Matchbox.
-
insert_model
–Writes a model to Matchbox with a default truth value of 100.
-
insert_results
–Writes a results table to Matchbox.
HashIDMap
¶
An object to help map between IDs and hashes.
When given a set of IDs, returns their hashes. If any ID doesn’t have a hash, it will error.
When given a set of hashes, it will return their IDs. If any don’t have IDs, it will create one and return it as part of the set.
Parameters:
-
start
¶int
) –The first integer to use for new IDs
-
lookup
¶optional
, default:None
) –A lookup table to use for existing hashes
Methods:
-
get_hashes
–Returns the hashes of the given IDs.
-
get_ids
–Returns the IDs of the given hashes, assigning new IDs for unknown hashes.
Attributes:
insert_dataset
¶
Indexes a dataset from your data warehouse within Matchbox.
insert_model
¶
insert_model(
model: str,
left: Resolutions,
right: Resolutions,
description: str,
engine: Engine,
) -> None
Writes a model to Matchbox with a default truth value of 100.
Parameters:
-
model
¶str
) –Name of the new model
-
left
¶Resolutions
) –Left parent of the model
-
right
¶Resolutions
) –Right parent of the model. Same as left in a dedupe job
-
description
¶str
) –Model description
-
engine
¶Engine
) –SQLAlchemy engine instance
Raises:
-
MatchboxResolutionNotFoundError
–If the specified parent models don’t exist.
-
MatchboxResolutionNotFoundError
–If the specified model doesn’t exist.
insert_results
¶
insert_results(
resolution: Resolutions,
engine: Engine,
results: Table,
batch_size: int,
) -> None
Writes a results table to Matchbox.
The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.
This means two-item components are synonymous with their original pairwise probabilities.
This allows easy querying of clusters at any threshold.
Parameters:
-
resolution
¶Resolutions
) –Resolution of type model to associate results with
-
engine
¶Engine
) –SQLAlchemy engine instance
-
results
¶Table
) –A PyArrow results table with left_id, right_id, probability
-
batch_size
¶int
) –Number of records to insert in each batch
Raises:
-
MatchboxResolutionNotFoundError
–If the specified model doesn’t exist.
query
¶
Utilities for querying and matching in the PostgreSQL backend.
Functions:
-
query
–Queries Matchbox and the Source warehouse to retrieve linked data.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
Attributes:
-
T
–
query
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries Matchbox and the Source warehouse to retrieve linked data.
Takes the dictionaries of tables and fields outputted by selectors and queries database for them. If a “point of truth” resolution is supplied, will attach the clusters this data belongs to.
To accomplish this, the function:
- Iterates through each selector, and
- Retrieves its data in Matchbox according to the optional point of truth, including its hash and cluster hash
- Retrieves its raw data from its Source’s warehouse
- Joins the two together
- Unions the results, one row per item of data in the warehouses
Returns:
-
Table
–A table containing the requested data from each table, unioned together,
-
Table
–with the hash key of each row in Matchbox
match
¶
match(
engine: Engine,
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
To accomplish this, the function:
- Reconstructs the resolution lineage from the specified resolution
- Iterates through each target, and
- Retrieves its cluster hash according to the resolution
- Retrieves all other IDs in the cluster in the source dataset
- Retrieves all other IDs in the cluster in the target dataset
- Returns the results as Match objects, one per target
results
¶
Utilities for querying model results from the PostgreSQL backend.
Classes:
-
SourceInfo
–Information about a model’s sources.
Functions:
-
get_model_metadata
–Get metadata for a model resolution.
-
get_model_results
–Recover the model’s pairwise probabilities and return as a PyArrow table.
SourceInfo
¶
Bases: NamedTuple
Information about a model’s sources.
Attributes:
-
left
(int
) – -
right
(int | None
) – -
left_ancestors
(set[int]
) – -
right_ancestors
(set[int] | None
) –
get_model_metadata
¶
get_model_metadata(
engine: Engine, resolution: Resolutions
) -> ModelMetadata
Get metadata for a model resolution.
get_model_results
¶
get_model_results(resolution: Resolutions) -> Table
Recover the model’s pairwise probabilities and return as a PyArrow table.
For each probability this model assigned: - Get its two immediate children - Filter for children that aren’t parents of other clusters this model scored - Determine left/right by tracing ancestry to source resolutions using query helpers
Parameters:
-
resolution
¶Resolutions
) –Resolution of type model to query
Returns:
-
Table
–Table containing the original pairwise probabilities