PostgreSQL
matchbox.server.postgresql
¶
PostgreSQL adapter for Matchbox server.
Modules:
-
adapter
–PostgreSQL adapter for Matchbox server.
-
benchmark
–Benchmarking utilities for Matchbox’s PostgreSQL backend.
-
db
–Matchbox PostgreSQL database connection.
-
mixin
–A module for defining mixins for the PostgreSQL backend ORM.
-
orm
–ORM classes for the Matchbox PostgreSQL database.
-
utils
–Utilities for using the PostgreSQL backend.
Classes:
-
MatchboxPostgres
–A PostgreSQL adapter for Matchbox.
-
MatchboxPostgresSettings
–Settings for the Matchbox PostgreSQL backend.
MatchboxPostgres
¶
MatchboxPostgres(settings: MatchboxPostgresSettings)
Bases: MatchboxDBAdapter
A PostgreSQL adapter for Matchbox.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
-
index
–Indexes to Matchbox a source dataset in your warehouse.
-
get_source
–Get a source from its name address.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
clear
–Clears all data from the database.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds, converting hashes to model names.
-
delete_model
–Delete a model from the database.
Attributes:
models
instance-attribute
¶
models = FilteredResolutions(
datasets=False, humans=False, models=True
)
source_resolutions
instance-attribute
¶
source_resolutions = FilteredResolutions(
datasets=True, humans=False, models=False
)
query
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int | None = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source_address
¶SourceAddress
) –the
SourceAddress
object identifying the source to query -
resolution_name
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
¶
match(
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
Parameters:
-
source_pk
¶str
) –The primary key to match from the source.
-
source
¶SourceAddress
) –The address of the source dataset.
-
targets
¶list[SourceAddress]
) –The addresses of the target datasets.
-
resolution_name
¶str
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
¶
index(source: Source, data_hashes: Table) -> None
get_source
¶
get_source(address: SourceAddress) -> Source
Get a source from its name address.
Parameters:
-
address
¶SourceAddress
) –The name address for the source
Returns:
-
Source
–A Source object
validate_ids
¶
validate_hashes
¶
cluster_id_to_hash
¶
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
clear
¶
restore
¶
restore(
snapshot: MatchboxSnapshot, clear: bool = False
) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
clear
¶bool
) –Whether to clear the database before restoration
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
¶
insert_model(model: ModelMetadata) -> None
Writes a model to Matchbox.
Parameters:
-
model
¶ModelMetadata
) –ModelMetadata object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
set_model_results
¶
set_model_results(model: str, results: Table) -> None
Set the results for a model.
set_model_truth
¶
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
¶
Gets the current truth threshold for this model.
get_model_ancestors
¶
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
¶
set_model_ancestors_cache(
model: str, ancestors_cache: list[ModelAncestor]
) -> None
get_model_ancestors_cache
¶
Gets the cached ancestor thresholds, converting hashes to model names.
Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
MatchboxPostgresSettings
¶
Bases: MatchboxSettings
Settings for the Matchbox PostgreSQL backend.
Inherits the core settings and adds the PostgreSQL-specific settings.
Attributes:
-
model_config
– -
batch_size
(int
) – -
datastore
(MatchboxDatastoreSettings
) – -
backend_type
(MatchboxBackends
) – -
postgres
(MatchboxPostgresCoreSettings
) –
model_config
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(
env_prefix="MB__",
env_nested_delimiter="__",
use_enum_values=True,
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
postgres
class-attribute
instance-attribute
¶
postgres: MatchboxPostgresCoreSettings = Field(
default_factory=MatchboxPostgresCoreSettings
)
adapter
¶
PostgreSQL adapter for Matchbox server.
Classes:
-
FilteredClusters
–Wrapper class for filtered cluster queries.
-
FilteredProbabilities
–Wrapper class for filtered probability queries.
-
FilteredResolutions
–Wrapper class for filtered resolution queries.
-
MatchboxPostgres
–A PostgreSQL adapter for Matchbox.
Attributes:
FilteredClusters
¶
Bases: BaseModel
Wrapper class for filtered cluster queries.
Methods:
-
count
–Counts the number of clusters in the database.
Attributes:
-
has_dataset
(bool | None
) –
FilteredProbabilities
¶
Bases: BaseModel
Wrapper class for filtered probability queries.
Methods:
-
count
–Counts the number of probabilities in the database.
Attributes:
-
over_truth
(bool
) –
FilteredResolutions
¶
Bases: BaseModel
Wrapper class for filtered resolution queries.
Methods:
-
count
–Counts the number of resolutions in the database.
Attributes:
MatchboxPostgres
¶
MatchboxPostgres(settings: MatchboxPostgresSettings)
Bases: MatchboxDBAdapter
A PostgreSQL adapter for Matchbox.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
-
index
–Indexes to Matchbox a source dataset in your warehouse.
-
get_source
–Get a source from its name address.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
clear
–Clears all data from the database.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds, converting hashes to model names.
-
delete_model
–Delete a model from the database.
Attributes:
models
instance-attribute
¶
models = FilteredResolutions(
datasets=False, humans=False, models=True
)
source_resolutions
instance-attribute
¶
source_resolutions = FilteredResolutions(
datasets=True, humans=False, models=False
)
query
¶
query(
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int | None = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source_address
¶SourceAddress
) –the
SourceAddress
object identifying the source to query -
resolution_name
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
¶
match(
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
Parameters:
-
source_pk
¶str
) –The primary key to match from the source.
-
source
¶SourceAddress
) –The address of the source dataset.
-
targets
¶list[SourceAddress]
) –The addresses of the target datasets.
-
resolution_name
¶str
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
¶
index(source: Source, data_hashes: Table) -> None
get_source
¶
get_source(address: SourceAddress) -> Source
Get a source from its name address.
Parameters:
-
address
¶SourceAddress
) –The name address for the source
Returns:
-
Source
–A Source object
validate_ids
¶
validate_hashes
¶
cluster_id_to_hash
¶
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
clear
¶
restore
¶
restore(
snapshot: MatchboxSnapshot, clear: bool = False
) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
clear
¶bool
) –Whether to clear the database before restoration
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
¶
insert_model(model: ModelMetadata) -> None
Writes a model to Matchbox.
Parameters:
-
model
¶ModelMetadata
) –ModelMetadata object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
set_model_results
¶
set_model_results(model: str, results: Table) -> None
Set the results for a model.
set_model_truth
¶
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
¶
Gets the current truth threshold for this model.
get_model_ancestors
¶
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
¶
set_model_ancestors_cache(
model: str, ancestors_cache: list[ModelAncestor]
) -> None
get_model_ancestors_cache
¶
Gets the cached ancestor thresholds, converting hashes to model names.
Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
benchmark
¶
Benchmarking utilities for Matchbox’s PostgreSQL backend.
Modules:
-
cluster_pipeline
–Script to benchmark the full dummy data generation pipeline.
-
generate_tables
–Generate tables for benchmarking PostgreSQL backend.
-
query
–Benchmarking utilities for the PostgreSQL backend.
cluster_pipeline
¶
Script to benchmark the full dummy data generation pipeline.
Functions:
-
timer
–Context manager to time a block of code.
Attributes:
-
ROOT
– -
config
– -
left_ids
– -
right_ids
– -
probs
– -
all_probs
– -
lookup
– -
hm
– -
probs_with_ccs
– -
hierarchy
–
right_ids
module-attribute
¶
probs
module-attribute
¶
probs = generate_dummy_probabilities(
left_ids,
right_ids,
[0.6, 1],
config["link_components"],
config["link_len"],
)
lookup
module-attribute
¶
lookup = table(
{
"id": all_probs,
"hash": array(
[hash_data(p) for p in to_pylist()],
type=large_binary(),
),
}
)
probs_with_ccs
module-attribute
¶
probs_with_ccs = attach_components_to_probabilities(
table(
{
"left_id": get_hashes(probs["left_id"]),
"right_id": get_hashes(probs["right_id"]),
"probability": probs["probability"],
}
)
)
hierarchy
module-attribute
¶
hierarchy = to_hierarchical_clusters(
probabilities=probs_with_ccs,
hash_func=hash_values,
dtype=large_binary,
)
generate_tables
¶
Generate tables for benchmarking PostgreSQL backend.
Functions:
-
generate_sources
–Generate sources table.
-
generate_resolutions
–Generate resolutions table.
-
generate_resolution_from
–Generate resolution_from table.
-
generate_cluster_source
–Generate cluster table containing rows for source rows.
-
generate_result_tables
–Generate probabilities, contains and clusters tables.
-
generate_all_tables
–Make all six PostgreSQL backend tables.
-
main
–Command line tool for generating data.
Attributes:
-
PRESETS
–
PRESETS
module-attribute
¶
PRESETS = {
"xs": {
"source_len": 10000,
"dedupe_components": 8000,
"dedupe_len": 2000,
"link_components": 6000,
"link_len": 10000,
},
"s": {
"source_len": 100000,
"dedupe_components": 80000,
"dedupe_len": 20000,
"link_components": 60000,
"link_len": 100000,
},
"m": {
"source_len": 1000000,
"dedupe_components": 800000,
"dedupe_len": 200000,
"link_components": 600000,
"link_len": 1000000,
},
"l": {
"source_len": 10000000,
"dedupe_components": 8000000,
"dedupe_len": 2000000,
"link_components": 6000000,
"link_len": 10000000,
},
"xl": {
"source_len": 100000000,
"dedupe_components": 80000000,
"dedupe_len": 20000000,
"link_components": 60000000,
"link_len": 100000000,
},
}
generate_sources
¶
generate_sources(dataset_start_id: int = 1) -> Table
generate_resolutions
¶
generate_resolutions(dataset_start_id: int = 1) -> Table
generate_resolution_from
¶
generate_resolution_from(
dataset_start_id: int = 1,
) -> Table
generate_cluster_source
¶
generate_cluster_source(
range_left: int,
range_right: int,
resolution_source: int,
cluster_start_id: int = 0,
) -> Table
Generate cluster table containing rows for source rows.
Parameters:
-
range_left
¶int
) –first ID to generate
-
range_right
¶int
) –last ID to generate, plus one
-
resolution_source
¶int
) –resolution ID for the source
-
cluster_start_id
¶int
, default:0
) –Starting ID for clusters
Returns: PyArrow cluster table
generate_result_tables
¶
generate_result_tables(
left_ids: Iterable[int],
right_ids: Iterable[int] | None,
resolution_id: int,
next_id: int,
n_components: int,
n_probs: int,
prob_min: float = 0.6,
prob_max: float = 1,
) -> tuple[list[int], Table, Table, Table, int]
Generate probabilities, contains and clusters tables.
Parameters:
-
left_ids
¶Iterable[int]
) –list of IDs for rows to dedupe, or for left rows to link
-
right_ids
¶Iterable[int] | None
) –list of IDs for right rows to link
-
resolution_id
¶int
) –ID of resolution for this dedupe or link model
-
next_id
¶int
) –the next ID to use when generating IDs
-
n_components
¶int
) –number of implied connected components
-
n_probs
¶int
) –total number of probability edges to be generated
-
prob_min
¶float
, default:0.6
) –minimum value for probabilities to be generated
-
prob_max
¶float
, default:1
) –maximum value for probabilities to be generated
Returns:
generate_all_tables
¶
generate_all_tables(
source_len: int,
dedupe_components: int,
dedupe_len: int,
link_components: int,
link_len: int,
cluster_start_id: int = 0,
dataset_start_id: int = 1,
) -> dict[str, Table]
Make all six PostgreSQL backend tables.
It will create two sources, one deduper for each, and one linker from each deduper.
Parameters:
-
source_len
¶int
) –length of each data source
-
dedupe_components
¶int
) –number of connected components implied by each deduper
-
dedupe_len
¶int
) –probabilities generated by each deduper
-
link_components
¶int
) –number of connected components implied by each linker
-
link_len
¶int
) –probabilities generated by each linker
-
cluster_start_id
¶int
, default:0
) –Starting ID for clusters
-
dataset_start_id
¶int
, default:1
) –Starting ID for dataset resolution IDs
Returns:
main
¶
main(
settings: str,
output_dir: Path,
cluster_start_id: int,
dataset_start_id: int,
) -> None
Command line tool for generating data.
Parameters:
-
settings
¶str
) –The key of the settings dictionary to use
-
output_dir
¶Path
) –Where to save the output files
-
cluster_start_id
¶int
) –The first integer to use for clusters
-
dataset_start_id
¶int
) –The first integer to use for datasets
Examples:
query
¶
Benchmarking utilities for the PostgreSQL backend.
Functions:
-
compile_query_sql
–Compiles a the SQL for query() based on a single point of truth and dataset.
-
compile_match_sql
–Compiles a the SQL for match() based on a single point of truth and dataset.
compile_query_sql
¶
compile_query_sql(
point_of_truth: str, source_address: SourceAddress
) -> str
Compiles a the SQL for query() based on a single point of truth and dataset.
Parameters:
-
point_of_truth
¶str
) –The name of the resolution to use, like “linker_1”
-
source_address
¶SourceAddress
) –The address of the source to retrieve
Returns:
-
str
–A compiled PostgreSQL query, including semicolon, ready to run on Matchbox
compile_match_sql
¶
compile_match_sql(
source_pk: str,
resolution_name: str,
point_of_truth: str,
) -> str
Compiles a the SQL for match() based on a single point of truth and dataset.
Note this only tests the query that retrieves all valid matches for the supplied key. The actual match function goes on to merge this with the user’s requested target table(s).
Parameters:
-
source_pk
¶str
) –The name of the primary key of the source table
-
resolution_name
¶str
) –The resolution name of the source table
-
point_of_truth
¶str
) –The name of the resolution to use, like “linker_1”
Returns:
-
str
–A compiled PostgreSQL query, including semicolon, ready to run on Matchbox
db
¶
Matchbox PostgreSQL database connection.
Classes:
-
MatchboxPostgresCoreSettings
–PostgreSQL-specific settings for Matchbox.
-
MatchboxPostgresSettings
–Settings for the Matchbox PostgreSQL backend.
-
MatchboxDatabase
–Matchbox PostgreSQL database connection.
Attributes:
-
MBDB
–
MatchboxPostgresCoreSettings
¶
Bases: BaseModel
PostgreSQL-specific settings for Matchbox.
Attributes:
MatchboxPostgresSettings
¶
Bases: MatchboxSettings
Settings for the Matchbox PostgreSQL backend.
Inherits the core settings and adds the PostgreSQL-specific settings.
Attributes:
-
model_config
– -
batch_size
(int
) – -
datastore
(MatchboxDatastoreSettings
) – -
backend_type
(MatchboxBackends
) – -
postgres
(MatchboxPostgresCoreSettings
) –
model_config
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(
env_prefix="MB__",
env_nested_delimiter="__",
use_enum_values=True,
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
postgres
class-attribute
instance-attribute
¶
postgres: MatchboxPostgresCoreSettings = Field(
default_factory=MatchboxPostgresCoreSettings
)
MatchboxDatabase
¶
MatchboxDatabase(settings: MatchboxPostgresSettings)
Matchbox PostgreSQL database connection.
Methods:
-
connect
–Connect to the database.
-
get_engine
–Get the database engine.
-
get_session
–Get a new session.
-
create_database
–Create the database.
-
clear_database
–Clear the database.
Attributes:
-
settings
– -
engine
(Engine | None
) – -
SessionLocal
(sessionmaker | None
) – -
MatchboxBase
–
MatchboxBase
instance-attribute
¶
mixin
¶
A module for defining mixins for the PostgreSQL backend ORM.
Classes:
-
CountMixin
–A mixin for counting the number of rows in a table.
Attributes:
-
T
–
orm
¶
ORM classes for the Matchbox PostgreSQL database.
Classes:
-
ResolutionFrom
–Resolution lineage closure table with cached truth values.
-
Resolutions
–Table of resolution points: models, datasets and humans.
-
Sources
–Table of sources of data for Matchbox.
-
Contains
–Cluster lineage table.
-
Clusters
–Table of indexed data and clusters that match it.
-
Probabilities
–Table of probabilities that a cluster is correct, according to a resolution.
ResolutionFrom
¶
Bases: CountMixin
, MatchboxBase
Resolution lineage closure table with cached truth values.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
parent
– -
child
– -
level
– -
truth_cache
– -
__table_args__
–
parent
class-attribute
instance-attribute
¶
parent = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
child
class-attribute
instance-attribute
¶
child = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"parent != child", name="no_self_reference"
),
CheckConstraint("level > 0", name="positive_level"),
)
Resolutions
¶
Bases: CountMixin
, MatchboxBase
Table of resolution points: models, datasets and humans.
Resolutions produce probabilities or own data in the clusters table.
Methods:
-
count
–Counts the number of rows in the table.
-
get_lineage
–Returns all ancestors and their cached truth values from this model.
-
get_lineage_to_dataset
–Returns the resolution lineage and cached truth values to a dataset.
-
next_id
–Returns the next available resolution_id.
Attributes:
-
__tablename__
– -
resolution_id
– -
resolution_hash
– -
type
– -
name
– -
description
– -
truth
– -
source
– -
probabilities
– -
children
– -
__table_args__
– -
ancestors
(set[Resolutions]
) –Returns all ancestors (parents, grandparents, etc.) of this resolution.
-
descendants
(set[Resolutions]
) –Returns descendants (children, grandchildren, etc.) of this resolution.
resolution_hash
class-attribute
instance-attribute
¶
source
class-attribute
instance-attribute
¶
probabilities
class-attribute
instance-attribute
¶
probabilities = relationship(
"Probabilities",
back_populates="proposed_by",
cascade="all, delete-orphan",
)
children
class-attribute
instance-attribute
¶
children = relationship(
"Resolutions",
secondary=__table__,
primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
backref="parents",
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"type IN ('model', 'dataset', 'human')",
name="resolution_type_constraints",
),
UniqueConstraint(
"resolution_hash", name="resolutions_hash_key"
),
UniqueConstraint("name", name="resolutions_name_key"),
)
ancestors
property
¶
ancestors: set[Resolutions]
Returns all ancestors (parents, grandparents, etc.) of this resolution.
descendants
property
¶
descendants: set[Resolutions]
Returns descendants (children, grandchildren, etc.) of this resolution.
get_lineage
¶
Returns all ancestors and their cached truth values from this model.
get_lineage_to_dataset
¶
Returns the resolution lineage and cached truth values to a dataset.
Sources
¶
Bases: CountMixin
, MatchboxBase
Table of sources of data for Matchbox.
Methods:
Attributes:
-
__tablename__
– -
resolution_id
– -
resolution_name
– -
full_name
– -
warehouse_hash
– -
id
– -
column_names
– -
column_types
– -
dataset_resolution
– -
clusters
– -
__table_args__
–
resolution_id
class-attribute
instance-attribute
¶
resolution_id = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
column_names
class-attribute
instance-attribute
¶
column_types
class-attribute
instance-attribute
¶
dataset_resolution
class-attribute
instance-attribute
¶
clusters
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
UniqueConstraint(
"full_name",
"warehouse_hash",
name="unique_source_address",
),
)
Contains
¶
Bases: CountMixin
, MatchboxBase
Cluster lineage table.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
parent
– -
child
– -
__table_args__
–
parent
class-attribute
instance-attribute
¶
child
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"parent != child", name="no_self_containment"
),
Index("ix_contains_parent_child", "parent", "child"),
Index("ix_contains_child_parent", "child", "parent"),
)
Clusters
¶
Bases: CountMixin
, MatchboxBase
Table of indexed data and clusters that match it.
Methods:
Attributes:
-
__tablename__
– -
cluster_id
– -
cluster_hash
– -
dataset
– -
source_pk
– -
source
– -
probabilities
– -
children
– -
__table_args__
–
dataset
class-attribute
instance-attribute
¶
source_pk
class-attribute
instance-attribute
¶
source
class-attribute
instance-attribute
¶
probabilities
class-attribute
instance-attribute
¶
probabilities = relationship(
"Probabilities",
back_populates="proposes",
cascade="all, delete-orphan",
)
children
class-attribute
instance-attribute
¶
children = relationship(
"Clusters",
secondary=__table__,
primaryjoin="Clusters.cluster_id == Contains.parent",
secondaryjoin="Clusters.cluster_id == Contains.child",
backref="parents",
)
Probabilities
¶
Bases: CountMixin
, MatchboxBase
Table of probabilities that a cluster is correct, according to a resolution.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
resolution
– -
cluster
– -
probability
– -
proposed_by
– -
proposes
– -
__table_args__
–
resolution
class-attribute
instance-attribute
¶
resolution = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
cluster
class-attribute
instance-attribute
¶
cluster = Column(
BIGINT,
ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
primary_key=True,
)
proposed_by
class-attribute
instance-attribute
¶
proposes
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
utils
¶
Utilities for using the PostgreSQL backend.
Modules:
-
db
–General utilities for the PostgreSQL backend.
-
insert
–Utilities for inserting data into the PostgreSQL backend.
-
query
–Utilities for querying and matching in the PostgreSQL backend.
-
results
–Utilities for querying model results from the PostgreSQL backend.
db
¶
General utilities for the PostgreSQL backend.
Functions:
-
resolve_model_name
–Resolves a model name to a Resolution object.
-
get_resolution_graph
–Retrieves the resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
restore
–Restores the database from a snapshot.
-
sqa_profiled
–SQLAlchemy profiler.
-
batched
–Batch data into lists of length n. The last batch may be shorter.
-
data_to_batch
–Constructs a batches function for any dataframe and table.
-
isolate_table
–Creates an isolated copy of a SQLAlchemy table.
-
hash_to_hex_decode
–A workround for PostgreSQL so we can compile the query and use ConnectorX.
-
batch_ingest
–Batch ingest records into a database table.
resolve_model_name
¶
resolve_model_name(
model: str, engine: Engine
) -> Resolutions
Resolves a model name to a Resolution object.
Parameters:
Raises:
-
MatchboxResolutionNotFoundError
–If the model doesn’t exist.
get_resolution_graph
¶
get_resolution_graph(engine: Engine) -> ResolutionGraph
Retrieves the resolution graph.
dump
¶
dump(engine: Engine) -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Parameters:
-
engine
¶Engine
) –The database engine.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
restore
¶
restore(
engine: Engine,
snapshot: MatchboxSnapshot,
batch_size: int,
) -> None
Restores the database from a snapshot.
Parameters:
-
engine
¶Engine
) –The database engine.
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
batch_size
¶int
) –The number of records to insert in each batch
Raises:
-
ValueError
–If the snapshot is missing data
sqa_profiled
¶
SQLAlchemy profiler.
Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling
batched
¶
Batch data into lists of length n. The last batch may be shorter.
data_to_batch
¶
Constructs a batches function for any dataframe and table.
isolate_table
¶
Creates an isolated copy of a SQLAlchemy table.
This is used to prevent pg_bulk_ingest from attempting to drop unrelated tables in the same schema. The function creates a new Table instance with:
- A fresh MetaData instance
- Copied columns
- Recreated indices properly bound to the new table
Parameters:
-
table
¶DeclarativeMeta
) –The DeclarativeMeta class whose table should be isolated
Returns:
-
tuple[MetaData, Table]
–A tuple of:
- The isolated SQLAlchemy MetaData
- A new SQLAlchemy Table instance with all columns and indices
hash_to_hex_decode
¶
A workround for PostgreSQL so we can compile the query and use ConnectorX.
insert
¶
Utilities for inserting data into the PostgreSQL backend.
Classes:
-
HashIDMap
–An object to help map between IDs and hashes.
Functions:
-
insert_dataset
–Indexes a dataset from your data warehouse within Matchbox.
-
insert_model
–Writes a model to Matchbox with a default truth value of 100.
-
insert_results
–Writes a results table to Matchbox.
HashIDMap
¶
An object to help map between IDs and hashes.
When given a set of IDs, returns their hashes. If any ID doesn’t have a hash, it will error.
When given a set of hashes, it will return their IDs. If any don’t have IDs, it will create one and return it as part of the set.
Parameters:
-
start
¶int
) –The first integer to use for new IDs
-
lookup
¶optional
, default:None
) –A lookup table to use for existing hashes
Methods:
-
get_hashes
–Returns the hashes of the given IDs.
-
get_ids
–Returns the IDs of the given hashes, assigning new IDs for unknown hashes.
Attributes:
insert_dataset
¶
insert_dataset(
source: Source,
data_hashes: Table,
engine: Engine,
batch_size: int,
) -> None
Indexes a dataset from your data warehouse within Matchbox.
insert_model
¶
insert_model(
model: str,
left: Resolutions,
right: Resolutions,
description: str,
engine: Engine,
) -> None
Writes a model to Matchbox with a default truth value of 100.
Parameters:
-
model
¶str
) –Name of the new model
-
left
¶Resolutions
) –Left parent of the model
-
right
¶Resolutions
) –Right parent of the model. Same as left in a dedupe job
-
description
¶str
) –Model description
-
engine
¶Engine
) –SQLAlchemy engine instance
Raises:
-
MatchboxResolutionNotFoundError
–If the specified parent models don’t exist.
-
MatchboxResolutionNotFoundError
–If the specified model doesn’t exist.
insert_results
¶
insert_results(
resolution: Resolutions,
engine: Engine,
results: Table,
batch_size: int,
) -> None
Writes a results table to Matchbox.
The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.
This means two-item components are synonymous with their original pairwise probabilities.
This allows easy querying of clusters at any threshold.
Parameters:
-
resolution
¶Resolutions
) –Resolution of type model to associate results with
-
engine
¶Engine
) –SQLAlchemy engine instance
-
results
¶Table
) –A PyArrow results table with left_id, right_id, probability
-
batch_size
¶int
) –Number of records to insert in each batch
Raises:
-
MatchboxResolutionNotFoundError
–If the specified model doesn’t exist.
query
¶
Utilities for querying and matching in the PostgreSQL backend.
Functions:
-
query
–Queries Matchbox and the Source warehouse to retrieve linked data.
-
match
–Matches an ID in a source dataset and returns the keys in the targets.
Attributes:
-
T
–
query
¶
query(
engine: Engine,
source_address: SourceAddress,
resolution_name: str | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries Matchbox and the Source warehouse to retrieve linked data.
Takes the dictionaries of tables and fields outputted by selectors and queries database for them. If a “point of truth” resolution is supplied, will attach the clusters this data belongs to.
To accomplish this, the function:
- Iterates through each selector, and
- Retrieves its data in Matchbox according to the optional point of truth, including its hash and cluster hash
- Retrieves its raw data from its Source’s warehouse
- Joins the two together
- Unions the results, one row per item of data in the warehouses
Returns:
-
Table
–A table containing the requested data from each table, unioned together,
-
Table
–with the hash key of each row in Matchbox
match
¶
match(
engine: Engine,
source_pk: str,
source: SourceAddress,
targets: list[SourceAddress],
resolution_name: str,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source dataset and returns the keys in the targets.
To accomplish this, the function:
- Reconstructs the resolution lineage from the specified resolution
- Iterates through each target, and
- Retrieves its cluster hash according to the resolution
- Retrieves all other IDs in the cluster in the source dataset
- Retrieves all other IDs in the cluster in the target dataset
- Returns the results as Match objects, one per target
results
¶
Utilities for querying model results from the PostgreSQL backend.
Classes:
-
SourceInfo
–Information about a model’s sources.
Functions:
-
get_model_metadata
–Get metadata for a model resolution.
-
get_model_results
–Recover the model’s pairwise probabilities and return as a PyArrow table.
SourceInfo
¶
Bases: NamedTuple
Information about a model’s sources.
Attributes:
-
left
(int
) – -
right
(int | None
) – -
left_ancestors
(set[int]
) – -
right_ancestors
(set[int] | None
) –
get_model_metadata
¶
get_model_metadata(
engine: Engine, resolution: Resolutions
) -> ModelMetadata
Get metadata for a model resolution.
get_model_results
¶
get_model_results(
engine: Engine, resolution: Resolutions
) -> Table
Recover the model’s pairwise probabilities and return as a PyArrow table.
For each probability this model assigned: - Get its two immediate children - Filter for children that aren’t parents of other clusters this model scored - Determine left/right by tracing ancestry to source resolutions using query helpers
Parameters:
-
engine
¶Engine
) –SQLAlchemy engine
-
resolution
¶Resolutions
) –Resolution of type model to query
Returns:
-
Table
–Table containing the original pairwise probabilities