Skip to content

PostgreSQL

matchbox.server.postgresql

PostgreSQL adapter for Matchbox server.

Modules:

  • adapter

    PostgreSQL adapter for Matchbox server.

  • benchmark

    Benchmarking utilities for Matchbox’s PostgreSQL backend.

  • db

    Matchbox PostgreSQL database connection.

  • mixin

    A module for defining mixins for the PostgreSQL backend ORM.

  • orm

    ORM classes for the Matchbox PostgreSQL database.

  • utils

    Utilities for using the PostgreSQL backend.

Classes:

__all__ module-attribute

__all__ = ['MatchboxPostgres', 'MatchboxPostgresSettings']

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute

settings = settings

datasets instance-attribute

datasets = Sources

models instance-attribute

models = FilteredResolutions(
    datasets=False, humans=False, models=True
)

source_resolutions instance-attribute

source_resolutions = FilteredResolutions(
    datasets=True, humans=False, models=False
)

data instance-attribute

data = FilteredClusters(has_dataset=True)

clusters instance-attribute

clusters = FilteredClusters(has_dataset=False)

merges instance-attribute

merges = Contains

creates instance-attribute

creates = FilteredProbabilities(over_truth=True)

proposes instance-attribute

proposes = FilteredProbabilities()

query

query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

match(
    source_pk: str,
    source: SourceAddress,
    targets: list[SourceAddress],
    resolution_name: str,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index

index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source

get_source(address: SourceAddress) -> Source

Get a source from its name address.

Parameters:

  • address
    (SourceAddress) –

    The name address for the source

Returns:

  • Source

    A Source object

validate_ids

validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes

validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

clear

clear(certain: bool = False) -> None

Clears all data from the database.

Parameters:

  • certain
    (bool) –

    Whether to clear the database without confirmation.

restore

restore(
    snapshot: MatchboxSnapshot, clear: bool = False
) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model

insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model

get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results

set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results

get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth

set_model_truth(model: str, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth

get_model_truth(model: str) -> int

Gets the current truth threshold for this model.

get_model_ancestors

get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache

set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache

get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model

delete_model(model: str, certain: bool = False) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

MatchboxPostgresSettings

Bases: MatchboxSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

model_config class-attribute instance-attribute

model_config = SettingsConfigDict(
    env_prefix="MB__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

datastore instance-attribute

backend_type class-attribute instance-attribute

backend_type: MatchboxBackends = POSTGRES

postgres class-attribute instance-attribute

postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)

adapter

PostgreSQL adapter for Matchbox server.

Classes:

Attributes:

  • T
  • P

T module-attribute

T = TypeVar('T')

P module-attribute

P = ParamSpec('P')

FilteredClusters

Bases: BaseModel

Wrapper class for filtered cluster queries.

Methods:

  • count

    Counts the number of clusters in the database.

Attributes:

has_dataset class-attribute instance-attribute
has_dataset: bool | None = None
count
count() -> int

Counts the number of clusters in the database.

FilteredProbabilities

Bases: BaseModel

Wrapper class for filtered probability queries.

Methods:

  • count

    Counts the number of probabilities in the database.

Attributes:

over_truth class-attribute instance-attribute
over_truth: bool = False
count
count() -> int

Counts the number of probabilities in the database.

FilteredResolutions

Bases: BaseModel

Wrapper class for filtered resolution queries.

Methods:

  • count

    Counts the number of resolutions in the database.

Attributes:

datasets class-attribute instance-attribute
datasets: bool = False
humans class-attribute instance-attribute
humans: bool = False
models class-attribute instance-attribute
models: bool = False
count
count() -> int

Counts the number of resolutions in the database.

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute
settings = settings
datasets instance-attribute
datasets = Sources
models instance-attribute
models = FilteredResolutions(
    datasets=False, humans=False, models=True
)
source_resolutions instance-attribute
source_resolutions = FilteredResolutions(
    datasets=True, humans=False, models=False
)
data instance-attribute
data = FilteredClusters(has_dataset=True)
clusters instance-attribute
clusters = FilteredClusters(has_dataset=False)
merges instance-attribute
merges = Contains
creates instance-attribute
creates = FilteredProbabilities(over_truth=True)
proposes instance-attribute
proposes = FilteredProbabilities()
query
query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match
match(
    source_pk: str,
    source: SourceAddress,
    targets: list[SourceAddress],
    resolution_name: str,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index
index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source
get_source(address: SourceAddress) -> Source

Get a source from its name address.

Parameters:

  • address
    (SourceAddress) –

    The name address for the source

Returns:

  • Source

    A Source object

validate_ids
validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes
validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash
cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph
get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

clear
clear(certain: bool = False) -> None

Clears all data from the database.

Parameters:

  • certain
    (bool) –

    Whether to clear the database without confirmation.

restore
restore(
    snapshot: MatchboxSnapshot, clear: bool = False
) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model
insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model
get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results
set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results
get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth
set_model_truth(model: str, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth
get_model_truth(model: str) -> int

Gets the current truth threshold for this model.

get_model_ancestors
get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache
set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache
get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model
delete_model(model: str, certain: bool = False) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

benchmark

Benchmarking utilities for Matchbox’s PostgreSQL backend.

Modules:

  • cluster_pipeline

    Script to benchmark the full dummy data generation pipeline.

  • generate_tables

    Generate tables for benchmarking PostgreSQL backend.

  • query

    Benchmarking utilities for the PostgreSQL backend.

cluster_pipeline

Script to benchmark the full dummy data generation pipeline.

Functions:

  • timer

    Context manager to time a block of code.

Attributes:

ROOT module-attribute
ROOT = parent
config module-attribute
config = PRESETS['l']
left_ids module-attribute
left_ids = tuple(range(config['dedupe_components']))
right_ids module-attribute
right_ids = tuple(
    range(
        config["dedupe_components"],
        config["dedupe_components"] * 2,
    )
)
probs module-attribute
probs = generate_dummy_probabilities(
    left_ids,
    right_ids,
    [0.6, 1],
    config["link_components"],
    config["link_len"],
)
all_probs module-attribute
all_probs = concat_arrays(
    [combine_chunks(), combine_chunks()]
)
lookup module-attribute
lookup = table(
    {
        "id": all_probs,
        "hash": array(
            [hash_data(p) for p in to_pylist()],
            type=large_binary(),
        ),
    }
)
hm module-attribute
probs_with_ccs module-attribute
probs_with_ccs = attach_components_to_probabilities(
    table(
        {
            "left_id": get_hashes(probs["left_id"]),
            "right_id": get_hashes(probs["right_id"]),
            "probability": probs["probability"],
        }
    )
)
hierarchy module-attribute
timer
timer(description: str)

Context manager to time a block of code.

generate_tables

Generate tables for benchmarking PostgreSQL backend.

Functions:

Attributes:

PRESETS module-attribute
PRESETS = {
    "xs": {
        "source_len": 10000,
        "dedupe_components": 8000,
        "dedupe_len": 2000,
        "link_components": 6000,
        "link_len": 10000,
    },
    "s": {
        "source_len": 100000,
        "dedupe_components": 80000,
        "dedupe_len": 20000,
        "link_components": 60000,
        "link_len": 100000,
    },
    "m": {
        "source_len": 1000000,
        "dedupe_components": 800000,
        "dedupe_len": 200000,
        "link_components": 600000,
        "link_len": 1000000,
    },
    "l": {
        "source_len": 10000000,
        "dedupe_components": 8000000,
        "dedupe_len": 2000000,
        "link_components": 6000000,
        "link_len": 10000000,
    },
    "xl": {
        "source_len": 100000000,
        "dedupe_components": 80000000,
        "dedupe_len": 20000000,
        "link_components": 60000000,
        "link_len": 100000000,
    },
}
generate_sources
generate_sources(dataset_start_id: int = 1) -> Table

Generate sources table.

Parameters:

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • Table

    PyArrow sources table

generate_resolutions
generate_resolutions(dataset_start_id: int = 1) -> Table

Generate resolutions table.

Parameters:

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • Table

    PyArrow resolutions table

generate_resolution_from
generate_resolution_from(
    dataset_start_id: int = 1,
) -> Table

Generate resolution_from table.

Parameters:

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • Table

    PyArrow resolution_from table

generate_cluster_source
generate_cluster_source(
    range_left: int,
    range_right: int,
    resolution_source: int,
    cluster_start_id: int = 0,
) -> Table

Generate cluster table containing rows for source rows.

Parameters:

  • range_left
    (int) –

    first ID to generate

  • range_right
    (int) –

    last ID to generate, plus one

  • resolution_source
    (int) –

    resolution ID for the source

  • cluster_start_id
    (int, default: 0 ) –

    Starting ID for clusters

Returns: PyArrow cluster table

generate_result_tables
generate_result_tables(
    left_ids: Iterable[int],
    right_ids: Iterable[int] | None,
    resolution_id: int,
    next_id: int,
    n_components: int,
    n_probs: int,
    prob_min: float = 0.6,
    prob_max: float = 1,
) -> tuple[list[int], Table, Table, Table, int]

Generate probabilities, contains and clusters tables.

Parameters:

  • left_ids
    (Iterable[int]) –

    list of IDs for rows to dedupe, or for left rows to link

  • right_ids
    (Iterable[int] | None) –

    list of IDs for right rows to link

  • resolution_id
    (int) –

    ID of resolution for this dedupe or link model

  • next_id
    (int) –

    the next ID to use when generating IDs

  • n_components
    (int) –

    number of implied connected components

  • n_probs
    (int) –

    total number of probability edges to be generated

  • prob_min
    (float, default: 0.6 ) –

    minimum value for probabilities to be generated

  • prob_max
    (float, default: 1 ) –

    maximum value for probabilities to be generated

Returns:

  • list[int]

    Tuple with 1 list of top-level clusters, 3 PyArrow tables, for probabilities,

  • Table

    contains and clusters, and the next ID to use for future calls

generate_all_tables

Make all six PostgreSQL backend tables.

It will create two sources, one deduper for each, and one linker from each deduper.

Parameters:

  • source_len
    (int) –

    length of each data source

  • dedupe_components
    (int) –

    number of connected components implied by each deduper

  • dedupe_len
    (int) –

    probabilities generated by each deduper

  • (int) –

    number of connected components implied by each linker

  • (int) –

    probabilities generated by each linker

  • cluster_start_id
    (int, default: 0 ) –

    Starting ID for clusters

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • dict[str, Table]

    A dictionary where keys are table names and values are PyArrow tables

main

Command line tool for generating data.

Parameters:

  • settings
    (str) –

    The key of the settings dictionary to use

  • output_dir
    (Path) –

    Where to save the output files

  • cluster_start_id
    (int) –

    The first integer to use for clusters

  • dataset_start_id
    (int) –

    The first integer to use for datasets

Examples:

generate_tables.py             --settings xl             --output-dir data/v4             --dataset-start-id 6742             --cluster-start-id 7
generate_tables.py -s s -o data/v4 -d 1 -c 0

query

Benchmarking utilities for the PostgreSQL backend.

Functions:

  • compile_query_sql

    Compiles a the SQL for query() based on a single point of truth and dataset.

  • compile_match_sql

    Compiles a the SQL for match() based on a single point of truth and dataset.

compile_query_sql
compile_query_sql(
    point_of_truth: str, source_address: SourceAddress
) -> str

Compiles a the SQL for query() based on a single point of truth and dataset.

Parameters:

  • point_of_truth
    (str) –

    The name of the resolution to use, like “linker_1”

  • source_address
    (SourceAddress) –

    The address of the source to retrieve

Returns:

  • str

    A compiled PostgreSQL query, including semicolon, ready to run on Matchbox

compile_match_sql
compile_match_sql(
    source_pk: str,
    resolution_name: str,
    point_of_truth: str,
) -> str

Compiles a the SQL for match() based on a single point of truth and dataset.

Note this only tests the query that retrieves all valid matches for the supplied key. The actual match function goes on to merge this with the user’s requested target table(s).

Parameters:

  • source_pk
    (str) –

    The name of the primary key of the source table

  • resolution_name
    (str) –

    The resolution name of the source table

  • point_of_truth
    (str) –

    The name of the resolution to use, like “linker_1”

Returns:

  • str

    A compiled PostgreSQL query, including semicolon, ready to run on Matchbox

db

Matchbox PostgreSQL database connection.

Classes:

Attributes:

MBDB module-attribute

MatchboxPostgresCoreSettings

Bases: BaseModel

PostgreSQL-specific settings for Matchbox.

Attributes:

host instance-attribute
host: str
port instance-attribute
port: int
user instance-attribute
user: str
password instance-attribute
password: str
database instance-attribute
database: str
db_schema instance-attribute
db_schema: str

MatchboxPostgresSettings

Bases: MatchboxSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

model_config class-attribute instance-attribute
model_config = SettingsConfigDict(
    env_prefix="MB__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)
batch_size class-attribute instance-attribute
batch_size: int = Field(default=250000)
datastore instance-attribute
backend_type class-attribute instance-attribute
backend_type: MatchboxBackends = POSTGRES
postgres class-attribute instance-attribute
postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)

MatchboxDatabase

MatchboxDatabase(settings: MatchboxPostgresSettings)

Matchbox PostgreSQL database connection.

Methods:

Attributes:

settings instance-attribute
settings = settings
engine instance-attribute
engine: Engine | None = None
SessionLocal instance-attribute
SessionLocal: sessionmaker | None = None
MatchboxBase instance-attribute
MatchboxBase = declarative_base(
    metadata=MetaData(schema=db_schema)
)
connect
connect()

Connect to the database.

get_engine
get_engine() -> Engine

Get the database engine.

get_session
get_session()

Get a new session.

create_database
create_database()

Create the database.

clear_database
clear_database()

Clear the database.

mixin

A module for defining mixins for the PostgreSQL backend ORM.

Classes:

  • CountMixin

    A mixin for counting the number of rows in a table.

Attributes:

  • T

T module-attribute

T = TypeVar('T')

CountMixin

A mixin for counting the number of rows in a table.

Methods:

  • count

    Counts the number of rows in the table.

count classmethod
count() -> int

Counts the number of rows in the table.

orm

ORM classes for the Matchbox PostgreSQL database.

Classes:

  • ResolutionFrom

    Resolution lineage closure table with cached truth values.

  • Resolutions

    Table of resolution points: models, datasets and humans.

  • Sources

    Table of sources of data for Matchbox.

  • Contains

    Cluster lineage table.

  • Clusters

    Table of indexed data and clusters that match it.

  • Probabilities

    Table of probabilities that a cluster is correct, according to a resolution.

ResolutionFrom

Bases: CountMixin, MatchboxBase

Resolution lineage closure table with cached truth values.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolution_from'
parent class-attribute instance-attribute
parent = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
child class-attribute instance-attribute
child = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
level class-attribute instance-attribute
level = Column(INTEGER, nullable=False)
truth_cache class-attribute instance-attribute
truth_cache = Column(SMALLINT, nullable=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "parent != child", name="no_self_reference"
    ),
    CheckConstraint("level > 0", name="positive_level"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Resolutions

Bases: CountMixin, MatchboxBase

Table of resolution points: models, datasets and humans.

Resolutions produce probabilities or own data in the clusters table.

Methods:

  • count

    Counts the number of rows in the table.

  • get_lineage

    Returns all ancestors and their cached truth values from this model.

  • get_lineage_to_dataset

    Returns the resolution lineage and cached truth values to a dataset.

  • next_id

    Returns the next available resolution_id.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolutions'
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, primary_key=True)
resolution_hash class-attribute instance-attribute
resolution_hash = Column(BYTEA, nullable=False)
type class-attribute instance-attribute
type = Column(TEXT, nullable=False)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
description class-attribute instance-attribute
description = Column(TEXT)
truth class-attribute instance-attribute
truth = Column(SMALLINT)
source class-attribute instance-attribute
source = relationship(
    "Sources",
    back_populates="dataset_resolution",
    uselist=False,
)
probabilities class-attribute instance-attribute
probabilities = relationship(
    "Probabilities",
    back_populates="proposed_by",
    cascade="all, delete-orphan",
)
children class-attribute instance-attribute
children = relationship(
    "Resolutions",
    secondary=__table__,
    primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
    secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
    backref="parents",
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "type IN ('model', 'dataset', 'human')",
        name="resolution_type_constraints",
    ),
    UniqueConstraint(
        "resolution_hash", name="resolutions_hash_key"
    ),
    UniqueConstraint("name", name="resolutions_name_key"),
)
ancestors property
ancestors: set[Resolutions]

Returns all ancestors (parents, grandparents, etc.) of this resolution.

descendants property
descendants: set[Resolutions]

Returns descendants (children, grandchildren, etc.) of this resolution.

count classmethod
count() -> int

Counts the number of rows in the table.

get_lineage
get_lineage() -> dict[int, float]

Returns all ancestors and their cached truth values from this model.

get_lineage_to_dataset
get_lineage_to_dataset(
    dataset: Resolutions,
) -> tuple[bytes, dict[int, float]]

Returns the resolution lineage and cached truth values to a dataset.

next_id classmethod
next_id() -> int

Returns the next available resolution_id.

Sources

Bases: CountMixin, MatchboxBase

Table of sources of data for Matchbox.

Methods:

  • count

    Counts the number of rows in the table.

  • list

    Returns all sources in the database.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'sources'
resolution_id class-attribute instance-attribute
resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
resolution_name class-attribute instance-attribute
resolution_name = Column(TEXT, nullable=False)
full_name class-attribute instance-attribute
full_name = Column(TEXT, nullable=False)
warehouse_hash class-attribute instance-attribute
warehouse_hash = Column(BYTEA, nullable=False)
id class-attribute instance-attribute
id = Column(TEXT, nullable=False)
column_names class-attribute instance-attribute
column_names = Column(ARRAY(TEXT), nullable=False)
column_types class-attribute instance-attribute
column_types = Column(ARRAY(TEXT), nullable=False)
dataset_resolution class-attribute instance-attribute
dataset_resolution = relationship(
    "Resolutions", back_populates="source"
)
clusters class-attribute instance-attribute
clusters = relationship("Clusters", back_populates="source")
__table_args__ class-attribute instance-attribute
__table_args__ = (
    UniqueConstraint(
        "full_name",
        "warehouse_hash",
        name="unique_source_address",
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

list classmethod
list() -> list[Sources]

Returns all sources in the database.

Contains

Bases: CountMixin, MatchboxBase

Cluster lineage table.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'contains'
parent class-attribute instance-attribute
parent = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
child class-attribute instance-attribute
child = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "parent != child", name="no_self_containment"
    ),
    Index("ix_contains_parent_child", "parent", "child"),
    Index("ix_contains_child_parent", "child", "parent"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Clusters

Bases: CountMixin, MatchboxBase

Table of indexed data and clusters that match it.

Methods:

  • count

    Counts the number of rows in the table.

  • next_id

    Returns the next available cluster_id.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'clusters'
cluster_id class-attribute instance-attribute
cluster_id = Column(BIGINT, primary_key=True)
cluster_hash class-attribute instance-attribute
cluster_hash = Column(BYTEA, nullable=False)
dataset class-attribute instance-attribute
dataset = Column(
    BIGINT,
    ForeignKey("sources.resolution_id"),
    nullable=True,
)
source_pk class-attribute instance-attribute
source_pk = Column(ARRAY(TEXT), index=True, nullable=True)
source class-attribute instance-attribute
source = relationship('Sources', back_populates='clusters')
probabilities class-attribute instance-attribute
probabilities = relationship(
    "Probabilities",
    back_populates="proposes",
    cascade="all, delete-orphan",
)
children class-attribute instance-attribute
children = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == Contains.parent",
    secondaryjoin="Clusters.cluster_id == Contains.child",
    backref="parents",
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    Index(
        "ix_clusters_id_gin",
        source_pk,
        postgresql_using="gin",
    ),
    UniqueConstraint(
        "cluster_hash", name="clusters_hash_key"
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

next_id classmethod
next_id() -> int

Returns the next available cluster_id.

Probabilities

Bases: CountMixin, MatchboxBase

Table of probabilities that a cluster is correct, according to a resolution.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'probabilities'
resolution class-attribute instance-attribute
resolution = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
cluster class-attribute instance-attribute
cluster = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
probability class-attribute instance-attribute
probability = Column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by = relationship(
    "Resolutions", back_populates="probabilities"
)
proposes class-attribute instance-attribute
proposes = relationship(
    "Clusters", back_populates="probabilities"
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "probability BETWEEN 0 AND 100",
        name="valid_probability",
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

utils

Utilities for using the PostgreSQL backend.

Modules:

  • db

    General utilities for the PostgreSQL backend.

  • insert

    Utilities for inserting data into the PostgreSQL backend.

  • query

    Utilities for querying and matching in the PostgreSQL backend.

  • results

    Utilities for querying model results from the PostgreSQL backend.

db

General utilities for the PostgreSQL backend.

Functions:

  • resolve_model_name

    Resolves a model name to a Resolution object.

  • get_resolution_graph

    Retrieves the resolution graph.

  • dump

    Dumps the entire database to a snapshot.

  • restore

    Restores the database from a snapshot.

  • sqa_profiled

    SQLAlchemy profiler.

  • batched

    Batch data into lists of length n. The last batch may be shorter.

  • data_to_batch

    Constructs a batches function for any dataframe and table.

  • isolate_table

    Creates an isolated copy of a SQLAlchemy table.

  • hash_to_hex_decode

    A workround for PostgreSQL so we can compile the query and use ConnectorX.

  • batch_ingest

    Batch ingest records into a database table.

resolve_model_name
resolve_model_name(
    model: str, engine: Engine
) -> Resolutions

Resolves a model name to a Resolution object.

Parameters:

  • model
    (str) –

    The name of the model to resolve.

  • engine
    (Engine) –

    The database engine.

Raises:

get_resolution_graph
get_resolution_graph(engine: Engine) -> ResolutionGraph

Retrieves the resolution graph.

dump
dump(engine: Engine) -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Parameters:

  • engine
    (Engine) –

    The database engine.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

restore
restore(
    engine: Engine,
    snapshot: MatchboxSnapshot,
    batch_size: int,
) -> None

Restores the database from a snapshot.

Parameters:

  • engine
    (Engine) –

    The database engine.

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • batch_size
    (int) –

    The number of records to insert in each batch

Raises:

sqa_profiled
sqa_profiled()

SQLAlchemy profiler.

Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling

batched
batched(iterable: Iterable, n: int) -> Iterable

Batch data into lists of length n. The last batch may be shorter.

data_to_batch
data_to_batch(
    records: list[tuple], table: Table, batch_size: int
) -> Callable[[str], tuple[Any]]

Constructs a batches function for any dataframe and table.

isolate_table
isolate_table(
    table: DeclarativeMeta,
) -> tuple[MetaData, Table]

Creates an isolated copy of a SQLAlchemy table.

This is used to prevent pg_bulk_ingest from attempting to drop unrelated tables in the same schema. The function creates a new Table instance with:

  • A fresh MetaData instance
  • Copied columns
  • Recreated indices properly bound to the new table

Parameters:

  • table
    (DeclarativeMeta) –

    The DeclarativeMeta class whose table should be isolated

Returns:

  • tuple[MetaData, Table]

    A tuple of:

    • The isolated SQLAlchemy MetaData
    • A new SQLAlchemy Table instance with all columns and indices
hash_to_hex_decode
hash_to_hex_decode(hash: bytes) -> bytes

A workround for PostgreSQL so we can compile the query and use ConnectorX.

batch_ingest
batch_ingest(
    records: list[tuple[Any]],
    table: DeclarativeMeta,
    conn: Connection,
    batch_size: int,
) -> None

Batch ingest records into a database table.

We isolate the table and metadata as pg_bulk_ingest will try and drop unrelated tables if they’re in the same schema.

insert

Utilities for inserting data into the PostgreSQL backend.

Classes:

  • HashIDMap

    An object to help map between IDs and hashes.

Functions:

  • insert_dataset

    Indexes a dataset from your data warehouse within Matchbox.

  • insert_model

    Writes a model to Matchbox with a default truth value of 100.

  • insert_results

    Writes a results table to Matchbox.

HashIDMap
HashIDMap(start: int, lookup: Table = None)

An object to help map between IDs and hashes.

When given a set of IDs, returns their hashes. If any ID doesn’t have a hash, it will error.

When given a set of hashes, it will return their IDs. If any don’t have IDs, it will create one and return it as part of the set.

Parameters:

  • start
    (int) –

    The first integer to use for new IDs

  • lookup
    (optional, default: None ) –

    A lookup table to use for existing hashes

Methods:

  • get_hashes

    Returns the hashes of the given IDs.

  • get_ids

    Returns the IDs of the given hashes, assigning new IDs for unknown hashes.

Attributes:

next_int instance-attribute
next_int = start
lookup instance-attribute
lookup = from_arrays(
    [
        array([], type=uint64()),
        array([], type=large_binary()),
        array([], type=bool_()),
    ],
    names=["id", "hash", "new"],
)
get_hashes
get_hashes(ids: UInt64Array) -> LargeBinaryArray

Returns the hashes of the given IDs.

get_ids
get_ids(hashes: BinaryArray) -> UInt64Array

Returns the IDs of the given hashes, assigning new IDs for unknown hashes.

insert_dataset
insert_dataset(
    source: Source,
    data_hashes: Table,
    engine: Engine,
    batch_size: int,
) -> None

Indexes a dataset from your data warehouse within Matchbox.

insert_model
insert_model(
    model: str,
    left: Resolutions,
    right: Resolutions,
    description: str,
    engine: Engine,
) -> None

Writes a model to Matchbox with a default truth value of 100.

Parameters:

  • model
    (str) –

    Name of the new model

  • left
    (Resolutions) –

    Left parent of the model

  • right
    (Resolutions) –

    Right parent of the model. Same as left in a dedupe job

  • description
    (str) –

    Model description

  • engine
    (Engine) –

    SQLAlchemy engine instance

Raises:

  • MatchboxResolutionNotFoundError

    If the specified parent models don’t exist.

  • MatchboxResolutionNotFoundError

    If the specified model doesn’t exist.

insert_results
insert_results(
    resolution: Resolutions,
    engine: Engine,
    results: Table,
    batch_size: int,
) -> None

Writes a results table to Matchbox.

The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.

This means two-item components are synonymous with their original pairwise probabilities.

This allows easy querying of clusters at any threshold.

Parameters:

  • resolution
    (Resolutions) –

    Resolution of type model to associate results with

  • engine
    (Engine) –

    SQLAlchemy engine instance

  • results
    (Table) –

    A PyArrow results table with left_id, right_id, probability

  • batch_size
    (int) –

    Number of records to insert in each batch

Raises:

  • MatchboxResolutionNotFoundError

    If the specified model doesn’t exist.

query

Utilities for querying and matching in the PostgreSQL backend.

Functions:

  • query

    Queries Matchbox and the Source warehouse to retrieve linked data.

  • match

    Matches an ID in a source dataset and returns the keys in the targets.

Attributes:

  • T
T module-attribute
T = TypeVar('T')
query
query(
    engine: Engine,
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries Matchbox and the Source warehouse to retrieve linked data.

Takes the dictionaries of tables and fields outputted by selectors and queries database for them. If a “point of truth” resolution is supplied, will attach the clusters this data belongs to.

To accomplish this, the function:

  • Iterates through each selector, and
    • Retrieves its data in Matchbox according to the optional point of truth, including its hash and cluster hash
    • Retrieves its raw data from its Source’s warehouse
    • Joins the two together
  • Unions the results, one row per item of data in the warehouses

Returns:

  • Table

    A table containing the requested data from each table, unioned together,

  • Table

    with the hash key of each row in Matchbox

match
match(
    engine: Engine,
    source_pk: str,
    source: SourceAddress,
    targets: list[SourceAddress],
    resolution_name: str,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source dataset and returns the keys in the targets.

To accomplish this, the function:

  • Reconstructs the resolution lineage from the specified resolution
  • Iterates through each target, and
    • Retrieves its cluster hash according to the resolution
    • Retrieves all other IDs in the cluster in the source dataset
    • Retrieves all other IDs in the cluster in the target dataset
  • Returns the results as Match objects, one per target

results

Utilities for querying model results from the PostgreSQL backend.

Classes:

  • SourceInfo

    Information about a model’s sources.

Functions:

SourceInfo

Bases: NamedTuple

Information about a model’s sources.

Attributes:

left instance-attribute
left: int
right instance-attribute
right: int | None
left_ancestors instance-attribute
left_ancestors: set[int]
right_ancestors instance-attribute
right_ancestors: set[int] | None
get_model_metadata
get_model_metadata(
    engine: Engine, resolution: Resolutions
) -> ModelMetadata

Get metadata for a model resolution.

get_model_results
get_model_results(
    engine: Engine, resolution: Resolutions
) -> Table

Recover the model’s pairwise probabilities and return as a PyArrow table.

For each probability this model assigned: - Get its two immediate children - Filter for children that aren’t parents of other clusters this model scored - Determine left/right by tracing ancestry to source resolutions using query helpers

Parameters:

  • engine
    (Engine) –

    SQLAlchemy engine

  • resolution
    (Resolutions) –

    Resolution of type model to query

Returns:

  • Table

    Table containing the original pairwise probabilities