Skip to content

PostgreSQL

A backend adapter for deploying Matchbox using PostgreSQL.

There are two graph-like trees in place here.

  • In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
  • In the data subgraph the tree is implemented as an adjacency list, which means recursive queries are required to resolve it, but less data is stored
erDiagram
    Sources {
        bigint resolution_id PK,FK
        string resolution_name
        string full_name
        bytes warehouse_hash
        string db_pk
    }
    SourceColumns {
        bigint column_id PK
        bigint source_id FK
        int column_index
        string column_name
        string column_type
    }
    Clusters {
        bigint cluster_id PK
        bytes cluster_hash
    }
    ClusterSourcePK {
        bigint pk_id PK
        bigint cluster_id FK
        bigint source_id FK
        string source_pk
    }
    Contains {
        bigint parent PK,FK
        bigint child PK,FK
    }
    Probabilities {
        bigint resolution PK,FK
        bigint cluster PK,FK
        smallint probability
    }
    Resolutions {
        bigint resolution_id PK
        bytes resolution_hash
        string type
        string name
        string description
        smallint truth
    }
    ResolutionFrom {
        bigint parent PK,FK
        bigint child PK,FK
        int level
        smallint truth_cache
    }

    Sources |o--|| Resolutions : ""
    Sources ||--o{ SourceColumns : ""
    Sources ||--o{ ClusterSourcePK : ""
    Clusters ||--o{ ClusterSourcePK : ""
    Clusters ||--o{ Probabilities : ""
    Clusters ||--o{ Contains : "parent"
    Contains }o--|| Clusters : "child"
    Resolutions ||--o{ Probabilities : ""
    Resolutions ||--o{ ResolutionFrom : "parent"
    ResolutionFrom }o--|| Resolutions : "child"

matchbox.server.postgresql

PostgreSQL adapter for Matchbox server.

Modules:

  • adapter

    PostgreSQL adapter for Matchbox server.

  • benchmark

    Benchmarking utilities for Matchbox’s PostgreSQL backend.

  • db

    Matchbox PostgreSQL database connection.

  • mixin

    A module for defining mixins for the PostgreSQL backend ORM.

  • orm

    ORM classes for the Matchbox PostgreSQL database.

  • utils

    Utilities for using the PostgreSQL backend.

Classes:

__all__ module-attribute

__all__ = ['MatchboxPostgres', 'MatchboxPostgresSettings']

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

  • query

    Queries the database from an optional point of truth.

  • match

    Matches an ID in a source dataset and returns the keys in the targets.

  • index

    Indexes to Matchbox a source dataset in your warehouse.

  • get_source

    Get a source from its address.

  • get_resolution_sources

    Get a list of sources queriable from a resolution.

  • validate_ids

    Validates a list of IDs exist in the database.

  • validate_hashes

    Validates a list of hashes exist in the database.

  • cluster_id_to_hash

    Get a lookup of Cluster hashes from a list of IDs.

  • get_resolution_graph

    Get the full resolution graph.

  • dump

    Dumps the entire database to a snapshot.

  • drop

    Hard clear the database by dropping all tables and re-creating.

  • clear

    Soft clear the database by deleting all rows but retaining tables.

  • restore

    Restores the database from a snapshot.

  • verify

    Checks the database schema against expected and logs outcome.

  • insert_model

    Writes a model to Matchbox.

  • get_model

    Get a model from the database.

  • set_model_results

    Set the results for a model.

  • get_model_results

    Get the results for a model.

  • set_model_truth

    Sets the truth threshold for this model, changing the default clusters.

  • get_model_truth

    Gets the current truth threshold for this model.

  • get_model_ancestors

    Gets the current truth values of all ancestors.

  • set_model_ancestors_cache

    Updates the cached ancestor thresholds.

  • get_model_ancestors_cache

    Gets the cached ancestor thresholds, converting hashes to model names.

  • delete_model

    Delete a model from the database.

Attributes:

settings instance-attribute

settings = settings

datasets instance-attribute

datasets = Sources

models instance-attribute

models = FilteredResolutions(
    datasets=False, humans=False, models=True
)

source_resolutions instance-attribute

source_resolutions = FilteredResolutions(
    datasets=True, humans=False, models=False
)

data instance-attribute

data = FilteredClusters(has_dataset=True)

clusters instance-attribute

clusters = FilteredClusters(has_dataset=False)

merges instance-attribute

merges = Contains

creates instance-attribute

creates = FilteredProbabilities(over_truth=True)

proposes instance-attribute

proposes = FilteredProbabilities()

query

query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index

index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source

get_source(address: SourceAddress) -> Source

Get a source from its address.

Parameters:

Returns:

get_resolution_sources

get_resolution_sources(
    resolution_name: str,
) -> list[Source]

Get a list of sources queriable from a resolution.

Parameters:

  • resolution_name
    (str) –

    Name of the resolution to query.

Returns:

  • list[Source]

    List of relevant Source objects.

validate_ids

validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes

validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop

drop(certain: bool = False) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear

clear(certain: bool = False) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore

restore(
    snapshot: MatchboxSnapshot, clear: bool = False
) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

verify

verify() -> None

Checks the database schema against expected and logs outcome.

insert_model

insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model

get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results

set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results

get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth

set_model_truth(model: str, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth

get_model_truth(model: str) -> int

Gets the current truth threshold for this model.

get_model_ancestors

get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache

set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache

get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model

delete_model(model: str, certain: bool = False) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

model_config class-attribute instance-attribute

model_config = SettingsConfigDict(
    env_prefix="MB__SERVER__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

datastore instance-attribute

api_key class-attribute instance-attribute

api_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

backend_type class-attribute instance-attribute

backend_type: MatchboxBackends = POSTGRES

postgres class-attribute instance-attribute

postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)

adapter

PostgreSQL adapter for Matchbox server.

Classes:

Attributes:

  • T
  • P

T module-attribute

T = TypeVar('T')

P module-attribute

P = ParamSpec('P')

FilteredClusters

Bases: BaseModel

Wrapper class for filtered cluster queries.

Methods:

  • count

    Counts the number of clusters in the database.

Attributes:

has_dataset class-attribute instance-attribute
has_dataset: bool | None = None
count
count() -> int

Counts the number of clusters in the database.

FilteredProbabilities

Bases: BaseModel

Wrapper class for filtered probability queries.

Methods:

  • count

    Counts the number of probabilities in the database.

Attributes:

over_truth class-attribute instance-attribute
over_truth: bool = False
count
count() -> int

Counts the number of probabilities in the database.

FilteredResolutions

Bases: BaseModel

Wrapper class for filtered resolution queries.

Methods:

  • count

    Counts the number of resolutions in the database.

Attributes:

datasets class-attribute instance-attribute
datasets: bool = False
humans class-attribute instance-attribute
humans: bool = False
models class-attribute instance-attribute
models: bool = False
count
count() -> int

Counts the number of resolutions in the database.

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

  • query

    Queries the database from an optional point of truth.

  • match

    Matches an ID in a source dataset and returns the keys in the targets.

  • index

    Indexes to Matchbox a source dataset in your warehouse.

  • get_source

    Get a source from its address.

  • get_resolution_sources

    Get a list of sources queriable from a resolution.

  • validate_ids

    Validates a list of IDs exist in the database.

  • validate_hashes

    Validates a list of hashes exist in the database.

  • cluster_id_to_hash

    Get a lookup of Cluster hashes from a list of IDs.

  • get_resolution_graph

    Get the full resolution graph.

  • dump

    Dumps the entire database to a snapshot.

  • drop

    Hard clear the database by dropping all tables and re-creating.

  • clear

    Soft clear the database by deleting all rows but retaining tables.

  • restore

    Restores the database from a snapshot.

  • verify

    Checks the database schema against expected and logs outcome.

  • insert_model

    Writes a model to Matchbox.

  • get_model

    Get a model from the database.

  • set_model_results

    Set the results for a model.

  • get_model_results

    Get the results for a model.

  • set_model_truth

    Sets the truth threshold for this model, changing the default clusters.

  • get_model_truth

    Gets the current truth threshold for this model.

  • get_model_ancestors

    Gets the current truth values of all ancestors.

  • set_model_ancestors_cache

    Updates the cached ancestor thresholds.

  • get_model_ancestors_cache

    Gets the cached ancestor thresholds, converting hashes to model names.

  • delete_model

    Delete a model from the database.

Attributes:

settings instance-attribute
settings = settings
datasets instance-attribute
datasets = Sources
models instance-attribute
models = FilteredResolutions(
    datasets=False, humans=False, models=True
)
source_resolutions instance-attribute
source_resolutions = FilteredResolutions(
    datasets=True, humans=False, models=False
)
data instance-attribute
data = FilteredClusters(has_dataset=True)
clusters instance-attribute
clusters = FilteredClusters(has_dataset=False)
merges instance-attribute
merges = Contains
creates instance-attribute
creates = FilteredProbabilities(over_truth=True)
proposes instance-attribute
proposes = FilteredProbabilities()
query
query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source_address
    (SourceAddress) –

    the SourceAddress object identifying the source to query

  • resolution_name
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the dataset resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

Matches an ID in a source dataset and returns the keys in the targets.

Parameters:

  • source_pk
    (str) –

    The primary key to match from the source.

  • source
    (SourceAddress) –

    The address of the source dataset.

  • targets
    (list[SourceAddress]) –

    The addresses of the target datasets.

  • resolution_name
    (str) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index
index(source: Source, data_hashes: Table) -> None

Indexes to Matchbox a source dataset in your warehouse.

Parameters:

  • source
    (Source) –

    The source dataset to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source
get_source(address: SourceAddress) -> Source

Get a source from its address.

Parameters:

Returns:

get_resolution_sources
get_resolution_sources(
    resolution_name: str,
) -> list[Source]

Get a list of sources queriable from a resolution.

Parameters:

  • resolution_name
    (str) –

    Name of the resolution to query.

Returns:

  • list[Source]

    List of relevant Source objects.

validate_ids
validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes
validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash
cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph
get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop
drop(certain: bool = False) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear
clear(certain: bool = False) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore
restore(
    snapshot: MatchboxSnapshot, clear: bool = False
) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • clear
    (bool) –

    Whether to clear the database before restoration

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

verify
verify() -> None

Checks the database schema against expected and logs outcome.

insert_model
insert_model(model: ModelMetadata) -> None

Writes a model to Matchbox.

Parameters:

  • model
    (ModelMetadata) –

    ModelMetadata object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

get_model
get_model(model: str) -> ModelMetadata

Get a model from the database.

set_model_results
set_model_results(model: str, results: Table) -> None

Set the results for a model.

get_model_results
get_model_results(model: str) -> Table

Get the results for a model.

set_model_truth
set_model_truth(model: str, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth
get_model_truth(model: str) -> int

Gets the current truth threshold for this model.

get_model_ancestors
get_model_ancestors(model: str) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache
set_model_ancestors_cache(
    model: str, ancestors_cache: list[ModelAncestor]
) -> None

Updates the cached ancestor thresholds.

Parameters:

  • model
    (str) –

    The name of the model to update

  • ancestors_cache
    (list[ModelAncestor]) –

    List of ModelAncestor objects mapping model names to their truth thresholds

get_model_ancestors_cache
get_model_ancestors_cache(
    model: str,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds, converting hashes to model names.

Returns a list of ModelAncestor objects mapping model names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_model
delete_model(model: str, certain: bool = False) -> None

Delete a model from the database.

Parameters:

  • model
    (str) –

    The name of the model to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

benchmark

Benchmarking utilities for Matchbox’s PostgreSQL backend.

Modules:

  • cluster_pipeline

    Script to benchmark the full dummy data generation pipeline.

  • generate_tables

    Generate tables for benchmarking PostgreSQL backend.

  • query

    Benchmarking utilities for the PostgreSQL backend.

cluster_pipeline

Script to benchmark the full dummy data generation pipeline.

Functions:

  • timer

    Context manager to time a block of code.

Attributes:

ROOT module-attribute
ROOT = parent
config module-attribute
config = PRESETS['l']
left_ids module-attribute
left_ids = tuple(range(config['dedupe_components']))
right_ids module-attribute
right_ids = tuple(
    range(
        config["dedupe_components"],
        config["dedupe_components"] * 2,
    )
)
probs module-attribute
probs = generate_dummy_probabilities(
    left_ids,
    right_ids,
    [0.6, 1],
    config["link_components"],
    config["link_len"],
)
all_probs module-attribute
all_probs = concat_arrays(
    [combine_chunks(), combine_chunks()]
)
lookup module-attribute
lookup = table(
    {
        "id": all_probs,
        "hash": array(
            [hash_data(p) for p in to_pylist()],
            type=large_binary(),
        ),
    }
)
hm module-attribute
probs_with_ccs module-attribute
probs_with_ccs = attach_components_to_probabilities(
    table(
        {
            "left_id": get_hashes(probs["left_id"]),
            "right_id": get_hashes(probs["right_id"]),
            "probability": probs["probability"],
        }
    )
)
hierarchy module-attribute
timer
timer(description: str)

Context manager to time a block of code.

generate_tables

Generate tables for benchmarking PostgreSQL backend.

Functions:

Attributes:

PRESETS module-attribute
PRESETS = {
    "xs": {
        "source_len": 10000,
        "dedupe_components": 8000,
        "dedupe_len": 2000,
        "link_components": 6000,
        "link_len": 10000,
    },
    "s": {
        "source_len": 100000,
        "dedupe_components": 80000,
        "dedupe_len": 20000,
        "link_components": 60000,
        "link_len": 100000,
    },
    "m": {
        "source_len": 1000000,
        "dedupe_components": 800000,
        "dedupe_len": 200000,
        "link_components": 600000,
        "link_len": 1000000,
    },
    "l": {
        "source_len": 10000000,
        "dedupe_components": 8000000,
        "dedupe_len": 2000000,
        "link_components": 6000000,
        "link_len": 10000000,
    },
    "xl": {
        "source_len": 100000000,
        "dedupe_components": 80000000,
        "dedupe_len": 20000000,
        "link_components": 60000000,
        "link_len": 100000000,
    },
}
generate_sources
generate_sources(
    dataset_start_id: int = 1,
) -> tuple[Table, Table]

Generate sources and source_columns tables.

Parameters:

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • tuple[Table, Table]

    Tuple of (sources_table, source_columns_table) as PyArrow tables

generate_resolutions
generate_resolutions(dataset_start_id: int = 1) -> Table

Generate resolutions table.

Parameters:

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • Table

    PyArrow resolutions table

generate_resolution_from
generate_resolution_from(
    dataset_start_id: int = 1,
) -> Table

Generate resolution_from table.

Parameters:

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

Returns:

  • Table

    PyArrow resolution_from table

generate_cluster_source
generate_cluster_source(
    range_left: int,
    range_right: int,
    source_id: int,
    cluster_start_id: int = 0,
    pk_start_id: int = 0,
) -> tuple[Table, Table]

Generate both Clusters and ClusterSourcePK tables for source rows.

Parameters:

  • range_left
    (int) –

    first ID to generate

  • range_right
    (int) –

    last ID to generate, plus one

  • source_id
    (int) –

    source ID for the source

  • cluster_start_id
    (int, default: 0 ) –

    Starting ID for clusters

  • pk_start_id
    (int, default: 0 ) –

    Starting ID for primary keys

Returns:

  • tuple[Table, Table]

    Tuple of (Clusters table, ClusterSourcePK table)

generate_result_tables
generate_result_tables(
    left_ids: Iterable[int],
    right_ids: Iterable[int] | None,
    resolution_id: int,
    next_id: int,
    n_components: int,
    n_probs: int,
    prob_min: float = 0.6,
    prob_max: float = 1,
) -> tuple[list[int], Table, Table, Table, int]

Generate probabilities, contains and clusters tables.

Parameters:

  • left_ids
    (Iterable[int]) –

    list of IDs for rows to dedupe, or for left rows to link

  • right_ids
    (Iterable[int] | None) –

    list of IDs for right rows to link

  • resolution_id
    (int) –

    ID of resolution for this dedupe or link model

  • next_id
    (int) –

    the next ID to use when generating IDs

  • n_components
    (int) –

    number of implied connected components

  • n_probs
    (int) –

    total number of probability edges to be generated

  • prob_min
    (float, default: 0.6 ) –

    minimum value for probabilities to be generated

  • prob_max
    (float, default: 1 ) –

    maximum value for probabilities to be generated

Returns:

  • list[int]

    Tuple with 1 list of top-level clusters, 3 PyArrow tables, for probabilities,

  • Table

    contains, and clusters, and the next ID to use for future calls

generate_all_tables

Make all six PostgreSQL backend tables.

It will create two sources, one deduper for each, and one linker from each deduper.

Parameters:

  • source_len
    (int) –

    length of each data source

  • dedupe_components
    (int) –

    number of connected components implied by each deduper

  • dedupe_len
    (int) –

    probabilities generated by each deduper

  • (int) –

    number of connected components implied by each linker

  • (int) –

    probabilities generated by each linker

  • cluster_start_id
    (int, default: 0 ) –

    Starting ID for clusters

  • dataset_start_id
    (int, default: 1 ) –

    Starting ID for dataset resolution IDs

  • pk_start_id
    (int, default: 0 ) –

    Starting ID for primary keys (globally unique)

Returns:

  • dict[str, Table]

    A dictionary where keys are table names and values are PyArrow tables

main

Command line tool for generating data.

Parameters:

  • settings
    (str) –

    The key of the settings dictionary to use

  • output_dir
    (Path) –

    Where to save the output files

  • cluster_start_id
    (int) –

    The first integer to use for clusters

  • dataset_start_id
    (int) –

    The first integer to use for datasets

Examples:

generate_tables.py             --settings xl             --output-dir data/v4             --dataset-start-id 6742             --cluster-start-id 7
generate_tables.py -s s -o data/v4 -d 1 -c 0

query

Benchmarking utilities for the PostgreSQL backend.

Functions:

  • compile_query_sql

    Compiles a the SQL for query() based on a single point of truth and dataset.

  • compile_match_sql

    Compiles a the SQL for match() based on a single point of truth and dataset.

compile_query_sql
compile_query_sql(
    point_of_truth: str, source_address: SourceAddress
) -> str

Compiles a the SQL for query() based on a single point of truth and dataset.

Parameters:

  • point_of_truth
    (str) –

    The name of the resolution to use, like “linker_1”

  • source_address
    (SourceAddress) –

    The address of the source to retrieve

Returns:

  • str

    A compiled PostgreSQL query, including semicolon, ready to run on Matchbox

compile_match_sql

Compiles a the SQL for match() based on a single point of truth and dataset.

Note this only tests the query that retrieves all valid matches for the supplied key. The actual match function goes on to merge this with the user’s requested target table(s).

Parameters:

  • source_pk
    (str) –

    The name of the primary key of the source table

  • source_address
    (SourceAddress) –

    The address of the source to use

  • point_of_truth
    (str) –

    The name of the resolution to use, like “linker_1”

Returns:

  • str

    A compiled PostgreSQL query, including semicolon, ready to run on Matchbox

db

Matchbox PostgreSQL database connection.

Classes:

Attributes:

MBDB module-attribute

MatchboxPostgresCoreSettings

Bases: BaseModel

PostgreSQL-specific settings for Matchbox.

Methods:

Attributes:

host instance-attribute
host: str
port instance-attribute
port: int
user instance-attribute
user: str
password instance-attribute
password: str
database instance-attribute
database: str
db_schema instance-attribute
db_schema: str
alembic_config class-attribute instance-attribute
alembic_config: Path = Field(
    default=Path(
        "src/matchbox/server/postgresql/alembic.ini"
    )
)
get_alembic_config
get_alembic_config() -> Config

Get the Alembic config.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

backend_type class-attribute instance-attribute
backend_type: MatchboxBackends = POSTGRES
postgres class-attribute instance-attribute
postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)
model_config class-attribute instance-attribute
model_config = SettingsConfigDict(
    env_prefix="MB__SERVER__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)
batch_size class-attribute instance-attribute
batch_size: int = Field(default=250000)
datastore instance-attribute
api_key class-attribute instance-attribute
api_key: SecretStr | None = Field(default=None)
log_level class-attribute instance-attribute
log_level: LogLevelType = 'INFO'

MatchboxDatabase

MatchboxDatabase(settings: MatchboxPostgresSettings)

Matchbox PostgreSQL database connection.

Methods:

Attributes:

settings instance-attribute
settings = settings
MatchboxBase instance-attribute
MatchboxBase = declarative_base(
    metadata=MetaData(schema=db_schema)
)
alembic_config instance-attribute
alembic_config = get_alembic_config()
connection_string
connection_string(driver: bool = True) -> str

Get the connection string for PostgreSQL.

get_engine
get_engine() -> Engine

Get the database engine.

get_session
get_session()

Get a new session.

get_adbc_connection
get_adbc_connection() -> Generator[Connection, Any, Any]

Get a new ADBC connection.

The connection must be used within a context manager.

run_migrations
run_migrations()

Create the database and all tables expected in the schema.

clear_database
clear_database()

Delete all rows in every table in the database schema.

drop_database
drop_database()

Drop all tables in the database schema and re-recreate them.

verify_schema
verify_schema()

Verify the database schema live is in sync with the ORM.

If any differences are detected, log this as an error.

NOTE: this was originally implemented prior to alembic. In principle alembic is best placed to manage any such diff, and this remains for now only as an informative aid and could be removed.

mixin

A module for defining mixins for the PostgreSQL backend ORM.

Classes:

  • CountMixin

    A mixin for counting the number of rows in a table.

Attributes:

  • T

T module-attribute

T = TypeVar('T')

CountMixin

A mixin for counting the number of rows in a table.

Methods:

  • count

    Counts the number of rows in the table.

count classmethod
count() -> int

Counts the number of rows in the table.

orm

ORM classes for the Matchbox PostgreSQL database.

Classes:

  • ResolutionFrom

    Resolution lineage closure table with cached truth values.

  • Resolutions

    Table of resolution points: models, datasets and humans.

  • SourceColumns

    Table for storing column details for Sources.

  • ClusterSourcePK

    Table for storing source primary keys for clusters.

  • Sources

    Table of sources of data for Matchbox.

  • Contains

    Cluster lineage table.

  • Clusters

    Table of indexed data and clusters that match it.

  • Probabilities

    Table of probabilities that a cluster is correct, according to a resolution.

ResolutionFrom

Bases: CountMixin, MatchboxBase

Resolution lineage closure table with cached truth values.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolution_from'
parent class-attribute instance-attribute
parent = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
child class-attribute instance-attribute
child = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
level class-attribute instance-attribute
level = Column(INTEGER, nullable=False)
truth_cache class-attribute instance-attribute
truth_cache = Column(SMALLINT, nullable=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "parent != child", name="no_self_reference"
    ),
    CheckConstraint("level > 0", name="positive_level"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Resolutions

Bases: CountMixin, MatchboxBase

Table of resolution points: models, datasets and humans.

Resolutions produce probabilities or own data in the clusters table.

Methods:

  • get_lineage

    Returns all ancestors and their cached truth values from this model.

  • get_lineage_to_dataset

    Returns the resolution lineage and cached truth values to a dataset.

  • next_id

    Returns the next available resolution_id.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolutions'
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, primary_key=True)
resolution_hash class-attribute instance-attribute
resolution_hash = Column(BYTEA, nullable=False)
type class-attribute instance-attribute
type = Column(TEXT, nullable=False)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
description class-attribute instance-attribute
description = Column(TEXT)
truth class-attribute instance-attribute
truth = Column(SMALLINT)
source class-attribute instance-attribute
source = relationship(
    "Sources",
    back_populates="dataset_resolution",
    uselist=False,
)
probabilities class-attribute instance-attribute
probabilities = relationship(
    "Probabilities",
    back_populates="proposed_by",
    cascade="all, delete-orphan",
    passive_deletes=True,
)
children class-attribute instance-attribute
children = relationship(
    "Resolutions",
    secondary=__table__,
    primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
    secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
    backref="parents",
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "type IN ('model', 'dataset', 'human')",
        name="resolution_type_constraints",
    ),
    UniqueConstraint(
        "resolution_hash", name="resolutions_hash_key"
    ),
    UniqueConstraint("name", name="resolutions_name_key"),
)
ancestors property
ancestors: set[Resolutions]

Returns all ancestors (parents, grandparents, etc.) of this resolution.

descendants property
descendants: set[Resolutions]

Returns descendants (children, grandchildren, etc.) of this resolution.

get_lineage
get_lineage() -> dict[int, float]

Returns all ancestors and their cached truth values from this model.

get_lineage_to_dataset
get_lineage_to_dataset(
    dataset: Resolutions,
) -> tuple[bytes, dict[int, float]]

Returns the resolution lineage and cached truth values to a dataset.

next_id classmethod
next_id() -> int

Returns the next available resolution_id.

count classmethod
count() -> int

Counts the number of rows in the table.

SourceColumns

Bases: CountMixin, MatchboxBase

Table for storing column details for Sources.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_columns'
column_id class-attribute instance-attribute
column_id = Column(BIGINT, primary_key=True)
source_id class-attribute instance-attribute
source_id = Column(
    BIGINT,
    ForeignKey("sources.source_id", ondelete="CASCADE"),
    nullable=False,
)
column_index class-attribute instance-attribute
column_index = Column(INTEGER, nullable=False)
column_name class-attribute instance-attribute
column_name = Column(TEXT, nullable=False)
column_type class-attribute instance-attribute
column_type = Column(TEXT, nullable=False)
source class-attribute instance-attribute
source = relationship('Sources', back_populates='columns')
__table_args__ class-attribute instance-attribute
__table_args__ = (
    UniqueConstraint(
        "source_id",
        "column_index",
        name="unique_column_index",
    ),
    Index("ix_source_columns_source_id", "source_id"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

ClusterSourcePK

Bases: CountMixin, MatchboxBase

Table for storing source primary keys for clusters.

Methods:

  • next_id

    Returns the next available cluster_id.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'cluster_source_pks'
pk_id class-attribute instance-attribute
pk_id = Column(BIGINT, primary_key=True)
cluster_id class-attribute instance-attribute
cluster_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)
source_id class-attribute instance-attribute
source_id = Column(
    BIGINT,
    ForeignKey("sources.source_id", ondelete="CASCADE"),
    nullable=False,
)
source_pk class-attribute instance-attribute
source_pk = Column(TEXT, nullable=False)
cluster class-attribute instance-attribute
cluster = relationship(
    "Clusters", back_populates="source_pks"
)
source class-attribute instance-attribute
source = relationship(
    "Sources", back_populates="cluster_source_pks"
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    Index("ix_cluster_source_pks_cluster_id", "cluster_id"),
    Index("ix_cluster_source_pks_source_pk", "source_pk"),
    UniqueConstraint(
        "pk_id", "source_id", name="unique_pk_source"
    ),
)
next_id classmethod
next_id() -> int

Returns the next available cluster_id.

count classmethod
count() -> int

Counts the number of rows in the table.

Sources

Bases: CountMixin, MatchboxBase

Table of sources of data for Matchbox.

Methods:

  • list_all

    Returns all sources in the database.

  • to_common_source

    Convert ORM source to a matchbox.common Source object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'sources'
source_id class-attribute instance-attribute
source_id = Column(
    BIGINT, Identity(start=1), primary_key=True
)
resolution_id class-attribute instance-attribute
resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
)
resolution_name class-attribute instance-attribute
resolution_name = Column(TEXT, nullable=False)
full_name class-attribute instance-attribute
full_name = Column(TEXT, nullable=False)
warehouse_hash class-attribute instance-attribute
warehouse_hash = Column(BYTEA, nullable=False)
db_pk class-attribute instance-attribute
db_pk = Column(TEXT, nullable=False)
dataset_resolution class-attribute instance-attribute
dataset_resolution = relationship(
    "Resolutions", back_populates="source"
)
columns class-attribute instance-attribute
columns = relationship(
    "SourceColumns",
    back_populates="source",
    cascade="all, delete-orphan",
    passive_deletes=True,
)
cluster_source_pks class-attribute instance-attribute
cluster_source_pks = relationship(
    "ClusterSourcePK",
    back_populates="source",
    cascade="all, delete-orphan",
    passive_deletes=True,
)
clusters class-attribute instance-attribute
clusters = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="Sources.source_id == ClusterSourcePK.source_id",
    secondaryjoin="ClusterSourcePK.cluster_id == Clusters.cluster_id",
    viewonly=True,
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    UniqueConstraint(
        "full_name",
        "warehouse_hash",
        name="unique_source_address",
    ),
)
list_all classmethod
list_all() -> list[Sources]

Returns all sources in the database.

to_common_source
to_common_source() -> list[Source]

Convert ORM source to a matchbox.common Source object.

count classmethod
count() -> int

Counts the number of rows in the table.

Contains

Bases: CountMixin, MatchboxBase

Cluster lineage table.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'contains'
parent class-attribute instance-attribute
parent = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
child class-attribute instance-attribute
child = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "parent != child", name="no_self_containment"
    ),
    Index("ix_contains_parent_child", "parent", "child"),
    Index("ix_contains_child_parent", "child", "parent"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Clusters

Bases: CountMixin, MatchboxBase

Table of indexed data and clusters that match it.

Methods:

  • next_id

    Returns the next available cluster_id.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'clusters'
cluster_id class-attribute instance-attribute
cluster_id = Column(BIGINT, primary_key=True)
cluster_hash class-attribute instance-attribute
cluster_hash = Column(BYTEA, nullable=False)
source_pks class-attribute instance-attribute
source_pks = relationship(
    "ClusterSourcePK",
    back_populates="cluster",
    cascade="all, delete-orphan",
    passive_deletes=True,
)
probabilities class-attribute instance-attribute
probabilities = relationship(
    "Probabilities",
    back_populates="proposes",
    cascade="all, delete-orphan",
    passive_deletes=True,
)
children class-attribute instance-attribute
children = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == Contains.parent",
    secondaryjoin="Clusters.cluster_id == Contains.child",
    backref="parents",
)
sources class-attribute instance-attribute
sources = relationship(
    "Sources",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == ClusterSourcePK.cluster_id",
    secondaryjoin="ClusterSourcePK.source_id == Sources.source_id",
    viewonly=True,
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    UniqueConstraint(
        "cluster_hash", name="clusters_hash_key"
    ),
)
next_id classmethod
next_id() -> int

Returns the next available cluster_id.

count classmethod
count() -> int

Counts the number of rows in the table.

Probabilities

Bases: CountMixin, MatchboxBase

Table of probabilities that a cluster is correct, according to a resolution.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'probabilities'
resolution class-attribute instance-attribute
resolution = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
cluster class-attribute instance-attribute
cluster = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
probability class-attribute instance-attribute
probability = Column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by = relationship(
    "Resolutions", back_populates="probabilities"
)
proposes class-attribute instance-attribute
proposes = relationship(
    "Clusters", back_populates="probabilities"
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "probability BETWEEN 0 AND 100",
        name="valid_probability",
    ),
    Index("ix_probabilities_resolution", "resolution"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

utils

Utilities for using the PostgreSQL backend.

Modules:

  • db

    General utilities for the PostgreSQL backend.

  • insert

    Utilities for inserting data into the PostgreSQL backend.

  • query

    Utilities for querying and matching in the PostgreSQL backend.

  • results

    Utilities for querying model results from the PostgreSQL backend.

db

General utilities for the PostgreSQL backend.

Functions:

resolve_model_name
resolve_model_name(
    model: str, engine: Engine
) -> Resolutions

Resolves a model name to a Resolution object.

Parameters:

  • model
    (str) –

    The name of the model to resolve.

  • engine
    (Engine) –

    The database engine.

Raises:

get_resolution_graph
get_resolution_graph(engine: Engine) -> ResolutionGraph

Retrieves the resolution graph.

dump
dump(engine: Engine) -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Parameters:

  • engine
    (Engine) –

    The database engine.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

restore
restore(
    engine: Engine,
    snapshot: MatchboxSnapshot,
    batch_size: int,
) -> None

Restores the database from a snapshot.

Parameters:

  • engine
    (Engine) –

    The database engine.

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • batch_size
    (int) –

    The number of records to insert in each batch

Raises:

sqa_profiled
sqa_profiled()

SQLAlchemy profiler.

Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling

compile_sql
compile_sql(stmt: Select) -> str

Compiles a SQLAlchemy statement into a string.

Parameters:

  • stmt
    (Select) –

    The SQLAlchemy statement to compile.

Returns:

  • str

    The compiled SQL statement as a string.

large_ingest
large_ingest(
    data: Table,
    table_class: DeclarativeMeta,
    max_chunksize: int | None = None,
    upsert_keys: list[str] | None = None,
    update_columns: list[str] | None = None,
)

Append a PyArrow table to a PostgreSQL table using ADBC.

It will either copy directly (and error if primary key constraints are violated), or it can be run in upsert mode by using a staging table, which is slower.

Parameters:

  • data
    (Table) –

    A PyArrow table to write.

  • table_class
    (DeclarativeMeta) –

    The SQLAlchemy ORM class for the table to write to.

  • max_chunksize
    (int | None, default: None ) –

    Size of data chunks to be read and copied.

  • upsert_keys
    (list[str] | None, default: None ) –

    Columns used as keys for “on conflict do update”. If passed, it will run ingest in slower upsert mode. If not passed and update_columns is passed, defaults to primary keys.

  • update_columns
    (list[str] | None, default: None ) –

    Columns to update when upserting. If passed, it will run ingest in slower upsert mode. If not passed and upsert_keys is passed, defaults to all other columns.

insert

Utilities for inserting data into the PostgreSQL backend.

Classes:

  • HashIDMap

    An object to help map between IDs and hashes.

Functions:

  • insert_dataset

    Indexes a dataset from your data warehouse within Matchbox.

  • insert_model

    Writes a model to Matchbox with a default truth value of 100.

  • insert_results

    Writes a results table to Matchbox.

HashIDMap
HashIDMap(start: int, lookup: Table = None)

An object to help map between IDs and hashes.

When given a set of IDs, returns their hashes. If any ID doesn’t have a hash, it will error.

When given a set of hashes, it will return their IDs. If any don’t have IDs, it will create one and return it as part of the set.

Parameters:

  • start
    (int) –

    The first integer to use for new IDs

  • lookup
    (optional, default: None ) –

    A lookup table to use for existing hashes

Methods:

  • get_hashes

    Returns the hashes of the given IDs.

  • get_ids

    Returns the IDs of the given hashes, assigning new IDs for unknown hashes.

Attributes:

next_int instance-attribute
next_int = start
lookup instance-attribute
lookup = from_arrays(
    [
        array([], type=uint64()),
        array([], type=large_binary()),
        array([], type=bool_()),
    ],
    names=["id", "hash", "new"],
)
get_hashes
get_hashes(ids: UInt64Array) -> LargeBinaryArray

Returns the hashes of the given IDs.

get_ids
get_ids(hashes: BinaryArray) -> UInt64Array

Returns the IDs of the given hashes, assigning new IDs for unknown hashes.

insert_dataset
insert_dataset(
    source: Source, data_hashes: Table, batch_size: int
) -> None

Indexes a dataset from your data warehouse within Matchbox.

insert_model
insert_model(
    model: str,
    left: Resolutions,
    right: Resolutions,
    description: str,
    engine: Engine,
) -> None

Writes a model to Matchbox with a default truth value of 100.

Parameters:

  • model
    (str) –

    Name of the new model

  • left
    (Resolutions) –

    Left parent of the model

  • right
    (Resolutions) –

    Right parent of the model. Same as left in a dedupe job

  • description
    (str) –

    Model description

  • engine
    (Engine) –

    SQLAlchemy engine instance

Raises:

  • MatchboxResolutionNotFoundError

    If the specified parent models don’t exist.

  • MatchboxResolutionNotFoundError

    If the specified model doesn’t exist.

insert_results
insert_results(
    resolution: Resolutions,
    engine: Engine,
    results: Table,
    batch_size: int,
) -> None

Writes a results table to Matchbox.

The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.

This means two-item components are synonymous with their original pairwise probabilities.

This allows easy querying of clusters at any threshold.

Parameters:

  • resolution
    (Resolutions) –

    Resolution of type model to associate results with

  • engine
    (Engine) –

    SQLAlchemy engine instance

  • results
    (Table) –

    A PyArrow results table with left_id, right_id, probability

  • batch_size
    (int) –

    Number of records to insert in each batch

Raises:

  • MatchboxResolutionNotFoundError

    If the specified model doesn’t exist.

query

Utilities for querying and matching in the PostgreSQL backend.

Functions:

  • query

    Queries Matchbox and the Source warehouse to retrieve linked data.

  • match

    Matches an ID in a source dataset and returns the keys in the targets.

Attributes:

  • T
T module-attribute
T = TypeVar('T')
query
query(
    source_address: SourceAddress,
    resolution_name: str | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries Matchbox and the Source warehouse to retrieve linked data.

Takes the dictionaries of tables and fields outputted by selectors and queries database for them. If a “point of truth” resolution is supplied, will attach the clusters this data belongs to.

To accomplish this, the function:

  • Iterates through each selector, and
    • Retrieves its data in Matchbox according to the optional point of truth, including its hash and cluster hash
    • Retrieves its raw data from its Source’s warehouse
    • Joins the two together
  • Unions the results, one row per item of data in the warehouses

Returns:

  • Table

    A table containing the requested data from each table, unioned together,

  • Table

    with the hash key of each row in Matchbox

match
match(
    engine: Engine,
    source_pk: str,
    source: SourceAddress,
    targets: list[SourceAddress],
    resolution_name: str,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source dataset and returns the keys in the targets.

To accomplish this, the function:

  • Reconstructs the resolution lineage from the specified resolution
  • Iterates through each target, and
    • Retrieves its cluster hash according to the resolution
    • Retrieves all other IDs in the cluster in the source dataset
    • Retrieves all other IDs in the cluster in the target dataset
  • Returns the results as Match objects, one per target

results

Utilities for querying model results from the PostgreSQL backend.

Classes:

  • SourceInfo

    Information about a model’s sources.

Functions:

SourceInfo

Bases: NamedTuple

Information about a model’s sources.

Attributes:

left instance-attribute
left: int
right instance-attribute
right: int | None
left_ancestors instance-attribute
left_ancestors: set[int]
right_ancestors instance-attribute
right_ancestors: set[int] | None
get_model_metadata
get_model_metadata(
    engine: Engine, resolution: Resolutions
) -> ModelMetadata

Get metadata for a model resolution.

get_model_results
get_model_results(resolution: Resolutions) -> Table

Recover the model’s pairwise probabilities and return as a PyArrow table.

For each probability this model assigned: - Get its two immediate children - Filter for children that aren’t parents of other clusters this model scored - Determine left/right by tracing ancestry to source resolutions using query helpers

Parameters:

  • resolution
    (Resolutions) –

    Resolution of type model to query

Returns:

  • Table

    Table containing the original pairwise probabilities