Skip to content

PostgreSQL

A backend adapter for deploying Matchbox using PostgreSQL.

There are two graph-like trees in place here.

  • In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
  • In the data subgraph the tree is implemented as an adjacency list, which means recursive queries are required to resolve it, but less data is stored
erDiagram
    SourceConfigs {
        bigint source_config_id PK
        bigint resolution_id FK
        string location_type
        string location_uri
        string extract_transform
    }
    SourceFields {
        bigint field_id PK
        bigint source_config_id FK
        int index
        string name
        string type
        bool is_key
    }
    Clusters {
        bigint cluster_id PK
        bytes cluster_hash
    }
    ClusterSourceKey {
        bigint key_id PK
        bigint cluster_id FK
        bigint source_config_id FK
        string key
    }
    Contains {
        bigint parent PK,FK
        bigint child PK,FK
    }
    PKSpace {
        bigint id
        bigint next_cluster_id
        bigint next_cluster_keys_id
    }
    Probabilities {
        bigint resolution PK,FK
        bigint cluster PK,FK
        smallint probability
    }
    Resolutions {
        bigint resolution_id PK
        string name
        string description
        string type
        bytes hash
        smallint truth
    }
    ResolutionFrom {
        bigint parent PK,FK
        bigint child PK,FK
        int level
        smallint truth_cache
    }

    SourceConfigs |o--|| Resolutions : ""
    SourceConfigs ||--o{ SourceFields : ""
    SourceConfigs ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ Probabilities : ""
    Clusters ||--o{ Contains : "parent"
    Contains }o--|| Clusters : "child"
    Resolutions ||--o{ Probabilities : ""
    Resolutions ||--o{ ResolutionFrom : "parent"
    ResolutionFrom }o--|| Resolutions : "child"

matchbox.server.postgresql

PostgreSQL adapter for Matchbox server.

Modules:

  • adapter

    PostgreSQL adapter for Matchbox server.

  • db

    Matchbox PostgreSQL database connection.

  • mixin

    A module for defining mixins for the PostgreSQL backend ORM.

  • orm

    ORM classes for the Matchbox PostgreSQL database.

  • utils

    Utilities for using the PostgreSQL backend.

Classes:

__all__ module-attribute

__all__ = ['MatchboxPostgres', 'MatchboxPostgresSettings']

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute

settings = settings

sources instance-attribute

sources = SourceConfigs

models instance-attribute

models = FilteredResolutions(
    sources=False, humans=False, models=True
)

source_resolutions instance-attribute

source_resolutions = FilteredResolutions(
    sources=True, humans=False, models=False
)

data instance-attribute

data = FilteredClusters(has_source=True)

clusters instance-attribute

clusters = FilteredClusters(has_source=False)

merges instance-attribute

merges = Contains

creates instance-attribute

creates = FilteredProbabilities(over_truth=True)

proposes instance-attribute

proposes = FilteredProbabilities()

query

query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionName) –

    the SourceResolutionName string identifying the source to query

  • resolution
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionName) –

    The name of the source resolution.

  • targets
    (list[SourceResolutionName]) –

    The names of the target source resolutions.

  • resolution
    (ResolutionName) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index

index(
    source_config: SourceConfig, data_hashes: Table
) -> None

Indexes a source in your warehouse to Matchbox.

Parameters:

  • source_config
    (SourceConfig) –

    The source configuration to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source_config

get_source_config(
    name: SourceResolutionName,
) -> SourceConfig

Get a source configuration from its resolution name.

Parameters:

Returns:

get_resolution_source_configs

get_resolution_source_configs(
    name: ModelResolutionName,
) -> list[SourceConfig]

Get a list of source configurations queriable from a resolution.

Parameters:

Returns:

validate_ids

validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes

validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop

drop(certain: bool = False) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear

clear(certain: bool = False) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model

insert_model(model_config: ModelConfig) -> None

Writes a model to Matchbox.

Parameters:

  • model_config
    (ModelConfig) –

    ModelConfig object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

  • MatchboxModelConfigError

    If the model configuration is invalid, such as the resolutions sharing ancestors

get_model

get_model(name: ModelResolutionName) -> ModelConfig

Get a model from the database.

set_model_results

set_model_results(
    name: ModelResolutionName, results: Table
) -> None

Set the results for a model.

get_model_results

get_model_results(name: ModelResolutionName) -> Table

Get the results for a model.

set_model_truth

set_model_truth(
    name: ModelResolutionName, truth: int
) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth

get_model_truth(name: ModelResolutionName) -> int

Gets the current truth threshold for this model.

get_model_ancestors

get_model_ancestors(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache

set_model_ancestors_cache(
    name: ModelResolutionName,
    ancestors_cache: list[ModelAncestor],
) -> None

Updates the cached ancestor thresholds.

Parameters:

get_model_ancestors_cache

get_model_ancestors_cache(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds.

Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_resolution

delete_resolution(
    name: ResolutionName, certain: bool = False
) -> None

Delete a resolution from the database.

Parameters:

  • name
    (ResolutionName) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

model_config class-attribute instance-attribute

model_config = SettingsConfigDict(
    env_prefix="MB__SERVER__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

datastore instance-attribute

api_key class-attribute instance-attribute

api_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

backend_type class-attribute instance-attribute

backend_type: MatchboxBackends = POSTGRES

postgres class-attribute instance-attribute

postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)

adapter

PostgreSQL adapter for Matchbox server.

Classes:

Attributes:

  • T
  • P

T module-attribute

T = TypeVar('T')

P module-attribute

P = ParamSpec('P')

FilteredClusters

Bases: BaseModel

Wrapper class for filtered cluster queries.

Methods:

  • count

    Counts the number of clusters in the database.

Attributes:

has_source class-attribute instance-attribute
has_source: bool | None = None
count
count() -> int

Counts the number of clusters in the database.

FilteredProbabilities

Bases: BaseModel

Wrapper class for filtered probability queries.

Methods:

  • count

    Counts the number of probabilities in the database.

Attributes:

over_truth class-attribute instance-attribute
over_truth: bool = False
count
count() -> int

Counts the number of probabilities in the database.

FilteredResolutions

Bases: BaseModel

Wrapper class for filtered resolution queries.

Methods:

  • count

    Counts the number of resolutions in the database.

Attributes:

sources class-attribute instance-attribute
sources: bool = False
humans class-attribute instance-attribute
humans: bool = False
models class-attribute instance-attribute
models: bool = False
count
count() -> int

Counts the number of resolutions in the database.

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute
settings = settings
sources instance-attribute
sources = SourceConfigs
models instance-attribute
models = FilteredResolutions(
    sources=False, humans=False, models=True
)
source_resolutions instance-attribute
source_resolutions = FilteredResolutions(
    sources=True, humans=False, models=False
)
data instance-attribute
data = FilteredClusters(has_source=True)
clusters instance-attribute
clusters = FilteredClusters(has_source=False)
merges instance-attribute
merges = Contains
creates instance-attribute
creates = FilteredProbabilities(over_truth=True)
proposes instance-attribute
proposes = FilteredProbabilities()
query
query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionName) –

    the SourceResolutionName string identifying the source to query

  • resolution
    (optional, default: None ) –

    the resolution to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionName) –

    The name of the source resolution.

  • targets
    (list[SourceResolutionName]) –

    The names of the target source resolutions.

  • resolution
    (ResolutionName) –

    The name of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index
index(
    source_config: SourceConfig, data_hashes: Table
) -> None

Indexes a source in your warehouse to Matchbox.

Parameters:

  • source_config
    (SourceConfig) –

    The source configuration to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

get_source_config
get_source_config(
    name: SourceResolutionName,
) -> SourceConfig

Get a source configuration from its resolution name.

Parameters:

Returns:

get_resolution_source_configs
get_resolution_source_configs(
    name: ModelResolutionName,
) -> list[SourceConfig]

Get a list of source configurations queriable from a resolution.

Parameters:

Returns:

validate_ids
validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

validate_hashes
validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

  • hashes
    (list[bytes]) –

    A list of hashes to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

cluster_id_to_hash
cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to get hashes for.

Returns:

  • dict[int, bytes | None]

    A dictionary mapping IDs to hashes.

get_resolution_graph
get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop
drop(certain: bool = False) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear
clear(certain: bool = False) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore
restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

insert_model
insert_model(model_config: ModelConfig) -> None

Writes a model to Matchbox.

Parameters:

  • model_config
    (ModelConfig) –

    ModelConfig object with the model’s metadata

Raises:

  • MatchboxDataNotFound

    If, for a linker, the source models weren’t found in the database

  • MatchboxModelConfigError

    If the model configuration is invalid, such as the resolutions sharing ancestors

get_model
get_model(name: ModelResolutionName) -> ModelConfig

Get a model from the database.

set_model_results
set_model_results(
    name: ModelResolutionName, results: Table
) -> None

Set the results for a model.

get_model_results
get_model_results(name: ModelResolutionName) -> Table

Get the results for a model.

set_model_truth
set_model_truth(
    name: ModelResolutionName, truth: int
) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth
get_model_truth(name: ModelResolutionName) -> int

Gets the current truth threshold for this model.

get_model_ancestors
get_model_ancestors(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache
set_model_ancestors_cache(
    name: ModelResolutionName,
    ancestors_cache: list[ModelAncestor],
) -> None

Updates the cached ancestor thresholds.

Parameters:

get_model_ancestors_cache
get_model_ancestors_cache(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds.

Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_resolution
delete_resolution(
    name: ResolutionName, certain: bool = False
) -> None

Delete a resolution from the database.

Parameters:

  • name
    (ResolutionName) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

db

Matchbox PostgreSQL database connection.

Classes:

Attributes:

MBDB module-attribute

MatchboxPostgresCoreSettings

Bases: BaseModel

PostgreSQL-specific settings for Matchbox.

Methods:

Attributes:

host instance-attribute
host: str
port instance-attribute
port: int
user instance-attribute
user: str
password instance-attribute
password: str
database instance-attribute
database: str
db_schema instance-attribute
db_schema: str
alembic_config class-attribute instance-attribute
alembic_config: Path = Field(
    default=Path(
        "src/matchbox/server/postgresql/alembic.ini"
    )
)
get_alembic_config
get_alembic_config() -> Config

Get the Alembic config.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

backend_type class-attribute instance-attribute
backend_type: MatchboxBackends = POSTGRES
postgres class-attribute instance-attribute
postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)
model_config class-attribute instance-attribute
model_config = SettingsConfigDict(
    env_prefix="MB__SERVER__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)
batch_size class-attribute instance-attribute
batch_size: int = Field(default=250000)
datastore instance-attribute
api_key class-attribute instance-attribute
api_key: SecretStr | None = Field(default=None)
log_level class-attribute instance-attribute
log_level: LogLevelType = 'INFO'

MatchboxDatabase

MatchboxDatabase(settings: MatchboxPostgresSettings)

Matchbox PostgreSQL database connection.

Methods:

Attributes:

settings instance-attribute
settings = settings
MatchboxBase instance-attribute
MatchboxBase = declarative_base(
    metadata=MetaData(schema=db_schema)
)
alembic_config instance-attribute
alembic_config = get_alembic_config()
sorted_tables property
sorted_tables: list[Table]

Return a list of SQLAlchemy tables in order of creation.

connection_string
connection_string(driver: bool = True) -> str

Get the connection string for PostgreSQL.

get_engine
get_engine() -> Engine

Get the database engine.

get_session
get_session() -> Session

Get a new session.

get_adbc_connection
get_adbc_connection() -> Generator[Connection, Any, Any]

Get a new ADBC connection.

The connection must be used within a context manager.

run_migrations
run_migrations()

Create the database and all tables expected in the schema.

clear_database
clear_database()

Delete all rows in every table in the database schema.

  • TRUNCATE tables that are part of the core ORM (preserves structure)
  • DROP tables that are not in the ORM (removes temporary/test tables)
drop_database
drop_database()

Drop all tables in the database schema and re-recreate them.

mixin

A module for defining mixins for the PostgreSQL backend ORM.

Classes:

  • CountMixin

    A mixin for counting the number of rows in a table.

Attributes:

  • T

T module-attribute

T = TypeVar('T')

CountMixin

A mixin for counting the number of rows in a table.

Methods:

  • count

    Counts the number of rows in the table.

count classmethod
count() -> int

Counts the number of rows in the table.

orm

ORM classes for the Matchbox PostgreSQL database.

Classes:

  • ResolutionFrom

    Resolution lineage closure table with cached truth values.

  • Resolutions

    Table of resolution points: models, sources and humans.

  • PKSpace

    Table used to reserve ranges of primary keys.

  • SourceFields

    Table for storing column details for SourceConfigs.

  • ClusterSourceKey

    Table for storing source primary keys for clusters.

  • SourceConfigs

    Table of source_configs of data for Matchbox.

  • Contains

    Cluster lineage table.

  • Clusters

    Table of indexed data and clusters that match it.

  • Probabilities

    Table of probabilities that a cluster is correct, according to a resolution.

  • Results

    Table of results for a resolution.

ResolutionFrom

Bases: CountMixin, MatchboxBase

Resolution lineage closure table with cached truth values.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolution_from'
parent class-attribute instance-attribute
parent = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
child class-attribute instance-attribute
child = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
level class-attribute instance-attribute
level = Column(INTEGER, nullable=False)
truth_cache class-attribute instance-attribute
truth_cache = Column(SMALLINT, nullable=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "parent != child", name="no_self_reference"
    ),
    CheckConstraint("level > 0", name="positive_level"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Resolutions

Bases: CountMixin, MatchboxBase

Table of resolution points: models, sources and humans.

Resolutions produce probabilities or own data in the clusters table.

Methods:

  • get_lineage

    Returns lineage ordered by priority.

  • from_name

    Resolves a model resolution name to a Resolution object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolutions'
resolution_id class-attribute instance-attribute
resolution_id = Column(
    BIGINT, primary_key=True, autoincrement=True
)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
description class-attribute instance-attribute
description = Column(TEXT, nullable=True)
type class-attribute instance-attribute
type = Column(TEXT, nullable=False)
hash class-attribute instance-attribute
hash = Column(BYTEA, nullable=True)
truth class-attribute instance-attribute
truth = Column(SMALLINT, nullable=True)
source_config class-attribute instance-attribute
source_config = relationship(
    "SourceConfigs",
    back_populates="source_resolution",
    uselist=False,
)
probabilities class-attribute instance-attribute
probabilities = relationship(
    "Probabilities",
    back_populates="proposed_by",
    passive_deletes=True,
)
results class-attribute instance-attribute
results = relationship(
    "Results",
    back_populates="proposed_by",
    passive_deletes=True,
)
children class-attribute instance-attribute
children = relationship(
    "Resolutions",
    secondary=__table__,
    primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
    secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
    backref="parents",
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "type IN ('model', 'source', 'human')",
        name="resolution_type_constraints",
    ),
    UniqueConstraint("name", name="resolutions_name_key"),
)
ancestors property
ancestors: set[Resolutions]

Returns all ancestors (parents, grandparents, etc.) of this resolution.

descendants property
descendants: set[Resolutions]

Returns descendants (children, grandchildren, etc.) of this resolution.

get_lineage
get_lineage(
    sources: list[SourceConfigs] | None = None,
    threshold: int | None = None,
) -> list[tuple[int, int, float | None]]

Returns lineage ordered by priority.

Highest priority (lowest level) first, then by resolution_id for stability.

Parameters:

  • sources
    (list[SourceConfigs] | None, default: None ) –

    If provided, only return lineage paths that lead to these sources

  • threshold
    (int | None, default: None ) –

    If provided, override this resolution’s threshold

Returns:

  • list[tuple[int, int, float | None]]

    List of tuples (resolution_id, source_config_id, threshold) ordered by priority.

from_name classmethod
from_name(
    name: ResolutionName,
    res_type: Literal["model", "source", "human"]
    | None = None,
    session: Session | None = None,
) -> Resolutions

Resolves a model resolution name to a Resolution object.

Parameters:

  • name
    (ResolutionName) –

    The name of the model to resolve.

  • res_type
    (Literal['model', 'source', 'human'] | None, default: None ) –

    A resolution type to use as filter.

  • session
    (Session | None, default: None ) –

    A session to get the resolution for updates.

Raises:

count classmethod
count() -> int

Counts the number of rows in the table.

PKSpace

Bases: MatchboxBase

Table used to reserve ranges of primary keys.

Methods:

  • initialise

    Create PKSpace tracking row if not exists.

  • reserve_block

    Atomically get next available ID for table, and increment it.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'pk_space'
id class-attribute instance-attribute
id = Column(BIGINT, primary_key=True)
next_cluster_id class-attribute instance-attribute
next_cluster_id = Column(BIGINT, nullable=False)
next_cluster_keys_id class-attribute instance-attribute
next_cluster_keys_id = Column(BIGINT, nullable=False)
initialise classmethod
initialise() -> None

Create PKSpace tracking row if not exists.

reserve_block classmethod
reserve_block(
    table: Literal["clusters", "cluster_keys"],
    block_size: int,
) -> int

Atomically get next available ID for table, and increment it.

SourceFields

Bases: CountMixin, MatchboxBase

Table for storing column details for SourceConfigs.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_fields'
field_id class-attribute instance-attribute
field_id = Column(BIGINT, primary_key=True)
source_config_id class-attribute instance-attribute
source_config_id = Column(
    BIGINT,
    ForeignKey(
        "source_configs.source_config_id",
        ondelete="CASCADE",
    ),
    nullable=False,
)
index class-attribute instance-attribute
index = Column(INTEGER, nullable=False)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
type class-attribute instance-attribute
type = Column(TEXT, nullable=False)
is_key class-attribute instance-attribute
is_key = Column(BOOLEAN, nullable=False)
source_config class-attribute instance-attribute
source_config = relationship(
    "SourceConfigs",
    back_populates="fields",
    foreign_keys=[source_config_id],
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    UniqueConstraint(
        "source_config_id", "index", name="unique_index"
    ),
    Index(
        "ix_source_columns_source_config_id",
        "source_config_id",
    ),
    Index(
        "ix_unique_key_field",
        "source_config_id",
        unique=True,
        postgresql_where=text("is_key = true"),
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

ClusterSourceKey

Bases: CountMixin, MatchboxBase

Table for storing source primary keys for clusters.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'cluster_keys'
key_id class-attribute instance-attribute
key_id = Column(BIGINT, primary_key=True)
cluster_id class-attribute instance-attribute
cluster_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)
source_config_id class-attribute instance-attribute
source_config_id = Column(
    BIGINT,
    ForeignKey(
        "source_configs.source_config_id",
        ondelete="CASCADE",
    ),
    nullable=False,
)
key class-attribute instance-attribute
key = Column(TEXT, nullable=False)
cluster class-attribute instance-attribute
cluster = relationship('Clusters', back_populates='keys')
source_config class-attribute instance-attribute
source_config = relationship(
    "SourceConfigs", back_populates="cluster_keys"
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    Index("ix_cluster_keys_cluster_id", "cluster_id"),
    Index("ix_cluster_keys_keys", "key"),
    UniqueConstraint(
        "key_id",
        "source_config_id",
        name="unique_keys_source",
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

SourceConfigs

SourceConfigs(
    key_field: SourceFields | None = None,
    index_fields: list[SourceFields] | None = None,
    **kwargs,
)

Bases: CountMixin, MatchboxBase

Table of source_configs of data for Matchbox.

Methods:

  • list_all

    Returns all source_configs in the database.

  • from_dto

    Create a SourceConfigs instance from a CommonSource object.

  • to_dto

    Convert ORM source to a matchbox.common SourceConfig object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_configs'
source_config_id class-attribute instance-attribute
source_config_id = Column(
    BIGINT, Identity(start=1), primary_key=True
)
resolution_id class-attribute instance-attribute
resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    nullable=False,
)
location_type class-attribute instance-attribute
location_type = Column(TEXT, nullable=False)
location_uri class-attribute instance-attribute
location_uri = Column(TEXT, nullable=False)
extract_transform class-attribute instance-attribute
extract_transform = Column(TEXT, nullable=False)
name property
name: str

Get the name of the related resolution.

source_resolution class-attribute instance-attribute
source_resolution = relationship(
    "Resolutions", back_populates="source_config"
)
fields class-attribute instance-attribute
fields = relationship(
    "SourceFields",
    back_populates="source_config",
    passive_deletes=True,
    cascade="all, delete-orphan",
)
key_field class-attribute instance-attribute
key_field = relationship(
    "SourceFields",
    primaryjoin="and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == True)",
    viewonly=True,
    uselist=False,
)
index_fields class-attribute instance-attribute
index_fields = relationship(
    "SourceFields",
    primaryjoin="and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == False)",
    viewonly=True,
    order_by="SourceFields.index",
    collection_class=list,
)
cluster_keys class-attribute instance-attribute
cluster_keys = relationship(
    "ClusterSourceKey",
    back_populates="source_config",
    passive_deletes=True,
)
clusters class-attribute instance-attribute
clusters = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="SourceConfigs.source_config_id == ClusterSourceKey.source_config_id",
    secondaryjoin="ClusterSourceKey.cluster_id == Clusters.cluster_id",
    viewonly=True,
)
list_all classmethod
list_all() -> list[SourceConfigs]

Returns all source_configs in the database.

from_dto classmethod
from_dto(
    resolution: Resolutions, source_config: SourceConfig
) -> SourceConfigs

Create a SourceConfigs instance from a CommonSource object.

to_dto
to_dto() -> SourceConfig

Convert ORM source to a matchbox.common SourceConfig object.

count classmethod
count() -> int

Counts the number of rows in the table.

Contains

Bases: CountMixin, MatchboxBase

Cluster lineage table.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'contains'
root class-attribute instance-attribute
root = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
leaf class-attribute instance-attribute
leaf = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "root != leaf", name="no_self_containment"
    ),
    Index("ix_contains_root_leaf", "root", "leaf"),
    Index("ix_contains_leaf_root", "leaf", "root"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Clusters

Bases: CountMixin, MatchboxBase

Table of indexed data and clusters that match it.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'clusters'
cluster_id class-attribute instance-attribute
cluster_id = Column(BIGINT, primary_key=True)
cluster_hash class-attribute instance-attribute
cluster_hash = Column(BYTEA, nullable=False)
keys class-attribute instance-attribute
keys = relationship(
    "ClusterSourceKey",
    back_populates="cluster",
    passive_deletes=True,
)
probabilities class-attribute instance-attribute
probabilities = relationship(
    "Probabilities",
    back_populates="proposes",
    passive_deletes=True,
)
leaves class-attribute instance-attribute
leaves = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == Contains.root",
    secondaryjoin="Clusters.cluster_id == Contains.leaf",
    backref="roots",
)
source_configs class-attribute instance-attribute
source_configs = relationship(
    "SourceConfigs",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == ClusterSourceKey.cluster_id",
    secondaryjoin="ClusterSourceKey.source_config_id == SourceConfigs.source_config_id",
    viewonly=True,
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    UniqueConstraint(
        "cluster_hash", name="clusters_hash_key"
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Probabilities

Bases: CountMixin, MatchboxBase

Table of probabilities that a cluster is correct, according to a resolution.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'probabilities'
resolution_id class-attribute instance-attribute
resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)
cluster_id class-attribute instance-attribute
cluster_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)
probability class-attribute instance-attribute
probability = Column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by = relationship(
    "Resolutions", back_populates="probabilities"
)
proposes class-attribute instance-attribute
proposes = relationship(
    "Clusters", back_populates="probabilities"
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    CheckConstraint(
        "probability BETWEEN 0 AND 100",
        name="valid_probability",
    ),
    Index("ix_probabilities_resolution", "resolution_id"),
)
count classmethod
count() -> int

Counts the number of rows in the table.

Results

Bases: CountMixin, MatchboxBase

Table of results for a resolution.

Stores the raw left/right probabilities created by a model.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'results'
result_id class-attribute instance-attribute
result_id = Column(
    BIGINT, primary_key=True, autoincrement=True
)
resolution_id class-attribute instance-attribute
resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    nullable=False,
)
left_id class-attribute instance-attribute
left_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)
right_id class-attribute instance-attribute
right_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)
probability class-attribute instance-attribute
probability = Column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by = relationship(
    "Resolutions", back_populates="results"
)
__table_args__ class-attribute instance-attribute
__table_args__ = (
    Index("ix_results_resolution", "resolution_id"),
    CheckConstraint(
        "probability BETWEEN 0 AND 100",
        name="valid_probability",
    ),
    UniqueConstraint(
        "resolution_id", "left_id", "right_id"
    ),
)
count classmethod
count() -> int

Counts the number of rows in the table.

utils

Utilities for using the PostgreSQL backend.

Modules:

  • db

    General utilities for the PostgreSQL backend.

  • insert

    Utilities for inserting data into the PostgreSQL backend.

  • query

    Utilities for querying and matching in the PostgreSQL backend.

  • results

    Utilities for querying model results from the PostgreSQL backend.

db

General utilities for the PostgreSQL backend.

Functions:

get_resolution_graph
get_resolution_graph() -> ResolutionGraph

Retrieves the resolution graph.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

restore
restore(
    snapshot: MatchboxSnapshot, batch_size: int
) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • batch_size
    (int) –

    The number of records to insert in each batch

Raises:

sqa_profiled
sqa_profiled()

SQLAlchemy profiler.

Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling

compile_sql
compile_sql(stmt: Select) -> str

Compiles a SQLAlchemy statement into a string.

Parameters:

  • stmt
    (Select) –

    The SQLAlchemy statement to compile.

Returns:

  • str

    The compiled SQL statement as a string.

large_ingest
large_ingest(
    data: Table,
    table_class: DeclarativeMeta,
    max_chunksize: int | None = None,
    upsert_keys: list[str] | None = None,
    update_columns: list[str] | None = None,
)

Append a PyArrow table to a PostgreSQL table using ADBC.

It will either copy directly (and error if primary key constraints are violated), or it can be run in upsert mode by using a staging table, which is slower.

Parameters:

  • data
    (Table) –

    A PyArrow table to write.

  • table_class
    (DeclarativeMeta) –

    The SQLAlchemy ORM class for the table to write to.

  • max_chunksize
    (int | None, default: None ) –

    Size of data chunks to be read and copied.

  • upsert_keys
    (list[str] | None, default: None ) –

    Columns used as keys for “on conflict do update”. If passed, it will run ingest in slower upsert mode. If not passed and update_columns is passed, defaults to primary keys.

  • update_columns
    (list[str] | None, default: None ) –

    Columns to update when upserting. If passed, it will run ingest in slower upsert mode. If not passed and upsert_keys is passed, defaults to all other columns.

ingest_to_temporary_table
ingest_to_temporary_table(
    table_name: str,
    schema_name: str,
    data: Table,
    column_types: dict[str, type[TypeEngine]],
    max_chunksize: int | None = None,
) -> Generator[Table, None, None]

Context manager to ingest Arrow data to a temporary table with explicit types.

Parameters:

  • table_name
    (str) –

    Base name for the temporary table

  • schema_name
    (str) –

    Schema where the temporary table will be created

  • data
    (Table) –

    PyArrow table containing the data to ingest

  • column_types
    (dict[str, type[TypeEngine]]) –

    Map of column names to SQLAlchemy types

  • max_chunksize
    (int | None, default: None ) –

    Optional maximum chunk size for batches

Returns:

  • None

    A SQLAlchemy Table object representing the temporary table

insert

Utilities for inserting data into the PostgreSQL backend.

Functions:

insert_source
insert_source(
    source_config: SourceConfig,
    data_hashes: Table,
    batch_size: int,
) -> None

Indexes a source within Matchbox.

insert_model

Writes a model to Matchbox with a default truth value of 100.

Parameters:

Raises:

  • MatchboxResolutionNotFoundError

    If the specified parent models don’t exist.

  • MatchboxResolutionAlreadyExists

    If the specified model already exists.

insert_results
insert_results(
    resolution: Resolutions, results: Table, batch_size: int
) -> None

Writes a results table to Matchbox.

The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.

This means two-item components are synonymous with their original pairwise probabilities.

This allows easy querying of clusters at any threshold.

Parameters:

  • resolution
    (Resolutions) –

    Resolution of type model to associate results with

  • results
    (Table) –

    A PyArrow results table with left_id, right_id, probability

  • batch_size
    (int) –

    Number of records to insert in each batch

Raises:

  • MatchboxResolutionNotFoundError

    If the specified model doesn’t exist.

query

Utilities for querying and matching in the PostgreSQL backend.

Functions:

  • get_source_config

    Converts the named source to a SourceConfigs ORM object.

  • query

    Queries Matchbox to retrieve linked data for a source.

  • get_parent_clusters_and_leaves

    Query clusters and their leaves for all parent resolutions.

  • match

    Matches an ID in a source resolution and returns the keys in the targets.

Attributes:

  • T
T module-attribute
T = TypeVar('T')
get_source_config
get_source_config(
    name: SourceResolutionName, session: Session
) -> SourceConfigs

Converts the named source to a SourceConfigs ORM object.

query
query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries Matchbox to retrieve linked data for a source.

Retrieves all linked data for a given source, resolving through hierarchy if needed.

  • Simple case: If querying the same resolution as the source, just select cluster IDs and keys directly from ClusterSourceKey
  • Hierarchy case: Uses the unified query builder to traverse up the resolution hierarchy, applying COALESCE priority logic to determine which parent cluster each source record belongs to
  • Priority resolution: When multiple model resolutions could assign a record to different clusters, COALESCE ensures higher-priority resolutions win

Returns all records with their final resolved cluster IDs.

get_parent_clusters_and_leaves
get_parent_clusters_and_leaves(
    resolution: Resolutions,
) -> dict[int, dict[str, list[dict]]]

Query clusters and their leaves for all parent resolutions.

For a given resolution, find all its parent resolutions and return complete cluster compositions.

  • Parent discovery: Queries ResolutionFrom to find all direct parent resolutions (level 1)
  • Cluster building: For each parent, runs the full unified query to get all cluster assignments with both root and leaf information
  • Aggregation: Collects all leaf nodes belonging to each root cluster across all parent resolutions

Return a dictionary mapping cluster IDs to their complete leaf compositions and metadata.

match
match(
    key: str,
    source: SourceResolutionName,
    targets: list[SourceResolutionName],
    resolution: ResolutionName,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source resolution and returns the keys in the targets.

Given a specific key in a source, find what it matches to in target sources through a resolution hierarchy.

  • Target cluster identification: Uses COALESCE priority CTE to determine which cluster the input key belongs to at the resolution level
  • Matching leaves discovery: Builds UNION ALL query with branches for:
    • Direct cluster members (source-only case)
    • Members connected through each model resolution in the hierarchy
  • Cross-reference: Joins the target cluster with all possible matching leaves, filtering for the requested target sources

Organises matches by source configuration and returns structured Match objects for each target.

results

Utilities for querying model results from the PostgreSQL backend.

Classes:

  • SourceInfo

    Information about a model’s sources.

Functions:

SourceInfo

Bases: NamedTuple

Information about a model’s sources.

Attributes:

left instance-attribute
left: int
right instance-attribute
right: int | None
left_ancestors instance-attribute
left_ancestors: set[int]
right_ancestors instance-attribute
right_ancestors: set[int] | None
get_model_config
get_model_config(resolution: Resolutions) -> ModelConfig

Get metadata for a model resolution.