Skip to content

PostgreSQL

A backend adapter for deploying Matchbox using PostgreSQL.

There are two graph-like trees in place here.

  • In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
  • In the data subgraph the tree is implemented as an adjacency list, which means recursive queries are required to resolve it, but less data is stored
erDiagram
    SourceConfigs {
        bigint source_config_id PK
        bigint resolution_id FK
        string location_type
        string location_name
        string extract_transform
    }
    SourceFields {
        bigint field_id PK
        bigint source_config_id FK
        int index
        string name
        string type
        bool is_key
    }
    Clusters {
        bigint cluster_id PK
        bytes cluster_hash
    }
    ClusterSourceKey {
        bigint key_id PK
        bigint cluster_id FK
        bigint source_config_id FK
        string key
    }
    Contains {
        bigint parent PK,FK
        bigint child PK,FK
    }
    PKSpace {
        bigint id
        bigint next_cluster_id
        bigint next_cluster_keys_id
    }
    Probabilities {
        bigint resolution PK,FK
        bigint cluster PK,FK
        smallint probability
    }
    Resolutions {
        bigint resolution_id PK
        string name
        string description
        string type
        bytes hash
        smallint truth
    }
    ResolutionFrom {
        bigint parent PK,FK
        bigint child PK,FK
        int level
        smallint truth_cache
    }
    Users {
        bigint user_id PK
        string name
    }
    EvalJudgements {
        bigint judgement_id PK
        bigint user_id FK
        bigint endorsed_cluster_id FK
        bigint shown_cluster_id FK
        datetime timestamp
    }

    SourceConfigs |o--|| Resolutions : ""
    SourceConfigs ||--o{ SourceFields : ""
    SourceConfigs ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ Probabilities : ""
    Clusters ||--o{ EvalJudgements : "endorsed_cluster_id"
    Clusters ||--o{ EvalJudgements : "shown_cluster_id" 
    Clusters ||--o{ Contains : "parent"
    Contains }o--|| Clusters : "child"
    Resolutions ||--o{ Probabilities : ""
    Resolutions ||--o{ ResolutionFrom : "parent"
    ResolutionFrom }o--|| Resolutions : "child"
    Users ||--o{ EvalJudgements : ""

matchbox.server.postgresql

PostgreSQL adapter for Matchbox server.

Modules:

  • adapter

    PostgreSQL adapter for Matchbox server.

  • db

    Matchbox PostgreSQL database connection.

  • mixin

    A module for defining mixins for the PostgreSQL backend ORM.

  • orm

    ORM classes for the Matchbox PostgreSQL database.

  • utils

    Utilities for using the PostgreSQL backend.

Classes:

__all__ module-attribute

__all__ = ['MatchboxPostgres', 'MatchboxPostgresSettings']

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute

settings = settings

sources instance-attribute

sources = SourceConfigs

models instance-attribute

models = FilteredResolutions(sources=False, models=True)

data instance-attribute

data = FilteredClusters(has_source=True)

clusters instance-attribute

clusters = FilteredClusters(has_source=False)

creates instance-attribute

creates = FilteredProbabilities(over_truth=True)

merges instance-attribute

merges = Contains

proposes instance-attribute

proposes = FilteredProbabilities()

source_resolutions instance-attribute

source_resolutions = FilteredResolutions(sources=True, models=False)

query

query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionPath) –

    the resolution pathidentifying the source to query

  • point_of_truth
    (optional, default: None ) –

    the resolution path to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • return_leaf_id
    (optional, default: False ) –

    whether to return cluster ID of leaves

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionPath) –

    The path of the source resolution.

  • targets
    (list[SourceResolutionPath]) –

    The paths of the target source resolutions.

  • point_of_truth
    (ResolutionPath) –

    The path of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

create_collection

create_collection(name: CollectionName) -> Collection

Create a new collection.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

get_collection

get_collection(name: CollectionName) -> Collection

Get collection metadata.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

list_collections

list_collections() -> list[CollectionName]

List all collection names.

Returns:

delete_collection

delete_collection(name: CollectionName, certain: bool) -> None

Delete a collection and all its versions.

Parameters:

  • name
    (CollectionName) –

    The name of the collection to delete.

  • certain
    (bool) –

    Whether to delete the collection without confirmation.

create_run

create_run(collection: CollectionName) -> Run

Create a new run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection to create the run in.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

set_run_mutable

Set the mutability of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • mutable
    (bool) –

    Whether the run should be mutable.

Returns:

  • Run

    The updated Run object.

set_run_default

Set the default status of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • default
    (bool) –

    Whether the run should be the default run.

Returns:

  • Run

    The updated Run object.

get_run

Get run metadata and resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to get.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

delete_run

delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None

Delete a run and all its resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to delete.

  • certain
    (bool) –

    Whether to delete the run without confirmation.

create_resolution

create_resolution(resolution: Resolution, path: ResolutionPath) -> None

Writes a resolution to Matchbox.

Parameters:

  • resolution
    (Resolution) –

    Resolution object with a source or model config

  • path
    (ResolutionPath) –

    The resolution path for the source

Raises:

  • MatchboxModelConfigError

    If the configuration is invalid, such as the ModelConfig’s resolutions sharing ancestors

get_resolution

get_resolution(path: ResolutionPath, validate: ResolutionType | None = None) -> Resolution

Get a resolution from its path.

Parameters:

Returns:

delete_resolution

delete_resolution(path: ResolutionPath, certain: bool) -> None

Delete a resolution from the database.

Parameters:

  • path
    (ResolutionPath) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

insert_source_data

insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None

Inserts hash data for a source resolution.

Parameters:

  • path
    (SourceResolutionPath) –

    The path of the source resolution to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

insert_model_data

insert_model_data(path: ModelResolutionPath, results: Table) -> None

Inserts results data for a model resolution.

get_model_data

get_model_data(path: ModelResolutionPath) -> Table

Get the results for a model resolution.

set_model_truth

set_model_truth(path: ModelResolutionPath, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth

get_model_truth(path: ModelResolutionPath) -> int

Gets the current truth threshold for this model.

validate_ids

validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

dump

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop

drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear

clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

login

login(user_name: str) -> int

Receives a user name and returns user ID.

insert_judgement

insert_judgement(judgement: Judgement) -> None

Adds an evaluation judgement to the database.

Parameters:

  • judgement
    (Judgement) –

    representation of the proposed clusters.

get_judgements

get_judgements() -> tuple[Table, Table]

Retrieves all evaluation judgements.

Returns:

  • Table

    Two PyArrow tables with the judgments and their expansion.

  • Table

    See matchbox.common.arrow for information on the schema.

compare_models

Compare metrics of models based on evaluation data.

Parameters:

Returns:

  • ModelComparison

    A model comparison object, listing metrics for each model.

sample_for_eval

sample_for_eval(n: int, path: ModelResolutionPath, user_id: int) -> Table

Sample a cluster to validate.

Parameters:

  • n
    (int) –

    Number of clusters to sample

  • path
    (ModelResolutionPath) –

    Path of resolution from which to sample

  • user_id
    (int) –

    ID of user requesting the sample

Returns:

  • Table

    An Arrow table with the same schema as returned by query()

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

model_config class-attribute instance-attribute

model_config = SettingsConfigDict(env_prefix='MB__SERVER__', env_nested_delimiter='__', use_enum_values=True, env_file='.env', env_file_encoding='utf-8', extra='ignore')

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

datastore instance-attribute

task_runner instance-attribute

task_runner: Literal['api', 'celery']

redis_uri instance-attribute

redis_uri: str | None

uploads_expiry_minutes instance-attribute

uploads_expiry_minutes: int | None

authorisation class-attribute instance-attribute

authorisation: bool = False

public_key class-attribute instance-attribute

public_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

backend_type class-attribute instance-attribute

backend_type: MatchboxBackends = POSTGRES

postgres class-attribute instance-attribute

check_settings

check_settings() -> Self

Check that legal combinations of settings are provided.

adapter

PostgreSQL adapter for Matchbox server.

Classes:

Attributes:

  • T
  • P

T module-attribute

T = TypeVar('T')

P module-attribute

P = ParamSpec('P')

FilteredClusters

Bases: BaseModel

Wrapper class for filtered cluster queries.

Methods:

  • count

    Counts the number of clusters in the database.

Attributes:

has_source class-attribute instance-attribute
has_source: bool | None = None
count
count() -> int

Counts the number of clusters in the database.

FilteredProbabilities

Bases: BaseModel

Wrapper class for filtered probability queries.

Methods:

  • count

    Counts the number of probabilities in the database.

Attributes:

over_truth class-attribute instance-attribute
over_truth: bool = False
count
count() -> int

Counts the number of probabilities in the database.

FilteredResolutions

Bases: BaseModel

Wrapper class for filtered resolution queries.

Methods:

  • count

    Counts the number of resolutions in the database.

Attributes:

sources class-attribute instance-attribute
sources: bool = False
models class-attribute instance-attribute
models: bool = False
count
count() -> int

Counts the number of resolutions in the database.

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute
settings = settings
sources instance-attribute
sources = SourceConfigs
models instance-attribute
models = FilteredResolutions(sources=False, models=True)
data instance-attribute
data = FilteredClusters(has_source=True)
clusters instance-attribute
clusters = FilteredClusters(has_source=False)
creates instance-attribute
creates = FilteredProbabilities(over_truth=True)
merges instance-attribute
merges = Contains
proposes instance-attribute
proposes = FilteredProbabilities()
source_resolutions instance-attribute
source_resolutions = FilteredResolutions(sources=True, models=False)
query
query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table

Queries the database from an optional point of truth.

Parameters:

  • source
    (SourceResolutionPath) –

    the resolution pathidentifying the source to query

  • point_of_truth
    (optional, default: None ) –

    the resolution path to use for filtering results If not specified, will use the source resolution for the queried source

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors

  • return_leaf_id
    (optional, default: False ) –

    whether to return cluster ID of leaves

  • limit
    (optional, default: None ) –

    the number to use in a limit clause. Useful for testing

Returns:

  • Table

    The resulting matchbox IDs in Arrow format

match

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

  • key
    (str) –

    The key to match from the source.

  • source
    (SourceResolutionPath) –

    The path of the source resolution.

  • targets
    (list[SourceResolutionPath]) –

    The paths of the target source resolutions.

  • point_of_truth
    (ResolutionPath) –

    The path of the resolution to use for matching.

  • threshold
    (optional, default: None ) –

    the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

create_collection
create_collection(name: CollectionName) -> Collection

Create a new collection.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

get_collection
get_collection(name: CollectionName) -> Collection

Get collection metadata.

Parameters:

Returns:

  • Collection

    A Collection object containing its metadata, versions, and resolutions.

list_collections
list_collections() -> list[CollectionName]

List all collection names.

Returns:

delete_collection
delete_collection(name: CollectionName, certain: bool) -> None

Delete a collection and all its versions.

Parameters:

  • name
    (CollectionName) –

    The name of the collection to delete.

  • certain
    (bool) –

    Whether to delete the collection without confirmation.

create_run
create_run(collection: CollectionName) -> Run

Create a new run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection to create the run in.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

set_run_mutable

Set the mutability of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • mutable
    (bool) –

    Whether the run should be mutable.

Returns:

  • Run

    The updated Run object.

set_run_default

Set the default status of a run.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to update.

  • default
    (bool) –

    Whether the run should be the default run.

Returns:

  • Run

    The updated Run object.

get_run

Get run metadata and resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to get.

Returns:

  • Run

    A Run object containing its metadata and resolutions.

delete_run
delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None

Delete a run and all its resolutions.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run to delete.

  • certain
    (bool) –

    Whether to delete the run without confirmation.

create_resolution
create_resolution(resolution: Resolution, path: ResolutionPath) -> None

Writes a resolution to Matchbox.

Parameters:

  • resolution
    (Resolution) –

    Resolution object with a source or model config

  • path
    (ResolutionPath) –

    The resolution path for the source

Raises:

  • MatchboxModelConfigError

    If the configuration is invalid, such as the ModelConfig’s resolutions sharing ancestors

get_resolution
get_resolution(path: ResolutionPath, validate: ResolutionType | None = None) -> Resolution

Get a resolution from its path.

Parameters:

Returns:

delete_resolution
delete_resolution(path: ResolutionPath, certain: bool) -> None

Delete a resolution from the database.

Parameters:

  • path
    (ResolutionPath) –

    The name of the resolution to delete.

  • certain
    (bool) –

    Whether to delete the model without confirmation.

insert_source_data
insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None

Inserts hash data for a source resolution.

Parameters:

  • path
    (SourceResolutionPath) –

    The path of the source resolution to index.

  • data_hashes
    (Table) –

    The Arrow table with the hash of each data row

insert_model_data
insert_model_data(path: ModelResolutionPath, results: Table) -> None

Inserts results data for a model resolution.

get_model_data
get_model_data(path: ModelResolutionPath) -> Table

Get the results for a model resolution.

set_model_truth
set_model_truth(path: ModelResolutionPath, truth: int) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth
get_model_truth(path: ModelResolutionPath) -> int

Gets the current truth threshold for this model.

validate_ids
validate_ids(ids: list[int]) -> bool

Validates a list of IDs exist in the database.

Parameters:

  • ids
    (list[int]) –

    A list of IDs to validate.

Raises:

  • MatchboxDataNotFound

    If some items don’t exist in the target table.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop
drop(certain: bool) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

  • certain
    (bool) –

    Whether to drop the database without confirmation.

clear
clear(certain: bool) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

  • certain
    (bool) –

    Whether to delete the database without confirmation.

restore
restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

  • TypeError

    If the snapshot is not compatible with PostgreSQL

login
login(user_name: str) -> int

Receives a user name and returns user ID.

insert_judgement
insert_judgement(judgement: Judgement) -> None

Adds an evaluation judgement to the database.

Parameters:

  • judgement
    (Judgement) –

    representation of the proposed clusters.

get_judgements
get_judgements() -> tuple[Table, Table]

Retrieves all evaluation judgements.

Returns:

  • Table

    Two PyArrow tables with the judgments and their expansion.

  • Table

    See matchbox.common.arrow for information on the schema.

compare_models

Compare metrics of models based on evaluation data.

Parameters:

Returns:

  • ModelComparison

    A model comparison object, listing metrics for each model.

sample_for_eval
sample_for_eval(n: int, path: ModelResolutionPath, user_id: int) -> Table

Sample a cluster to validate.

Parameters:

  • n
    (int) –

    Number of clusters to sample

  • path
    (ModelResolutionPath) –

    Path of resolution from which to sample

  • user_id
    (int) –

    ID of user requesting the sample

Returns:

  • Table

    An Arrow table with the same schema as returned by query()

db

Matchbox PostgreSQL database connection.

Classes:

Attributes:

MBDB module-attribute

MatchboxPostgresCoreSettings

Bases: BaseModel

PostgreSQL-specific settings for Matchbox.

Methods:

Attributes:

host instance-attribute
host: str
port instance-attribute
port: int
user instance-attribute
user: str
password instance-attribute
password: str
database instance-attribute
database: str
db_schema instance-attribute
db_schema: str
alembic_config class-attribute instance-attribute
alembic_config: Path = Field(default=Path('src/matchbox/server/postgresql/alembic.ini'))
get_alembic_config
get_alembic_config() -> Config

Get the Alembic config.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

backend_type class-attribute instance-attribute
backend_type: MatchboxBackends = POSTGRES
postgres class-attribute instance-attribute
model_config class-attribute instance-attribute
model_config = SettingsConfigDict(env_prefix='MB__SERVER__', env_nested_delimiter='__', use_enum_values=True, env_file='.env', env_file_encoding='utf-8', extra='ignore')
batch_size class-attribute instance-attribute
batch_size: int = Field(default=250000)
datastore instance-attribute
task_runner instance-attribute
task_runner: Literal['api', 'celery']
redis_uri instance-attribute
redis_uri: str | None
uploads_expiry_minutes instance-attribute
uploads_expiry_minutes: int | None
authorisation class-attribute instance-attribute
authorisation: bool = False
public_key class-attribute instance-attribute
public_key: SecretStr | None = Field(default=None)
log_level class-attribute instance-attribute
log_level: LogLevelType = 'INFO'
check_settings
check_settings() -> Self

Check that legal combinations of settings are provided.

MatchboxDatabase

MatchboxDatabase(settings: MatchboxPostgresSettings)

Matchbox PostgreSQL database connection.

Methods:

Attributes:

settings instance-attribute
settings = settings
MatchboxBase instance-attribute
MatchboxBase = declarative_base(metadata=MetaData(schema=db_schema))
alembic_config instance-attribute
alembic_config = get_alembic_config()
sorted_tables property
sorted_tables: list[Table]

Return a list of SQLAlchemy tables in order of creation.

connection_string
connection_string(driver: bool = True) -> str

Get the connection string for PostgreSQL.

get_engine
get_engine() -> Engine

Get the database engine.

get_session
get_session() -> Session

Get a new session.

get_adbc_connection
get_adbc_connection() -> Generator[PoolProxiedConnection, Any, Any]

Get a new ADBC connection wrapped by a SQLAlchemy pool proxy.

The connection must be used within a context manager.

run_migrations
run_migrations()

Create the database and all tables expected in the schema.

clear_database
clear_database()

Delete all rows in every table in the database schema.

  • TRUNCATE tables that are part of the core ORM (preserves structure)
  • DROP tables that are not in the ORM (removes temporary/test tables)
drop_database
drop_database()

Drop all tables in the database schema and re-recreate them.

mixin

A module for defining mixins for the PostgreSQL backend ORM.

Classes:

  • CountMixin

    A mixin for counting the number of rows in a table.

Attributes:

  • T

T module-attribute

T = TypeVar('T')

CountMixin

A mixin for counting the number of rows in a table.

Methods:

  • count

    Counts the number of rows in the table.

count classmethod
count() -> int

Counts the number of rows in the table.

orm

ORM classes for the Matchbox PostgreSQL database.

Classes:

  • Collections

    Named collections of resolutions and runs.

  • Runs

    Runs of collections of resolutions.

  • ResolutionFrom

    Resolution lineage closure table with cached truth values.

  • Resolutions

    Table of resolution points corresponding to models, and sources.

  • PKSpace

    Table used to reserve ranges of primary keys.

  • SourceFields

    Table for storing column details for SourceConfigs.

  • ClusterSourceKey

    Table for storing source primary keys for clusters.

  • SourceConfigs

    Table of source_configs of data for Matchbox.

  • ModelConfigs

    Table of model configs for Matchbox.

  • Contains

    Cluster lineage table.

  • Clusters

    Table of indexed data and clusters that match it.

  • Users

    Table of identities of human validators.

  • EvalJudgements

    Table of evaluation judgements produced by human validators.

  • Probabilities

    Table of probabilities that a cluster is correct, according to a resolution.

  • Results

    Table of results for a resolution.

Collections

Bases: CountMixin, MatchboxBase

Named collections of resolutions and runs.

Methods:

  • from_name

    Resolve a collection name to a Collections object.

  • to_dto

    Convert ORM collection to a matchbox.common Collection object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'collections'
collection_id class-attribute instance-attribute
collection_id = Column(BIGINT, primary_key=True, autoincrement=True)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
runs class-attribute instance-attribute
runs = relationship('Runs', back_populates='collection')
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('name', name='collections_name_key'),)
from_name classmethod
from_name(name: CollectionName, session: Session | None = None) -> Collections

Resolve a collection name to a Collections object.

Parameters:

  • name
    (CollectionName) –

    The name of the collection to resolve.

  • session
    (Session | None, default: None ) –

    Optional session to use for the query.

Raises:

to_dto
to_dto() -> Collection

Convert ORM collection to a matchbox.common Collection object.

count classmethod
count() -> int

Counts the number of rows in the table.

Runs

Bases: CountMixin, MatchboxBase

Runs of collections of resolutions.

Methods:

  • from_id

    Resolve a collection and run name to a Runs object.

  • to_dto

    Convert ORM run to a matchbox.common Run object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'runs'
run_id class-attribute instance-attribute
run_id = Column(BIGINT, primary_key=True, autoincrement=True)
collection_id class-attribute instance-attribute
collection_id = Column(BIGINT, ForeignKey('collections.collection_id', ondelete='CASCADE'), nullable=False)
is_mutable class-attribute instance-attribute
is_mutable = Column(BOOLEAN, default=False)
is_default class-attribute instance-attribute
is_default = Column(BOOLEAN, default=False)
collection class-attribute instance-attribute
collection = relationship('Collections', back_populates='runs')
resolutions class-attribute instance-attribute
resolutions = relationship('Resolutions', back_populates='run')
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('collection_id', 'run_id', name='unique_run_id'), Index('ix_default_run_collection', 'collection_id', unique=True, postgresql_where=text('is_default = true')))
from_id classmethod
from_id(collection: CollectionName, run_id: RunID, session: Session | None = None) -> Runs

Resolve a collection and run name to a Runs object.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run within that collection.

  • session
    (Session | None, default: None ) –

    Optional session to use for the query.

Raises:

to_dto
to_dto() -> Run

Convert ORM run to a matchbox.common Run object.

count classmethod
count() -> int

Counts the number of rows in the table.

ResolutionFrom

Bases: CountMixin, MatchboxBase

Resolution lineage closure table with cached truth values.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolution_from'
parent class-attribute instance-attribute
parent = Column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), primary_key=True)
child class-attribute instance-attribute
child = Column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), primary_key=True)
level class-attribute instance-attribute
level = Column(INTEGER, nullable=False)
truth_cache class-attribute instance-attribute
truth_cache = Column(SMALLINT, nullable=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint('parent != child', name='no_self_reference'), CheckConstraint('level > 0', name='positive_level'))
count classmethod
count() -> int

Counts the number of rows in the table.

Resolutions

Bases: CountMixin, MatchboxBase

Table of resolution points corresponding to models, and sources.

Resolutions produce probabilities or own data in the clusters table.

Methods:

  • get_lineage

    Returns lineage ordered by priority.

  • from_path

    Resolves a resolution name to a Resolution object.

  • from_dto

    Create a Resolutions instance from a Resolution DTO object.

  • to_dto

    Convert ORM resolution to a matchbox.common Resolution object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolutions'
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, primary_key=True, autoincrement=True)
run_id class-attribute instance-attribute
run_id = Column(BIGINT, ForeignKey('runs.run_id', ondelete='CASCADE'), nullable=False)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
description class-attribute instance-attribute
description = Column(TEXT, nullable=True)
type class-attribute instance-attribute
type = Column(TEXT, nullable=False)
hash class-attribute instance-attribute
hash = Column(BYTEA, nullable=True)
truth class-attribute instance-attribute
truth = Column(SMALLINT, nullable=True)
source_config class-attribute instance-attribute
source_config = relationship('SourceConfigs', back_populates='source_resolution', uselist=False)
model_config class-attribute instance-attribute
model_config = relationship('ModelConfigs', back_populates='model_resolution', uselist=False)
probabilities class-attribute instance-attribute
probabilities = relationship('Probabilities', back_populates='proposed_by', passive_deletes=True)
results class-attribute instance-attribute
results = relationship('Results', back_populates='proposed_by', passive_deletes=True)
children class-attribute instance-attribute
children = relationship('Resolutions', secondary=__table__, primaryjoin='Resolutions.resolution_id == ResolutionFrom.parent', secondaryjoin='Resolutions.resolution_id == ResolutionFrom.child', backref='parents')
run class-attribute instance-attribute
run = relationship('Runs', back_populates='resolutions')
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint("type IN ('model', 'source')", name='resolution_type_constraints'), UniqueConstraint('run_id', 'name', name='resolutions_name_key'))
ancestors property
ancestors: set[Resolutions]

Returns all ancestors (parents, grandparents, etc.) of this resolution.

descendants property
descendants: set[Resolutions]

Returns descendants (children, grandchildren, etc.) of this resolution.

get_lineage
get_lineage(sources: list[SourceConfigs] | None = None, threshold: int | None = None) -> list[tuple[int, int, float | None]]

Returns lineage ordered by priority.

Highest priority (lowest level) first, then by resolution_id for stability.

Parameters:

  • sources
    (list[SourceConfigs] | None, default: None ) –

    If provided, only return lineage paths that lead to these sources

  • threshold
    (int | None, default: None ) –

    If provided, override this resolution’s threshold

Returns:

  • list[tuple[int, int, float | None]]

    List of tuples (resolution_id, source_config_id, threshold) ordered by priority.

from_path classmethod
from_path(path: ResolutionPath, res_type: ResolutionType | None = None, session: Session | None = None) -> Resolutions

Resolves a resolution name to a Resolution object.

Parameters:

  • path
    (ResolutionPath) –

    The path of the resolution to resolve.

  • res_type
    (ResolutionType | None, default: None ) –

    A resolution type to use as filter.

  • session
    (Session | None, default: None ) –

    A session to get the resolution for updates.

Raises:

from_dto classmethod

Create a Resolutions instance from a Resolution DTO object.

The resolution will be added to the session and flushed (but not committed).

For model resolutions, lineage entries will be created automatically.

Parameters:

  • resolution
    (Resolution) –

    The Resolution DTO to convert

  • path
    (ResolutionPath) –

    The full resolution path

  • session
    (Session) –

    Database session (caller must commit)

Returns:

  • Resolutions

    A Resolutions ORM instance with ID and relationships established

to_dto
to_dto() -> Resolution

Convert ORM resolution to a matchbox.common Resolution object.

count classmethod
count() -> int

Counts the number of rows in the table.

PKSpace

Bases: MatchboxBase

Table used to reserve ranges of primary keys.

Methods:

  • initialise

    Create PKSpace tracking row if not exists.

  • reserve_block

    Atomically get next available ID for table, and increment it.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'pk_space'
id class-attribute instance-attribute
id = Column(BIGINT, primary_key=True)
next_cluster_id class-attribute instance-attribute
next_cluster_id = Column(BIGINT, nullable=False)
next_cluster_keys_id class-attribute instance-attribute
next_cluster_keys_id = Column(BIGINT, nullable=False)
initialise classmethod
initialise() -> None

Create PKSpace tracking row if not exists.

reserve_block classmethod
reserve_block(table: Literal['clusters', 'cluster_keys'], block_size: int) -> int

Atomically get next available ID for table, and increment it.

SourceFields

Bases: CountMixin, MatchboxBase

Table for storing column details for SourceConfigs.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_fields'
field_id class-attribute instance-attribute
field_id = Column(BIGINT, primary_key=True)
source_config_id class-attribute instance-attribute
source_config_id = Column(BIGINT, ForeignKey('source_configs.source_config_id', ondelete='CASCADE'), nullable=False)
index class-attribute instance-attribute
index = Column(INTEGER, nullable=False)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
type class-attribute instance-attribute
type = Column(TEXT, nullable=False)
is_key class-attribute instance-attribute
is_key = Column(BOOLEAN, nullable=False)
source_config class-attribute instance-attribute
source_config = relationship('SourceConfigs', back_populates='fields', foreign_keys=[source_config_id])
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('source_config_id', 'index', name='unique_index'), Index('ix_source_columns_source_config_id', 'source_config_id'), Index('ix_unique_key_field', 'source_config_id', unique=True, postgresql_where=text('is_key = true')))
count classmethod
count() -> int

Counts the number of rows in the table.

ClusterSourceKey

Bases: CountMixin, MatchboxBase

Table for storing source primary keys for clusters.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'cluster_keys'
key_id class-attribute instance-attribute
key_id = Column(BIGINT, primary_key=True)
cluster_id class-attribute instance-attribute
cluster_id = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
source_config_id class-attribute instance-attribute
source_config_id = Column(BIGINT, ForeignKey('source_configs.source_config_id', ondelete='CASCADE'), nullable=False)
key class-attribute instance-attribute
key = Column(TEXT, nullable=False)
cluster class-attribute instance-attribute
cluster = relationship('Clusters', back_populates='keys')
source_config class-attribute instance-attribute
source_config = relationship('SourceConfigs', back_populates='cluster_keys')
__table_args__ class-attribute instance-attribute
__table_args__ = (Index('ix_cluster_keys_cluster_id', 'cluster_id'), Index('ix_cluster_keys_keys', 'key'), Index('ix_cluster_keys_source_config_id', 'source_config_id'), UniqueConstraint('key_id', 'source_config_id', name='unique_keys_source'))
count classmethod
count() -> int

Counts the number of rows in the table.

SourceConfigs

SourceConfigs(key_field: SourceFields | None = None, index_fields: list[SourceFields] | None = None, **kwargs)

Bases: CountMixin, MatchboxBase

Table of source_configs of data for Matchbox.

Methods:

  • list_all

    Returns all source_configs in the database.

  • from_dto

    Create a SourceConfigs instance from a Resolution DTO object.

  • to_dto

    Convert ORM source to a matchbox.common.SourceConfig object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_configs'
source_config_id class-attribute instance-attribute
source_config_id = Column(BIGINT, Identity(start=1), primary_key=True)
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), nullable=False)
location_type class-attribute instance-attribute
location_type = Column(TEXT, nullable=False)
location_name class-attribute instance-attribute
location_name = Column(TEXT, nullable=False)
extract_transform class-attribute instance-attribute
extract_transform = Column(TEXT, nullable=False)
name property
name: str

Get the name of the related resolution.

source_resolution class-attribute instance-attribute
source_resolution = relationship('Resolutions', back_populates='source_config')
fields class-attribute instance-attribute
fields = relationship('SourceFields', back_populates='source_config', passive_deletes=True, cascade='all, delete-orphan')
key_field class-attribute instance-attribute
key_field = relationship('SourceFields', primaryjoin='and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == True)', viewonly=True, uselist=False)
index_fields class-attribute instance-attribute
index_fields = relationship('SourceFields', primaryjoin='and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == False)', viewonly=True, order_by='SourceFields.index', collection_class=list)
cluster_keys class-attribute instance-attribute
cluster_keys = relationship('ClusterSourceKey', back_populates='source_config', passive_deletes=True)
clusters class-attribute instance-attribute
clusters = relationship('Clusters', secondary=__table__, primaryjoin='SourceConfigs.source_config_id == ClusterSourceKey.source_config_id', secondaryjoin='ClusterSourceKey.cluster_id == Clusters.cluster_id', viewonly=True)
list_all classmethod
list_all() -> list[SourceConfigs]

Returns all source_configs in the database.

from_dto classmethod
from_dto(config: SourceConfig) -> SourceConfigs

Create a SourceConfigs instance from a Resolution DTO object.

to_dto
to_dto() -> SourceConfig

Convert ORM source to a matchbox.common.SourceConfig object.

count classmethod
count() -> int

Counts the number of rows in the table.

ModelConfigs

Bases: CountMixin, MatchboxBase

Table of model configs for Matchbox.

Methods:

  • list_all

    Returns all model_configs in the database.

  • from_dto

    Create a SourceConfigs instance from a Resolution DTO object.

  • to_dto

    Convert ORM source to a matchbox.common.ModelConfig object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'model_configs'
model_config_id class-attribute instance-attribute
model_config_id = Column(BIGINT, Identity(start=1), primary_key=True)
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), nullable=False)
model_class class-attribute instance-attribute
model_class = Column(TEXT, nullable=False)
model_settings class-attribute instance-attribute
model_settings = Column(JSONB, nullable=False)
left_query class-attribute instance-attribute
left_query = Column(JSONB, nullable=False)
right_query class-attribute instance-attribute
right_query = Column(JSONB, nullable=True)
name property
name: str

Get the name of the related resolution.

model_resolution class-attribute instance-attribute
model_resolution = relationship('Resolutions', back_populates='model_config')
list_all classmethod
list_all() -> list[SourceConfigs]

Returns all model_configs in the database.

from_dto classmethod
from_dto(config: ModelConfig) -> ModelConfigs

Create a SourceConfigs instance from a Resolution DTO object.

to_dto
to_dto() -> ModelConfig

Convert ORM source to a matchbox.common.ModelConfig object.

count classmethod
count() -> int

Counts the number of rows in the table.

Contains

Bases: CountMixin, MatchboxBase

Cluster lineage table.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'contains'
root class-attribute instance-attribute
root = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), primary_key=True)
leaf class-attribute instance-attribute
leaf = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), primary_key=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint('root != leaf', name='no_self_containment'), Index('ix_contains_root_leaf', 'root', 'leaf'), Index('ix_contains_leaf_root', 'leaf', 'root'))
count classmethod
count() -> int

Counts the number of rows in the table.

Clusters

Bases: CountMixin, MatchboxBase

Table of indexed data and clusters that match it.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'clusters'
cluster_id class-attribute instance-attribute
cluster_id = Column(BIGINT, primary_key=True)
cluster_hash class-attribute instance-attribute
cluster_hash = Column(BYTEA, nullable=False)
keys class-attribute instance-attribute
keys = relationship('ClusterSourceKey', back_populates='cluster', passive_deletes=True)
probabilities class-attribute instance-attribute
probabilities = relationship('Probabilities', back_populates='proposes', passive_deletes=True)
leaves class-attribute instance-attribute
leaves = relationship('Clusters', secondary=__table__, primaryjoin='Clusters.cluster_id == Contains.root', secondaryjoin='Clusters.cluster_id == Contains.leaf', backref='roots')
source_configs class-attribute instance-attribute
source_configs = relationship('SourceConfigs', secondary=__table__, primaryjoin='Clusters.cluster_id == ClusterSourceKey.cluster_id', secondaryjoin='ClusterSourceKey.source_config_id == SourceConfigs.source_config_id', viewonly=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('cluster_hash', name='clusters_hash_key'),)
count classmethod
count() -> int

Counts the number of rows in the table.

Users

Bases: CountMixin, MatchboxBase

Table of identities of human validators.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'users'
user_id class-attribute instance-attribute
user_id = Column(BIGINT, primary_key=True)
name class-attribute instance-attribute
name = Column(TEXT, nullable=False)
judgements class-attribute instance-attribute
judgements = relationship('EvalJudgements', back_populates='user')
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('name', name='user_name_unique'),)
count classmethod
count() -> int

Counts the number of rows in the table.

EvalJudgements

Bases: CountMixin, MatchboxBase

Table of evaluation judgements produced by human validators.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'eval_judgements'
judgement_id class-attribute instance-attribute
judgement_id = Column(BIGINT, primary_key=True)
user_id class-attribute instance-attribute
user_id = Column(BIGINT, ForeignKey('users.user_id', ondelete='CASCADE'), nullable=False)
endorsed_cluster_id class-attribute instance-attribute
endorsed_cluster_id = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
shown_cluster_id class-attribute instance-attribute
shown_cluster_id = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
timestamp class-attribute instance-attribute
timestamp = Column(DateTime(timezone=True), nullable=False)
user class-attribute instance-attribute
user = relationship('Users', back_populates='judgements')
count classmethod
count() -> int

Counts the number of rows in the table.

Probabilities

Bases: CountMixin, MatchboxBase

Table of probabilities that a cluster is correct, according to a resolution.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'probabilities'
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), primary_key=True)
cluster_id class-attribute instance-attribute
cluster_id = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), primary_key=True)
probability class-attribute instance-attribute
probability = Column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by = relationship('Resolutions', back_populates='probabilities')
proposes class-attribute instance-attribute
proposes = relationship('Clusters', back_populates='probabilities')
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint('probability BETWEEN 0 AND 100', name='valid_probability'), Index('ix_probabilities_resolution', 'resolution_id'))
count classmethod
count() -> int

Counts the number of rows in the table.

Results

Bases: CountMixin, MatchboxBase

Table of results for a resolution.

Stores the raw left/right probabilities created by a model.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'results'
result_id class-attribute instance-attribute
result_id = Column(BIGINT, primary_key=True, autoincrement=True)
resolution_id class-attribute instance-attribute
resolution_id = Column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), nullable=False)
left_id class-attribute instance-attribute
left_id = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
right_id class-attribute instance-attribute
right_id = Column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
probability class-attribute instance-attribute
probability = Column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by = relationship('Resolutions', back_populates='results')
__table_args__ class-attribute instance-attribute
__table_args__ = (Index('ix_results_resolution', 'resolution_id'), CheckConstraint('probability BETWEEN 0 AND 100', name='valid_probability'), UniqueConstraint('resolution_id', 'left_id', 'right_id'))
count classmethod
count() -> int

Counts the number of rows in the table.

utils

Utilities for using the PostgreSQL backend.

Modules:

  • db

    General utilities for the PostgreSQL backend.

  • evaluation

    Evaluation logic for PostgreSQL adapter.

  • insert

    Utilities for inserting data into the PostgreSQL backend.

  • query

    Utilities for querying and matching in the PostgreSQL backend.

  • results

    Utilities for querying model results from the PostgreSQL backend.

db

General utilities for the PostgreSQL backend.

Functions:

  • dump

    Dumps the entire database to a snapshot.

  • restore

    Restores the database from a snapshot.

  • sqa_profiled

    SQLAlchemy profiler.

  • compile_sql

    Compiles a SQLAlchemy statement into a string.

  • large_append

    Append a PyArrow table to a PostgreSQL table using ADBC.

  • ingest_to_temporary_table

    Context manager to ingest Arrow data to a temporary table with explicit types.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

restore

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • batch_size
    (int) –

    The number of records to insert in each batch

Raises:

sqa_profiled
sqa_profiled()

SQLAlchemy profiler.

Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling

compile_sql
compile_sql(stmt: Select) -> str

Compiles a SQLAlchemy statement into a string.

Parameters:

  • stmt
    (Select) –

    The SQLAlchemy statement to compile.

Returns:

  • str

    The compiled SQL statement as a string.

large_append
large_append(data: Table, table_class: DeclarativeMeta, adbc_connection: PoolProxiedConnection, max_chunksize: int | None = None)

Append a PyArrow table to a PostgreSQL table using ADBC.

This function does not support upserting and will error if keys clash. This method does not auto-commit, which is the responsibility of the caller.

Parameters:

  • data
    (Table) –

    A PyArrow table to write.

  • table_class
    (DeclarativeMeta) –

    The SQLAlchemy ORM class for the table to write to.

  • adbc_connection
    (PoolProxiedConnection) –

    An ADBC connection from the pool. This is returned by MBDB.get_adbc_connection() and needs to be used via a context manager.

  • max_chunksize
    (int | None, default: None ) –

    Size of data chunks to be read and copied.

ingest_to_temporary_table
ingest_to_temporary_table(table_name: str, schema_name: str, data: Table, column_types: dict[str, type[TypeEngine]], max_chunksize: int | None = None) -> Generator[Table, None, None]

Context manager to ingest Arrow data to a temporary table with explicit types.

Parameters:

  • table_name
    (str) –

    Base name for the temporary table

  • schema_name
    (str) –

    Schema where the temporary table will be created

  • data
    (Table) –

    PyArrow table containing the data to ingest

  • column_types
    (dict[str, type[TypeEngine]]) –

    Map of column names to SQLAlchemy types

  • max_chunksize
    (int | None, default: None ) –

    Optional maximum chunk size for batches

Returns:

  • None

    A SQLAlchemy Table object representing the temporary table

evaluation

Evaluation logic for PostgreSQL adapter.

Functions:

insert_judgement
insert_judgement(judgement: Judgement)

Record judgement to server.

get_judgements
get_judgements() -> tuple[DataFrame, DataFrame]

Get all judgements from server.

sample
sample(n: int, resolution_path: ModelResolutionPath, user_id: int)

Sample some clusters from a resolution.

compare_models
compare_models(resolutions: list[ModelResolutionPath], judgements: DataFrame, expansion: DataFrame)

Compare models on the basis of precision and recall.

insert

Utilities for inserting data into the PostgreSQL backend.

Functions:

insert_hashes
insert_hashes(path: SourceResolutionPath, data_hashes: Table, batch_size: int) -> None

Indexes hash data for a source within Matchbox.

Parameters:

  • path
    (SourceResolutionPath) –

    The path of the source resolution

  • data_hashes
    (Table) –

    Arrow table containing hash data

  • batch_size
    (int) –

    Batch size for bulk operations

insert_results
insert_results(path: ModelResolutionPath, results: Table, batch_size: int) -> None

Writes a results table to Matchbox.

The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.

This means two-item components are synonymous with their original pairwise probabilities.

This allows easy querying of clusters at any threshold.

Parameters:

  • path
    (ModelResolutionPath) –

    The path of the model resolution to upload results for

  • results
    (Table) –

    A PyArrow results table with left_id, right_id, probability

  • batch_size
    (int) –

    Number of records to insert in each batch

Raises:

  • MatchboxResolutionNotFoundError

    If the specified model doesn’t exist.

query

Utilities for querying and matching in the PostgreSQL backend.

Functions:

  • build_unified_query

    Build a query to resolve cluster assignments across resolution hierarchies.

  • query

    Queries Matchbox to retrieve linked data for a source.

  • get_parent_clusters_and_leaves

    Query clusters and their leaves for all parent resolutions.

  • match

    Matches an ID in a source resolution and returns the keys in the targets.

Attributes:

  • T
T module-attribute
T = TypeVar('T')
build_unified_query
build_unified_query(resolution: Resolutions, sources: list[SourceConfigs] | None = None, threshold: int | None = None, level: Literal['leaf', 'key'] = 'leaf', get_hashes: bool = False) -> Select

Build a query to resolve cluster assignments across resolution hierarchies.

This function creates SQL that determines which cluster each source record belongs to by traversing up a resolution hierarchy and applying priority-based cluster selection.

The query uses COALESCE to implement a priority system where higher-level resolutions can “claim” records, with lower levels only processing unclaimed records:

COALESCE(highest_priority_cluster, medium_priority_cluster, ..., source_cluster)
  1. Lineage discovery: Queries the resolution hierarchy to find all ancestor resolutions, ordered by priority (lowest level = highest priority)
  2. Source filtering: When sources is provided, constrains results to only include clusters from those specific source configurations
  3. Threshold application: Applies probability thresholds to determine which clusters qualify at each resolution level
  4. Subquery construction: For each model resolution in the lineage, builds a subquery that finds qualifying clusters via the Contains→Probabilities join. Each joined subquery adds a new cluster column which is then merged via…
  5. COALESCE assembly: Joins all subqueries to source data and uses COALESCE to select the highest-priority cluster assignment for each record

The level changes the data returned:

  • "leaf": Returns both root and leaf cluster IDs. For unmerged source clusters, the root and leaf properties will be the same.
  • "key": In addition to the above, it also returns the source key. This will give more rows than "leaf" because it needs a row for every key attached to a leaf.

Additionally, if get_hashes is set to True, the root and leaf hashes are returned.

query
query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int = None) -> Table

Queries Matchbox to retrieve linked data for a source.

Retrieves all linked data for a given source, resolving through hierarchy if needed.

  • Simple case: If querying the same resolution as the source, just select cluster IDs and keys directly from ClusterSourceKey
  • Hierarchy case: Uses the unified query builder to traverse up the resolution hierarchy, applying COALESCE priority logic to determine which parent cluster each source record belongs to
  • Priority resolution: When multiple model resolutions could assign a record to different clusters, COALESCE ensures higher-priority resolutions win

Returns all records with their final resolved cluster IDs.

get_parent_clusters_and_leaves
get_parent_clusters_and_leaves(resolution: Resolutions) -> dict[int, dict[str, list[dict]]]

Query clusters and their leaves for all parent resolutions.

For a given resolution, find all its parent resolutions and return complete cluster compositions.

  • Parent discovery: Queries ResolutionFrom to find all direct parent resolutions (level 1)
  • Cluster building: For each parent, runs the full unified query to get all cluster assignments with both root and leaf information
  • Aggregation: Collects all leaf nodes belonging to each root cluster across all parent resolutions

Return a dictionary mapping cluster IDs to their complete leaf compositions and metadata.

match
match(key: str, source: SourceResolutionPath, targets: list[SourceResolutionPath], point_of_truth: ResolutionPath, threshold: int | None = None) -> list[Match]

Matches an ID in a source resolution and returns the keys in the targets.

Given a specific key in a source, find what it matches to in target sources through a resolution hierarchy.

  • Target cluster identification: Uses COALESCE priority CTE to determine which cluster the input key belongs to at the resolution level
  • Matching leaves discovery: Builds UNION ALL query with branches for:
    • Direct cluster members (source-only case)
    • Members connected through each model resolution in the hierarchy
  • Cross-reference: Joins the target cluster with all possible matching leaves, filtering for the requested target sources

Organises matches by source configuration and returns structured Match objects for each target.

results

Utilities for querying model results from the PostgreSQL backend.

Classes:

  • SourceInfo

    Information about a model’s sources.

Functions:

SourceInfo

Bases: NamedTuple

Information about a model’s sources.

Attributes:

left instance-attribute
left: int
right instance-attribute
right: int | None
left_ancestors instance-attribute
left_ancestors: set[int]
right_ancestors instance-attribute
right_ancestors: set[int] | None
get_model_config
get_model_config(resolution: Resolutions) -> ModelConfig

Get metadata for a model resolution.