Skip to content

PostgreSQL

A backend adapter for deploying Matchbox using PostgreSQL.

There are two graph-like trees in place here.

  • In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
  • In the data subgraph the tree is implemented as a modified closure table which only stores the “root” and “leaf” relationships for each model
    • The leaf IDs
    • The model’s proposed cluster IDs at that threshold – the roots
erDiagram
    Collections {
        bigint collection_id PK
        text name
    }
    Runs {
        bigint run_id PK
        bigint collection_id FK
        boolean is_mutable
        boolean is_default
    }
    Resolutions {
        bigint resolution_id PK
        bigint run_id FK
        text name
        text description
        text type
        bytea fingerprint
        smallint truth
        enum upload_stage
    }
    ResolutionFrom {
        bigint parent PK,FK
        bigint child PK,FK
        integer level
        smallint truth_cache
    }
    SourceConfigs {
        bigint source_config_id PK
        bigint resolution_id FK
        text location_type
        text location_name
        text extract_transform
    }
    SourceFields {
        bigint field_id PK
        bigint source_config_id FK
        integer index
        text name
        text type
        boolean is_key
    }
    ModelConfigs {
        bigint model_config_id PK
        bigint resolution_id FK
        text model_class
        jsonb model_settings
        jsonb left_query
        jsonb right_query
    }
    Clusters {
        bigint cluster_id PK
        bytea cluster_hash
    }
    ClusterSourceKey {
        bigint key_id PK
        bigint cluster_id FK
        bigint source_config_id FK
        text key
    }
    Contains {
        bigint root PK,FK
        bigint leaf PK,FK
    }
    PKSpace {
        bigint id PK
        bigint next_cluster_id
        bigint next_cluster_keys_id
    }
    Probabilities {
        bigint resolution_id PK,FK
        bigint cluster_id PK,FK
        smallint probability
    }
    Results {
        bigint result_id PK
        bigint resolution_id FK
        bigint left_id FK
        bigint right_id FK
        smallint probability
    }
    Users {
        bigint user_id PK
        text name
        text email
    }
    Groups {
        bigint group_id PK
        text name
        text description
        boolean is_system
    }
    UserGroups {
        bigint user_id PK,FK
        bigint group_id PK,FK
    }
    Permissions {
        bigint permission_id PK
        text permission
        bigint group_id FK
        bigint collection_id FK
        boolean is_system
    }
    EvalJudgements {
        bigint judgement_id PK
        bigint user_id FK
        bigint endorsed_cluster_id FK
        bigint shown_cluster_id FK
        datetime timestamp
    }

    Collections ||--o{ Runs : ""
    Collections ||--o{ Permissions : ""
    Runs ||--o{ Resolutions : ""
    Resolutions ||--o{ ResolutionFrom : "parent"
    ResolutionFrom }o--|| Resolutions : "child"
    Resolutions |o--|| SourceConfigs : ""
    Resolutions |o--|| ModelConfigs : ""
    Resolutions ||--o{ Probabilities : ""
    Resolutions ||--o{ Results : ""
    SourceConfigs ||--o{ SourceFields : ""
    SourceConfigs ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ Contains : "root"
    Contains }o--|| Clusters : "leaf"
    Clusters ||--o{ Probabilities : ""
    Clusters ||--o{ Results : "left_id"
    Clusters ||--o{ Results : "right_id"
    Clusters ||--o{ EvalJudgements : "endorsed_cluster_id"
    Clusters ||--o{ EvalJudgements : "shown_cluster_id"
    Users ||--o{ UserGroups : ""
    Users ||--o{ EvalJudgements : ""
    Groups ||--o{ UserGroups : ""
    Groups ||--o{ Permissions : ""

matchbox.server.postgresql

PostgreSQL adapter for Matchbox server.

Modules:

  • adapter

    Composed PostgreSQL adapter for Matchbox server.

  • db

    Matchbox PostgreSQL database connection.

  • mixin

    A module for defining mixins for the PostgreSQL backend ORM.

  • orm

    ORM classes for the Matchbox PostgreSQL database.

  • utils

    Utilities for using the PostgreSQL backend.

Classes:

__all__ module-attribute

__all__ = ['MatchboxPostgres', 'MatchboxPostgresSettings']

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxPostgresQueryMixin, MatchboxPostgresEvaluationMixin, MatchboxPostgresCollectionsMixin, MatchboxPostgresAdminMixin, MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute

settings = settings

sources instance-attribute

sources = SourceConfigs

models instance-attribute

models = FilteredResolutions(sources=False, models=True)

source_clusters instance-attribute

source_clusters = FilteredClusters(has_source=True)

model_clusters instance-attribute

model_clusters = FilteredClusters(has_source=False)

all_clusters instance-attribute

all_clusters = FilteredClusters()

creates instance-attribute

creates = FilteredProbabilities(over_truth=True)

merges instance-attribute

merges = Contains

proposes instance-attribute

proposes = FilteredProbabilities()

source_resolutions instance-attribute

source_resolutions = FilteredResolutions(sources=True, models=False)

query

query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table

match

match(key: str, source: SourceResolutionPath, targets: list[SourceResolutionPath], point_of_truth: ResolutionPath, threshold: int | None = None) -> list[Match]

create_collection

create_collection(name: CollectionName) -> Collection

get_collection

get_collection(name: CollectionName) -> Collection

list_collections

list_collections() -> list[CollectionName]

delete_collection

delete_collection(name: CollectionName, certain: bool) -> None

create_run

create_run(collection: CollectionName) -> Run

set_run_mutable

set_run_mutable(collection: CollectionName, run_id: RunID, mutable: bool) -> Run

set_run_default

set_run_default(collection: CollectionName, run_id: RunID, default: bool) -> Run

get_run

get_run(collection: CollectionName, run_id: RunID) -> Run

delete_run

delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None

create_resolution

create_resolution(resolution: Resolution, path: ResolutionPath) -> None

get_resolution

get_resolution(path: ResolutionPath) -> Resolution

update_resolution

update_resolution(resolution: Resolution, path: ResolutionPath) -> None

delete_resolution

delete_resolution(path: ResolutionPath, certain: bool) -> None

lock_resolution_data

lock_resolution_data(path: ResolutionPath) -> None

unlock_resolution_data

unlock_resolution_data(path: ResolutionPath, complete: bool = False) -> None

get_resolution_stage

get_resolution_stage(path: ResolutionPath) -> UploadStage

insert_source_data

insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None

insert_model_data

insert_model_data(path: ModelResolutionPath, results: Table) -> None

get_model_data

get_model_data(path: ModelResolutionPath) -> Table

validate_ids

validate_ids(ids: list[int]) -> bool

dump

dump() -> MatchboxSnapshot

drop

drop(certain: bool) -> None

clear

clear(certain: bool) -> None

restore

restore(snapshot: MatchboxSnapshot) -> None

delete_orphans

delete_orphans() -> int

login

login(user: User) -> User

get_user_groups

get_user_groups(user_name: str) -> list[GroupName]

list_groups

list_groups() -> list[Group]

get_group

get_group(name: GroupName) -> Group

create_group

create_group(group: Group) -> None

delete_group

delete_group(name: GroupName, certain: bool = False) -> None

add_user_to_group

add_user_to_group(user_name: str, group_name: GroupName) -> None

remove_user_from_group

remove_user_from_group(user_name: str, group_name: GroupName) -> None

check_permission

check_permission(user_name: str, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> bool

get_permissions

get_permissions(resource: Literal[SYSTEM] | CollectionName) -> list[PermissionGrant]

grant_permission

grant_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None

revoke_permission

revoke_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None

insert_judgement

insert_judgement(judgement: Judgement) -> None

get_judgements

get_judgements(tag: str | None = None) -> tuple[Table, Table]

sample_for_eval

sample_for_eval(n: int, path: ModelResolutionPath, user_name: str) -> Table

Sample some clusters from a resolution.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

model_config class-attribute instance-attribute

model_config = SettingsConfigDict(env_prefix='MB__SERVER__', env_nested_delimiter='__', use_enum_values=True, env_file='.env', env_file_encoding='utf-8', extra='ignore')

batch_size class-attribute instance-attribute

batch_size: int = Field(default=250000)

datastore instance-attribute

task_runner instance-attribute

task_runner: Literal['api', 'celery']

redis_uri instance-attribute

redis_uri: str | None

uploads_expiry_minutes instance-attribute

uploads_expiry_minutes: int | None

authorisation class-attribute instance-attribute

authorisation: bool = True

public_key class-attribute instance-attribute

public_key: SecretStr | None = Field(default=None)

log_level class-attribute instance-attribute

log_level: LogLevelType = 'INFO'

backend_type class-attribute instance-attribute

backend_type: MatchboxBackends = POSTGRES

postgres class-attribute instance-attribute

check_settings

check_settings() -> Self

Check that legal combinations of settings are provided.

adapter

Composed PostgreSQL adapter for Matchbox server.

Modules:

  • admin

    Admin PostgreSQL mixin for Matchbox server.

  • collections

    Collections PostgreSQL mixin for Matchbox server.

  • eval

    Evaluation PostgreSQL mixin for Matchbox server.

  • main

    Composed PostgreSQL adapter for Matchbox server.

  • query

    Query PostgreSQL mixin for Matchbox server.

Classes:

__all__ module-attribute

__all__ = ('MatchboxPostgres', 'MatchboxPostgresSettings')

MatchboxPostgres

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxPostgresQueryMixin, MatchboxPostgresEvaluationMixin, MatchboxPostgresCollectionsMixin, MatchboxPostgresAdminMixin, MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute
settings = settings
sources instance-attribute
sources = SourceConfigs
models instance-attribute
models = FilteredResolutions(sources=False, models=True)
source_clusters instance-attribute
source_clusters = FilteredClusters(has_source=True)
model_clusters instance-attribute
model_clusters = FilteredClusters(has_source=False)
all_clusters instance-attribute
all_clusters = FilteredClusters()
creates instance-attribute
creates = FilteredProbabilities(over_truth=True)
merges instance-attribute
merges = Contains
proposes instance-attribute
proposes = FilteredProbabilities()
source_resolutions instance-attribute
source_resolutions = FilteredResolutions(sources=True, models=False)
query
query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table
match
match(key: str, source: SourceResolutionPath, targets: list[SourceResolutionPath], point_of_truth: ResolutionPath, threshold: int | None = None) -> list[Match]
create_collection
create_collection(name: CollectionName) -> Collection
get_collection
get_collection(name: CollectionName) -> Collection
list_collections
list_collections() -> list[CollectionName]
delete_collection
delete_collection(name: CollectionName, certain: bool) -> None
create_run
create_run(collection: CollectionName) -> Run
set_run_mutable
set_run_mutable(collection: CollectionName, run_id: RunID, mutable: bool) -> Run
set_run_default
set_run_default(collection: CollectionName, run_id: RunID, default: bool) -> Run
get_run
get_run(collection: CollectionName, run_id: RunID) -> Run
delete_run
delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None
create_resolution
create_resolution(resolution: Resolution, path: ResolutionPath) -> None
get_resolution
get_resolution(path: ResolutionPath) -> Resolution
update_resolution
update_resolution(resolution: Resolution, path: ResolutionPath) -> None
delete_resolution
delete_resolution(path: ResolutionPath, certain: bool) -> None
lock_resolution_data
lock_resolution_data(path: ResolutionPath) -> None
unlock_resolution_data
unlock_resolution_data(path: ResolutionPath, complete: bool = False) -> None
get_resolution_stage
get_resolution_stage(path: ResolutionPath) -> UploadStage
insert_source_data
insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None
insert_model_data
insert_model_data(path: ModelResolutionPath, results: Table) -> None
get_model_data
get_model_data(path: ModelResolutionPath) -> Table
validate_ids
validate_ids(ids: list[int]) -> bool
dump
dump() -> MatchboxSnapshot
drop
drop(certain: bool) -> None
clear
clear(certain: bool) -> None
restore
restore(snapshot: MatchboxSnapshot) -> None
delete_orphans
delete_orphans() -> int
login
login(user: User) -> User
get_user_groups
get_user_groups(user_name: str) -> list[GroupName]
list_groups
list_groups() -> list[Group]
get_group
get_group(name: GroupName) -> Group
create_group
create_group(group: Group) -> None
delete_group
delete_group(name: GroupName, certain: bool = False) -> None
add_user_to_group
add_user_to_group(user_name: str, group_name: GroupName) -> None
remove_user_from_group
remove_user_from_group(user_name: str, group_name: GroupName) -> None
check_permission
check_permission(user_name: str, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> bool
get_permissions
get_permissions(resource: Literal[SYSTEM] | CollectionName) -> list[PermissionGrant]
grant_permission
grant_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None
revoke_permission
revoke_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None
insert_judgement
insert_judgement(judgement: Judgement) -> None
get_judgements
get_judgements(tag: str | None = None) -> tuple[Table, Table]
sample_for_eval
sample_for_eval(n: int, path: ModelResolutionPath, user_name: str) -> Table

Sample some clusters from a resolution.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

model_config class-attribute instance-attribute
model_config = SettingsConfigDict(env_prefix='MB__SERVER__', env_nested_delimiter='__', use_enum_values=True, env_file='.env', env_file_encoding='utf-8', extra='ignore')
batch_size class-attribute instance-attribute
batch_size: int = Field(default=250000)
datastore instance-attribute
task_runner instance-attribute
task_runner: Literal['api', 'celery']
redis_uri instance-attribute
redis_uri: str | None
uploads_expiry_minutes instance-attribute
uploads_expiry_minutes: int | None
authorisation class-attribute instance-attribute
authorisation: bool = True
public_key class-attribute instance-attribute
public_key: SecretStr | None = Field(default=None)
log_level class-attribute instance-attribute
log_level: LogLevelType = 'INFO'
backend_type class-attribute instance-attribute
backend_type: MatchboxBackends = POSTGRES
postgres class-attribute instance-attribute
check_settings
check_settings() -> Self

Check that legal combinations of settings are provided.

admin

Admin PostgreSQL mixin for Matchbox server.

Classes:

MatchboxPostgresAdminMixin

Admin mixin for the PostgreSQL adapter for Matchbox.

Methods:

login
login(user: User) -> User
get_user_groups
get_user_groups(user_name: str) -> list[GroupName]
list_groups
list_groups() -> list[Group]
get_group
get_group(name: GroupName) -> Group
create_group
create_group(group: Group) -> None
delete_group
delete_group(name: GroupName, certain: bool = False) -> None
add_user_to_group
add_user_to_group(user_name: str, group_name: GroupName) -> None
remove_user_from_group
remove_user_from_group(user_name: str, group_name: GroupName) -> None
check_permission
check_permission(user_name: str, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> bool
get_permissions
get_permissions(resource: Literal[SYSTEM] | CollectionName) -> list[PermissionGrant]
grant_permission
grant_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None
revoke_permission
revoke_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None
validate_ids
validate_ids(ids: list[int]) -> bool
dump
dump() -> MatchboxSnapshot
drop
drop(certain: bool) -> None
clear
clear(certain: bool) -> None
restore
restore(snapshot: MatchboxSnapshot) -> None
delete_orphans
delete_orphans() -> int

collections

Collections PostgreSQL mixin for Matchbox server.

Classes:

MatchboxPostgresCollectionsMixin

Collections mixin for the PostgreSQL adapter for Matchbox.

Methods:

create_collection
create_collection(name: CollectionName) -> Collection
get_collection
get_collection(name: CollectionName) -> Collection
list_collections
list_collections() -> list[CollectionName]
delete_collection
delete_collection(name: CollectionName, certain: bool) -> None
create_run
create_run(collection: CollectionName) -> Run
set_run_mutable
set_run_mutable(collection: CollectionName, run_id: RunID, mutable: bool) -> Run
set_run_default
set_run_default(collection: CollectionName, run_id: RunID, default: bool) -> Run
get_run
get_run(collection: CollectionName, run_id: RunID) -> Run
delete_run
delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None
create_resolution
create_resolution(resolution: Resolution, path: ResolutionPath) -> None
get_resolution
get_resolution(path: ResolutionPath) -> Resolution
update_resolution
update_resolution(resolution: Resolution, path: ResolutionPath) -> None
delete_resolution
delete_resolution(path: ResolutionPath, certain: bool) -> None
lock_resolution_data
lock_resolution_data(path: ResolutionPath) -> None
unlock_resolution_data
unlock_resolution_data(path: ResolutionPath, complete: bool = False) -> None
get_resolution_stage
get_resolution_stage(path: ResolutionPath) -> UploadStage
insert_source_data
insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None
insert_model_data
insert_model_data(path: ModelResolutionPath, results: Table) -> None
get_model_data
get_model_data(path: ModelResolutionPath) -> Table

eval

Evaluation PostgreSQL mixin for Matchbox server.

Classes:

MatchboxPostgresEvaluationMixin

Evaluation mixin for the PostgreSQL adapter for Matchbox.

Methods:

insert_judgement
insert_judgement(judgement: Judgement) -> None
get_judgements
get_judgements(tag: str | None = None) -> tuple[Table, Table]
sample_for_eval
sample_for_eval(n: int, path: ModelResolutionPath, user_name: str) -> Table

Sample some clusters from a resolution.

main

Composed PostgreSQL adapter for Matchbox server.

Classes:

FilteredClusters

Bases: BaseModel

Wrapper class for filtered cluster queries.

Methods:

  • count

    Counts the number of clusters in the database.

Attributes:

has_source class-attribute instance-attribute
has_source: bool | None = None
count
count() -> int

Counts the number of clusters in the database.

FilteredProbabilities

Bases: BaseModel

Wrapper class for filtered probability queries.

Methods:

  • count

    Counts the number of probabilities in the database.

Attributes:

over_truth class-attribute instance-attribute
over_truth: bool = False
count
count() -> int

Counts the number of probabilities in the database.

FilteredResolutions

Bases: BaseModel

Wrapper class for filtered resolution queries.

Methods:

  • count

    Counts the number of resolutions in the database.

Attributes:

sources class-attribute instance-attribute
sources: bool = False
models class-attribute instance-attribute
models: bool = False
count
count() -> int

Counts the number of resolutions in the database.

MatchboxPostgres
MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxPostgresQueryMixin, MatchboxPostgresEvaluationMixin, MatchboxPostgresCollectionsMixin, MatchboxPostgresAdminMixin, MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

Attributes:

settings instance-attribute
settings = settings
sources instance-attribute
sources = SourceConfigs
models instance-attribute
models = FilteredResolutions(sources=False, models=True)
source_clusters instance-attribute
source_clusters = FilteredClusters(has_source=True)
model_clusters instance-attribute
model_clusters = FilteredClusters(has_source=False)
all_clusters instance-attribute
all_clusters = FilteredClusters()
creates instance-attribute
creates = FilteredProbabilities(over_truth=True)
merges instance-attribute
merges = Contains
proposes instance-attribute
proposes = FilteredProbabilities()
source_resolutions instance-attribute
source_resolutions = FilteredResolutions(sources=True, models=False)
query
query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table
match
match(key: str, source: SourceResolutionPath, targets: list[SourceResolutionPath], point_of_truth: ResolutionPath, threshold: int | None = None) -> list[Match]
create_collection
create_collection(name: CollectionName) -> Collection
get_collection
get_collection(name: CollectionName) -> Collection
list_collections
list_collections() -> list[CollectionName]
delete_collection
delete_collection(name: CollectionName, certain: bool) -> None
create_run
create_run(collection: CollectionName) -> Run
set_run_mutable
set_run_mutable(collection: CollectionName, run_id: RunID, mutable: bool) -> Run
set_run_default
set_run_default(collection: CollectionName, run_id: RunID, default: bool) -> Run
get_run
get_run(collection: CollectionName, run_id: RunID) -> Run
delete_run
delete_run(collection: CollectionName, run_id: RunID, certain: bool) -> None
create_resolution
create_resolution(resolution: Resolution, path: ResolutionPath) -> None
get_resolution
get_resolution(path: ResolutionPath) -> Resolution
update_resolution
update_resolution(resolution: Resolution, path: ResolutionPath) -> None
delete_resolution
delete_resolution(path: ResolutionPath, certain: bool) -> None
lock_resolution_data
lock_resolution_data(path: ResolutionPath) -> None
unlock_resolution_data
unlock_resolution_data(path: ResolutionPath, complete: bool = False) -> None
get_resolution_stage
get_resolution_stage(path: ResolutionPath) -> UploadStage
insert_source_data
insert_source_data(path: SourceResolutionPath, data_hashes: Table) -> None
insert_model_data
insert_model_data(path: ModelResolutionPath, results: Table) -> None
get_model_data
get_model_data(path: ModelResolutionPath) -> Table
validate_ids
validate_ids(ids: list[int]) -> bool
dump
dump() -> MatchboxSnapshot
drop
drop(certain: bool) -> None
clear
clear(certain: bool) -> None
restore
restore(snapshot: MatchboxSnapshot) -> None
delete_orphans
delete_orphans() -> int
login
login(user: User) -> User
get_user_groups
get_user_groups(user_name: str) -> list[GroupName]
list_groups
list_groups() -> list[Group]
get_group
get_group(name: GroupName) -> Group
create_group
create_group(group: Group) -> None
delete_group
delete_group(name: GroupName, certain: bool = False) -> None
add_user_to_group
add_user_to_group(user_name: str, group_name: GroupName) -> None
remove_user_from_group
remove_user_from_group(user_name: str, group_name: GroupName) -> None
check_permission
check_permission(user_name: str, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> bool
get_permissions
get_permissions(resource: Literal[SYSTEM] | CollectionName) -> list[PermissionGrant]
grant_permission
grant_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None
revoke_permission
revoke_permission(group_name: GroupName, permission: PermissionType, resource: Literal[SYSTEM] | CollectionName) -> None
insert_judgement
insert_judgement(judgement: Judgement) -> None
get_judgements
get_judgements(tag: str | None = None) -> tuple[Table, Table]
sample_for_eval
sample_for_eval(n: int, path: ModelResolutionPath, user_name: str) -> Table

Sample some clusters from a resolution.

query

Query PostgreSQL mixin for Matchbox server.

Classes:

MatchboxPostgresQueryMixin

Query mixin for the PostgreSQL adapter for Matchbox.

Methods:

query
query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int | None = None) -> Table
match
match(key: str, source: SourceResolutionPath, targets: list[SourceResolutionPath], point_of_truth: ResolutionPath, threshold: int | None = None) -> list[Match]

db

Matchbox PostgreSQL database connection.

Classes:

Attributes:

MBDB module-attribute

MatchboxPostgresCoreSettings

Bases: BaseModel

PostgreSQL-specific settings for Matchbox.

Methods:

Attributes:

host instance-attribute
host: str
port instance-attribute
port: int
user instance-attribute
user: str
password instance-attribute
password: str
database instance-attribute
database: str
db_schema instance-attribute
db_schema: str
alembic_config class-attribute instance-attribute
alembic_config: Path = Field(default=Path('src/matchbox/server/postgresql/alembic.ini'))
get_alembic_config
get_alembic_config() -> Config

Get the Alembic config.

MatchboxPostgresSettings

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Methods:

  • check_settings

    Check that legal combinations of settings are provided.

Attributes:

backend_type class-attribute instance-attribute
backend_type: MatchboxBackends = POSTGRES
postgres class-attribute instance-attribute
model_config class-attribute instance-attribute
model_config = SettingsConfigDict(env_prefix='MB__SERVER__', env_nested_delimiter='__', use_enum_values=True, env_file='.env', env_file_encoding='utf-8', extra='ignore')
batch_size class-attribute instance-attribute
batch_size: int = Field(default=250000)
datastore instance-attribute
task_runner instance-attribute
task_runner: Literal['api', 'celery']
redis_uri instance-attribute
redis_uri: str | None
uploads_expiry_minutes instance-attribute
uploads_expiry_minutes: int | None
authorisation class-attribute instance-attribute
authorisation: bool = True
public_key class-attribute instance-attribute
public_key: SecretStr | None = Field(default=None)
log_level class-attribute instance-attribute
log_level: LogLevelType = 'INFO'
check_settings
check_settings() -> Self

Check that legal combinations of settings are provided.

MatchboxDatabase

MatchboxDatabase(settings: MatchboxPostgresSettings)

Matchbox PostgreSQL database connection.

Methods:

Attributes:

settings instance-attribute
settings = settings
MatchboxBase instance-attribute
MatchboxBase = declarative_base(metadata=MetaData(schema=db_schema))
alembic_config instance-attribute
alembic_config = get_alembic_config()
sorted_tables property
sorted_tables: list[Table]

Return a list of SQLAlchemy tables in order of creation.

connection_string
connection_string(driver: bool = True) -> str

Get the connection string for PostgreSQL.

get_engine
get_engine() -> Engine

Get the database engine.

get_session
get_session() -> Session

Get a new session.

get_adbc_connection
get_adbc_connection() -> Generator[PoolProxiedConnection, Any, Any]

Get a new ADBC connection wrapped by a SQLAlchemy pool proxy.

The connection must be used within a context manager.

run_migrations
run_migrations() -> None

Create the database and all tables expected in the schema.

clear_database
clear_database() -> None

Delete all rows in every table in the database schema.

  • TRUNCATE tables that are part of the core ORM (preserves structure)
  • DROP tables that are not in the ORM (removes temporary/test tables)
drop_database
drop_database() -> None

Drop all tables in the database schema and re-recreate them.

vacuum_analyze
vacuum_analyze(*table_names: str) -> None

Run VACUUM ANALYZE on specified tables.

VACUUM ANALYZE reclaims storage and updates statistics for the query planner. PostgreSQL may not fully utilise indexes until VACUUM ANALYZE is run. According to https://www.postgresql.org/docs/current/sql-vacuum.html, VACUUM ANALYZE is recommended over just ANALYZE for optimal performance.

Parameters:

  • *table_names
    (str, default: () ) –

    Fully qualified table names to vacuum. If none provided, vacuums the entire database.

mixin

A module for defining mixins for the PostgreSQL backend ORM.

Classes:

  • CountMixin

    A mixin for counting the number of rows in a table.

Attributes:

  • T

T module-attribute

T = TypeVar('T')

CountMixin

A mixin for counting the number of rows in a table.

Methods:

  • count

    Counts the number of rows in the table.

count classmethod
count() -> int

Counts the number of rows in the table.

orm

ORM classes for the Matchbox PostgreSQL database.

Classes:

  • Collections

    Named collections of resolutions and runs.

  • Runs

    Runs of collections of resolutions.

  • ResolutionFrom

    Resolution lineage closure table with cached truth values.

  • Resolutions

    Table of resolution points corresponding to models, and sources.

  • PKSpace

    Table used to reserve ranges of primary keys.

  • SourceFields

    Table for storing column details for SourceConfigs.

  • ClusterSourceKey

    Table for storing source primary keys for clusters.

  • SourceConfigs

    Table of source_configs of data for Matchbox.

  • ModelConfigs

    Table of model configs for Matchbox.

  • Contains

    Cluster lineage table.

  • Clusters

    Table of indexed data and clusters that match it.

  • UserGroups

    Association table for user-group membership.

  • Users

    Table of user identities.

  • Groups

    Groups for permission management.

  • Permissions

    Permissions granted to groups on resources.

  • EvalJudgements

    Table of evaluation judgements produced by human validators.

  • Probabilities

    Table of probabilities that a cluster is correct, according to a resolution.

  • Results

    Table of results for a resolution.

Collections

Bases: CountMixin, MatchboxBase

Named collections of resolutions and runs.

Methods:

  • from_name

    Resolve a collection name to a Collections object.

  • to_dto

    Convert ORM collection to a matchbox.common Collection object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'collections'
collection_id class-attribute instance-attribute
collection_id: Mapped[int] = mapped_column(BIGINT, primary_key=True, autoincrement=True)
name class-attribute instance-attribute
name: Mapped[str] = mapped_column(TEXT, nullable=False)
runs class-attribute instance-attribute
runs: Mapped[list[Runs]] = relationship(back_populates='collection')
permissions class-attribute instance-attribute
permissions: Mapped[list[Permissions]] = relationship(back_populates='collection', passive_deletes=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('name', name='collections_name_key'),)
from_name classmethod
from_name(name: CollectionName, session: Session | None = None) -> Collections

Resolve a collection name to a Collections object.

Parameters:

  • name
    (CollectionName) –

    The name of the collection to resolve.

  • session
    (Session | None, default: None ) –

    Optional session to use for the query.

Raises:

to_dto
to_dto() -> Collection

Convert ORM collection to a matchbox.common Collection object.

count classmethod
count() -> int

Counts the number of rows in the table.

Runs

Bases: CountMixin, MatchboxBase

Runs of collections of resolutions.

Methods:

  • from_id

    Resolve a collection and run name to a Runs object.

  • to_dto

    Convert ORM run to a matchbox.common Run object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'runs'
run_id class-attribute instance-attribute
run_id: Mapped[int] = mapped_column(BIGINT, primary_key=True, autoincrement=True)
collection_id class-attribute instance-attribute
collection_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('collections.collection_id', ondelete='CASCADE'), nullable=False)
is_mutable class-attribute instance-attribute
is_mutable: Mapped[bool] = mapped_column(BOOLEAN, default=False, nullable=True)
is_default class-attribute instance-attribute
is_default: Mapped[bool] = mapped_column(BOOLEAN, default=False, nullable=True)
collection class-attribute instance-attribute
collection: Mapped[Collections] = relationship(back_populates='runs')
resolutions class-attribute instance-attribute
resolutions: Mapped[list[Resolutions]] = relationship(back_populates='run')
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('collection_id', 'run_id', name='unique_run_id'), Index('ix_default_run_collection', 'collection_id', unique=True, postgresql_where=text('is_default = true')))
from_id classmethod
from_id(collection: CollectionName, run_id: RunID, session: Session | None = None) -> Runs

Resolve a collection and run name to a Runs object.

Parameters:

  • collection
    (CollectionName) –

    The name of the collection containing the run.

  • run_id
    (RunID) –

    The ID of the run within that collection.

  • session
    (Session | None, default: None ) –

    Optional session to use for the query.

Raises:

to_dto
to_dto() -> Run

Convert ORM run to a matchbox.common Run object.

count classmethod
count() -> int

Counts the number of rows in the table.

ResolutionFrom

Bases: CountMixin, MatchboxBase

Resolution lineage closure table with cached truth values.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolution_from'
parent class-attribute instance-attribute
parent: Mapped[int] = mapped_column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), primary_key=True)
child class-attribute instance-attribute
child: Mapped[int] = mapped_column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), primary_key=True)
level class-attribute instance-attribute
level: Mapped[int] = mapped_column(INTEGER, nullable=False)
truth_cache class-attribute instance-attribute
truth_cache: Mapped[int | None] = mapped_column(SMALLINT, nullable=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint('parent != child', name='no_self_reference'), CheckConstraint('level > 0', name='positive_level'))
count classmethod
count() -> int

Counts the number of rows in the table.

Resolutions

Bases: CountMixin, MatchboxBase

Table of resolution points corresponding to models, and sources.

Resolutions produce probabilities or own data in the clusters table.

Methods:

  • get_lineage

    Returns lineage ordered by priority.

  • from_path

    Resolves a resolution name to a Resolution object.

  • from_dto

    Create a Resolutions instance from a Resolution DTO object.

  • to_dto

    Convert ORM resolution to a matchbox.common Resolution object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'resolutions'
resolution_id class-attribute instance-attribute
resolution_id: Mapped[int] = mapped_column(BIGINT, primary_key=True, autoincrement=True)
run_id class-attribute instance-attribute
run_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('runs.run_id', ondelete='CASCADE'), nullable=False)
upload_stage class-attribute instance-attribute
upload_stage: Mapped[UploadStage] = mapped_column(Enum(UploadStage, native_enum=True, name='upload_stages', schema='mb'), nullable=False, default=READY)
name class-attribute instance-attribute
name: Mapped[str] = mapped_column(TEXT, nullable=False)
description class-attribute instance-attribute
description: Mapped[str | None] = mapped_column(TEXT, nullable=True)
type class-attribute instance-attribute
type: Mapped[str] = mapped_column(TEXT, nullable=False)
fingerprint class-attribute instance-attribute
fingerprint: Mapped[bytes] = mapped_column(BYTEA, nullable=False)
truth class-attribute instance-attribute
truth: Mapped[int | None] = mapped_column(SMALLINT, nullable=True)
source_config class-attribute instance-attribute
source_config: Mapped[Optional[SourceConfigs]] = relationship(back_populates='source_resolution', uselist=False)
model_config class-attribute instance-attribute
model_config: Mapped[Optional[ModelConfigs]] = relationship(back_populates='model_resolution', uselist=False)
probabilities class-attribute instance-attribute
probabilities: Mapped[list[Probabilities]] = relationship(back_populates='proposed_by', passive_deletes=True)
results class-attribute instance-attribute
results: Mapped[list[Results]] = relationship(back_populates='proposed_by', passive_deletes=True)
children class-attribute instance-attribute
children: Mapped[list[Resolutions]] = relationship(secondary=__table__, primaryjoin='Resolutions.resolution_id == ResolutionFrom.parent', secondaryjoin='Resolutions.resolution_id == ResolutionFrom.child', backref='parents')
run class-attribute instance-attribute
run: Mapped[Runs] = relationship(back_populates='resolutions')
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint("type IN ('model', 'source')", name='resolution_type_constraints'), UniqueConstraint('run_id', 'name', name='resolutions_name_key'))
ancestors property
ancestors: set[Resolutions]

Returns all ancestors (parents, grandparents, etc.) of this resolution.

descendants property
descendants: set[Resolutions]

Returns descendants (children, grandchildren, etc.) of this resolution.

get_lineage
get_lineage(sources: list[SourceConfigs] | None = None, threshold: int | None = None) -> list[tuple[int, int, float | None]]

Returns lineage ordered by priority.

Highest priority (lowest level) first, then by resolution_id for stability.

Parameters:

  • sources
    (list[SourceConfigs] | None, default: None ) –

    If provided, only return lineage paths that lead to these sources

  • threshold
    (int | None, default: None ) –

    If provided, override this resolution’s threshold

Returns:

  • list[tuple[int, int, float | None]]

    List of tuples (resolution_id, source_config_id, threshold) ordered by priority.

from_path classmethod
from_path(path: ResolutionPath, res_type: ResolutionType | None = None, session: Session | None = None, for_update: bool = False) -> Resolutions

Resolves a resolution name to a Resolution object.

Parameters:

  • path
    (ResolutionPath) –

    The path of the resolution to resolve.

  • res_type
    (ResolutionType | None, default: None ) –

    A resolution type to use as filter.

  • session
    (Session | None, default: None ) –

    A session to get the resolution for updates.

  • for_update
    (bool, default: False ) –

    Locks the row until updated.

Raises:

from_dto classmethod

Create a Resolutions instance from a Resolution DTO object.

The resolution will be added to the session and flushed (but not committed).

For model resolutions, lineage entries will be created automatically.

Parameters:

  • resolution
    (Resolution) –

    The Resolution DTO to convert

  • path
    (ResolutionPath) –

    The full resolution path

  • session
    (Session) –

    Database session (caller must commit)

Returns:

  • Resolutions

    A Resolutions ORM instance with ID and relationships established

to_dto
to_dto() -> Resolution

Convert ORM resolution to a matchbox.common Resolution object.

count classmethod
count() -> int

Counts the number of rows in the table.

PKSpace

Bases: MatchboxBase

Table used to reserve ranges of primary keys.

Methods:

  • initialise

    Create PKSpace tracking row if not exists.

  • reserve_block

    Atomically get next available ID for table, and increment it.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'pk_space'
id class-attribute instance-attribute
id: Mapped[int] = mapped_column(BIGINT, primary_key=True)
next_cluster_id class-attribute instance-attribute
next_cluster_id: Mapped[int] = mapped_column(BIGINT, nullable=False)
next_cluster_keys_id class-attribute instance-attribute
next_cluster_keys_id: Mapped[int] = mapped_column(BIGINT, nullable=False)
initialise classmethod
initialise() -> None

Create PKSpace tracking row if not exists.

reserve_block classmethod
reserve_block(table: Literal['clusters', 'cluster_keys'], block_size: int) -> int

Atomically get next available ID for table, and increment it.

SourceFields

Bases: CountMixin, MatchboxBase

Table for storing column details for SourceConfigs.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_fields'
field_id class-attribute instance-attribute
field_id: Mapped[int] = mapped_column(BIGINT, primary_key=True)
source_config_id class-attribute instance-attribute
source_config_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('source_configs.source_config_id', ondelete='CASCADE'), nullable=False)
index class-attribute instance-attribute
index: Mapped[int] = mapped_column(INTEGER, nullable=False)
name class-attribute instance-attribute
name: Mapped[str] = mapped_column(TEXT, nullable=False)
type class-attribute instance-attribute
type: Mapped[str] = mapped_column(TEXT, nullable=False)
is_key class-attribute instance-attribute
is_key: Mapped[bool] = mapped_column(BOOLEAN, nullable=False)
source_config class-attribute instance-attribute
source_config: Mapped[SourceConfigs] = relationship(back_populates='fields', foreign_keys=[source_config_id])
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('source_config_id', 'index', name='unique_index'), Index('ix_source_columns_source_config_id', 'source_config_id'), Index('ix_unique_key_field', 'source_config_id', unique=True, postgresql_where=text('is_key = true')))
count classmethod
count() -> int

Counts the number of rows in the table.

ClusterSourceKey

Bases: CountMixin, MatchboxBase

Table for storing source primary keys for clusters.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'cluster_keys'
key_id class-attribute instance-attribute
key_id: Mapped[int] = mapped_column(BIGINT, primary_key=True)
cluster_id class-attribute instance-attribute
cluster_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
source_config_id class-attribute instance-attribute
source_config_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('source_configs.source_config_id', ondelete='CASCADE'), nullable=False)
key class-attribute instance-attribute
key: Mapped[str] = mapped_column(TEXT, nullable=False)
cluster class-attribute instance-attribute
cluster: Mapped[Clusters] = relationship(back_populates='keys')
source_config class-attribute instance-attribute
source_config: Mapped[SourceConfigs] = relationship(back_populates='cluster_keys')
__table_args__ class-attribute instance-attribute
__table_args__ = (Index('ix_cluster_keys_cluster_id', 'cluster_id'), Index('ix_cluster_keys_keys', 'key'), Index('ix_cluster_keys_source_config_id', 'source_config_id'), UniqueConstraint('key_id', 'source_config_id', name='unique_keys_source'))
count classmethod
count() -> int

Counts the number of rows in the table.

SourceConfigs

SourceConfigs(key_field: SourceFields | None = None, index_fields: list[SourceFields] | None = None, **kwargs: Any)

Bases: CountMixin, MatchboxBase

Table of source_configs of data for Matchbox.

Methods:

  • list_all

    Returns all source_configs in the database.

  • from_dto

    Create a SourceConfigs instance from a Resolution DTO object.

  • to_dto

    Convert ORM source to a matchbox.common.SourceConfig object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'source_configs'
source_config_id class-attribute instance-attribute
source_config_id: Mapped[int] = mapped_column(BIGINT, Identity(start=1), primary_key=True)
resolution_id class-attribute instance-attribute
resolution_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), nullable=False)
location_type class-attribute instance-attribute
location_type: Mapped[str] = mapped_column(TEXT, nullable=False)
location_name class-attribute instance-attribute
location_name: Mapped[str] = mapped_column(TEXT, nullable=False)
extract_transform class-attribute instance-attribute
extract_transform: Mapped[str] = mapped_column(TEXT, nullable=False)
name property
name: str

Get the name of the related resolution.

source_resolution class-attribute instance-attribute
source_resolution: Mapped[Resolutions] = relationship(back_populates='source_config')
fields class-attribute instance-attribute
fields: Mapped[list[SourceFields]] = relationship(back_populates='source_config', passive_deletes=True, cascade='all, delete-orphan')
key_field class-attribute instance-attribute
key_field: Mapped[Optional[SourceFields]] = relationship(primaryjoin='and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == True)', viewonly=True, uselist=False)
index_fields class-attribute instance-attribute
index_fields: Mapped[list[SourceFields]] = relationship(primaryjoin='and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == False)', viewonly=True, order_by='SourceFields.index', collection_class=list)
cluster_keys class-attribute instance-attribute
cluster_keys: Mapped[list[ClusterSourceKey]] = relationship(back_populates='source_config', passive_deletes=True)
clusters class-attribute instance-attribute
clusters: Mapped[list[Clusters]] = relationship(secondary=__table__, primaryjoin='SourceConfigs.source_config_id == ClusterSourceKey.source_config_id', secondaryjoin='ClusterSourceKey.cluster_id == Clusters.cluster_id', viewonly=True)
list_all classmethod
list_all() -> list[SourceConfigs]

Returns all source_configs in the database.

from_dto classmethod
from_dto(config: SourceConfig) -> SourceConfigs

Create a SourceConfigs instance from a Resolution DTO object.

to_dto
to_dto() -> SourceConfig

Convert ORM source to a matchbox.common.SourceConfig object.

count classmethod
count() -> int

Counts the number of rows in the table.

ModelConfigs

Bases: CountMixin, MatchboxBase

Table of model configs for Matchbox.

Methods:

  • list_all

    Returns all model_configs in the database.

  • from_dto

    Create a SourceConfigs instance from a Resolution DTO object.

  • to_dto

    Convert ORM source to a matchbox.common.ModelConfig object.

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'model_configs'
model_config_id class-attribute instance-attribute
model_config_id: Mapped[int] = mapped_column(BIGINT, Identity(start=1), primary_key=True)
resolution_id class-attribute instance-attribute
resolution_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), nullable=False)
model_class class-attribute instance-attribute
model_class: Mapped[str] = mapped_column(TEXT, nullable=False)
model_settings class-attribute instance-attribute
model_settings: Mapped[dict] = mapped_column(JSONB, nullable=False)
left_query class-attribute instance-attribute
left_query: Mapped[dict] = mapped_column(JSONB, nullable=False)
right_query class-attribute instance-attribute
right_query: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
name property
name: str

Get the name of the related resolution.

model_resolution class-attribute instance-attribute
model_resolution: Mapped[Resolutions] = relationship(back_populates='model_config')
list_all classmethod
list_all() -> list[SourceConfigs]

Returns all model_configs in the database.

from_dto classmethod
from_dto(config: ModelConfig) -> ModelConfigs

Create a SourceConfigs instance from a Resolution DTO object.

to_dto
to_dto() -> ModelConfig

Convert ORM source to a matchbox.common.ModelConfig object.

count classmethod
count() -> int

Counts the number of rows in the table.

Contains

Bases: CountMixin, MatchboxBase

Cluster lineage table.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'contains'
root class-attribute instance-attribute
root: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), primary_key=True)
leaf class-attribute instance-attribute
leaf: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), primary_key=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint('root != leaf', name='no_self_containment'), Index('ix_contains_root_leaf', 'root', 'leaf'), Index('ix_contains_leaf_root', 'leaf', 'root'))
count classmethod
count() -> int

Counts the number of rows in the table.

Clusters

Bases: CountMixin, MatchboxBase

Table of indexed data and clusters that match it.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'clusters'
cluster_id class-attribute instance-attribute
cluster_id: Mapped[int] = mapped_column(BIGINT, primary_key=True)
cluster_hash class-attribute instance-attribute
cluster_hash: Mapped[bytes] = mapped_column(BYTEA, nullable=False)
keys class-attribute instance-attribute
keys: Mapped[list[ClusterSourceKey]] = relationship(back_populates='cluster', passive_deletes=True)
probabilities class-attribute instance-attribute
probabilities: Mapped[list[Probabilities]] = relationship(back_populates='proposes', passive_deletes=True)
leaves class-attribute instance-attribute
leaves: Mapped[list[Clusters]] = relationship(secondary=__table__, primaryjoin='Clusters.cluster_id == Contains.root', secondaryjoin='Clusters.cluster_id == Contains.leaf', backref='roots')
source_configs class-attribute instance-attribute
source_configs: Mapped[list[SourceConfigs]] = relationship(secondary=__table__, primaryjoin='Clusters.cluster_id == ClusterSourceKey.cluster_id', secondaryjoin='ClusterSourceKey.source_config_id == SourceConfigs.source_config_id', viewonly=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('cluster_hash', name='clusters_hash_key'),)
count classmethod
count() -> int

Counts the number of rows in the table.

UserGroups

Bases: MatchboxBase

Association table for user-group membership.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'user_groups'
user_id class-attribute instance-attribute
user_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('users.user_id', ondelete='CASCADE'), primary_key=True)
group_id class-attribute instance-attribute
group_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('groups.group_id', ondelete='CASCADE'), primary_key=True)

Users

Bases: CountMixin, MatchboxBase

Table of user identities.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'users'
user_id class-attribute instance-attribute
user_id: Mapped[int] = mapped_column(BIGINT, primary_key=True)
name class-attribute instance-attribute
name: Mapped[str] = mapped_column(TEXT, nullable=False)
email class-attribute instance-attribute
email: Mapped[str] = mapped_column(TEXT, nullable=True)
judgements class-attribute instance-attribute
judgements: Mapped[list[EvalJudgements]] = relationship(back_populates='user')
groups class-attribute instance-attribute
groups: Mapped[list[Groups]] = relationship(secondary=__table__, back_populates='members')
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('name', name='user_name_unique'),)
count classmethod
count() -> int

Counts the number of rows in the table.

Groups

Bases: CountMixin, MatchboxBase

Groups for permission management.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'groups'
group_id class-attribute instance-attribute
group_id: Mapped[int] = mapped_column(BIGINT, primary_key=True, autoincrement=True)
name class-attribute instance-attribute
name: Mapped[str] = mapped_column(TEXT, nullable=False)
description class-attribute instance-attribute
description: Mapped[str | None] = mapped_column(TEXT, nullable=True)
is_system class-attribute instance-attribute
is_system: Mapped[bool] = mapped_column(BOOLEAN, default=False, nullable=False)
members class-attribute instance-attribute
members: Mapped[list[Users]] = relationship(secondary=__table__, back_populates='groups')
permissions class-attribute instance-attribute
permissions: Mapped[list[Permissions]] = relationship(back_populates='group', passive_deletes=True)
__table_args__ class-attribute instance-attribute
__table_args__ = (UniqueConstraint('name', name='groups_name_key'),)
count classmethod
count() -> int

Counts the number of rows in the table.

Permissions

Bases: CountMixin, MatchboxBase

Permissions granted to groups on resources.

Each resource type should have one column. This creates lots of nulls, which are cheap in PostgreSQL and are on an ultimately small table, and avoids a polymorphic association.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'permissions'
permission_id class-attribute instance-attribute
permission_id: Mapped[int] = mapped_column(BIGINT, primary_key=True, autoincrement=True)
permission class-attribute instance-attribute
permission: Mapped[str] = mapped_column(TEXT, nullable=False)
group_id class-attribute instance-attribute
group_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('groups.group_id', ondelete='CASCADE'), nullable=False)
collection_id class-attribute instance-attribute
collection_id: Mapped[int | None] = mapped_column(BIGINT, ForeignKey('collections.collection_id', ondelete='CASCADE'), nullable=True)
is_system class-attribute instance-attribute
is_system: Mapped[bool | None] = mapped_column(BOOLEAN, nullable=True)
group class-attribute instance-attribute
group: Mapped[Groups] = relationship(back_populates='permissions')
collection class-attribute instance-attribute
collection: Mapped[Collections | None] = relationship(back_populates='permissions')
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint("permission IN ('read', 'write', 'admin')", name='valid_permission'), CheckConstraint('(collection_id IS NOT NULL AND is_system IS NULL) OR (collection_id IS NULL AND is_system = true)', name='exactly_one_resource'), UniqueConstraint('permission', 'group_id', 'collection_id', 'is_system', name='unique_permission_grant'))
count classmethod
count() -> int

Counts the number of rows in the table.

EvalJudgements

Bases: CountMixin, MatchboxBase

Table of evaluation judgements produced by human validators.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'eval_judgements'
judgement_id class-attribute instance-attribute
judgement_id: Mapped[int] = mapped_column(BIGINT, primary_key=True)
user_id class-attribute instance-attribute
user_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('users.user_id', ondelete='CASCADE'), nullable=False)
endorsed_cluster_id class-attribute instance-attribute
endorsed_cluster_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
shown_cluster_id class-attribute instance-attribute
shown_cluster_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
tag class-attribute instance-attribute
tag: Mapped[str] = mapped_column(TEXT, nullable=True)
timestamp class-attribute instance-attribute
timestamp: Mapped[DateTime] = mapped_column(DateTime(timezone=True), nullable=False)
user class-attribute instance-attribute
user: Mapped[Users] = relationship(back_populates='judgements')
count classmethod
count() -> int

Counts the number of rows in the table.

Probabilities

Bases: CountMixin, MatchboxBase

Table of probabilities that a cluster is correct, according to a resolution.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'probabilities'
resolution_id class-attribute instance-attribute
resolution_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), primary_key=True)
cluster_id class-attribute instance-attribute
cluster_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), primary_key=True)
probability class-attribute instance-attribute
probability: Mapped[int] = mapped_column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by: Mapped[Resolutions] = relationship(back_populates='probabilities')
proposes class-attribute instance-attribute
proposes: Mapped[Clusters] = relationship(back_populates='probabilities')
__table_args__ class-attribute instance-attribute
__table_args__ = (CheckConstraint('probability BETWEEN 0 AND 100', name='valid_probability'), Index('ix_probabilities_resolution', 'resolution_id'))
count classmethod
count() -> int

Counts the number of rows in the table.

Results

Bases: CountMixin, MatchboxBase

Table of results for a resolution.

Stores the raw left/right probabilities created by a model.

Methods:

  • count

    Counts the number of rows in the table.

Attributes:

__tablename__ class-attribute instance-attribute
__tablename__ = 'results'
result_id class-attribute instance-attribute
result_id: Mapped[int] = mapped_column(BIGINT, primary_key=True, autoincrement=True)
resolution_id class-attribute instance-attribute
resolution_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('resolutions.resolution_id', ondelete='CASCADE'), nullable=False)
left_id class-attribute instance-attribute
left_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
right_id class-attribute instance-attribute
right_id: Mapped[int] = mapped_column(BIGINT, ForeignKey('clusters.cluster_id', ondelete='CASCADE'), nullable=False)
probability class-attribute instance-attribute
probability: Mapped[int] = mapped_column(SMALLINT, nullable=False)
proposed_by class-attribute instance-attribute
proposed_by: Mapped[Resolutions] = relationship(back_populates='results')
__table_args__ class-attribute instance-attribute
__table_args__ = (Index('ix_results_resolution', 'resolution_id'), CheckConstraint('probability BETWEEN 0 AND 100', name='valid_probability'), UniqueConstraint('resolution_id', 'left_id', 'right_id'))
count classmethod
count() -> int

Counts the number of rows in the table.

utils

Utilities for using the PostgreSQL backend.

Modules:

  • db

    General utilities for the PostgreSQL backend.

  • insert

    Utilities for inserting data into the PostgreSQL backend.

  • query

    Utilities for querying and matching in the PostgreSQL backend.

  • results

    Utilities for querying model results from the PostgreSQL backend.

db

General utilities for the PostgreSQL backend.

Functions:

  • dump

    Dumps the entire database to a snapshot.

  • restore

    Restores the database from a snapshot.

  • sqa_profiled

    SQLAlchemy profiler.

  • compile_sql

    Compiles a SQLAlchemy statement into a string.

  • large_append

    Append a PyArrow table to a PostgreSQL table using ADBC.

  • ingest_to_temporary_table

    Context manager to ingest Arrow data to a temporary table with explicit types.

dump
dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

  • MatchboxSnapshot

    A MatchboxSnapshot object of type “postgres” with the database’s current state.

restore

Restores the database from a snapshot.

Parameters:

  • snapshot
    (MatchboxSnapshot) –

    A MatchboxSnapshot object of type “postgres” with the database’s state

  • batch_size
    (int) –

    The number of records to insert in each batch

Raises:

sqa_profiled
sqa_profiled() -> Generator[None, None, None]

SQLAlchemy profiler.

Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling

compile_sql
compile_sql(stmt: Select) -> str

Compiles a SQLAlchemy statement into a string.

Parameters:

  • stmt
    (Select) –

    The SQLAlchemy statement to compile.

Returns:

  • str

    The compiled SQL statement as a string.

large_append
large_append(data: Table, table_class: DeclarativeMeta, adbc_connection: PoolProxiedConnection, max_chunksize: int | None = None) -> None

Append a PyArrow table to a PostgreSQL table using ADBC.

This function does not support upserting and will error if keys clash. This method does not auto-commit, which is the responsibility of the caller.

Parameters:

  • data
    (Table) –

    A PyArrow table to write.

  • table_class
    (DeclarativeMeta) –

    The SQLAlchemy ORM class for the table to write to.

  • adbc_connection
    (PoolProxiedConnection) –

    An ADBC connection from the pool. This is returned by MBDB.get_adbc_connection() and needs to be used via a context manager.

  • max_chunksize
    (int | None, default: None ) –

    Size of data chunks to be read and copied.

ingest_to_temporary_table
ingest_to_temporary_table(table_name: str, schema_name: str, data: Table, column_types: dict[str, type[TypeEngine]], max_chunksize: int | None = None) -> Generator[Table, None, None]

Context manager to ingest Arrow data to a temporary table with explicit types.

Parameters:

  • table_name
    (str) –

    Base name for the temporary table

  • schema_name
    (str) –

    Schema where the temporary table will be created

  • data
    (Table) –

    PyArrow table containing the data to ingest

  • column_types
    (dict[str, type[TypeEngine]]) –

    Map of column names to SQLAlchemy types

  • max_chunksize
    (int | None, default: None ) –

    Optional maximum chunk size for batches

Returns:

  • None

    A SQLAlchemy Table object representing the temporary table

insert

Utilities for inserting data into the PostgreSQL backend.

Functions:

insert_hashes
insert_hashes(path: SourceResolutionPath, data_hashes: Table, batch_size: int) -> None

Indexes hash data for a source within Matchbox.

Parameters:

  • path
    (SourceResolutionPath) –

    The path of the source resolution

  • data_hashes
    (Table) –

    Arrow table containing hash data

  • batch_size
    (int) –

    Batch size for bulk operations

Raises:

insert_results
insert_results(path: ModelResolutionPath, results: Table, batch_size: int) -> None

Writes a results table to Matchbox.

The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.

This means two-item components are synonymous with their original pairwise probabilities.

This allows easy querying of clusters at any threshold.

Parameters:

  • path
    (ModelResolutionPath) –

    The path of the model resolution to upload results for

  • results
    (Table) –

    A PyArrow results table with left_id, right_id, probability

  • batch_size
    (int) –

    Number of records to insert in each batch

Raises:

query

Utilities for querying and matching in the PostgreSQL backend.

Functions:

  • build_unified_query

    Build a query to resolve cluster assignments across resolution hierarchies.

  • query

    Queries Matchbox to retrieve linked data for a source.

  • get_parent_clusters_and_leaves

    Query clusters and their leaves for all parent resolutions.

  • match

    Matches an ID in a source resolution and returns the keys in the targets.

Attributes:

  • T
T module-attribute
T = TypeVar('T')
build_unified_query
build_unified_query(resolution: Resolutions, sources: list[SourceConfigs] | None = None, threshold: int | None = None, level: Literal['leaf', 'key'] = 'leaf', get_hashes: bool = False) -> Select

Build a query to resolve cluster assignments across resolution hierarchies.

This function creates SQL that determines which cluster each source record belongs to by traversing up a resolution hierarchy and applying priority-based cluster selection.

The query uses COALESCE to implement a priority system where higher-level resolutions can “claim” records, with lower levels only processing unclaimed records:

COALESCE(highest_priority_cluster, medium_priority_cluster, ..., source_cluster)
  1. Lineage discovery: Queries the resolution hierarchy to find all ancestor resolutions, ordered by priority (lowest level = highest priority)
  2. Source filtering: When sources is provided, constrains results to only include clusters from those specific source configurations
  3. Threshold application: Applies probability thresholds to determine which clusters qualify at each resolution level
  4. Subquery construction: For each model resolution in the lineage, builds a subquery that finds qualifying clusters via the Contains→Probabilities join. Each joined subquery adds a new cluster column which is then merged via…
  5. COALESCE assembly: Joins all subqueries to source data and uses COALESCE to select the highest-priority cluster assignment for each record

The level changes the data returned:

  • "leaf": Returns both root and leaf cluster IDs. For unmerged source clusters, the root and leaf properties will be the same.
  • "key": In addition to the above, it also returns the source key. This will give more rows than "leaf" because it needs a row for every key attached to a leaf.

Additionally, if get_hashes is set to True, the root and leaf hashes are returned.

query
query(source: SourceResolutionPath, point_of_truth: ResolutionPath | None = None, threshold: int | None = None, return_leaf_id: bool = False, limit: int = None) -> Table

Queries Matchbox to retrieve linked data for a source.

Retrieves all linked data for a given source, resolving through hierarchy if needed.

  • Simple case: If querying the same resolution as the source, just select cluster IDs and keys directly from ClusterSourceKey
  • Hierarchy case: Uses the unified query builder to traverse up the resolution hierarchy, applying COALESCE priority logic to determine which parent cluster each source record belongs to
  • Priority resolution: When multiple model resolutions could assign a record to different clusters, COALESCE ensures higher-priority resolutions win

Returns all records with their final resolved cluster IDs.

get_parent_clusters_and_leaves
get_parent_clusters_and_leaves(resolution: Resolutions) -> dict[int, dict[str, list[dict]]]

Query clusters and their leaves for all parent resolutions.

For a given resolution, find all its parent resolutions and return complete cluster compositions.

  • Parent discovery: Queries ResolutionFrom to find all direct parent resolutions (level 1)
  • Cluster building: For each parent, runs the full unified query to get all cluster assignments with both root and leaf information
  • Aggregation: Collects all leaf nodes belonging to each root cluster across all parent resolutions

Return a dictionary mapping cluster IDs to their complete leaf compositions and metadata.

match
match(key: str, source: SourceResolutionPath, targets: list[SourceResolutionPath], point_of_truth: ResolutionPath, threshold: int | None = None) -> list[Match]

Matches an ID in a source resolution and returns the keys in the targets.

Given a specific key in a source, find what it matches to in target sources through a resolution hierarchy.

  • Target cluster identification: Uses COALESCE priority CTE to determine which cluster the input key belongs to at the resolution level
  • Matching leaves discovery: Builds UNION ALL query with branches for:
    • Direct cluster members (source-only case)
    • Members connected through each model resolution in the hierarchy
  • Cross-reference: Joins the target cluster with all possible matching leaves, filtering for the requested target sources

Organises matches by source configuration and returns structured Match objects for each target.

results

Utilities for querying model results from the PostgreSQL backend.

Classes:

  • SourceInfo

    Information about a model’s sources.

Functions:

SourceInfo

Bases: NamedTuple

Information about a model’s sources.

Attributes:

left instance-attribute
left: int
right instance-attribute
right: int | None
left_ancestors instance-attribute
left_ancestors: set[int]
right_ancestors instance-attribute
right_ancestors: set[int] | None
get_model_config
get_model_config(resolution: Resolutions) -> ModelConfig

Get metadata for a model resolution.