PostgreSQL¶

A backend adapter for deploying Matchbox using PostgreSQL.

There are two graph-like trees in place here.

In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
In the data subgraph the tree is implemented as an adjacency list, which means recursive queries are required to resolve it, but less data is stored

erDiagram
    SourceConfigs {
        bigint source_config_id PK
        bigint resolution_id FK
        string location_type
        string location_uri
        string extract_transform
    }
    SourceFields {
        bigint field_id PK
        bigint source_config_id FK
        int index
        string name
        string type
        bool is_key
    }
    Clusters {
        bigint cluster_id PK
        bytes cluster_hash
    }
    ClusterSourceKey {
        bigint key_id PK
        bigint cluster_id FK
        bigint source_config_id FK
        string key
    }
    Contains {
        bigint parent PK,FK
        bigint child PK,FK
    }
    PKSpace {
        bigint id
        bigint next_cluster_id
        bigint next_cluster_keys_id
    }
    Probabilities {
        bigint resolution PK,FK
        bigint cluster PK,FK
        smallint probability
    }
    Resolutions {
        bigint resolution_id PK
        string name
        string description
        string type
        bytes hash
        smallint truth
    }
    ResolutionFrom {
        bigint parent PK,FK
        bigint child PK,FK
        int level
        smallint truth_cache
    }

    SourceConfigs |o--|| Resolutions : ""
    SourceConfigs ||--o{ SourceFields : ""
    SourceConfigs ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ ClusterSourceKey : ""
    Clusters ||--o{ Probabilities : ""
    Clusters ||--o{ Contains : "parent"
    Contains }o--|| Clusters : "child"
    Resolutions ||--o{ Probabilities : ""
    Resolutions ||--o{ ResolutionFrom : "parent"
    ResolutionFrom }o--|| Resolutions : "child"

matchbox.server.postgresql ¶

PostgreSQL adapter for Matchbox server.

Modules:

adapter –

PostgreSQL adapter for Matchbox server.
db –

Matchbox PostgreSQL database connection.
mixin –

A module for defining mixins for the PostgreSQL backend ORM.
orm –

ORM classes for the Matchbox PostgreSQL database.
utils –

Utilities for using the PostgreSQL backend.

Classes:

MatchboxPostgres –

A PostgreSQL adapter for Matchbox.
MatchboxPostgresSettings –

Settings for the Matchbox PostgreSQL backend.

all `module-attribute` ¶

__all__ = ['MatchboxPostgres', 'MatchboxPostgresSettings']

MatchboxPostgres ¶

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

query –

Queries the database from an optional point of truth.
match –

Matches an ID in a source resolution and returns the keys in the targets.
index –

Indexes a source in your warehouse to Matchbox.
get_source_config –

Get a source configuration from its resolution name.
get_resolution_source_configs –

Get a list of source configurations queriable from a resolution.
validate_ids –

Validates a list of IDs exist in the database.
validate_hashes –

Validates a list of hashes exist in the database.
cluster_id_to_hash –

Get a lookup of Cluster hashes from a list of IDs.
get_resolution_graph –

Get the full resolution graph.
dump –

Dumps the entire database to a snapshot.
drop –

Hard clear the database by dropping all tables and re-creating.
clear –

Soft clear the database by deleting all rows but retaining tables.
restore –

Restores the database from a snapshot.
insert_model –

Writes a model to Matchbox.
get_model –

Get a model from the database.
set_model_results –

Set the results for a model.
get_model_results –

Get the results for a model.
set_model_truth –

Sets the truth threshold for this model, changing the default clusters.
get_model_truth –

Gets the current truth threshold for this model.
get_model_ancestors –

Gets the current truth values of all ancestors.
set_model_ancestors_cache –

Updates the cached ancestor thresholds.
get_model_ancestors_cache –

Gets the cached ancestor thresholds.
delete_resolution –

Delete a resolution from the database.

Attributes:

settings –
sources –
models –
source_resolutions –
data –
clusters –
merges –
creates –
proposes –

settings `instance-attribute` ¶

settings = settings

sources `instance-attribute` ¶

sources = SourceConfigs

models `instance-attribute` ¶

models = FilteredResolutions(
    sources=False, humans=False, models=True
)

source_resolutions `instance-attribute` ¶

source_resolutions = FilteredResolutions(
    sources=True, humans=False, models=False
)

data `instance-attribute` ¶

data = FilteredClusters(has_source=True)

clusters `instance-attribute` ¶

clusters = FilteredClusters(has_source=False)

merges `instance-attribute` ¶

merges = Contains

creates `instance-attribute` ¶

creates = FilteredProbabilities(over_truth=True)

proposes `instance-attribute` ¶

proposes = FilteredProbabilities()

query ¶

query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

source ¶
(SourceResolutionName) –

the SourceResolutionName string identifying the source to query
resolution ¶
(optional, default: None ) –

the resolution to use for filtering results If not specified, will use the source resolution for the queried source
threshold ¶
(optional, default: None ) –

the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
limit ¶
(optional, default: None ) –

the number to use in a limit clause. Useful for testing

Returns:

Table –

The resulting matchbox IDs in Arrow format

match ¶

match(
    key: str,
    source: SourceResolutionName,
    targets: list[SourceResolutionName],
    resolution: ResolutionName,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

key ¶
(str) –

The key to match from the source.
source ¶
(SourceResolutionName) –

The name of the source resolution.
targets ¶
(list[SourceResolutionName]) –

The names of the target source resolutions.
resolution ¶
(ResolutionName) –

The name of the resolution to use for matching.
threshold ¶
(optional, default: None ) –

the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index ¶

index(
    source_config: SourceConfig, data_hashes: Table
) -> None

Indexes a source in your warehouse to Matchbox.

Parameters:

source_config ¶
(SourceConfig) –

The source configuration to index.
data_hashes ¶
(Table) –

The Arrow table with the hash of each data row

get_source_config ¶

get_source_config(
    name: SourceResolutionName,
) -> SourceConfig

Get a source configuration from its resolution name.

Parameters:

name ¶
(SourceResolutionName) –

The name resolution name for the source

Returns:

SourceConfig –

A SourceConfig object

get_resolution_source_configs ¶

get_resolution_source_configs(
    name: ModelResolutionName,
) -> list[SourceConfig]

Get a list of source configurations queriable from a resolution.

Parameters:

name ¶
(ResolutionName) –

Name of the resolution to query.

Returns:

list[SourceConfig] –

List of relevant SourceConfig objects.

validate_ids ¶

validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

ids ¶
(list[int]) –

A list of IDs to validate.

Raises:

MatchboxDataNotFound –

If some items don’t exist in the target table.

validate_hashes ¶

validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

hashes ¶
(list[bytes]) –

A list of hashes to validate.

Raises:

MatchboxDataNotFound –

If some items don’t exist in the target table.

cluster_id_to_hash ¶

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

ids ¶
(list[int]) –

A list of IDs to get hashes for.

Returns:

dict[int, bytes | None] –

A dictionary mapping IDs to hashes.

get_resolution_graph ¶

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump ¶

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

MatchboxSnapshot –

A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop ¶

drop(certain: bool = False) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

certain ¶
(bool) –

Whether to drop the database without confirmation.

clear ¶

clear(certain: bool = False) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

certain ¶
(bool) –

Whether to delete the database without confirmation.

restore ¶

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

snapshot ¶
(MatchboxSnapshot) –

A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

TypeError –

If the snapshot is not compatible with PostgreSQL

insert_model ¶

insert_model(model_config: ModelConfig) -> None

Writes a model to Matchbox.

Parameters:

model_config ¶
(ModelConfig) –

ModelConfig object with the model’s metadata

Raises:

MatchboxDataNotFound –

If, for a linker, the source models weren’t found in the database
MatchboxModelConfigError –

If the model configuration is invalid, such as the resolutions sharing ancestors

get_model ¶

get_model(name: ModelResolutionName) -> ModelConfig

Get a model from the database.

set_model_results ¶

set_model_results(
    name: ModelResolutionName, results: Table
) -> None

Set the results for a model.

get_model_results ¶

get_model_results(name: ModelResolutionName) -> Table

Get the results for a model.

set_model_truth ¶

set_model_truth(
    name: ModelResolutionName, truth: int
) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth ¶

get_model_truth(name: ModelResolutionName) -> int

Gets the current truth threshold for this model.

get_model_ancestors ¶

get_model_ancestors(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache ¶

set_model_ancestors_cache(
    name: ModelResolutionName,
    ancestors_cache: list[ModelAncestor],
) -> None

Updates the cached ancestor thresholds.

Parameters:

name ¶
(ModelResolutionName) –

The name of the model to update
ancestors_cache ¶
(list[ModelAncestor]) –

List of ModelAncestor objects mapping model resolution names to their truth thresholds

get_model_ancestors_cache ¶

get_model_ancestors_cache(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds.

Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_resolution ¶

delete_resolution(
    name: ResolutionName, certain: bool = False
) -> None

Delete a resolution from the database.

Parameters:

name ¶
(ResolutionName) –

The name of the resolution to delete.
certain ¶
(bool) –

Whether to delete the model without confirmation.

MatchboxPostgresSettings ¶

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

model_config –
batch_size (int) –
datastore (MatchboxDatastoreSettings) –
api_key (SecretStr | None) –
log_level (LogLevelType) –
backend_type (MatchboxBackends) –
postgres (MatchboxPostgresCoreSettings) –

model_config `class-attribute` `instance-attribute` ¶

model_config = SettingsConfigDict(
    env_prefix="MB__SERVER__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)

batch_size `class-attribute` `instance-attribute` ¶

batch_size: int = Field(default=250000)

datastore `instance-attribute` ¶

datastore: MatchboxDatastoreSettings

api_key `class-attribute` `instance-attribute` ¶

api_key: SecretStr | None = Field(default=None)

log_level `class-attribute` `instance-attribute` ¶

log_level: LogLevelType = 'INFO'

backend_type `class-attribute` `instance-attribute` ¶

backend_type: MatchboxBackends = POSTGRES

postgres `class-attribute` `instance-attribute` ¶

postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)

adapter ¶

PostgreSQL adapter for Matchbox server.

Classes:

FilteredClusters –

Wrapper class for filtered cluster queries.
FilteredProbabilities –

Wrapper class for filtered probability queries.
FilteredResolutions –

Wrapper class for filtered resolution queries.
MatchboxPostgres –

A PostgreSQL adapter for Matchbox.

Attributes:

T –
P –

T `module-attribute` ¶

T = TypeVar('T')

P `module-attribute` ¶

P = ParamSpec('P')

FilteredClusters ¶

Bases: BaseModel

Wrapper class for filtered cluster queries.

Methods:

count –

Counts the number of clusters in the database.

Attributes:

has_source (bool | None) –

has_source `class-attribute` `instance-attribute` ¶

has_source: bool | None = None

count ¶

count() -> int

Counts the number of clusters in the database.

FilteredProbabilities ¶

Bases: BaseModel

Wrapper class for filtered probability queries.

Methods:

count –

Counts the number of probabilities in the database.

Attributes:

over_truth (bool) –

over_truth `class-attribute` `instance-attribute` ¶

over_truth: bool = False

count ¶

count() -> int

Counts the number of probabilities in the database.

FilteredResolutions ¶

Bases: BaseModel

Wrapper class for filtered resolution queries.

Methods:

count –

Counts the number of resolutions in the database.

Attributes:

sources (bool) –
humans (bool) –
models (bool) –

sources `class-attribute` `instance-attribute` ¶

sources: bool = False

humans `class-attribute` `instance-attribute` ¶

humans: bool = False

models `class-attribute` `instance-attribute` ¶

models: bool = False

count ¶

count() -> int

Counts the number of resolutions in the database.

MatchboxPostgres ¶

MatchboxPostgres(settings: MatchboxPostgresSettings)

Bases: MatchboxDBAdapter

A PostgreSQL adapter for Matchbox.

Methods:

query –

Queries the database from an optional point of truth.
match –

Matches an ID in a source resolution and returns the keys in the targets.
index –

Indexes a source in your warehouse to Matchbox.
get_source_config –

Get a source configuration from its resolution name.
get_resolution_source_configs –

Get a list of source configurations queriable from a resolution.
validate_ids –

Validates a list of IDs exist in the database.
validate_hashes –

Validates a list of hashes exist in the database.
cluster_id_to_hash –

Get a lookup of Cluster hashes from a list of IDs.
get_resolution_graph –

Get the full resolution graph.
dump –

Dumps the entire database to a snapshot.
drop –

Hard clear the database by dropping all tables and re-creating.
clear –

Soft clear the database by deleting all rows but retaining tables.
restore –

Restores the database from a snapshot.
insert_model –

Writes a model to Matchbox.
get_model –

Get a model from the database.
set_model_results –

Set the results for a model.
get_model_results –

Get the results for a model.
set_model_truth –

Sets the truth threshold for this model, changing the default clusters.
get_model_truth –

Gets the current truth threshold for this model.
get_model_ancestors –

Gets the current truth values of all ancestors.
set_model_ancestors_cache –

Updates the cached ancestor thresholds.
get_model_ancestors_cache –

Gets the cached ancestor thresholds.
delete_resolution –

Delete a resolution from the database.

Attributes:

settings –
sources –
models –
source_resolutions –
data –
clusters –
merges –
creates –
proposes –

settings `instance-attribute` ¶

settings = settings

sources `instance-attribute` ¶

sources = SourceConfigs

models `instance-attribute` ¶

models = FilteredResolutions(
    sources=False, humans=False, models=True
)

source_resolutions `instance-attribute` ¶

source_resolutions = FilteredResolutions(
    sources=True, humans=False, models=False
)

data `instance-attribute` ¶

data = FilteredClusters(has_source=True)

clusters `instance-attribute` ¶

clusters = FilteredClusters(has_source=False)

merges `instance-attribute` ¶

merges = Contains

creates `instance-attribute` ¶

creates = FilteredProbabilities(over_truth=True)

proposes `instance-attribute` ¶

proposes = FilteredProbabilities()

query ¶

query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int | None = None,
) -> Table

Queries the database from an optional point of truth.

Parameters:

source ¶
(SourceResolutionName) –

the SourceResolutionName string identifying the source to query
resolution ¶
(optional, default: None ) –

the resolution to use for filtering results If not specified, will use the source resolution for the queried source
threshold ¶
(optional, default: None ) –

the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
limit ¶
(optional, default: None ) –

the number to use in a limit clause. Useful for testing

Returns:

Table –

The resulting matchbox IDs in Arrow format

match ¶

match(
    key: str,
    source: SourceResolutionName,
    targets: list[SourceResolutionName],
    resolution: ResolutionName,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source resolution and returns the keys in the targets.

Parameters:

key ¶
(str) –

The key to match from the source.
source ¶
(SourceResolutionName) –

The name of the source resolution.
targets ¶
(list[SourceResolutionName]) –

The names of the target source resolutions.
resolution ¶
(ResolutionName) –

The name of the resolution to use for matching.
threshold ¶
(optional, default: None ) –

the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds

index ¶

index(
    source_config: SourceConfig, data_hashes: Table
) -> None

Indexes a source in your warehouse to Matchbox.

Parameters:

source_config ¶
(SourceConfig) –

The source configuration to index.
data_hashes ¶
(Table) –

The Arrow table with the hash of each data row

get_source_config ¶

get_source_config(
    name: SourceResolutionName,
) -> SourceConfig

Get a source configuration from its resolution name.

Parameters:

name ¶
(SourceResolutionName) –

The name resolution name for the source

Returns:

SourceConfig –

A SourceConfig object

get_resolution_source_configs ¶

get_resolution_source_configs(
    name: ModelResolutionName,
) -> list[SourceConfig]

Get a list of source configurations queriable from a resolution.

Parameters:

name ¶
(ResolutionName) –

Name of the resolution to query.

Returns:

list[SourceConfig] –

List of relevant SourceConfig objects.

validate_ids ¶

validate_ids(ids: list[int]) -> None

Validates a list of IDs exist in the database.

Parameters:

ids ¶
(list[int]) –

A list of IDs to validate.

Raises:

MatchboxDataNotFound –

If some items don’t exist in the target table.

validate_hashes ¶

validate_hashes(hashes: list[bytes]) -> None

Validates a list of hashes exist in the database.

Parameters:

hashes ¶
(list[bytes]) –

A list of hashes to validate.

Raises:

MatchboxDataNotFound –

If some items don’t exist in the target table.

cluster_id_to_hash ¶

cluster_id_to_hash(
    ids: list[int],
) -> dict[int, bytes | None]

Get a lookup of Cluster hashes from a list of IDs.

Parameters:

ids ¶
(list[int]) –

A list of IDs to get hashes for.

Returns:

dict[int, bytes | None] –

A dictionary mapping IDs to hashes.

get_resolution_graph ¶

get_resolution_graph() -> ResolutionGraph

Get the full resolution graph.

dump ¶

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

MatchboxSnapshot –

A MatchboxSnapshot object of type “postgres” with the database’s current state.

drop ¶

drop(certain: bool = False) -> None

Hard clear the database by dropping all tables and re-creating.

Parameters:

certain ¶
(bool) –

Whether to drop the database without confirmation.

clear ¶

clear(certain: bool = False) -> None

Soft clear the database by deleting all rows but retaining tables.

Parameters:

certain ¶
(bool) –

Whether to delete the database without confirmation.

restore ¶

restore(snapshot: MatchboxSnapshot) -> None

Restores the database from a snapshot.

Parameters:

snapshot ¶
(MatchboxSnapshot) –

A MatchboxSnapshot object of type “postgres” with the database’s state

Raises:

TypeError –

If the snapshot is not compatible with PostgreSQL

insert_model ¶

insert_model(model_config: ModelConfig) -> None

Writes a model to Matchbox.

Parameters:

model_config ¶
(ModelConfig) –

ModelConfig object with the model’s metadata

Raises:

MatchboxDataNotFound –

If, for a linker, the source models weren’t found in the database
MatchboxModelConfigError –

If the model configuration is invalid, such as the resolutions sharing ancestors

get_model ¶

get_model(name: ModelResolutionName) -> ModelConfig

Get a model from the database.

set_model_results ¶

set_model_results(
    name: ModelResolutionName, results: Table
) -> None

Set the results for a model.

get_model_results ¶

get_model_results(name: ModelResolutionName) -> Table

Get the results for a model.

set_model_truth ¶

set_model_truth(
    name: ModelResolutionName, truth: int
) -> None

Sets the truth threshold for this model, changing the default clusters.

get_model_truth ¶

get_model_truth(name: ModelResolutionName) -> int

Gets the current truth threshold for this model.

get_model_ancestors ¶

get_model_ancestors(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the current truth values of all ancestors.

Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.

Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.

set_model_ancestors_cache ¶

set_model_ancestors_cache(
    name: ModelResolutionName,
    ancestors_cache: list[ModelAncestor],
) -> None

Updates the cached ancestor thresholds.

Parameters:

name ¶
(ModelResolutionName) –

The name of the model to update
ancestors_cache ¶
(list[ModelAncestor]) –

List of ModelAncestor objects mapping model resolution names to their truth thresholds

get_model_ancestors_cache ¶

get_model_ancestors_cache(
    name: ModelResolutionName,
) -> list[ModelAncestor]

Gets the cached ancestor thresholds.

Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.

This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.

delete_resolution ¶

delete_resolution(
    name: ResolutionName, certain: bool = False
) -> None

Delete a resolution from the database.

Parameters:

name ¶
(ResolutionName) –

The name of the resolution to delete.
certain ¶
(bool) –

Whether to delete the model without confirmation.

db ¶

Matchbox PostgreSQL database connection.

Classes:

MatchboxPostgresCoreSettings –

PostgreSQL-specific settings for Matchbox.
MatchboxPostgresSettings –

Settings for the Matchbox PostgreSQL backend.
MatchboxDatabase –

Matchbox PostgreSQL database connection.

Attributes:

MBDB –

MBDB `module-attribute` ¶

MBDB = MatchboxDatabase(MatchboxPostgresSettings())

MatchboxPostgresCoreSettings ¶

Bases: BaseModel

PostgreSQL-specific settings for Matchbox.

Methods:

get_alembic_config –

Get the Alembic config.

Attributes:

host (str) –
port (int) –
user (str) –
password (str) –
database (str) –
db_schema (str) –
alembic_config (Path) –

host `instance-attribute` ¶

host: str

port `instance-attribute` ¶

port: int

user `instance-attribute` ¶

user: str

password `instance-attribute` ¶

password: str

database `instance-attribute` ¶

database: str

db_schema `instance-attribute` ¶

db_schema: str

alembic_config `class-attribute` `instance-attribute` ¶

alembic_config: Path = Field(
    default=Path(
        "src/matchbox/server/postgresql/alembic.ini"
    )
)

get_alembic_config ¶

get_alembic_config() -> Config

Get the Alembic config.

MatchboxPostgresSettings ¶

Bases: MatchboxServerSettings

Settings for the Matchbox PostgreSQL backend.

Inherits the core settings and adds the PostgreSQL-specific settings.

Attributes:

backend_type (MatchboxBackends) –
postgres (MatchboxPostgresCoreSettings) –
model_config –
batch_size (int) –
datastore (MatchboxDatastoreSettings) –
api_key (SecretStr | None) –
log_level (LogLevelType) –

backend_type `class-attribute` `instance-attribute` ¶

backend_type: MatchboxBackends = POSTGRES

postgres `class-attribute` `instance-attribute` ¶

postgres: MatchboxPostgresCoreSettings = Field(
    default_factory=MatchboxPostgresCoreSettings
)

model_config `class-attribute` `instance-attribute` ¶

model_config = SettingsConfigDict(
    env_prefix="MB__SERVER__",
    env_nested_delimiter="__",
    use_enum_values=True,
    env_file=".env",
    env_file_encoding="utf-8",
    extra="ignore",
)

batch_size `class-attribute` `instance-attribute` ¶

batch_size: int = Field(default=250000)

datastore `instance-attribute` ¶

datastore: MatchboxDatastoreSettings

api_key `class-attribute` `instance-attribute` ¶

api_key: SecretStr | None = Field(default=None)

log_level `class-attribute` `instance-attribute` ¶

log_level: LogLevelType = 'INFO'

MatchboxDatabase ¶

MatchboxDatabase(settings: MatchboxPostgresSettings)

Matchbox PostgreSQL database connection.

Methods:

connection_string –

Get the connection string for PostgreSQL.
get_engine –

Get the database engine.
get_session –

Get a new session.
get_adbc_connection –

Get a new ADBC connection wrapped by a SQLAlchemy pool proxy.
run_migrations –

Create the database and all tables expected in the schema.
clear_database –

Delete all rows in every table in the database schema.
drop_database –

Drop all tables in the database schema and re-recreate them.

Attributes:

settings –
MatchboxBase –
alembic_config –
sorted_tables (list[Table]) –

Return a list of SQLAlchemy tables in order of creation.

settings `instance-attribute` ¶

settings = settings

MatchboxBase `instance-attribute` ¶

MatchboxBase = declarative_base(
    metadata=MetaData(schema=db_schema)
)

alembic_config `instance-attribute` ¶

alembic_config = get_alembic_config()

sorted_tables `property` ¶

sorted_tables: list[Table]

Return a list of SQLAlchemy tables in order of creation.

connection_string ¶

connection_string(driver: bool = True) -> str

Get the connection string for PostgreSQL.

get_engine ¶

get_engine() -> Engine

Get the database engine.

get_session ¶

get_session() -> Session

Get a new session.

get_adbc_connection ¶

get_adbc_connection() -> Generator[
    PoolProxiedConnection, Any, Any
]

Get a new ADBC connection wrapped by a SQLAlchemy pool proxy.

The connection must be used within a context manager.

run_migrations ¶

run_migrations()

Create the database and all tables expected in the schema.

clear_database ¶

clear_database()

Delete all rows in every table in the database schema.

TRUNCATE tables that are part of the core ORM (preserves structure)
DROP tables that are not in the ORM (removes temporary/test tables)

drop_database ¶

drop_database()

Drop all tables in the database schema and re-recreate them.

mixin ¶

A module for defining mixins for the PostgreSQL backend ORM.

Classes:

CountMixin –

A mixin for counting the number of rows in a table.

Attributes:

T –

T `module-attribute` ¶

T = TypeVar('T')

CountMixin ¶

A mixin for counting the number of rows in a table.

Methods:

count –

Counts the number of rows in the table.

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

orm ¶

ORM classes for the Matchbox PostgreSQL database.

Classes:

ResolutionFrom –

Resolution lineage closure table with cached truth values.
Resolutions –

Table of resolution points: models, sources and humans.
PKSpace –

Table used to reserve ranges of primary keys.
SourceFields –

Table for storing column details for SourceConfigs.
ClusterSourceKey –

Table for storing source primary keys for clusters.
SourceConfigs –

Table of source_configs of data for Matchbox.
Contains –

Cluster lineage table.
Clusters –

Table of indexed data and clusters that match it.
Probabilities –

Table of probabilities that a cluster is correct, according to a resolution.
Results –

Table of results for a resolution.

ResolutionFrom ¶

Bases: CountMixin, MatchboxBase

Resolution lineage closure table with cached truth values.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
parent –
child –
level –
truth_cache –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'resolution_from'

parent `class-attribute` `instance-attribute` ¶

parent = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)

child `class-attribute` `instance-attribute` ¶

child = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)

level `class-attribute` `instance-attribute` ¶

level = Column(INTEGER, nullable=False)

truth_cache `class-attribute` `instance-attribute` ¶

truth_cache = Column(SMALLINT, nullable=True)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    CheckConstraint(
        "parent != child", name="no_self_reference"
    ),
    CheckConstraint("level > 0", name="positive_level"),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

Resolutions ¶

Bases: CountMixin, MatchboxBase

Table of resolution points: models, sources and humans.

Resolutions produce probabilities or own data in the clusters table.

Methods:

get_lineage –

Returns lineage ordered by priority.
from_name –

Resolves a model resolution name to a Resolution object.
count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
resolution_id –
name –
description –
type –
hash –
truth –
source_config –
probabilities –
results –
children –
__table_args__ –
ancestors (set[Resolutions]) –

Returns all ancestors (parents, grandparents, etc.) of this resolution.
descendants (set[Resolutions]) –

Returns descendants (children, grandchildren, etc.) of this resolution.

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'resolutions'

resolution_id `class-attribute` `instance-attribute` ¶

resolution_id = Column(
    BIGINT, primary_key=True, autoincrement=True
)

name `class-attribute` `instance-attribute` ¶

name = Column(TEXT, nullable=False)

description `class-attribute` `instance-attribute` ¶

description = Column(TEXT, nullable=True)

type `class-attribute` `instance-attribute` ¶

type = Column(TEXT, nullable=False)

hash `class-attribute` `instance-attribute` ¶

hash = Column(BYTEA, nullable=True)

truth `class-attribute` `instance-attribute` ¶

truth = Column(SMALLINT, nullable=True)

source_config `class-attribute` `instance-attribute` ¶

source_config = relationship(
    "SourceConfigs",
    back_populates="source_resolution",
    uselist=False,
)

probabilities `class-attribute` `instance-attribute` ¶

probabilities = relationship(
    "Probabilities",
    back_populates="proposed_by",
    passive_deletes=True,
)

results `class-attribute` `instance-attribute` ¶

results = relationship(
    "Results",
    back_populates="proposed_by",
    passive_deletes=True,
)

children `class-attribute` `instance-attribute` ¶

children = relationship(
    "Resolutions",
    secondary=__table__,
    primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
    secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
    backref="parents",
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    CheckConstraint(
        "type IN ('model', 'source', 'human')",
        name="resolution_type_constraints",
    ),
    UniqueConstraint("name", name="resolutions_name_key"),
)

ancestors `property` ¶

ancestors: set[Resolutions]

Returns all ancestors (parents, grandparents, etc.) of this resolution.

descendants `property` ¶

descendants: set[Resolutions]

Returns descendants (children, grandchildren, etc.) of this resolution.

get_lineage ¶

get_lineage(
    sources: list[SourceConfigs] | None = None,
    threshold: int | None = None,
) -> list[tuple[int, int, float | None]]

Returns lineage ordered by priority.

Highest priority (lowest level) first, then by resolution_id for stability.

Parameters:

sources ¶
(list[SourceConfigs] | None, default: None ) –

If provided, only return lineage paths that lead to these sources
threshold ¶
(int | None, default: None ) –

If provided, override this resolution’s threshold

Returns:

list[tuple[int, int, float | None]] –

List of tuples (resolution_id, source_config_id, threshold) ordered by priority.

from_name `classmethod` ¶

from_name(
    name: ResolutionName,
    res_type: Literal["model", "source", "human"]
    | None = None,
    session: Session | None = None,
) -> Resolutions

Resolves a model resolution name to a Resolution object.

Parameters:

name ¶
(ResolutionName) –

The name of the model to resolve.
res_type ¶
(Literal['model', 'source', 'human'] | None, default: None ) –

A resolution type to use as filter.
session ¶
(Session | None, default: None ) –

A session to get the resolution for updates.

Raises:

MatchboxResolutionNotFoundError –

If the model doesn’t exist.

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

PKSpace ¶

Bases: MatchboxBase

Table used to reserve ranges of primary keys.

Methods:

initialise –

Create PKSpace tracking row if not exists.
reserve_block –

Atomically get next available ID for table, and increment it.

Attributes:

__tablename__ –
id –
next_cluster_id –
next_cluster_keys_id –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'pk_space'

id `class-attribute` `instance-attribute` ¶

id = Column(BIGINT, primary_key=True)

next_cluster_id `class-attribute` `instance-attribute` ¶

next_cluster_id = Column(BIGINT, nullable=False)

next_cluster_keys_id `class-attribute` `instance-attribute` ¶

next_cluster_keys_id = Column(BIGINT, nullable=False)

initialise `classmethod` ¶

initialise() -> None

Create PKSpace tracking row if not exists.

reserve_block `classmethod` ¶

reserve_block(
    table: Literal["clusters", "cluster_keys"],
    block_size: int,
) -> int

Atomically get next available ID for table, and increment it.

SourceFields ¶

Bases: CountMixin, MatchboxBase

Table for storing column details for SourceConfigs.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
field_id –
source_config_id –
index –
name –
type –
is_key –
source_config –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'source_fields'

field_id `class-attribute` `instance-attribute` ¶

field_id = Column(BIGINT, primary_key=True)

source_config_id `class-attribute` `instance-attribute` ¶

source_config_id = Column(
    BIGINT,
    ForeignKey(
        "source_configs.source_config_id",
        ondelete="CASCADE",
    ),
    nullable=False,
)

index `class-attribute` `instance-attribute` ¶

index = Column(INTEGER, nullable=False)

name `class-attribute` `instance-attribute` ¶

name = Column(TEXT, nullable=False)

type `class-attribute` `instance-attribute` ¶

type = Column(TEXT, nullable=False)

is_key `class-attribute` `instance-attribute` ¶

is_key = Column(BOOLEAN, nullable=False)

source_config `class-attribute` `instance-attribute` ¶

source_config = relationship(
    "SourceConfigs",
    back_populates="fields",
    foreign_keys=[source_config_id],
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    UniqueConstraint(
        "source_config_id", "index", name="unique_index"
    ),
    Index(
        "ix_source_columns_source_config_id",
        "source_config_id",
    ),
    Index(
        "ix_unique_key_field",
        "source_config_id",
        unique=True,
        postgresql_where=text("is_key = true"),
    ),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

ClusterSourceKey ¶

Bases: CountMixin, MatchboxBase

Table for storing source primary keys for clusters.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
key_id –
cluster_id –
source_config_id –
key –
cluster –
source_config –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'cluster_keys'

key_id `class-attribute` `instance-attribute` ¶

key_id = Column(BIGINT, primary_key=True)

cluster_id `class-attribute` `instance-attribute` ¶

cluster_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)

source_config_id `class-attribute` `instance-attribute` ¶

source_config_id = Column(
    BIGINT,
    ForeignKey(
        "source_configs.source_config_id",
        ondelete="CASCADE",
    ),
    nullable=False,
)

key `class-attribute` `instance-attribute` ¶

key = Column(TEXT, nullable=False)

cluster `class-attribute` `instance-attribute` ¶

cluster = relationship('Clusters', back_populates='keys')

source_config `class-attribute` `instance-attribute` ¶

source_config = relationship(
    "SourceConfigs", back_populates="cluster_keys"
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    Index("ix_cluster_keys_cluster_id", "cluster_id"),
    Index("ix_cluster_keys_keys", "key"),
    UniqueConstraint(
        "key_id",
        "source_config_id",
        name="unique_keys_source",
    ),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

SourceConfigs ¶

SourceConfigs(
    key_field: SourceFields | None = None,
    index_fields: list[SourceFields] | None = None,
    **kwargs,
)

Bases: CountMixin, MatchboxBase

Table of source_configs of data for Matchbox.

Methods:

list_all –

Returns all source_configs in the database.
from_dto –

Create a SourceConfigs instance from a CommonSource object.
to_dto –

Convert ORM source to a matchbox.common SourceConfig object.
count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
source_config_id –
resolution_id –
location_type –
location_uri –
extract_transform –
name (str) –

Get the name of the related resolution.
source_resolution –
fields –
key_field –
index_fields –
cluster_keys –
clusters –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'source_configs'

source_config_id `class-attribute` `instance-attribute` ¶

source_config_id = Column(
    BIGINT, Identity(start=1), primary_key=True
)

resolution_id `class-attribute` `instance-attribute` ¶

resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    nullable=False,
)

location_type `class-attribute` `instance-attribute` ¶

location_type = Column(TEXT, nullable=False)

location_uri `class-attribute` `instance-attribute` ¶

location_uri = Column(TEXT, nullable=False)

extract_transform `class-attribute` `instance-attribute` ¶

extract_transform = Column(TEXT, nullable=False)

name `property` ¶

name: str

Get the name of the related resolution.

source_resolution `class-attribute` `instance-attribute` ¶

source_resolution = relationship(
    "Resolutions", back_populates="source_config"
)

fields `class-attribute` `instance-attribute` ¶

fields = relationship(
    "SourceFields",
    back_populates="source_config",
    passive_deletes=True,
    cascade="all, delete-orphan",
)

key_field `class-attribute` `instance-attribute` ¶

key_field = relationship(
    "SourceFields",
    primaryjoin="and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == True)",
    viewonly=True,
    uselist=False,
)

index_fields `class-attribute` `instance-attribute` ¶

index_fields = relationship(
    "SourceFields",
    primaryjoin="and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == False)",
    viewonly=True,
    order_by="SourceFields.index",
    collection_class=list,
)

cluster_keys `class-attribute` `instance-attribute` ¶

cluster_keys = relationship(
    "ClusterSourceKey",
    back_populates="source_config",
    passive_deletes=True,
)

clusters `class-attribute` `instance-attribute` ¶

clusters = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="SourceConfigs.source_config_id == ClusterSourceKey.source_config_id",
    secondaryjoin="ClusterSourceKey.cluster_id == Clusters.cluster_id",
    viewonly=True,
)

list_all `classmethod` ¶

list_all() -> list[SourceConfigs]

Returns all source_configs in the database.

from_dto `classmethod` ¶

from_dto(
    resolution: Resolutions, source_config: SourceConfig
) -> SourceConfigs

Create a SourceConfigs instance from a CommonSource object.

to_dto ¶

to_dto() -> SourceConfig

Convert ORM source to a matchbox.common SourceConfig object.

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

Contains ¶

Bases: CountMixin, MatchboxBase

Cluster lineage table.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
root –
leaf –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'contains'

root `class-attribute` `instance-attribute` ¶

root = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)

leaf `class-attribute` `instance-attribute` ¶

leaf = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    CheckConstraint(
        "root != leaf", name="no_self_containment"
    ),
    Index("ix_contains_root_leaf", "root", "leaf"),
    Index("ix_contains_leaf_root", "leaf", "root"),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

Clusters ¶

Bases: CountMixin, MatchboxBase

Table of indexed data and clusters that match it.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
cluster_id –
cluster_hash –
keys –
probabilities –
leaves –
source_configs –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'clusters'

cluster_id `class-attribute` `instance-attribute` ¶

cluster_id = Column(BIGINT, primary_key=True)

cluster_hash `class-attribute` `instance-attribute` ¶

cluster_hash = Column(BYTEA, nullable=False)

keys `class-attribute` `instance-attribute` ¶

keys = relationship(
    "ClusterSourceKey",
    back_populates="cluster",
    passive_deletes=True,
)

probabilities `class-attribute` `instance-attribute` ¶

probabilities = relationship(
    "Probabilities",
    back_populates="proposes",
    passive_deletes=True,
)

leaves `class-attribute` `instance-attribute` ¶

leaves = relationship(
    "Clusters",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == Contains.root",
    secondaryjoin="Clusters.cluster_id == Contains.leaf",
    backref="roots",
)

source_configs `class-attribute` `instance-attribute` ¶

source_configs = relationship(
    "SourceConfigs",
    secondary=__table__,
    primaryjoin="Clusters.cluster_id == ClusterSourceKey.cluster_id",
    secondaryjoin="ClusterSourceKey.source_config_id == SourceConfigs.source_config_id",
    viewonly=True,
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    UniqueConstraint(
        "cluster_hash", name="clusters_hash_key"
    ),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

Probabilities ¶

Bases: CountMixin, MatchboxBase

Table of probabilities that a cluster is correct, according to a resolution.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
resolution_id –
cluster_id –
probability –
proposed_by –
proposes –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'probabilities'

resolution_id `class-attribute` `instance-attribute` ¶

resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    primary_key=True,
)

cluster_id `class-attribute` `instance-attribute` ¶

cluster_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    primary_key=True,
)

probability `class-attribute` `instance-attribute` ¶

probability = Column(SMALLINT, nullable=False)

proposed_by `class-attribute` `instance-attribute` ¶

proposed_by = relationship(
    "Resolutions", back_populates="probabilities"
)

proposes `class-attribute` `instance-attribute` ¶

proposes = relationship(
    "Clusters", back_populates="probabilities"
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    CheckConstraint(
        "probability BETWEEN 0 AND 100",
        name="valid_probability",
    ),
    Index("ix_probabilities_resolution", "resolution_id"),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

Results ¶

Bases: CountMixin, MatchboxBase

Table of results for a resolution.

Stores the raw left/right probabilities created by a model.

Methods:

count –

Counts the number of rows in the table.

Attributes:

__tablename__ –
result_id –
resolution_id –
left_id –
right_id –
probability –
proposed_by –
__table_args__ –

tablename `class-attribute` `instance-attribute` ¶

__tablename__ = 'results'

result_id `class-attribute` `instance-attribute` ¶

result_id = Column(
    BIGINT, primary_key=True, autoincrement=True
)

resolution_id `class-attribute` `instance-attribute` ¶

resolution_id = Column(
    BIGINT,
    ForeignKey(
        "resolutions.resolution_id", ondelete="CASCADE"
    ),
    nullable=False,
)

left_id `class-attribute` `instance-attribute` ¶

left_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)

right_id `class-attribute` `instance-attribute` ¶

right_id = Column(
    BIGINT,
    ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
    nullable=False,
)

probability `class-attribute` `instance-attribute` ¶

probability = Column(SMALLINT, nullable=False)

proposed_by `class-attribute` `instance-attribute` ¶

proposed_by = relationship(
    "Resolutions", back_populates="results"
)

__table_args__ `class-attribute` `instance-attribute` ¶

__table_args__ = (
    Index("ix_results_resolution", "resolution_id"),
    CheckConstraint(
        "probability BETWEEN 0 AND 100",
        name="valid_probability",
    ),
    UniqueConstraint(
        "resolution_id", "left_id", "right_id"
    ),
)

count `classmethod` ¶

count() -> int

Counts the number of rows in the table.

utils ¶

Utilities for using the PostgreSQL backend.

Modules:

db –

General utilities for the PostgreSQL backend.
insert –

Utilities for inserting data into the PostgreSQL backend.
query –

Utilities for querying and matching in the PostgreSQL backend.
results –

Utilities for querying model results from the PostgreSQL backend.

db ¶

General utilities for the PostgreSQL backend.

Functions:

get_resolution_graph –

Retrieves the resolution graph.
dump –

Dumps the entire database to a snapshot.
restore –

Restores the database from a snapshot.
sqa_profiled –

SQLAlchemy profiler.
compile_sql –

Compiles a SQLAlchemy statement into a string.
large_append –

Append a PyArrow table to a PostgreSQL table using ADBC.
ingest_to_temporary_table –

Context manager to ingest Arrow data to a temporary table with explicit types.

get_resolution_graph ¶

get_resolution_graph() -> ResolutionGraph

Retrieves the resolution graph.

dump ¶

dump() -> MatchboxSnapshot

Dumps the entire database to a snapshot.

Returns:

MatchboxSnapshot –

A MatchboxSnapshot object of type “postgres” with the database’s current state.

restore ¶

restore(
    snapshot: MatchboxSnapshot, batch_size: int
) -> None

Restores the database from a snapshot.

Parameters:

snapshot ¶
(MatchboxSnapshot) –

A MatchboxSnapshot object of type “postgres” with the database’s state
batch_size ¶
(int) –

The number of records to insert in each batch

Raises:

ValueError –

If the snapshot is missing data

sqa_profiled ¶

sqa_profiled()

SQLAlchemy profiler.

Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling

compile_sql ¶

compile_sql(stmt: Select) -> str

Compiles a SQLAlchemy statement into a string.

Parameters:

stmt ¶
(Select) –

The SQLAlchemy statement to compile.

Returns:

str –

The compiled SQL statement as a string.

large_append ¶

large_append(
    data: Table,
    table_class: DeclarativeMeta,
    adbc_connection: PoolProxiedConnection,
    max_chunksize: int | None = None,
)

Append a PyArrow table to a PostgreSQL table using ADBC.

This function does not support upserting and will error if keys clash. This method does not auto-commit, which is the responsibility of the caller.

Parameters:

data ¶
(Table) –

A PyArrow table to write.
table_class ¶
(DeclarativeMeta) –

The SQLAlchemy ORM class for the table to write to.
adbc_connection ¶
(PoolProxiedConnection) –

An ADBC connection from the pool. This is returned by MBDB.get_adbc_connection() and needs to be used via a context manager.
max_chunksize ¶
(int | None, default: None ) –

Size of data chunks to be read and copied.

ingest_to_temporary_table ¶

ingest_to_temporary_table(
    table_name: str,
    schema_name: str,
    data: Table,
    column_types: dict[str, type[TypeEngine]],
    max_chunksize: int | None = None,
) -> Generator[Table, None, None]

Context manager to ingest Arrow data to a temporary table with explicit types.

Parameters:

table_name ¶
(str) –

Base name for the temporary table
schema_name ¶
(str) –

Schema where the temporary table will be created
data ¶
(Table) –

PyArrow table containing the data to ingest
column_types ¶
(dict[str, type[TypeEngine]]) –

Map of column names to SQLAlchemy types
max_chunksize ¶
(int | None, default: None ) –

Optional maximum chunk size for batches

Returns:

None –

A SQLAlchemy Table object representing the temporary table

insert ¶

Utilities for inserting data into the PostgreSQL backend.

Functions:

insert_source –

Indexes a source within Matchbox.
insert_model –

Writes a model to Matchbox with a default truth value of 100.
insert_results –

Writes a results table to Matchbox.

insert_source ¶

insert_source(
    source_config: SourceConfig,
    data_hashes: Table,
    batch_size: int,
) -> None

Indexes a source within Matchbox.

insert_model ¶

insert_model(
    name: ModelResolutionName,
    left: Resolutions,
    right: Resolutions,
    description: str,
) -> None

Writes a model to Matchbox with a default truth value of 100.

Parameters:

name ¶
(ModelResolutionName) –

Name of the new model
left ¶
(Resolutions) –

Left parent of the model
right ¶
(Resolutions) –

Right parent of the model. Same as left in a dedupe job
description ¶
(str) –

Model description

Raises:

MatchboxResolutionNotFoundError –

If the specified parent models don’t exist.
MatchboxResolutionAlreadyExists –

If the specified model already exists.

insert_results ¶

insert_results(
    resolution: Resolutions, results: Table, batch_size: int
) -> None

Writes a results table to Matchbox.

The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.

This means two-item components are synonymous with their original pairwise probabilities.

This allows easy querying of clusters at any threshold.

Parameters:

resolution ¶
(Resolutions) –

Resolution of type model to associate results with
results ¶
(Table) –

A PyArrow results table with left_id, right_id, probability
batch_size ¶
(int) –

Number of records to insert in each batch

Raises:

MatchboxResolutionNotFoundError –

If the specified model doesn’t exist.

query ¶

Utilities for querying and matching in the PostgreSQL backend.

Functions:

get_source_config –

Converts the named source to a SourceConfigs ORM object.
query –

Queries Matchbox to retrieve linked data for a source.
get_parent_clusters_and_leaves –

Query clusters and their leaves for all parent resolutions.
match –

Matches an ID in a source resolution and returns the keys in the targets.

Attributes:

T –

T `module-attribute` ¶

T = TypeVar('T')

get_source_config ¶

get_source_config(
    name: SourceResolutionName, session: Session
) -> SourceConfigs

Converts the named source to a SourceConfigs ORM object.

query ¶

query(
    source: SourceResolutionName,
    resolution: ResolutionName | None = None,
    threshold: int | None = None,
    limit: int = None,
) -> Table

Queries Matchbox to retrieve linked data for a source.

Retrieves all linked data for a given source, resolving through hierarchy if needed.

Simple case: If querying the same resolution as the source, just select cluster IDs and keys directly from ClusterSourceKey
Hierarchy case: Uses the unified query builder to traverse up the resolution hierarchy, applying COALESCE priority logic to determine which parent cluster each source record belongs to
Priority resolution: When multiple model resolutions could assign a record to different clusters, COALESCE ensures higher-priority resolutions win

Returns all records with their final resolved cluster IDs.

get_parent_clusters_and_leaves ¶

get_parent_clusters_and_leaves(
    resolution: Resolutions,
) -> dict[int, dict[str, list[dict]]]

Query clusters and their leaves for all parent resolutions.

For a given resolution, find all its parent resolutions and return complete cluster compositions.

Parent discovery: Queries ResolutionFrom to find all direct parent resolutions (level 1)
Cluster building: For each parent, runs the full unified query to get all cluster assignments with both root and leaf information
Aggregation: Collects all leaf nodes belonging to each root cluster across all parent resolutions

Return a dictionary mapping cluster IDs to their complete leaf compositions and metadata.

match ¶

match(
    key: str,
    source: SourceResolutionName,
    targets: list[SourceResolutionName],
    resolution: ResolutionName,
    threshold: int | None = None,
) -> list[Match]

Matches an ID in a source resolution and returns the keys in the targets.

Given a specific key in a source, find what it matches to in target sources through a resolution hierarchy.

Target cluster identification: Uses COALESCE priority CTE to determine which cluster the input key belongs to at the resolution level
Matching leaves discovery: Builds UNION ALL query with branches for:
- Direct cluster members (source-only case)
- Members connected through each model resolution in the hierarchy
Cross-reference: Joins the target cluster with all possible matching leaves, filtering for the requested target sources

Organises matches by source configuration and returns structured Match objects for each target.

results ¶

Utilities for querying model results from the PostgreSQL backend.

Classes:

SourceInfo –

Information about a model’s sources.

Functions:

get_model_config –

Get metadata for a model resolution.

SourceInfo ¶

Bases: NamedTuple

Information about a model’s sources.

Attributes:

left (int) –
right (int | None) –
left_ancestors (set[int]) –
right_ancestors (set[int] | None) –

left `instance-attribute` ¶

left: int

right `instance-attribute` ¶

right: int | None

left_ancestors `instance-attribute` ¶

left_ancestors: set[int]

right_ancestors `instance-attribute` ¶

right_ancestors: set[int] | None

get_model_config ¶

get_model_config(resolution: Resolutions) -> ModelConfig

Get metadata for a model resolution.

PostgreSQL¶

matchbox.server.postgresql ¶

__all__ module-attribute ¶

MatchboxPostgres ¶

settings instance-attribute ¶

sources instance-attribute ¶

models instance-attribute ¶

source_resolutions instance-attribute ¶

data instance-attribute ¶

clusters instance-attribute ¶

merges instance-attribute ¶

creates instance-attribute ¶

proposes instance-attribute ¶

query ¶

source ¶

resolution ¶

threshold ¶

limit ¶

match ¶

key ¶

source ¶

targets ¶

resolution ¶

threshold ¶

index ¶

source_config ¶

data_hashes ¶

get_source_config ¶

name ¶

get_resolution_source_configs ¶

name ¶

validate_ids ¶

ids ¶

validate_hashes ¶

hashes ¶

cluster_id_to_hash ¶

ids ¶

get_resolution_graph ¶

dump ¶

drop ¶

certain ¶

clear ¶

certain ¶

restore ¶

snapshot ¶

insert_model ¶

model_config ¶

get_model ¶

set_model_results ¶

get_model_results ¶

set_model_truth ¶

get_model_truth ¶

get_model_ancestors ¶

set_model_ancestors_cache ¶

name ¶

ancestors_cache ¶

get_model_ancestors_cache ¶

delete_resolution ¶

name ¶

certain ¶

MatchboxPostgresSettings ¶

model_config class-attribute instance-attribute ¶

batch_size class-attribute instance-attribute ¶

datastore instance-attribute ¶

api_key class-attribute instance-attribute ¶

log_level class-attribute instance-attribute ¶

backend_type class-attribute instance-attribute ¶

postgres class-attribute instance-attribute ¶

adapter ¶

T module-attribute ¶

P module-attribute ¶

FilteredClusters ¶

has_source class-attribute instance-attribute ¶

count ¶

FilteredProbabilities ¶

over_truth class-attribute instance-attribute ¶

count ¶

FilteredResolutions ¶

sources class-attribute instance-attribute ¶

humans class-attribute instance-attribute ¶

all `module-attribute` ¶

settings `instance-attribute` ¶

sources `instance-attribute` ¶

models `instance-attribute` ¶

source_resolutions `instance-attribute` ¶

data `instance-attribute` ¶

clusters `instance-attribute` ¶

merges `instance-attribute` ¶

creates `instance-attribute` ¶

proposes `instance-attribute` ¶

`source` ¶

`resolution` ¶

`threshold` ¶

`limit` ¶

`key` ¶

`source` ¶

`targets` ¶

`resolution` ¶

`threshold` ¶

`source_config` ¶

`data_hashes` ¶

`name` ¶

`name` ¶

`ids` ¶

`hashes` ¶

`ids` ¶

`certain` ¶

`certain` ¶

`snapshot` ¶

`model_config` ¶

`name` ¶

`ancestors_cache` ¶

`name` ¶

`certain` ¶

model_config `class-attribute` `instance-attribute` ¶

batch_size `class-attribute` `instance-attribute` ¶

datastore `instance-attribute` ¶

api_key `class-attribute` `instance-attribute` ¶

log_level `class-attribute` `instance-attribute` ¶

backend_type `class-attribute` `instance-attribute` ¶

postgres `class-attribute` `instance-attribute` ¶

T `module-attribute` ¶

P `module-attribute` ¶

has_source `class-attribute` `instance-attribute` ¶

over_truth `class-attribute` `instance-attribute` ¶

sources `class-attribute` `instance-attribute` ¶

humans `class-attribute` `instance-attribute` ¶

models `class-attribute` `instance-attribute` ¶

settings `instance-attribute` ¶

sources `instance-attribute` ¶

models `instance-attribute` ¶

source_resolutions `instance-attribute` ¶

data `instance-attribute` ¶

clusters `instance-attribute` ¶

merges `instance-attribute` ¶

creates `instance-attribute` ¶

proposes `instance-attribute` ¶

`source` ¶

`resolution` ¶

`threshold` ¶

`limit` ¶

`key` ¶

`source` ¶

`targets` ¶

`resolution` ¶

`threshold` ¶

`source_config` ¶

`data_hashes` ¶

`name` ¶

`name` ¶

`ids` ¶

`hashes` ¶

`ids` ¶

`certain` ¶

`certain` ¶

`snapshot` ¶

`model_config` ¶

`name` ¶

`ancestors_cache` ¶

`name` ¶