PostgreSQL¶
A backend adapter for deploying Matchbox using PostgreSQL.
There are two graph-like trees in place here.
- In the resolution subgraph the tree is implemented as closure table, enabling quick querying of root to leaf paths at the cost of redundancy
- In the data subgraph the tree is implemented as an adjacency list, which means recursive queries are required to resolve it, but less data is stored
erDiagram
SourceConfigs {
bigint source_config_id PK
bigint resolution_id FK
string location_type
string location_uri
string extract_transform
}
SourceFields {
bigint field_id PK
bigint source_config_id FK
int index
string name
string type
bool is_key
}
Clusters {
bigint cluster_id PK
bytes cluster_hash
}
ClusterSourceKey {
bigint key_id PK
bigint cluster_id FK
bigint source_config_id FK
string key
}
Contains {
bigint parent PK,FK
bigint child PK,FK
}
PKSpace {
bigint id
bigint next_cluster_id
bigint next_cluster_keys_id
}
Probabilities {
bigint resolution PK,FK
bigint cluster PK,FK
smallint probability
}
Resolutions {
bigint resolution_id PK
string name
string description
string type
bytes hash
smallint truth
}
ResolutionFrom {
bigint parent PK,FK
bigint child PK,FK
int level
smallint truth_cache
}
SourceConfigs |o--|| Resolutions : ""
SourceConfigs ||--o{ SourceFields : ""
SourceConfigs ||--o{ ClusterSourceKey : ""
Clusters ||--o{ ClusterSourceKey : ""
Clusters ||--o{ Probabilities : ""
Clusters ||--o{ Contains : "parent"
Contains }o--|| Clusters : "child"
Resolutions ||--o{ Probabilities : ""
Resolutions ||--o{ ResolutionFrom : "parent"
ResolutionFrom }o--|| Resolutions : "child"
matchbox.server.postgresql
¶
PostgreSQL adapter for Matchbox server.
Modules:
-
adapter
–PostgreSQL adapter for Matchbox server.
-
db
–Matchbox PostgreSQL database connection.
-
mixin
–A module for defining mixins for the PostgreSQL backend ORM.
-
orm
–ORM classes for the Matchbox PostgreSQL database.
-
utils
–Utilities for using the PostgreSQL backend.
Classes:
-
MatchboxPostgres
–A PostgreSQL adapter for Matchbox.
-
MatchboxPostgresSettings
–Settings for the Matchbox PostgreSQL backend.
MatchboxPostgres
¶
MatchboxPostgres(settings: MatchboxPostgresSettings)
Bases: MatchboxDBAdapter
A PostgreSQL adapter for Matchbox.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source resolution and returns the keys in the targets.
-
index
–Indexes a source in your warehouse to Matchbox.
-
get_source_config
–Get a source configuration from its resolution name.
-
get_resolution_source_configs
–Get a list of source configurations queriable from a resolution.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
drop
–Hard clear the database by dropping all tables and re-creating.
-
clear
–Soft clear the database by deleting all rows but retaining tables.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds.
-
delete_resolution
–Delete a resolution from the database.
Attributes:
models
instance-attribute
¶
models = FilteredResolutions(
sources=False, humans=False, models=True
)
source_resolutions
instance-attribute
¶
source_resolutions = FilteredResolutions(
sources=True, humans=False, models=False
)
query
¶
query(
source: SourceResolutionName,
resolution: ResolutionName | None = None,
threshold: int | None = None,
limit: int | None = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source
¶SourceResolutionName
) –the
SourceResolutionName
string identifying the source to query -
resolution
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the source resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
¶
match(
key: str,
source: SourceResolutionName,
targets: list[SourceResolutionName],
resolution: ResolutionName,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source resolution and returns the keys in the targets.
Parameters:
-
key
¶str
) –The key to match from the source.
-
source
¶SourceResolutionName
) –The name of the source resolution.
-
targets
¶list[SourceResolutionName]
) –The names of the target source resolutions.
-
resolution
¶ResolutionName
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
¶
index(
source_config: SourceConfig, data_hashes: Table
) -> None
Indexes a source in your warehouse to Matchbox.
Parameters:
-
source_config
¶SourceConfig
) –The source configuration to index.
-
data_hashes
¶Table
) –The Arrow table with the hash of each data row
get_source_config
¶
get_source_config(
name: SourceResolutionName,
) -> SourceConfig
Get a source configuration from its resolution name.
Parameters:
-
name
¶SourceResolutionName
) –The name resolution name for the source
Returns:
-
SourceConfig
–A SourceConfig object
get_resolution_source_configs
¶
get_resolution_source_configs(
name: ModelResolutionName,
) -> list[SourceConfig]
Get a list of source configurations queriable from a resolution.
Parameters:
-
name
¶ResolutionName
) –Name of the resolution to query.
Returns:
-
list[SourceConfig]
–List of relevant SourceConfig objects.
validate_ids
¶
validate_hashes
¶
cluster_id_to_hash
¶
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
drop
¶
clear
¶
restore
¶
restore(snapshot: MatchboxSnapshot) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
¶
insert_model(model_config: ModelConfig) -> None
Writes a model to Matchbox.
Parameters:
-
model_config
¶ModelConfig
) –ModelConfig object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
-
MatchboxModelConfigError
–If the model configuration is invalid, such as the resolutions sharing ancestors
set_model_results
¶
set_model_results(
name: ModelResolutionName, results: Table
) -> None
Set the results for a model.
get_model_results
¶
get_model_results(name: ModelResolutionName) -> Table
Get the results for a model.
set_model_truth
¶
set_model_truth(
name: ModelResolutionName, truth: int
) -> None
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
¶
get_model_truth(name: ModelResolutionName) -> int
Gets the current truth threshold for this model.
get_model_ancestors
¶
get_model_ancestors(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
¶
set_model_ancestors_cache(
name: ModelResolutionName,
ancestors_cache: list[ModelAncestor],
) -> None
Updates the cached ancestor thresholds.
Parameters:
-
name
¶ModelResolutionName
) –The name of the model to update
-
ancestors_cache
¶list[ModelAncestor]
) –List of ModelAncestor objects mapping model resolution names to their truth thresholds
get_model_ancestors_cache
¶
get_model_ancestors_cache(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the cached ancestor thresholds.
Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
delete_resolution
¶
delete_resolution(
name: ResolutionName, certain: bool = False
) -> None
Delete a resolution from the database.
Parameters:
-
name
¶ResolutionName
) –The name of the resolution to delete.
-
certain
¶bool
) –Whether to delete the model without confirmation.
MatchboxPostgresSettings
¶
Bases: MatchboxServerSettings
Settings for the Matchbox PostgreSQL backend.
Inherits the core settings and adds the PostgreSQL-specific settings.
Attributes:
-
model_config
– -
batch_size
(int
) – -
datastore
(MatchboxDatastoreSettings
) – -
api_key
(SecretStr | None
) – -
log_level
(LogLevelType
) – -
backend_type
(MatchboxBackends
) – -
postgres
(MatchboxPostgresCoreSettings
) –
model_config
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(
env_prefix="MB__SERVER__",
env_nested_delimiter="__",
use_enum_values=True,
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
postgres
class-attribute
instance-attribute
¶
postgres: MatchboxPostgresCoreSettings = Field(
default_factory=MatchboxPostgresCoreSettings
)
adapter
¶
PostgreSQL adapter for Matchbox server.
Classes:
-
FilteredClusters
–Wrapper class for filtered cluster queries.
-
FilteredProbabilities
–Wrapper class for filtered probability queries.
-
FilteredResolutions
–Wrapper class for filtered resolution queries.
-
MatchboxPostgres
–A PostgreSQL adapter for Matchbox.
Attributes:
FilteredClusters
¶
Bases: BaseModel
Wrapper class for filtered cluster queries.
Methods:
-
count
–Counts the number of clusters in the database.
Attributes:
-
has_source
(bool | None
) –
FilteredProbabilities
¶
Bases: BaseModel
Wrapper class for filtered probability queries.
Methods:
-
count
–Counts the number of probabilities in the database.
Attributes:
-
over_truth
(bool
) –
FilteredResolutions
¶
Bases: BaseModel
Wrapper class for filtered resolution queries.
Methods:
-
count
–Counts the number of resolutions in the database.
Attributes:
MatchboxPostgres
¶
MatchboxPostgres(settings: MatchboxPostgresSettings)
Bases: MatchboxDBAdapter
A PostgreSQL adapter for Matchbox.
Methods:
-
query
–Queries the database from an optional point of truth.
-
match
–Matches an ID in a source resolution and returns the keys in the targets.
-
index
–Indexes a source in your warehouse to Matchbox.
-
get_source_config
–Get a source configuration from its resolution name.
-
get_resolution_source_configs
–Get a list of source configurations queriable from a resolution.
-
validate_ids
–Validates a list of IDs exist in the database.
-
validate_hashes
–Validates a list of hashes exist in the database.
-
cluster_id_to_hash
–Get a lookup of Cluster hashes from a list of IDs.
-
get_resolution_graph
–Get the full resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
drop
–Hard clear the database by dropping all tables and re-creating.
-
clear
–Soft clear the database by deleting all rows but retaining tables.
-
restore
–Restores the database from a snapshot.
-
insert_model
–Writes a model to Matchbox.
-
get_model
–Get a model from the database.
-
set_model_results
–Set the results for a model.
-
get_model_results
–Get the results for a model.
-
set_model_truth
–Sets the truth threshold for this model, changing the default clusters.
-
get_model_truth
–Gets the current truth threshold for this model.
-
get_model_ancestors
–Gets the current truth values of all ancestors.
-
set_model_ancestors_cache
–Updates the cached ancestor thresholds.
-
get_model_ancestors_cache
–Gets the cached ancestor thresholds.
-
delete_resolution
–Delete a resolution from the database.
Attributes:
models
instance-attribute
¶
models = FilteredResolutions(
sources=False, humans=False, models=True
)
source_resolutions
instance-attribute
¶
source_resolutions = FilteredResolutions(
sources=True, humans=False, models=False
)
query
¶
query(
source: SourceResolutionName,
resolution: ResolutionName | None = None,
threshold: int | None = None,
limit: int | None = None,
) -> Table
Queries the database from an optional point of truth.
Parameters:
-
source
¶SourceResolutionName
) –the
SourceResolutionName
string identifying the source to query -
resolution
¶optional
, default:None
) –the resolution to use for filtering results If not specified, will use the source resolution for the queried source
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the models’ default threshold If an integer, uses that threshold for the specified model, and the model’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –the number to use in a limit clause. Useful for testing
Returns:
-
Table
–The resulting matchbox IDs in Arrow format
match
¶
match(
key: str,
source: SourceResolutionName,
targets: list[SourceResolutionName],
resolution: ResolutionName,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source resolution and returns the keys in the targets.
Parameters:
-
key
¶str
) –The key to match from the source.
-
source
¶SourceResolutionName
) –The name of the source resolution.
-
targets
¶list[SourceResolutionName]
) –The names of the target source resolutions.
-
resolution
¶ResolutionName
) –The name of the resolution to use for matching.
-
threshold
¶optional
, default:None
) –the threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors Will use these threshold values instead of the cached thresholds
index
¶
index(
source_config: SourceConfig, data_hashes: Table
) -> None
Indexes a source in your warehouse to Matchbox.
Parameters:
-
source_config
¶SourceConfig
) –The source configuration to index.
-
data_hashes
¶Table
) –The Arrow table with the hash of each data row
get_source_config
¶
get_source_config(
name: SourceResolutionName,
) -> SourceConfig
Get a source configuration from its resolution name.
Parameters:
-
name
¶SourceResolutionName
) –The name resolution name for the source
Returns:
-
SourceConfig
–A SourceConfig object
get_resolution_source_configs
¶
get_resolution_source_configs(
name: ModelResolutionName,
) -> list[SourceConfig]
Get a list of source configurations queriable from a resolution.
Parameters:
-
name
¶ResolutionName
) –Name of the resolution to query.
Returns:
-
list[SourceConfig]
–List of relevant SourceConfig objects.
validate_ids
¶
validate_hashes
¶
cluster_id_to_hash
¶
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
drop
¶
clear
¶
restore
¶
restore(snapshot: MatchboxSnapshot) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
Raises:
-
TypeError
–If the snapshot is not compatible with PostgreSQL
insert_model
¶
insert_model(model_config: ModelConfig) -> None
Writes a model to Matchbox.
Parameters:
-
model_config
¶ModelConfig
) –ModelConfig object with the model’s metadata
Raises:
-
MatchboxDataNotFound
–If, for a linker, the source models weren’t found in the database
-
MatchboxModelConfigError
–If the model configuration is invalid, such as the resolutions sharing ancestors
set_model_results
¶
set_model_results(
name: ModelResolutionName, results: Table
) -> None
Set the results for a model.
get_model_results
¶
get_model_results(name: ModelResolutionName) -> Table
Get the results for a model.
set_model_truth
¶
set_model_truth(
name: ModelResolutionName, truth: int
) -> None
Sets the truth threshold for this model, changing the default clusters.
get_model_truth
¶
get_model_truth(name: ModelResolutionName) -> int
Gets the current truth threshold for this model.
get_model_ancestors
¶
get_model_ancestors(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the current truth values of all ancestors.
Returns a list of ModelAncestor objects mapping model resolution names to their current truth thresholds.
Unlike ancestors_cache which returns cached values, this property returns the current truth values of all ancestor models.
set_model_ancestors_cache
¶
set_model_ancestors_cache(
name: ModelResolutionName,
ancestors_cache: list[ModelAncestor],
) -> None
Updates the cached ancestor thresholds.
Parameters:
-
name
¶ModelResolutionName
) –The name of the model to update
-
ancestors_cache
¶list[ModelAncestor]
) –List of ModelAncestor objects mapping model resolution names to their truth thresholds
get_model_ancestors_cache
¶
get_model_ancestors_cache(
name: ModelResolutionName,
) -> list[ModelAncestor]
Gets the cached ancestor thresholds.
Returns a list of ModelAncestor objects mapping model resolution names to their cached truth thresholds.
This is required because each point of truth needs to be stable, so we choose when to update it, caching the ancestor’s values in the model itself.
delete_resolution
¶
delete_resolution(
name: ResolutionName, certain: bool = False
) -> None
Delete a resolution from the database.
Parameters:
-
name
¶ResolutionName
) –The name of the resolution to delete.
-
certain
¶bool
) –Whether to delete the model without confirmation.
db
¶
Matchbox PostgreSQL database connection.
Classes:
-
MatchboxPostgresCoreSettings
–PostgreSQL-specific settings for Matchbox.
-
MatchboxPostgresSettings
–Settings for the Matchbox PostgreSQL backend.
-
MatchboxDatabase
–Matchbox PostgreSQL database connection.
Attributes:
-
MBDB
–
MatchboxPostgresCoreSettings
¶
MatchboxPostgresSettings
¶
Bases: MatchboxServerSettings
Settings for the Matchbox PostgreSQL backend.
Inherits the core settings and adds the PostgreSQL-specific settings.
Attributes:
-
backend_type
(MatchboxBackends
) – -
postgres
(MatchboxPostgresCoreSettings
) – -
model_config
– -
batch_size
(int
) – -
datastore
(MatchboxDatastoreSettings
) – -
api_key
(SecretStr | None
) – -
log_level
(LogLevelType
) –
postgres
class-attribute
instance-attribute
¶
postgres: MatchboxPostgresCoreSettings = Field(
default_factory=MatchboxPostgresCoreSettings
)
model_config
class-attribute
instance-attribute
¶
model_config = SettingsConfigDict(
env_prefix="MB__SERVER__",
env_nested_delimiter="__",
use_enum_values=True,
env_file=".env",
env_file_encoding="utf-8",
extra="ignore",
)
MatchboxDatabase
¶
MatchboxDatabase(settings: MatchboxPostgresSettings)
Matchbox PostgreSQL database connection.
Methods:
-
connection_string
–Get the connection string for PostgreSQL.
-
get_engine
–Get the database engine.
-
get_session
–Get a new session.
-
get_adbc_connection
–Get a new ADBC connection.
-
run_migrations
–Create the database and all tables expected in the schema.
-
clear_database
–Delete all rows in every table in the database schema.
-
drop_database
–Drop all tables in the database schema and re-recreate them.
Attributes:
-
settings
– -
MatchboxBase
– -
alembic_config
– -
sorted_tables
(list[Table]
) –Return a list of SQLAlchemy tables in order of creation.
MatchboxBase
instance-attribute
¶
sorted_tables
property
¶
sorted_tables: list[Table]
Return a list of SQLAlchemy tables in order of creation.
connection_string
¶
Get the connection string for PostgreSQL.
get_adbc_connection
¶
Get a new ADBC connection.
The connection must be used within a context manager.
clear_database
¶
Delete all rows in every table in the database schema.
- TRUNCATE tables that are part of the core ORM (preserves structure)
- DROP tables that are not in the ORM (removes temporary/test tables)
mixin
¶
A module for defining mixins for the PostgreSQL backend ORM.
Classes:
-
CountMixin
–A mixin for counting the number of rows in a table.
Attributes:
-
T
–
orm
¶
ORM classes for the Matchbox PostgreSQL database.
Classes:
-
ResolutionFrom
–Resolution lineage closure table with cached truth values.
-
Resolutions
–Table of resolution points: models, sources and humans.
-
PKSpace
–Table used to reserve ranges of primary keys.
-
SourceFields
–Table for storing column details for SourceConfigs.
-
ClusterSourceKey
–Table for storing source primary keys for clusters.
-
SourceConfigs
–Table of source_configs of data for Matchbox.
-
Contains
–Cluster lineage table.
-
Clusters
–Table of indexed data and clusters that match it.
-
Probabilities
–Table of probabilities that a cluster is correct, according to a resolution.
-
Results
–Table of results for a resolution.
ResolutionFrom
¶
Bases: CountMixin
, MatchboxBase
Resolution lineage closure table with cached truth values.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
parent
– -
child
– -
level
– -
truth_cache
– -
__table_args__
–
parent
class-attribute
instance-attribute
¶
parent = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
child
class-attribute
instance-attribute
¶
child = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"parent != child", name="no_self_reference"
),
CheckConstraint("level > 0", name="positive_level"),
)
Resolutions
¶
Bases: CountMixin
, MatchboxBase
Table of resolution points: models, sources and humans.
Resolutions produce probabilities or own data in the clusters table.
Methods:
-
get_lineage
–Returns lineage ordered by priority.
-
from_name
–Resolves a model resolution name to a Resolution object.
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
resolution_id
– -
name
– -
description
– -
type
– -
hash
– -
truth
– -
source_config
– -
probabilities
– -
results
– -
children
– -
__table_args__
– -
ancestors
(set[Resolutions]
) –Returns all ancestors (parents, grandparents, etc.) of this resolution.
-
descendants
(set[Resolutions]
) –Returns descendants (children, grandchildren, etc.) of this resolution.
resolution_id
class-attribute
instance-attribute
¶
source_config
class-attribute
instance-attribute
¶
probabilities
class-attribute
instance-attribute
¶
probabilities = relationship(
"Probabilities",
back_populates="proposed_by",
passive_deletes=True,
)
results
class-attribute
instance-attribute
¶
children
class-attribute
instance-attribute
¶
children = relationship(
"Resolutions",
secondary=__table__,
primaryjoin="Resolutions.resolution_id == ResolutionFrom.parent",
secondaryjoin="Resolutions.resolution_id == ResolutionFrom.child",
backref="parents",
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"type IN ('model', 'source', 'human')",
name="resolution_type_constraints",
),
UniqueConstraint("name", name="resolutions_name_key"),
)
ancestors
property
¶
ancestors: set[Resolutions]
Returns all ancestors (parents, grandparents, etc.) of this resolution.
descendants
property
¶
descendants: set[Resolutions]
Returns descendants (children, grandchildren, etc.) of this resolution.
get_lineage
¶
get_lineage(
sources: list[SourceConfigs] | None = None,
threshold: int | None = None,
) -> list[tuple[int, int, float | None]]
Returns lineage ordered by priority.
Highest priority (lowest level) first, then by resolution_id for stability.
Parameters:
-
sources
¶list[SourceConfigs] | None
, default:None
) –If provided, only return lineage paths that lead to these sources
-
threshold
¶int | None
, default:None
) –If provided, override this resolution’s threshold
Returns:
from_name
classmethod
¶
from_name(
name: ResolutionName,
res_type: Literal["model", "source", "human"]
| None = None,
session: Session | None = None,
) -> Resolutions
Resolves a model resolution name to a Resolution object.
Parameters:
-
name
¶ResolutionName
) –The name of the model to resolve.
-
res_type
¶Literal['model', 'source', 'human'] | None
, default:None
) –A resolution type to use as filter.
-
session
¶Session | None
, default:None
) –A session to get the resolution for updates.
Raises:
-
MatchboxResolutionNotFoundError
–If the model doesn’t exist.
PKSpace
¶
Bases: MatchboxBase
Table used to reserve ranges of primary keys.
Methods:
-
initialise
–Create PKSpace tracking row if not exists.
-
reserve_block
–Atomically get next available ID for table, and increment it.
Attributes:
SourceFields
¶
Bases: CountMixin
, MatchboxBase
Table for storing column details for SourceConfigs.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
field_id
– -
source_config_id
– -
index
– -
name
– -
type
– -
is_key
– -
source_config
– -
__table_args__
–
source_config_id
class-attribute
instance-attribute
¶
source_config_id = Column(
BIGINT,
ForeignKey(
"source_configs.source_config_id",
ondelete="CASCADE",
),
nullable=False,
)
source_config
class-attribute
instance-attribute
¶
source_config = relationship(
"SourceConfigs",
back_populates="fields",
foreign_keys=[source_config_id],
)
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
UniqueConstraint(
"source_config_id", "index", name="unique_index"
),
Index(
"ix_source_columns_source_config_id",
"source_config_id",
),
Index(
"ix_unique_key_field",
"source_config_id",
unique=True,
postgresql_where=text("is_key = true"),
),
)
ClusterSourceKey
¶
Bases: CountMixin
, MatchboxBase
Table for storing source primary keys for clusters.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
key_id
– -
cluster_id
– -
source_config_id
– -
key
– -
cluster
– -
source_config
– -
__table_args__
–
cluster_id
class-attribute
instance-attribute
¶
cluster_id = Column(
BIGINT,
ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
nullable=False,
)
source_config_id
class-attribute
instance-attribute
¶
source_config_id = Column(
BIGINT,
ForeignKey(
"source_configs.source_config_id",
ondelete="CASCADE",
),
nullable=False,
)
cluster
class-attribute
instance-attribute
¶
source_config
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
Index("ix_cluster_keys_cluster_id", "cluster_id"),
Index("ix_cluster_keys_keys", "key"),
UniqueConstraint(
"key_id",
"source_config_id",
name="unique_keys_source",
),
)
SourceConfigs
¶
SourceConfigs(
key_field: SourceFields | None = None,
index_fields: list[SourceFields] | None = None,
**kwargs,
)
Bases: CountMixin
, MatchboxBase
Table of source_configs of data for Matchbox.
Methods:
-
list_all
–Returns all source_configs in the database.
-
from_dto
–Create a SourceConfigs instance from a CommonSource object.
-
to_dto
–Convert ORM source to a matchbox.common SourceConfig object.
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
source_config_id
– -
resolution_id
– -
location_type
– -
location_uri
– -
extract_transform
– -
name
(str
) –Get the name of the related resolution.
-
source_resolution
– -
fields
– -
key_field
– -
index_fields
– -
cluster_keys
– -
clusters
–
source_config_id
class-attribute
instance-attribute
¶
resolution_id
class-attribute
instance-attribute
¶
resolution_id = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
nullable=False,
)
extract_transform
class-attribute
instance-attribute
¶
source_resolution
class-attribute
instance-attribute
¶
fields
class-attribute
instance-attribute
¶
fields = relationship(
"SourceFields",
back_populates="source_config",
passive_deletes=True,
cascade="all, delete-orphan",
)
key_field
class-attribute
instance-attribute
¶
key_field = relationship(
"SourceFields",
primaryjoin="and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == True)",
viewonly=True,
uselist=False,
)
index_fields
class-attribute
instance-attribute
¶
index_fields = relationship(
"SourceFields",
primaryjoin="and_(SourceConfigs.source_config_id == SourceFields.source_config_id, SourceFields.is_key == False)",
viewonly=True,
order_by="SourceFields.index",
collection_class=list,
)
cluster_keys
class-attribute
instance-attribute
¶
cluster_keys = relationship(
"ClusterSourceKey",
back_populates="source_config",
passive_deletes=True,
)
clusters
class-attribute
instance-attribute
¶
clusters = relationship(
"Clusters",
secondary=__table__,
primaryjoin="SourceConfigs.source_config_id == ClusterSourceKey.source_config_id",
secondaryjoin="ClusterSourceKey.cluster_id == Clusters.cluster_id",
viewonly=True,
)
list_all
classmethod
¶
list_all() -> list[SourceConfigs]
Returns all source_configs in the database.
from_dto
classmethod
¶
from_dto(
resolution: Resolutions, source_config: SourceConfig
) -> SourceConfigs
Create a SourceConfigs instance from a CommonSource object.
Contains
¶
Bases: CountMixin
, MatchboxBase
Cluster lineage table.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
root
– -
leaf
– -
__table_args__
–
root
class-attribute
instance-attribute
¶
leaf
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"root != leaf", name="no_self_containment"
),
Index("ix_contains_root_leaf", "root", "leaf"),
Index("ix_contains_leaf_root", "leaf", "root"),
)
Clusters
¶
Bases: CountMixin
, MatchboxBase
Table of indexed data and clusters that match it.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
cluster_id
– -
cluster_hash
– -
keys
– -
probabilities
– -
leaves
– -
source_configs
– -
__table_args__
–
keys
class-attribute
instance-attribute
¶
probabilities
class-attribute
instance-attribute
¶
leaves
class-attribute
instance-attribute
¶
leaves = relationship(
"Clusters",
secondary=__table__,
primaryjoin="Clusters.cluster_id == Contains.root",
secondaryjoin="Clusters.cluster_id == Contains.leaf",
backref="roots",
)
source_configs
class-attribute
instance-attribute
¶
source_configs = relationship(
"SourceConfigs",
secondary=__table__,
primaryjoin="Clusters.cluster_id == ClusterSourceKey.cluster_id",
secondaryjoin="ClusterSourceKey.source_config_id == SourceConfigs.source_config_id",
viewonly=True,
)
__table_args__
class-attribute
instance-attribute
¶
Probabilities
¶
Bases: CountMixin
, MatchboxBase
Table of probabilities that a cluster is correct, according to a resolution.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
resolution_id
– -
cluster_id
– -
probability
– -
proposed_by
– -
proposes
– -
__table_args__
–
resolution_id
class-attribute
instance-attribute
¶
resolution_id = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
primary_key=True,
)
cluster_id
class-attribute
instance-attribute
¶
cluster_id = Column(
BIGINT,
ForeignKey("clusters.cluster_id", ondelete="CASCADE"),
primary_key=True,
)
proposed_by
class-attribute
instance-attribute
¶
proposes
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
CheckConstraint(
"probability BETWEEN 0 AND 100",
name="valid_probability",
),
Index("ix_probabilities_resolution", "resolution_id"),
)
Results
¶
Bases: CountMixin
, MatchboxBase
Table of results for a resolution.
Stores the raw left/right probabilities created by a model.
Methods:
-
count
–Counts the number of rows in the table.
Attributes:
-
__tablename__
– -
result_id
– -
resolution_id
– -
left_id
– -
right_id
– -
probability
– -
proposed_by
– -
__table_args__
–
result_id
class-attribute
instance-attribute
¶
resolution_id
class-attribute
instance-attribute
¶
resolution_id = Column(
BIGINT,
ForeignKey(
"resolutions.resolution_id", ondelete="CASCADE"
),
nullable=False,
)
left_id
class-attribute
instance-attribute
¶
right_id
class-attribute
instance-attribute
¶
proposed_by
class-attribute
instance-attribute
¶
__table_args__
class-attribute
instance-attribute
¶
__table_args__ = (
Index("ix_results_resolution", "resolution_id"),
CheckConstraint(
"probability BETWEEN 0 AND 100",
name="valid_probability",
),
UniqueConstraint(
"resolution_id", "left_id", "right_id"
),
)
utils
¶
Utilities for using the PostgreSQL backend.
Modules:
-
db
–General utilities for the PostgreSQL backend.
-
insert
–Utilities for inserting data into the PostgreSQL backend.
-
query
–Utilities for querying and matching in the PostgreSQL backend.
-
results
–Utilities for querying model results from the PostgreSQL backend.
db
¶
General utilities for the PostgreSQL backend.
Functions:
-
get_resolution_graph
–Retrieves the resolution graph.
-
dump
–Dumps the entire database to a snapshot.
-
restore
–Restores the database from a snapshot.
-
sqa_profiled
–SQLAlchemy profiler.
-
compile_sql
–Compiles a SQLAlchemy statement into a string.
-
large_ingest
–Append a PyArrow table to a PostgreSQL table using ADBC.
-
ingest_to_temporary_table
–Context manager to ingest Arrow data to a temporary table with explicit types.
dump
¶
dump() -> MatchboxSnapshot
Dumps the entire database to a snapshot.
Returns:
-
MatchboxSnapshot
–A MatchboxSnapshot object of type “postgres” with the database’s current state.
restore
¶
restore(
snapshot: MatchboxSnapshot, batch_size: int
) -> None
Restores the database from a snapshot.
Parameters:
-
snapshot
¶MatchboxSnapshot
) –A MatchboxSnapshot object of type “postgres” with the database’s state
-
batch_size
¶int
) –The number of records to insert in each batch
Raises:
-
ValueError
–If the snapshot is missing data
sqa_profiled
¶
SQLAlchemy profiler.
Taken directly from their docs: https://docs.sqlalchemy.org/en/20/faq/performance.html#query-profiling
compile_sql
¶
large_ingest
¶
large_ingest(
data: Table,
table_class: DeclarativeMeta,
max_chunksize: int | None = None,
upsert_keys: list[str] | None = None,
update_columns: list[str] | None = None,
)
Append a PyArrow table to a PostgreSQL table using ADBC.
It will either copy directly (and error if primary key constraints are violated), or it can be run in upsert mode by using a staging table, which is slower.
Parameters:
-
data
¶Table
) –A PyArrow table to write.
-
table_class
¶DeclarativeMeta
) –The SQLAlchemy ORM class for the table to write to.
-
max_chunksize
¶int | None
, default:None
) –Size of data chunks to be read and copied.
-
upsert_keys
¶list[str] | None
, default:None
) –Columns used as keys for “on conflict do update”. If passed, it will run ingest in slower upsert mode. If not passed and
update_columns
is passed, defaults to primary keys. -
update_columns
¶list[str] | None
, default:None
) –Columns to update when upserting. If passed, it will run ingest in slower upsert mode. If not passed and
upsert_keys
is passed, defaults to all other columns.
ingest_to_temporary_table
¶
ingest_to_temporary_table(
table_name: str,
schema_name: str,
data: Table,
column_types: dict[str, type[TypeEngine]],
max_chunksize: int | None = None,
) -> Generator[Table, None, None]
Context manager to ingest Arrow data to a temporary table with explicit types.
Parameters:
-
table_name
¶str
) –Base name for the temporary table
-
schema_name
¶str
) –Schema where the temporary table will be created
-
data
¶Table
) –PyArrow table containing the data to ingest
-
column_types
¶dict[str, type[TypeEngine]]
) –Map of column names to SQLAlchemy types
-
max_chunksize
¶int | None
, default:None
) –Optional maximum chunk size for batches
Returns:
-
None
–A SQLAlchemy Table object representing the temporary table
insert
¶
Utilities for inserting data into the PostgreSQL backend.
Functions:
-
insert_source
–Indexes a source within Matchbox.
-
insert_model
–Writes a model to Matchbox with a default truth value of 100.
-
insert_results
–Writes a results table to Matchbox.
insert_source
¶
insert_source(
source_config: SourceConfig,
data_hashes: Table,
batch_size: int,
) -> None
Indexes a source within Matchbox.
insert_model
¶
insert_model(
name: ModelResolutionName,
left: Resolutions,
right: Resolutions,
description: str,
) -> None
Writes a model to Matchbox with a default truth value of 100.
Parameters:
-
name
¶ModelResolutionName
) –Name of the new model
-
left
¶Resolutions
) –Left parent of the model
-
right
¶Resolutions
) –Right parent of the model. Same as left in a dedupe job
-
description
¶str
) –Model description
Raises:
-
MatchboxResolutionNotFoundError
–If the specified parent models don’t exist.
-
MatchboxResolutionAlreadyExists
–If the specified model already exists.
insert_results
¶
insert_results(
resolution: Resolutions, results: Table, batch_size: int
) -> None
Writes a results table to Matchbox.
The PostgreSQL backend stores clusters in a hierarchical structure, where each component references its parent component at a higher threshold.
This means two-item components are synonymous with their original pairwise probabilities.
This allows easy querying of clusters at any threshold.
Parameters:
-
resolution
¶Resolutions
) –Resolution of type model to associate results with
-
results
¶Table
) –A PyArrow results table with left_id, right_id, probability
-
batch_size
¶int
) –Number of records to insert in each batch
Raises:
-
MatchboxResolutionNotFoundError
–If the specified model doesn’t exist.
query
¶
Utilities for querying and matching in the PostgreSQL backend.
Functions:
-
get_source_config
–Converts the named source to a SourceConfigs ORM object.
-
query
–Queries Matchbox to retrieve linked data for a source.
-
get_parent_clusters_and_leaves
–Query clusters and their leaves for all parent resolutions.
-
match
–Matches an ID in a source resolution and returns the keys in the targets.
Attributes:
-
T
–
get_source_config
¶
get_source_config(
name: SourceResolutionName, session: Session
) -> SourceConfigs
Converts the named source to a SourceConfigs ORM object.
query
¶
query(
source: SourceResolutionName,
resolution: ResolutionName | None = None,
threshold: int | None = None,
limit: int = None,
) -> Table
Queries Matchbox to retrieve linked data for a source.
Retrieves all linked data for a given source, resolving through hierarchy if needed.
- Simple case: If querying the same resolution as the source, just select cluster IDs and keys directly from ClusterSourceKey
- Hierarchy case: Uses the unified query builder to traverse up the resolution hierarchy, applying COALESCE priority logic to determine which parent cluster each source record belongs to
- Priority resolution: When multiple model resolutions could assign a record to different clusters, COALESCE ensures higher-priority resolutions win
Returns all records with their final resolved cluster IDs.
get_parent_clusters_and_leaves
¶
Query clusters and their leaves for all parent resolutions.
For a given resolution, find all its parent resolutions and return complete cluster compositions.
- Parent discovery: Queries ResolutionFrom to find all direct parent resolutions (level 1)
- Cluster building: For each parent, runs the full unified query to get all cluster assignments with both root and leaf information
- Aggregation: Collects all leaf nodes belonging to each root cluster across all parent resolutions
Return a dictionary mapping cluster IDs to their complete leaf compositions and metadata.
match
¶
match(
key: str,
source: SourceResolutionName,
targets: list[SourceResolutionName],
resolution: ResolutionName,
threshold: int | None = None,
) -> list[Match]
Matches an ID in a source resolution and returns the keys in the targets.
Given a specific key in a source, find what it matches to in target sources through a resolution hierarchy.
- Target cluster identification: Uses COALESCE priority CTE to determine which cluster the input key belongs to at the resolution level
- Matching leaves discovery: Builds UNION ALL query with branches for:
- Direct cluster members (source-only case)
- Members connected through each model resolution in the hierarchy
- Cross-reference: Joins the target cluster with all possible matching leaves, filtering for the requested target sources
Organises matches by source configuration and returns structured Match objects for each target.
results
¶
Utilities for querying model results from the PostgreSQL backend.
Classes:
-
SourceInfo
–Information about a model’s sources.
Functions:
-
get_model_config
–Get metadata for a model resolution.
SourceInfo
¶
Bases: NamedTuple
Information about a model’s sources.
Attributes:
-
left
(int
) – -
right
(int | None
) – -
left_ancestors
(set[int]
) – -
right_ancestors
(set[int] | None
) –
get_model_config
¶
get_model_config(resolution: Resolutions) -> ModelConfig
Get metadata for a model resolution.