Skip to content

Models

matchbox.common.factories.models

Factory functions for generating model testkits and data for testing.

Classes:

Functions:

MockDeduper

Bases: Deduper

Mock deduper that does nothing.

Methods:

  • prepare

    Mock prepare method.

  • dedupe

    Mock dedupe method.

Attributes:

settings instance-attribute

settings: DeduperSettings

prepare

prepare(left: DataFrame) -> None

Mock prepare method.

dedupe

dedupe(left: DataFrame) -> DataFrame

Mock dedupe method.

MockLinker

Bases: Linker

Mock linker that does nothing.

Methods:

  • prepare

    Mock prepare method.

  • link

    Mock link method.

Attributes:

settings instance-attribute

settings: LinkerSettings

prepare

prepare(left: DataFrame, right: DataFrame) -> None

Mock prepare method.

link(left: DataFrame, right: DataFrame) -> DataFrame

Mock link method.

ModelTestkit

Bases: BaseModel

A testkit of data and metadata for a Model.

Methods:

  • into_dag

    Turn model into kwargs for dag.model(), detaching from original DAG.

  • init_query_lookup

    Initialize the query lookup table.

Attributes:

model instance-attribute

model: Model

left_data instance-attribute

left_data: Table

left_query instance-attribute

left_query: Query

left_clusters instance-attribute

left_clusters: dict[int, ClusterEntity]

right_data instance-attribute

right_data: Table | None

right_query instance-attribute

right_query: Query | None

right_clusters instance-attribute

right_clusters: dict[int, ClusterEntity] | None

probabilities instance-attribute

probabilities: DataFrame

name property

name: str

Return the full name of the Model.

data property

data: Table

Return a PyArrow table in the same format as matchbox queries.

entities property writable

entities: tuple[ClusterEntity, ...]

ClusterEntities that were generated by the model.

threshold property writable

threshold: int

Threshold for the model.

into_dag

into_dag() -> dict

Turn model into kwargs for dag.model(), detaching from original DAG.

init_query_lookup

init_query_lookup() -> ModelTestkit

Initialize the query lookup table.

component_report

component_report(all_nodes: list[Any], table: DataFrame) -> dict

Fast reporting on connected components using rustworkx.

Parameters:

  • all_nodes

    (list[Any]) –

    list of identities of inputs being matched

  • table

    (DataFrame) –

    Polars dataframe with ‘left’, ‘right’ columns

Returns:

  • dict

    dictionary containing basic component statistics

validate_components

Validate that probability edges create valid components.

Each component should be a subset of exactly one source entity.

Parameters:

calculate_min_max_edges

calculate_min_max_edges(left_nodes: int, right_nodes: int, num_components: int, deduplicate: bool) -> tuple[int, int]

Calculate min and max edges for a graph.

Parameters:

  • left_nodes

    (int) –

    number of nodes in left source

  • right_nodes

    (int) –

    number of nodes in right source

  • num_components

    (int) –

    number of requested components

  • deduplicate

    (bool) –

    whether edges are for deduplication

Returns:

  • tuple[int, int]

    Two-tuple representing min and max edges

generate_dummy_probabilities cached

generate_dummy_probabilities(left_values: tuple[int], right_values: tuple[int] | None, prob_range: tuple[float, float], num_components: int, total_rows: int | None = None, seed: int = 42) -> DataFrame

Generate dummy Arrow probabilities data with guaranteed isolated components.

While much of the factory system uses generate_entity_probabilities, this function is still in use in PostgreSQL benchmarking, and has been designed to be performant at scale.

Parameters:

  • left_values

    (tuple[int]) –

    Tuple of integers to use for left column

  • right_values

    (tuple[int] | None) –

    Tuple of integers to use for right column. If None, assume we are generating probabilities for deduplication

  • prob_range

    (tuple[float, float]) –

    Tuple of (min_prob, max_prob) to constrain probabilities

  • num_components

    (int) –

    Number of distinct connected components to generate

  • total_rows

    (int | None, default: None ) –

    Total number of rows to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • DataFrame

    Polars dataframe with ‘left_id’, ‘right_id’, and ‘probability’ columns

generate_entity_probabilities

generate_entity_probabilities(left_entities: frozenset[ClusterEntity], right_entities: frozenset[ClusterEntity] | None, source_entities: frozenset[SourceEntity], prob_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> DataFrame

Generate probabilities that will recover entity relationships.

Compares ClusterEntity objects against ground truth SourceEntities by checking whether their EntityReferences are subsets of the source entities. Initially focused on generating fully connected, correct probabilities only.

Parameters:

  • left_entities

    (frozenset[ClusterEntity]) –

    Set of ClusterEntity objects from left input

  • right_entities

    (frozenset[ClusterEntity] | None) –

    Set of ClusterEntity objects from right input. If None, assume we are deduplicating left_entities.

  • source_entities

    (frozenset[SourceEntity]) –

    Ground truth set of SourceEntities

  • prob_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of probabilities to assign to matches. All matches will be assigned a random probability in this range.

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • DataFrame

    PyArrow Table with ‘left_id’, ‘right_id’, and ‘probability’ columns

model_factory

model_factory(name: ModelResolutionName | None = None, dag: DAG | None = None, description: str | None = None, left_testkit: SourceTestkit | ModelTestkit | None = None, right_testkit: SourceTestkit | ModelTestkit | None = None, true_entities: tuple[SourceEntity, ...] | None = None, model_type: ModelType | None = None, n_true_entities: int | None = None, prob_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> ModelTestkit

Generate a complete model testkit.

Allows autoconfiguration with minimal settings, or more nuanced control.

Can either be used to generate a model in a pipeline, interconnected with existing SourceTestkit or ModelTestkit objects, or generate a standalone model with random data.

Parameters:

  • name

    (ModelResolutionName | None, default: None ) –

    Name of the model

  • dag

    (DAG | None, default: None ) –

    DAG containing this model. Overridden by dag of left testkit if present.

  • description

    (str | None, default: None ) –

    Description of the model

  • left_testkit

    (SourceTestkit | ModelTestkit | None, default: None ) –

    A SourceTestkit or ModelTestkit for the left source

  • right_testkit

    (SourceTestkit | ModelTestkit | None, default: None ) –

    If creating a linker, a SourceTestkit or ModelTestkit for the right source

  • true_entities

    (tuple[SourceEntity, ...] | None, default: None ) –

    Ground truth SourceEntity objects to use for generating probabilities. Must be supplied if sources are given

  • model_type

    (ModelType | None, default: None ) –

    Type of the model, one of ‘deduper’ or ‘linker’ Defaults to deduper. Ignored if left_testkit or right_testkit are provided.

  • n_true_entities

    (int | None, default: None ) –

    Base number of entities to generate when using default configs. Defaults to 10. Ignored if left_testkit or right_testkit are provided.

  • prob_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of probabilities to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • ModelTestkit ( ModelTestkit ) –

    A model testkit with generated data

Raises:

  • ValueError
    • If probabilities are not in increasing order and between 0 and 1
    • If sources are provided without true entities
  • UserWarning

    If some arguments are ignored due to sources or true entities

query_to_model_factory

query_to_model_factory(left_query: Query, left_data: Table, left_keys: dict[SourceResolutionName, str], true_entities: tuple[SourceEntity, ...], name: ModelResolutionName | None = None, description: str | None = None, right_query: Query | None = None, right_data: Table | None = None, right_keys: dict[SourceResolutionName, str] | None = None, prob_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> ModelTestkit

Turns raw queries from Matchbox into ModelTestkits.

Parameters:

  • left_query

    (Query) –

    Query generating left data

  • left_data

    (Table) –

    PyArrow table with left query data

  • left_keys

    (dict[SourceResolutionName, str]) –

    Dictionary mapping source resolution names to key field names in left query

  • true_entities

    (tuple[SourceEntity, ...]) –

    Ground truth SourceEntity objects to use for generating probabilities

  • name

    (ModelResolutionName | None, default: None ) –

    Name of the model

  • description

    (str | None, default: None ) –

    Description of the model

  • right_query

    (Query | None, default: None ) –

    Query generating right data

  • right_data

    (Table | None, default: None ) –

    PyArrow table with right query data, if creating a linker

  • right_keys

    (dict[SourceResolutionName, str] | None, default: None ) –

    Dictionary mapping source resolution names to key field names in right query

  • prob_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of probabilities to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns: