Skip to content

Models

matchbox.common.factories.models

Factory functions for generating model testkits and data for testing.

Classes:

  • ModelTestkit

    A testkit of data and metadata for a Model.

Functions:

ModelTestkit

Bases: BaseModel

A testkit of data and metadata for a Model.

Methods:

Attributes:

model instance-attribute

model: Model

left_query instance-attribute

left_query: Table

left_clusters instance-attribute

left_clusters: dict[int, ClusterEntity]

right_query instance-attribute

right_query: Table | None

right_clusters instance-attribute

right_clusters: dict[int, ClusterEntity] | None

probabilities instance-attribute

probabilities: Table

name property

name: str

Return the full name of the Model.

entities property writable

entities: tuple[ClusterEntity, ...]

ClusterEntities that were generated by the model.

threshold property writable

threshold: int

Threshold for the model.

mock property

mock: Mock

Create a mock Model object with this testkit’s configuration.

query property

query: Table

Return a PyArrow table in the same format at matchbox.query().

init_query_lookup

init_query_lookup() -> ModelTestkit

Initialize the query lookup table.

component_report

component_report(
    all_nodes: list[Any], table: Table
) -> dict

Fast reporting on connected components using rustworkx.

Parameters:

  • all_nodes

    (list[Any]) –

    list of identities of inputs being matched

  • table

    (Table) –

    PyArrow table with ‘left’, ‘right’ columns

Returns:

  • dict

    dictionary containing basic component statistics

validate_components

Validate that probability edges create valid components.

Each component should be a subset of exactly one source entity.

Parameters:

calculate_min_max_edges

calculate_min_max_edges(
    left_nodes: int,
    right_nodes: int,
    num_components: int,
    deduplicate: bool,
) -> tuple[int, int]

Calculate min and max edges for a graph.

Parameters:

  • left_nodes

    (int) –

    number of nodes in left source

  • right_nodes

    (int) –

    number of nodes in right source

  • num_components

    (int) –

    number of requested components

  • deduplicate

    (bool) –

    whether edges are for deduplication

Returns:

  • tuple[int, int]

    Two-tuple representing min and max edges

generate_dummy_probabilities cached

generate_dummy_probabilities(
    left_values: tuple[int],
    right_values: tuple[int] | None,
    prob_range: tuple[float, float],
    num_components: int,
    total_rows: int | None = None,
    seed: int = 42,
) -> Table

Generate dummy Arrow probabilities data with guaranteed isolated components.

While much of the factory system uses generate_entity_probabilities, this function is still in use in PostgreSQL benchmarking, and has been designed to be performant at scale.

Parameters:

  • left_values

    (tuple[int]) –

    Tuple of integers to use for left column

  • right_values

    (tuple[int] | None) –

    Tuple of integers to use for right column. If None, assume we are generating probabilities for deduplication

  • prob_range

    (tuple[float, float]) –

    Tuple of (min_prob, max_prob) to constrain probabilities

  • num_components

    (int) –

    Number of distinct connected components to generate

  • total_rows

    (int | None, default: None ) –

    Total number of rows to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • Table

    PyArrow Table with ‘left_id’, ‘right_id’, and ‘probability’ columns

generate_entity_probabilities

generate_entity_probabilities(
    left_entities: frozenset[ClusterEntity],
    right_entities: frozenset[ClusterEntity] | None,
    source_entities: frozenset[SourceEntity],
    prob_range: tuple[float, float] = (0.8, 1.0),
    seed: int = 42,
) -> Table

Generate probabilities that will recover entity relationships.

Compares ClusterEntity objects against ground truth SourceEntities by checking whether their EntityReferences are subsets of the source entities. Initially focused on generating fully connected, correct probabilities only.

Parameters:

  • left_entities

    (frozenset[ClusterEntity]) –

    Set of ClusterEntity objects from left input

  • right_entities

    (frozenset[ClusterEntity] | None) –

    Set of ClusterEntity objects from right input. If None, assume we are deduplicating left_entities.

  • source_entities

    (frozenset[SourceEntity]) –

    Ground truth set of SourceEntities

  • prob_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of probabilities to assign to matches. All matches will be assigned a random probability in this range.

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • Table

    PyArrow Table with ‘left_id’, ‘right_id’, and ‘probability’ columns

model_factory

model_factory(
    name: str | None = None,
    description: str | None = None,
    left_testkit: SourceTestkit
    | ModelTestkit
    | None = None,
    right_testkit: SourceTestkit
    | ModelTestkit
    | None = None,
    true_entities: tuple[SourceEntity, ...] | None = None,
    model_type: Literal["deduper", "linker"] | None = None,
    n_true_entities: int | None = None,
    prob_range: tuple[float, float] = (0.8, 1.0),
    seed: int = 42,
) -> ModelTestkit

Generate a complete model testkit.

Allows autoconfiguration with minimal settings, or more nuanced control.

Can either be used to generate a model in a pipeline, interconnected with existing SourceTestkit or ModelTestkit objects, or generate a standalone model with random data.

Parameters:

  • name

    (str | None, default: None ) –

    Name of the model

  • description

    (str | None, default: None ) –

    Description of the model

  • left_testkit

    (SourceTestkit | ModelTestkit | None, default: None ) –

    A SourceTestkit or ModelTestkit for the left source

  • right_testkit

    (SourceTestkit | ModelTestkit | None, default: None ) –

    If creating a linker, a SourceTestkit or ModelTestkit for the right source

  • true_entities

    (tuple[SourceEntity, ...] | None, default: None ) –

    Ground truth SourceEntity objects to use for generating probabilities. Must be supplied if sources are given

  • model_type

    (Literal['deduper', 'linker'] | None, default: None ) –

    Type of the model, one of ‘deduper’ or ‘linker’ Defaults to deduper. Ignored if left_testkit or right_testkit are provided.

  • n_true_entities

    (int | None, default: None ) –

    Base number of entities to generate when using default configs. Defaults to 10. Ignored if left_testkit or right_testkit are provided.

  • prob_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of probabilities to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • ModelTestkit ( ModelTestkit ) –

    A model testkit with generated data

Raises:

  • ValueError
    • If probabilities are not in increasing order and between 0 and 1
    • If sources are provided without true entities
  • UserWarning

    If some arguments are ignored due to sources or true entities

query_to_model_factory

query_to_model_factory(
    left_resolution: str,
    left_query: Table,
    left_source_pks: dict[str, str],
    true_entities: tuple[SourceEntity, ...],
    name: str | None = None,
    description: str | None = None,
    right_resolution: str | None = None,
    right_query: Table | None = None,
    right_source_pks: dict[str, str] | None = None,
    prob_range: tuple[float, float] = (0.8, 1.0),
    seed: int = 42,
) -> ModelTestkit

Turns raw queries from Matchbox into ModelTestkits.

Parameters:

  • left_resolution

    (str) –

    Name of the resolution used for the left query

  • left_query

    (Table) –

    PyArrow table with left query data

  • left_source_pks

    (dict[str, str]) –

    Dictionary mapping source names to primary key column names in left query

  • true_entities

    (tuple[SourceEntity, ...]) –

    Ground truth SourceEntity objects to use for generating probabilities

  • name

    (str | None, default: None ) –

    Name of the model

  • description

    (str | None, default: None ) –

    Description of the model

  • right_resolution

    (str | None, default: None ) –

    Name of the resolution used for the right query

  • right_query

    (Table | None, default: None ) –

    PyArrow table with right query data, if creating a linker

  • right_source_pks

    (dict[str, str] | None, default: None ) –

    Dictionary mapping source names to primary key column names in right query

  • prob_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of probabilities to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns: