Skip to content

Models

matchbox.common.factories.models

Factory functions for generating model testkits and data for testing.

Classes:

Functions:

MockDeduper

Bases: Deduper


              flowchart TD
              matchbox.common.factories.models.MockDeduper[MockDeduper]
              matchbox.client.models.dedupers.base.Deduper[Deduper]

                              matchbox.client.models.dedupers.base.Deduper --> matchbox.common.factories.models.MockDeduper
                


              click matchbox.common.factories.models.MockDeduper href "" "matchbox.common.factories.models.MockDeduper"
              click matchbox.client.models.dedupers.base.Deduper href "" "matchbox.client.models.dedupers.base.Deduper"
            

Mock deduper that does nothing.

Methods:

  • prepare

    Mock prepare method.

  • dedupe

    Mock dedupe method.

Attributes:

settings instance-attribute

settings: DeduperSettings

prepare

prepare(data: DataFrame) -> None

Mock prepare method.

dedupe

dedupe(data: DataFrame) -> DataFrame

Mock dedupe method.

MockLinker

Bases: Linker


              flowchart TD
              matchbox.common.factories.models.MockLinker[MockLinker]
              matchbox.client.models.linkers.base.Linker[Linker]

                              matchbox.client.models.linkers.base.Linker --> matchbox.common.factories.models.MockLinker
                


              click matchbox.common.factories.models.MockLinker href "" "matchbox.common.factories.models.MockLinker"
              click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
            

Mock linker that does nothing.

Methods:

  • prepare

    Mock prepare method.

  • link

    Mock link method.

Attributes:

settings instance-attribute

settings: LinkerSettings

prepare

prepare(left: DataFrame, right: DataFrame) -> None

Mock prepare method.

link(left: DataFrame, right: DataFrame) -> DataFrame

Mock link method.

ModelTestkit

Bases: BaseModel


              flowchart TD
              matchbox.common.factories.models.ModelTestkit[ModelTestkit]

              

              click matchbox.common.factories.models.ModelTestkit href "" "matchbox.common.factories.models.ModelTestkit"
            

A testkit of data and metadata for a Model.

Methods:

  • init_query_lookup

    Initialise query lookup and derived entities using all model edges.

  • fake_run

    Set model results without running model.

  • into_dag

    Turn model into kwargs for dag.model(), detaching from original DAG.

Attributes:

model instance-attribute

model: Model

left_data instance-attribute

left_data: Table

left_query instance-attribute

left_query: Query

left_clusters instance-attribute

left_clusters: dict[int, ClusterEntity]

right_data instance-attribute

right_data: Table | None

right_query instance-attribute

right_query: Query | None

right_clusters instance-attribute

right_clusters: dict[int, ClusterEntity] | None

scores class-attribute instance-attribute

scores: DataFrame = Field(frozen=True)

name property

name: str

Return the full name of the Model.

path property

Returns the model step path.

data property

data: Table

Return a PyArrow table in the same format as matchbox queries.

entities property

entities: tuple[ClusterEntity, ...]

ClusterEntities that were generated by the model.

init_query_lookup

init_query_lookup() -> ModelTestkit

Initialise query lookup and derived entities using all model edges.

fake_run

fake_run() -> Self

Set model results without running model.

into_dag

into_dag() -> dict

Turn model into kwargs for dag.model(), detaching from original DAG.

component_report

component_report(all_nodes: list[int], table: DataFrame) -> dict

Fast reporting on connected components.

Parameters:

  • all_nodes

    (list[int]) –

    list of identities of inputs being matched

  • table

    (DataFrame) –

    Polars dataframe matching SCHEMA_MODEL_EDGES

Returns:

  • dict

    dictionary containing basic component statistics

validate_components

Validate that score edges create valid components.

Each component should be a subset of exactly one source entity.

Parameters:

calculate_min_max_edges

calculate_min_max_edges(left_nodes: int, right_nodes: int, num_components: int, deduplicate: bool) -> tuple[int, int]

Calculate min and max edges for a graph.

Parameters:

  • left_nodes

    (int) –

    number of nodes in left source

  • right_nodes

    (int) –

    number of nodes in right source

  • num_components

    (int) –

    number of requested components

  • deduplicate

    (bool) –

    whether edges are for deduplication

Returns:

  • tuple[int, int]

    Two-tuple representing min and max edges

generate_dummy_scores cached

generate_dummy_scores(left_values: tuple[int], right_values: tuple[int] | None, score_range: tuple[float, float], num_components: int, total_rows: int | None = None, seed: int = 42) -> DataFrame

Generate dummy Arrow scores data with guaranteed isolated components.

While much of the factory system uses generate_entity_scores, this function is still in use in PostgreSQL benchmarking, and has been designed to be performant at scale.

Parameters:

  • left_values

    (tuple[int]) –

    Tuple of integers to use for left column

  • right_values

    (tuple[int] | None) –

    Tuple of integers to use for right column. If None, assume we are generating scores for deduplication

  • score_range

    (tuple[float, float]) –

    Tuple of (min_score, max_score) to constrain scores

  • num_components

    (int) –

    Number of distinct connected components to generate

  • total_rows

    (int | None, default: None ) –

    Total number of rows to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • DataFrame

    Polars dataframe with ‘left_id’, ‘right_id’, and ‘score’ columns

generate_entity_scores

Generate scores that will recover entity relationships.

Compares ClusterEntity objects against ground truth SourceEntities by checking whether their EntityReferences are subsets of the source entities. Initially focused on generating fully connected, correct scores only.

Parameters:

  • left_entities

    (frozenset[ClusterEntity]) –

    Set of ClusterEntity objects from left input

  • right_entities

    (frozenset[ClusterEntity] | None) –

    Set of ClusterEntity objects from right input. If None, assume we are deduplicating left_entities.

  • source_entities

    (frozenset[SourceEntity]) –

    Ground truth set of SourceEntities

  • score_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of scores to assign to matches. All matches will be assigned a random score in this range.

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • DataFrame

    PyArrow Table with ‘left_id’, ‘right_id’, and ‘score’ columns

model_factory

model_factory(name: ModelStepName | None = None, dag: DAG | None = None, description: str | None = None, left_testkit: SourceTestkit | ResolverTestkit | None = None, right_testkit: SourceTestkit | ResolverTestkit | None = None, true_entities: tuple[SourceEntity, ...] | None = None, model_type: ModelType | None = None, n_true_entities: int | None = None, score_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> ModelTestkit

Generate a complete model testkit.

Allows autoconfiguration with minimal settings, or more nuanced control.

Can either be used to generate a model in a pipeline, interconnected with existing testkit objects, or generate a standalone model with random data.

Parameters:

  • name

    (ModelStepName | None, default: None ) –

    Name of the model. Defaults to a randomly generated word suffixed with ‘_model’.

  • dag

    (DAG | None, default: None ) –

    DAG containing this model. Overridden by dag of left testkit if present.

  • description

    (str | None, default: None ) –

    Description of the model

  • left_testkit

    (SourceTestkit | ResolverTestkit | None, default: None ) –

    A SourceTestkit or ResolverTestkit for the left source

  • right_testkit

    (SourceTestkit | ResolverTestkit | None, default: None ) –

    If creating a linker, a SourceTestkit or ResolverTestkit for the right source

  • true_entities

    (tuple[SourceEntity, ...] | None, default: None ) –

    Ground truth SourceEntity objects to use for generating scores. Must be supplied if sources are given

  • model_type

    (ModelType | None, default: None ) –

    Type of the model, one of ‘deduper’ or ‘linker’ Defaults to deduper. Ignored if left_testkit or right_testkit are provided.

  • n_true_entities

    (int | None, default: None ) –

    Base number of entities to generate when using default configs. Defaults to 10. Ignored if left_testkit or right_testkit are provided.

  • score_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of scores to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns:

  • ModelTestkit ( ModelTestkit ) –

    A model testkit with generated data

Raises:

  • ValueError
    • If scores are not in increasing order and between 0 and 1
    • If sources are provided without true entities
  • UserWarning

    If some arguments are ignored due to sources or true entities

query_to_model_factory

query_to_model_factory(left_query: Query, left_data: Table, left_keys: dict[SourceStepName, str], true_entities: tuple[SourceEntity, ...], name: ModelStepName | None = None, description: str | None = None, right_query: Query | None = None, right_data: Table | None = None, right_keys: dict[SourceStepName, str] | None = None, score_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> ModelTestkit

Turns raw queries from Matchbox into ModelTestkits.

Parameters:

  • left_query

    (Query) –

    Query generating left data

  • left_data

    (Table) –

    PyArrow table with left query data

  • left_keys

    (dict[SourceStepName, str]) –

    Dictionary mapping source step names to key field names in left query

  • true_entities

    (tuple[SourceEntity, ...]) –

    Ground truth SourceEntity objects to use for generating scores

  • name

    (ModelStepName | None, default: None ) –

    Name of the model. Defaults to a randomly generated word suffixed with ‘_model’.

  • description

    (str | None, default: None ) –

    Description of the model

  • right_query

    (Query | None, default: None ) –

    Query generating right data

  • right_data

    (Table | None, default: None ) –

    PyArrow table with right query data, if creating a linker

  • right_keys

    (dict[SourceStepName, str] | None, default: None ) –

    Dictionary mapping source step names to key field names in right query

  • score_range

    (tuple[float, float], default: (0.8, 1.0) ) –

    Range of scores to generate

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility

Returns: