Models
matchbox.common.factories.models
¶
Factory functions for generating model testkits and data for testing.
Classes:
-
MockDeduper–Mock deduper that does nothing.
-
MockLinker–Mock linker that does nothing.
-
ModelTestkit–A testkit of data and metadata for a Model.
Functions:
-
component_report–Fast reporting on connected components.
-
validate_components–Validate that score edges create valid components.
-
calculate_min_max_edges–Calculate min and max edges for a graph.
-
generate_dummy_scores–Generate dummy Arrow scores data with guaranteed isolated components.
-
generate_entity_scores–Generate scores that will recover entity relationships.
-
model_factory–Generate a complete model testkit.
-
query_to_model_factory–Turns raw queries from Matchbox into ModelTestkits.
MockDeduper
¶
Bases: Deduper
flowchart TD
matchbox.common.factories.models.MockDeduper[MockDeduper]
matchbox.client.models.dedupers.base.Deduper[Deduper]
matchbox.client.models.dedupers.base.Deduper --> matchbox.common.factories.models.MockDeduper
click matchbox.common.factories.models.MockDeduper href "" "matchbox.common.factories.models.MockDeduper"
click matchbox.client.models.dedupers.base.Deduper href "" "matchbox.client.models.dedupers.base.Deduper"
Mock deduper that does nothing.
Methods:
Attributes:
MockLinker
¶
Bases: Linker
flowchart TD
matchbox.common.factories.models.MockLinker[MockLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.common.factories.models.MockLinker
click matchbox.common.factories.models.MockLinker href "" "matchbox.common.factories.models.MockLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
Mock linker that does nothing.
Methods:
Attributes:
ModelTestkit
¶
Bases: BaseModel
flowchart TD
matchbox.common.factories.models.ModelTestkit[ModelTestkit]
click matchbox.common.factories.models.ModelTestkit href "" "matchbox.common.factories.models.ModelTestkit"
A testkit of data and metadata for a Model.
Methods:
-
init_query_lookup–Initialise query lookup and derived entities using all model edges.
-
fake_run–Set model results without running model.
-
into_dag–Turn model into kwargs for
dag.model(), detaching from original DAG.
Attributes:
-
model(Model) – -
left_data(Table) – -
left_query(Query) – -
left_clusters(dict[int, ClusterEntity]) – -
right_data(Table | None) – -
right_query(Query | None) – -
right_clusters(dict[int, ClusterEntity] | None) – -
scores(DataFrame) – -
name(str) –Return the full name of the Model.
-
path(ModelStepPath) –Returns the model step path.
-
data(Table) –Return a PyArrow table in the same format as matchbox queries.
-
entities(tuple[ClusterEntity, ...]) –ClusterEntities that were generated by the model.
entities
property
¶
entities: tuple[ClusterEntity, ...]
ClusterEntities that were generated by the model.
init_query_lookup
¶
init_query_lookup() -> ModelTestkit
Initialise query lookup and derived entities using all model edges.
component_report
¶
validate_components
¶
validate_components(edges: list[tuple[int, int]], entities: set[ClusterEntity], source_entities: set[SourceEntity]) -> bool
Validate that score edges create valid components.
Each component should be a subset of exactly one source entity.
Parameters:
calculate_min_max_edges
¶
calculate_min_max_edges(left_nodes: int, right_nodes: int, num_components: int, deduplicate: bool) -> tuple[int, int]
generate_dummy_scores
cached
¶
generate_dummy_scores(left_values: tuple[int], right_values: tuple[int] | None, score_range: tuple[float, float], num_components: int, total_rows: int | None = None, seed: int = 42) -> DataFrame
Generate dummy Arrow scores data with guaranteed isolated components.
While much of the factory system uses generate_entity_scores, this function is still in use in PostgreSQL benchmarking, and has been designed to be performant at scale.
Parameters:
-
(left_values¶tuple[int]) –Tuple of integers to use for left column
-
(right_values¶tuple[int] | None) –Tuple of integers to use for right column. If None, assume we are generating scores for deduplication
-
(score_range¶tuple[float, float]) –Tuple of (min_score, max_score) to constrain scores
-
(num_components¶int) –Number of distinct connected components to generate
-
(total_rows¶int | None, default:None) –Total number of rows to generate
-
(seed¶int, default:42) –Random seed for reproducibility
Returns:
-
DataFrame–Polars dataframe with ‘left_id’, ‘right_id’, and ‘score’ columns
generate_entity_scores
¶
generate_entity_scores(left_entities: frozenset[ClusterEntity], right_entities: frozenset[ClusterEntity] | None, source_entities: frozenset[SourceEntity], score_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> DataFrame
Generate scores that will recover entity relationships.
Compares ClusterEntity objects against ground truth SourceEntities by checking whether their EntityReferences are subsets of the source entities. Initially focused on generating fully connected, correct scores only.
Parameters:
-
(left_entities¶frozenset[ClusterEntity]) –Set of ClusterEntity objects from left input
-
(right_entities¶frozenset[ClusterEntity] | None) –Set of ClusterEntity objects from right input. If None, assume we are deduplicating left_entities.
-
(source_entities¶frozenset[SourceEntity]) –Ground truth set of SourceEntities
-
(score_range¶tuple[float, float], default:(0.8, 1.0)) –Range of scores to assign to matches. All matches will be assigned a random score in this range.
-
(seed¶int, default:42) –Random seed for reproducibility
Returns:
-
DataFrame–PyArrow Table with ‘left_id’, ‘right_id’, and ‘score’ columns
model_factory
¶
model_factory(name: ModelStepName | None = None, dag: DAG | None = None, description: str | None = None, left_testkit: SourceTestkit | ResolverTestkit | None = None, right_testkit: SourceTestkit | ResolverTestkit | None = None, true_entities: tuple[SourceEntity, ...] | None = None, model_type: ModelType | None = None, n_true_entities: int | None = None, score_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> ModelTestkit
Generate a complete model testkit.
Allows autoconfiguration with minimal settings, or more nuanced control.
Can either be used to generate a model in a pipeline, interconnected with existing testkit objects, or generate a standalone model with random data.
Parameters:
-
(name¶ModelStepName | None, default:None) –Name of the model. Defaults to a randomly generated word suffixed with ‘_model’.
-
(dag¶DAG | None, default:None) –DAG containing this model. Overridden by dag of left testkit if present.
-
(description¶str | None, default:None) –Description of the model
-
(left_testkit¶SourceTestkit | ResolverTestkit | None, default:None) –A SourceTestkit or ResolverTestkit for the left source
-
(right_testkit¶SourceTestkit | ResolverTestkit | None, default:None) –If creating a linker, a SourceTestkit or ResolverTestkit for the right source
-
(true_entities¶tuple[SourceEntity, ...] | None, default:None) –Ground truth SourceEntity objects to use for generating scores. Must be supplied if sources are given
-
(model_type¶ModelType | None, default:None) –Type of the model, one of ‘deduper’ or ‘linker’ Defaults to deduper. Ignored if left_testkit or right_testkit are provided.
-
(n_true_entities¶int | None, default:None) –Base number of entities to generate when using default configs. Defaults to 10. Ignored if left_testkit or right_testkit are provided.
-
(score_range¶tuple[float, float], default:(0.8, 1.0)) –Range of scores to generate
-
(seed¶int, default:42) –Random seed for reproducibility
Returns:
-
ModelTestkit(ModelTestkit) –A model testkit with generated data
Raises:
-
ValueError–- If scores are not in increasing order and between 0 and 1
- If sources are provided without true entities
-
UserWarning–If some arguments are ignored due to sources or true entities
query_to_model_factory
¶
query_to_model_factory(left_query: Query, left_data: Table, left_keys: dict[SourceStepName, str], true_entities: tuple[SourceEntity, ...], name: ModelStepName | None = None, description: str | None = None, right_query: Query | None = None, right_data: Table | None = None, right_keys: dict[SourceStepName, str] | None = None, score_range: tuple[float, float] = (0.8, 1.0), seed: int = 42) -> ModelTestkit
Turns raw queries from Matchbox into ModelTestkits.
Parameters:
-
(left_query¶Query) –Query generating left data
-
(left_data¶Table) –PyArrow table with left query data
-
(left_keys¶dict[SourceStepName, str]) –Dictionary mapping source step names to key field names in left query
-
(true_entities¶tuple[SourceEntity, ...]) –Ground truth SourceEntity objects to use for generating scores
-
(name¶ModelStepName | None, default:None) –Name of the model. Defaults to a randomly generated word suffixed with ‘_model’.
-
(description¶str | None, default:None) –Description of the model
-
(right_query¶Query | None, default:None) –Query generating right data
-
(right_data¶Table | None, default:None) –PyArrow table with right query data, if creating a linker
-
(right_keys¶dict[SourceStepName, str] | None, default:None) –Dictionary mapping source step names to key field names in right query
-
(score_range¶tuple[float, float], default:(0.8, 1.0)) –Range of scores to generate
-
(seed¶int, default:42) –Random seed for reproducibility
Returns:
-
ModelTestkit–ModelTestkit with the processed data