Models
matchbox.common.factories.models
¶
Factory functions for generating model testkits and data for testing.
Classes:
-
ModelTestkit
–A testkit of data and metadata for a Model.
Functions:
-
component_report
–Fast reporting on connected components using rustworkx.
-
validate_components
–Validate that probability edges create valid components.
-
calculate_min_max_edges
–Calculate min and max edges for a graph.
-
generate_dummy_probabilities
–Generate dummy Arrow probabilities data with guaranteed isolated components.
-
generate_entity_probabilities
–Generate probabilities that will recover entity relationships.
-
model_factory
–Generate a complete model testkit.
-
query_to_model_factory
–Turns raw queries from Matchbox into ModelTestkits.
ModelTestkit
¶
Bases: BaseModel
A testkit of data and metadata for a Model.
Methods:
-
init_query_lookup
–Initialize the query lookup table.
Attributes:
-
model
(Model
) – -
left_query
(Table
) – -
left_clusters
(dict[int, ClusterEntity]
) – -
right_query
(Table | None
) – -
right_clusters
(dict[int, ClusterEntity] | None
) – -
probabilities
(Table
) – -
name
(str
) –Return the full name of the Model.
-
entities
(tuple[ClusterEntity, ...]
) –ClusterEntities that were generated by the model.
-
threshold
(int
) –Threshold for the model.
-
mock
(Mock
) –Create a mock Model object with this testkit’s configuration.
-
query
(Table
) –Return a PyArrow table in the same format at matchbox.query().
entities
property
writable
¶
entities: tuple[ClusterEntity, ...]
ClusterEntities that were generated by the model.
component_report
¶
validate_components
¶
validate_components(
edges: list[tuple[int, int]],
entities: set[ClusterEntity],
source_entities: set[SourceEntity],
) -> bool
Validate that probability edges create valid components.
Each component should be a subset of exactly one source entity.
Parameters:
calculate_min_max_edges
¶
calculate_min_max_edges(
left_nodes: int,
right_nodes: int,
num_components: int,
deduplicate: bool,
) -> tuple[int, int]
generate_dummy_probabilities
cached
¶
generate_dummy_probabilities(
left_values: tuple[int],
right_values: tuple[int] | None,
prob_range: tuple[float, float],
num_components: int,
total_rows: int | None = None,
seed: int = 42,
) -> Table
Generate dummy Arrow probabilities data with guaranteed isolated components.
While much of the factory system uses generate_entity_probabilities, this function is still in use in PostgreSQL benchmarking, and has been designed to be performant at scale.
Parameters:
-
left_values
¶tuple[int]
) –Tuple of integers to use for left column
-
right_values
¶tuple[int] | None
) –Tuple of integers to use for right column. If None, assume we are generating probabilities for deduplication
-
prob_range
¶tuple[float, float]
) –Tuple of (min_prob, max_prob) to constrain probabilities
-
num_components
¶int
) –Number of distinct connected components to generate
-
total_rows
¶int | None
, default:None
) –Total number of rows to generate
-
seed
¶int
, default:42
) –Random seed for reproducibility
Returns:
-
Table
–PyArrow Table with ‘left_id’, ‘right_id’, and ‘probability’ columns
generate_entity_probabilities
¶
generate_entity_probabilities(
left_entities: frozenset[ClusterEntity],
right_entities: frozenset[ClusterEntity] | None,
source_entities: frozenset[SourceEntity],
prob_range: tuple[float, float] = (0.8, 1.0),
seed: int = 42,
) -> Table
Generate probabilities that will recover entity relationships.
Compares ClusterEntity objects against ground truth SourceEntities by checking whether their EntityReferences are subsets of the source entities. Initially focused on generating fully connected, correct probabilities only.
Parameters:
-
left_entities
¶frozenset[ClusterEntity]
) –Set of ClusterEntity objects from left input
-
right_entities
¶frozenset[ClusterEntity] | None
) –Set of ClusterEntity objects from right input. If None, assume we are deduplicating left_entities.
-
source_entities
¶frozenset[SourceEntity]
) –Ground truth set of SourceEntities
-
prob_range
¶tuple[float, float]
, default:(0.8, 1.0)
) –Range of probabilities to assign to matches. All matches will be assigned a random probability in this range.
-
seed
¶int
, default:42
) –Random seed for reproducibility
Returns:
-
Table
–PyArrow Table with ‘left_id’, ‘right_id’, and ‘probability’ columns
model_factory
¶
model_factory(
name: str | None = None,
description: str | None = None,
left_testkit: SourceTestkit
| ModelTestkit
| None = None,
right_testkit: SourceTestkit
| ModelTestkit
| None = None,
true_entities: tuple[SourceEntity, ...] | None = None,
model_type: Literal["deduper", "linker"] | None = None,
n_true_entities: int | None = None,
prob_range: tuple[float, float] = (0.8, 1.0),
seed: int = 42,
) -> ModelTestkit
Generate a complete model testkit.
Allows autoconfiguration with minimal settings, or more nuanced control.
Can either be used to generate a model in a pipeline, interconnected with existing SourceTestkit or ModelTestkit objects, or generate a standalone model with random data.
Parameters:
-
name
¶str | None
, default:None
) –Name of the model
-
description
¶str | None
, default:None
) –Description of the model
-
left_testkit
¶SourceTestkit | ModelTestkit | None
, default:None
) –A SourceTestkit or ModelTestkit for the left source
-
right_testkit
¶SourceTestkit | ModelTestkit | None
, default:None
) –If creating a linker, a SourceTestkit or ModelTestkit for the right source
-
true_entities
¶tuple[SourceEntity, ...] | None
, default:None
) –Ground truth SourceEntity objects to use for generating probabilities. Must be supplied if sources are given
-
model_type
¶Literal['deduper', 'linker'] | None
, default:None
) –Type of the model, one of ‘deduper’ or ‘linker’ Defaults to deduper. Ignored if left_testkit or right_testkit are provided.
-
n_true_entities
¶int | None
, default:None
) –Base number of entities to generate when using default configs. Defaults to 10. Ignored if left_testkit or right_testkit are provided.
-
prob_range
¶tuple[float, float]
, default:(0.8, 1.0)
) –Range of probabilities to generate
-
seed
¶int
, default:42
) –Random seed for reproducibility
Returns:
-
ModelTestkit
(ModelTestkit
) –A model testkit with generated data
Raises:
-
ValueError
–- If probabilities are not in increasing order and between 0 and 1
- If sources are provided without true entities
-
UserWarning
–If some arguments are ignored due to sources or true entities
query_to_model_factory
¶
query_to_model_factory(
left_resolution: str,
left_query: Table,
left_source_pks: dict[str, str],
true_entities: tuple[SourceEntity, ...],
name: str | None = None,
description: str | None = None,
right_resolution: str | None = None,
right_query: Table | None = None,
right_source_pks: dict[str, str] | None = None,
prob_range: tuple[float, float] = (0.8, 1.0),
seed: int = 42,
) -> ModelTestkit
Turns raw queries from Matchbox into ModelTestkits.
Parameters:
-
left_resolution
¶str
) –Name of the resolution used for the left query
-
left_query
¶Table
) –PyArrow table with left query data
-
left_source_pks
¶dict[str, str]
) –Dictionary mapping source names to primary key column names in left query
-
true_entities
¶tuple[SourceEntity, ...]
) –Ground truth SourceEntity objects to use for generating probabilities
-
name
¶str | None
, default:None
) –Name of the model
-
description
¶str | None
, default:None
) –Description of the model
-
right_resolution
¶str | None
, default:None
) –Name of the resolution used for the right query
-
right_query
¶Table | None
, default:None
) –PyArrow table with right query data, if creating a linker
-
right_source_pks
¶dict[str, str] | None
, default:None
) –Dictionary mapping source names to primary key column names in right query
-
prob_range
¶tuple[float, float]
, default:(0.8, 1.0)
) –Range of probabilities to generate
-
seed
¶int
, default:42
) –Random seed for reproducibility
Returns:
-
ModelTestkit
–ModelTestkit with the processed data