Entities
matchbox.common.factories.entities
¶
Classes and functions for generating and comparing entities.
These underpin the entity resolution process, which is the core of the source and model testkit factory system.
Classes:
-
VariationRule–Abstract base class for variation rules.
-
SuffixRule–Add a suffix to a value.
-
PrefixRule–Add a prefix to a value.
-
ReplaceRule–Replace occurrences of a string with another.
-
FeatureConfig–Configuration for generating a feature with variations.
-
EntityReference–Reference to an entity’s presence in specific sources.
-
EntityIDMixin–Mixin providing common ID-based functionality for entity classes.
-
SourceKeyMixin–Mixin providing common source key functionality for entity classes.
-
ClusterEntity–Represents a merged entity mid-pipeline.
-
SourceEntity–Represents a single entity across all sources.
Functions:
-
infer_data_type–Infer an appropriate Matchbox type from a Faker configuration.
-
query_to_cluster_entities–Convert a query result to a set of ClusterEntities.
-
generate_entities–Generate base entities with their ground truth values from generator.
-
scores_to_results_entities–Convert scores to ClusterEntity objects based on a threshold.
-
diff_entities–Compare two lists of ClusterEntity with detailed diff information.
VariationRule
¶
Bases: BaseModel, Generic[T], ABC
flowchart TD
matchbox.common.factories.entities.VariationRule[VariationRule]
click matchbox.common.factories.entities.VariationRule href "" "matchbox.common.factories.entities.VariationRule"
Abstract base class for variation rules.
Methods:
-
apply–Apply the variation to a value.
Attributes:
SuffixRule
¶
Bases: VariationRule[str]
flowchart TD
matchbox.common.factories.entities.SuffixRule[SuffixRule]
matchbox.common.factories.entities.VariationRule[VariationRule]
matchbox.common.factories.entities.VariationRule --> matchbox.common.factories.entities.SuffixRule
click matchbox.common.factories.entities.SuffixRule href "" "matchbox.common.factories.entities.SuffixRule"
click matchbox.common.factories.entities.VariationRule href "" "matchbox.common.factories.entities.VariationRule"
Add a suffix to a value.
Methods:
-
apply–Apply the variation to a value.
Attributes:
PrefixRule
¶
Bases: VariationRule[str]
flowchart TD
matchbox.common.factories.entities.PrefixRule[PrefixRule]
matchbox.common.factories.entities.VariationRule[VariationRule]
matchbox.common.factories.entities.VariationRule --> matchbox.common.factories.entities.PrefixRule
click matchbox.common.factories.entities.PrefixRule href "" "matchbox.common.factories.entities.PrefixRule"
click matchbox.common.factories.entities.VariationRule href "" "matchbox.common.factories.entities.VariationRule"
Add a prefix to a value.
Methods:
-
apply–Apply the variation to a value.
Attributes:
ReplaceRule
¶
Bases: VariationRule[str]
flowchart TD
matchbox.common.factories.entities.ReplaceRule[ReplaceRule]
matchbox.common.factories.entities.VariationRule[VariationRule]
matchbox.common.factories.entities.VariationRule --> matchbox.common.factories.entities.ReplaceRule
click matchbox.common.factories.entities.ReplaceRule href "" "matchbox.common.factories.entities.ReplaceRule"
click matchbox.common.factories.entities.VariationRule href "" "matchbox.common.factories.entities.VariationRule"
Replace occurrences of a string with another.
Methods:
-
apply–Apply the variation to a value.
Attributes:
FeatureConfig
¶
Bases: BaseModel
flowchart TD
matchbox.common.factories.entities.FeatureConfig[FeatureConfig]
click matchbox.common.factories.entities.FeatureConfig href "" "matchbox.common.factories.entities.FeatureConfig"
Configuration for generating a feature with variations.
Methods:
-
add_variations–Add a variation rule to the feature.
-
protected_names–Ensure name is not a reserved keyword.
-
string_to_strenum–Convert string to DataTypes enum.
Attributes:
-
name(str) – -
base_generator(str) – -
parameters(tuple | None) – -
unique(bool) – -
drop_base(bool) – -
variations(tuple[VariationRule, ...]) – -
datatype(DataTypes) –
parameters
class-attribute
instance-attribute
¶
parameters: tuple | None = Field(default=None, description='Parameters for the generator. A tuple of tuples passed to the generator.')
unique
class-attribute
instance-attribute
¶
unique: bool = Field(default=True, description="Whether the generator enforces uniqueness in the generated data. For example, using unique=True with the 'boolean' generator will error if more the two values are generated.")
drop_base
class-attribute
instance-attribute
¶
drop_base: bool = Field(default=False, description='Whether the base case is dropped.')
variations
class-attribute
instance-attribute
¶
variations: tuple[VariationRule, ...] = Field(default_factory=tuple)
datatype
class-attribute
instance-attribute
¶
datatype: DataTypes = Field(default_factory=lambda data: infer_data_type(data['base_generator'], data['parameters']))
add_variations
¶
add_variations(*rule: VariationRule) -> FeatureConfig
Add a variation rule to the feature.
protected_names
classmethod
¶
Ensure name is not a reserved keyword.
EntityReference
¶
EntityReference(mapping: dict[SourceStepName, frozenset[str]] | None = None)
Bases: frozendict
flowchart TD
matchbox.common.factories.entities.EntityReference[EntityReference]
click matchbox.common.factories.entities.EntityReference href "" "matchbox.common.factories.entities.EntityReference"
Reference to an entity’s presence in specific sources.
Maps source step names to sets of primary keys.
EntityIDMixin
¶
SourceKeyMixin
¶
Mixin providing common source key functionality for entity classes.
Implements methods for accessing and retrieving source keys.
Methods:
-
get_keys–Get keys for a specific source.
-
get_values–Get all unique values for this entity across sources.
Attributes:
-
keys(EntityReference) –
get_keys
¶
get_keys(name: SourceStepName) -> set[str]
get_values
¶
get_values(sources: dict[SourceStepName, SourceTestkit]) -> dict[SourceStepName, dict[str, list[str]]]
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
(sources¶dict[SourceStepName, SourceTestkit]) –Dictionary of source step name to source data
Returns:
ClusterEntity
¶
Bases: BaseModel, EntityIDMixin, SourceKeyMixin
flowchart TD
matchbox.common.factories.entities.ClusterEntity[ClusterEntity]
matchbox.common.factories.entities.EntityIDMixin[EntityIDMixin]
matchbox.common.factories.entities.SourceKeyMixin[SourceKeyMixin]
matchbox.common.factories.entities.EntityIDMixin --> matchbox.common.factories.entities.ClusterEntity
matchbox.common.factories.entities.SourceKeyMixin --> matchbox.common.factories.entities.ClusterEntity
click matchbox.common.factories.entities.ClusterEntity href "" "matchbox.common.factories.entities.ClusterEntity"
click matchbox.common.factories.entities.EntityIDMixin href "" "matchbox.common.factories.entities.EntityIDMixin"
click matchbox.common.factories.entities.SourceKeyMixin href "" "matchbox.common.factories.entities.SourceKeyMixin"
Represents a merged entity mid-pipeline.
Methods:
-
is_subset_of_source_entity–Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
-
similarity_ratio–Return ratio of shared keys to total keys across all sources.
-
get_keys–Get keys for a specific source.
-
get_values–Get all unique values for this entity across sources.
Attributes:
-
id(int) – -
keys(EntityReference) –
is_subset_of_source_entity
¶
is_subset_of_source_entity(source_entity: SourceEntity) -> bool
Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
similarity_ratio
¶
similarity_ratio(other: ClusterEntity) -> float
Return ratio of shared keys to total keys across all sources.
get_keys
¶
get_keys(name: SourceStepName) -> set[str]
get_values
¶
get_values(sources: dict[SourceStepName, SourceTestkit]) -> dict[SourceStepName, dict[str, list[str]]]
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
(sources¶dict[SourceStepName, SourceTestkit]) –Dictionary of source step name to source data
Returns:
SourceEntity
¶
Bases: BaseModel, EntityIDMixin, SourceKeyMixin
flowchart TD
matchbox.common.factories.entities.SourceEntity[SourceEntity]
matchbox.common.factories.entities.EntityIDMixin[EntityIDMixin]
matchbox.common.factories.entities.SourceKeyMixin[SourceKeyMixin]
matchbox.common.factories.entities.EntityIDMixin --> matchbox.common.factories.entities.SourceEntity
matchbox.common.factories.entities.SourceKeyMixin --> matchbox.common.factories.entities.SourceEntity
click matchbox.common.factories.entities.SourceEntity href "" "matchbox.common.factories.entities.SourceEntity"
click matchbox.common.factories.entities.EntityIDMixin href "" "matchbox.common.factories.entities.EntityIDMixin"
click matchbox.common.factories.entities.SourceKeyMixin href "" "matchbox.common.factories.entities.SourceKeyMixin"
Represents a single entity across all sources.
Methods:
-
add_source_reference–Add or update a source reference.
-
to_cluster_entity–Convert this SourceEntity to a ClusterEntity with the specified sources.
-
get_keys–Get keys for a specific source.
-
get_values–Get all unique values for this entity across sources.
Attributes:
-
id(int) – -
base_values(dict[str, Any]) – -
keys(EntityReference) – -
total_unique_variations(int) –
base_values
class-attribute
instance-attribute
¶
keys
class-attribute
instance-attribute
¶
keys: EntityReference = Field(description='Source to keys mapping', default=EntityReference(mapping=frozenset()))
total_unique_variations
class-attribute
instance-attribute
¶
total_unique_variations: int = Field(default=0)
add_source_reference
¶
add_source_reference(name: SourceStepName, keys: list[str]) -> None
Add or update a source reference.
Parameters:
-
(name¶SourceStepName) –Source name
-
(keys¶list[str]) –List of primary keys for this source
to_cluster_entity
¶
to_cluster_entity(*names: SourceStepName) -> ClusterEntity | None
Convert this SourceEntity to a ClusterEntity with the specified sources.
This method makes diffing really easy. Testing whether ClusterEntity objects are subsets of SourceEntity objects is a weaker, logically more fragile test than directly comparing equality of sets of ClusterEntity objects. It enables a really simple syntactical expression of the test.
actual: set[ClusterEntity] = ...
expected: set[ClusterEntity] = {
s.to_cluster_entity("source1", "source2")
for s in source_entities
}
is_identical = expected) == actual
missing = expected - actual
extra = actual - expected
Parameters:
-
(*names¶SourceStepName, default:()) –Names of sources to include in the ClusterEntity
Returns:
-
ClusterEntity | None–ClusterEntity containing only the specified sources’ keys, or None
-
ClusterEntity | None–if none of the specified sources are present in this entity.
get_keys
¶
get_keys(name: SourceStepName) -> set[str]
get_values
¶
get_values(sources: dict[SourceStepName, SourceTestkit]) -> dict[SourceStepName, dict[str, list[str]]]
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
(sources¶dict[SourceStepName, SourceTestkit]) –Dictionary of source step name to source data
Returns:
infer_data_type
¶
infer_data_type(base: str, parameters: tuple | None) -> DataTypes
query_to_cluster_entities
¶
query_to_cluster_entities(data: Table | DataFrame | DataFrame, keys: dict[SourceStepName, str]) -> set[ClusterEntity]
Convert a query result to a set of ClusterEntities.
Useful for turning a real query from a real model step in Matchbox into
a set of ClusterEntities that can be used in LinkedSourcesTestkit.diff_entities().
Parameters:
-
(data¶Table | DataFrame | DataFrame) –A PyArrow table or DataFrame representing a query result
-
(keys¶dict[SourceStepName, str]) –Mapping of source step names to key field names
Returns:
-
set[ClusterEntity]–A set of ClusterEntity objects
generate_entities
cached
¶
generate_entities(generator: Faker, features: tuple[FeatureConfig, ...], n: int) -> tuple[SourceEntity]
Generate base entities with their ground truth values from generator.
scores_to_results_entities
¶
scores_to_results_entities(scores: DataFrame, left_clusters: tuple[ClusterEntity, ...], right_clusters: tuple[ClusterEntity, ...] | None = None, threshold: float = 0.0) -> tuple[ClusterEntity, ...]
Convert scores to ClusterEntity objects based on a threshold.
diff_entities
¶
diff_entities(expected: list[ClusterEntity], actual: list[ClusterEntity]) -> tuple[bool, dict]
Compare two lists of ClusterEntity with detailed diff information.
Parameters:
-
(expected¶list[ClusterEntity]) –Expected ClusterEntity list
-
(actual¶list[ClusterEntity]) –Actual ClusterEntity list
Returns:
-
bool–A tuple containing:
-
dict–- Boolean: True if lists are identical, False otherwise
-
tuple[bool, dict]–- Dictionary that counts the number of actual entities that fall into the following criteria:
- ‘perfect’: Match an expected entity exactly
- ‘subset’: Are a subset of an expected entity
- ‘superset’: Are a superset of an expected entity
- ‘wrong’: Don’t match any expected entity
- ‘invalid’: Contain keys not present in any expected entity