Entities
matchbox.common.factories.entities
¶
Classes and functions for generating and comparing entities.
These underpin the entity resolution process, which is the core of the source and model testkit factory system.
Classes:
-
VariationRule
–Abstract base class for variation rules.
-
SuffixRule
–Add a suffix to a value.
-
PrefixRule
–Add a prefix to a value.
-
ReplaceRule
–Replace occurrences of a string with another.
-
FeatureConfig
–Configuration for generating a feature with variations.
-
EntityReference
–Reference to an entity’s presence in specific sources.
-
EntityIDMixin
–Mixin providing common ID-based functionality for entity classes.
-
SourceKeyMixin
–Mixin providing common source key functionality for entity classes.
-
ClusterEntity
–Represents a merged entity mid-pipeline.
-
SourceEntity
–Represents a single entity across all sources.
Functions:
-
infer_data_type
–Infer an appropriate Matchbox type from a Faker configuration.
-
query_to_cluster_entities
–Convert a query result to a set of ClusterEntities.
-
generate_entities
–Generate base entities with their ground truth values from generator.
-
probabilities_to_results_entities
–Convert probabilities to ClusterEntity objects based on a threshold.
-
diff_results
–Compare two lists of ClusterEntity with detailed diff information.
VariationRule
¶
SuffixRule
¶
Bases: VariationRule
Add a suffix to a value.
Methods:
-
apply
–Apply the suffix to the value.
Attributes:
PrefixRule
¶
Bases: VariationRule
Add a prefix to a value.
Methods:
-
apply
–Apply the prefix to the value.
Attributes:
ReplaceRule
¶
Bases: VariationRule
Replace occurrences of a string with another.
Methods:
-
apply
–Apply the replacement to the value.
Attributes:
FeatureConfig
¶
Bases: BaseModel
Configuration for generating a feature with variations.
Methods:
-
add_variations
–Add a variation rule to the feature.
-
protected_names
–Ensure name is not a reserved keyword.
-
string_to_strenum
–Convert string to DataTypes enum.
Attributes:
-
name
(str
) – -
base_generator
(str
) – -
parameters
(tuple | None
) – -
unique
(bool
) – -
drop_base
(bool
) – -
variations
(tuple[VariationRule, ...]
) – -
datatype
(DataTypes
) –
parameters
class-attribute
instance-attribute
¶
parameters: tuple | None = Field(
default=None,
description="Parameters for the generator. A tuple of tuples passed to the generator.",
)
unique
class-attribute
instance-attribute
¶
unique: bool = Field(
default=True,
description="Whether the generator enforces uniqueness in the generated data. For example, using unique=True with the 'boolean' generator will error if more the two values are generated.",
)
drop_base
class-attribute
instance-attribute
¶
drop_base: bool = Field(
default=False,
description="Whether the base case is dropped.",
)
variations
class-attribute
instance-attribute
¶
variations: tuple[VariationRule, ...] = Field(
default_factory=tuple
)
datatype
class-attribute
instance-attribute
¶
datatype: DataTypes = Field(
default_factory=lambda data: infer_data_type(
data["base_generator"], data["parameters"]
)
)
add_variations
¶
add_variations(*rule: VariationRule) -> FeatureConfig
Add a variation rule to the feature.
protected_names
classmethod
¶
Ensure name is not a reserved keyword.
EntityReference
¶
EntityReference(
mapping: dict[SourceResolutionName, frozenset[str]]
| None = None,
)
Bases: frozendict
Reference to an entity’s presence in specific sources.
Maps source resolution names to sets of primary keys.
EntityIDMixin
¶
SourceKeyMixin
¶
Mixin providing common source key functionality for entity classes.
Implements methods for accessing and retrieving source keys.
Methods:
-
get_keys
–Get keys for a specific source.
-
get_values
–Get all unique values for this entity across sources.
Attributes:
-
keys
(EntityReference
) –
get_keys
¶
get_keys(name: SourceResolutionName) -> set[str]
Get keys for a specific source.
Parameters:
-
name
¶SourceResolutionName
) –Name of the source
Returns:
get_values
¶
get_values(
sources: dict[SourceResolutionName, SourceTestkit],
) -> dict[SourceResolutionName, dict[str, list[str]]]
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
sources
¶dict[SourceResolutionName, SourceTestkit]
) –Dictionary of source resolution name to source data
Returns:
ClusterEntity
¶
Bases: BaseModel
, EntityIDMixin
, SourceKeyMixin
Represents a merged entity mid-pipeline.
Methods:
-
is_subset_of_source_entity
–Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
-
similarity_ratio
–Return ratio of shared keys to total keys across all sources.
-
get_keys
–Get keys for a specific source.
-
get_values
–Get all unique values for this entity across sources.
Attributes:
-
id
(int
) – -
keys
(EntityReference
) –
is_subset_of_source_entity
¶
is_subset_of_source_entity(
source_entity: SourceEntity,
) -> bool
Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
similarity_ratio
¶
similarity_ratio(other: ClusterEntity) -> float
Return ratio of shared keys to total keys across all sources.
get_keys
¶
get_keys(name: SourceResolutionName) -> set[str]
Get keys for a specific source.
Parameters:
-
name
¶SourceResolutionName
) –Name of the source
Returns:
get_values
¶
get_values(
sources: dict[SourceResolutionName, SourceTestkit],
) -> dict[SourceResolutionName, dict[str, list[str]]]
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
sources
¶dict[SourceResolutionName, SourceTestkit]
) –Dictionary of source resolution name to source data
Returns:
SourceEntity
¶
Bases: BaseModel
, EntityIDMixin
, SourceKeyMixin
Represents a single entity across all sources.
Methods:
-
add_source_reference
–Add or update a source reference.
-
to_cluster_entity
–Convert this SourceEntity to a ClusterEntity with the specified sources.
-
get_keys
–Get keys for a specific source.
-
get_values
–Get all unique values for this entity across sources.
Attributes:
-
id
(int
) – -
base_values
(dict[str, Any]
) – -
keys
(EntityReference
) – -
total_unique_variations
(int
) –
base_values
class-attribute
instance-attribute
¶
keys
class-attribute
instance-attribute
¶
keys: EntityReference = Field(
description="Source to keys mapping",
default=EntityReference(mapping=frozenset()),
)
total_unique_variations
class-attribute
instance-attribute
¶
total_unique_variations: int = Field(default=0)
add_source_reference
¶
add_source_reference(
name: SourceResolutionName, keys: list[str]
) -> None
Add or update a source reference.
Parameters:
-
name
¶SourceResolutionName
) –Source name
-
keys
¶list[str]
) –List of primary keys for this source
to_cluster_entity
¶
to_cluster_entity(
*names: SourceResolutionName,
) -> ClusterEntity | None
Convert this SourceEntity to a ClusterEntity with the specified sources.
This method makes diffing really easy. Testing whether ClusterEntity objects are subsets of SourceEntity objects is a weaker, logically more fragile test than directly comparing equality of sets of ClusterEntity objects. It enables a really simple syntactical expression of the test.
actual: set[ClusterEntity] = ...
expected: set[ClusterEntity] = {
s.to_cluster_entity("source1", "source2")
for s in source_entities
}
is_identical = expected) == actual
missing = expected - actual
extra = actual - expected
Parameters:
-
*names
¶SourceResolutionName
, default:()
) –Names of sources to include in the ClusterEntity
Returns:
-
ClusterEntity | None
–ClusterEntity containing only the specified sources’ keys, or None
-
ClusterEntity | None
–if none of the specified sources are present in this entity.
get_keys
¶
get_keys(name: SourceResolutionName) -> set[str]
Get keys for a specific source.
Parameters:
-
name
¶SourceResolutionName
) –Name of the source
Returns:
get_values
¶
get_values(
sources: dict[SourceResolutionName, SourceTestkit],
) -> dict[SourceResolutionName, dict[str, list[str]]]
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
sources
¶dict[SourceResolutionName, SourceTestkit]
) –Dictionary of source resolution name to source data
Returns:
query_to_cluster_entities
¶
query_to_cluster_entities(
query: Table | DataFrame,
keys: dict[SourceResolutionName, str],
) -> set[ClusterEntity]
Convert a query result to a set of ClusterEntities.
Useful for turning a real query from a real model resolution in Matchbox into
a set of ClusterEntities that can be used in LinkedSourcesTestkit.diff_results()
.
Parameters:
-
query
¶Table | DataFrame
) –A PyArrow table or DataFrame representing a query result
-
keys
¶dict[SourceResolutionName, str]
) –Mapping of source resolution names to key field names
Returns:
-
set[ClusterEntity]
–A set of ClusterEntity objects
generate_entities
cached
¶
generate_entities(
generator: Faker,
features: tuple[FeatureConfig, ...],
n: int,
) -> tuple[SourceEntity]
Generate base entities with their ground truth values from generator.
probabilities_to_results_entities
¶
probabilities_to_results_entities(
probabilities: Table,
left_clusters: tuple[ClusterEntity, ...],
right_clusters: tuple[ClusterEntity, ...] | None = None,
threshold: float | int = 0,
) -> tuple[ClusterEntity, ...]
Convert probabilities to ClusterEntity objects based on a threshold.
diff_results
¶
diff_results(
expected: list[ClusterEntity],
actual: list[ClusterEntity],
) -> tuple[bool, dict]
Compare two lists of ClusterEntity with detailed diff information.
Parameters:
-
expected
¶list[ClusterEntity]
) –Expected ClusterEntity list
-
actual
¶list[ClusterEntity]
) –Actual ClusterEntity list
Returns:
-
bool
–A tuple containing:
-
dict
–- Boolean: True if lists are identical, False otherwise
-
tuple[bool, dict]
–- Dictionary that counts the number of actual entities that fall into the following criteria:
- ‘perfect’: Match an expected entity exactly
- ‘subset’: Are a subset of an expected entity
- ‘superset’: Are a superset of an expected entity
- ‘wrong’: Don’t match any expected entity
- ‘invalid’: Contain keys not present in any expected entity