Entities
matchbox.common.factories.entities
¶
Classes and functions for generating and comparing entities.
These underpin the entity resolution process, which is the core of the source and model testkit factory system.
Classes:
-
VariationRule
–Abstract base class for variation rules.
-
SuffixRule
–Add a suffix to a value.
-
PrefixRule
–Add a prefix to a value.
-
ReplaceRule
–Replace occurrences of a string with another.
-
FeatureConfig
–Configuration for generating a feature with variations.
-
EntityReference
–Reference to an entity’s presence in specific sources.
-
EntityIDMixin
–Mixin providing common ID-based functionality for entity classes.
-
SourcePKMixin
–Mixin providing common source primary key functionality for entity classes.
-
ClusterEntity
–Represents a merged entity mid-pipeline.
-
SourceEntity
–Represents a single entity across all sources.
Functions:
-
infer_sql_type
–Infer an appropriate SQL type from a Faker configuration.
-
query_to_cluster_entities
–Convert a query result to a set of ClusterEntities.
-
generate_entities
–Generate base entities with their ground truth values.
-
probabilities_to_results_entities
–Convert probabilities to ClusterEntity objects based on a threshold.
-
diff_results
–Compare two lists of ClusterEntity with detailed diff information.
VariationRule
¶
SuffixRule
¶
Bases: VariationRule
Add a suffix to a value.
Methods:
-
apply
–Apply the suffix to the value.
Attributes:
PrefixRule
¶
Bases: VariationRule
Add a prefix to a value.
Methods:
-
apply
–Apply the prefix to the value.
Attributes:
ReplaceRule
¶
Bases: VariationRule
Replace occurrences of a string with another.
Methods:
-
apply
–Apply the replacement to the value.
Attributes:
FeatureConfig
¶
Bases: BaseModel
Configuration for generating a feature with variations.
Methods:
-
add_variations
–Add a variation rule to the feature.
-
protected_names
–Ensure name is not a reserved keyword.
Attributes:
-
name
(str
) – -
base_generator
(str
) – -
parameters
(tuple
) – -
unique
(bool
) – -
drop_base
(bool
) – -
variations
(tuple[VariationRule, ...]
) – -
sql_type
(str
) –
parameters
class-attribute
instance-attribute
¶
parameters: tuple = Field(
default_factory=tuple,
description="Parameters for the generator. A tuple of tuples passed to the generator.",
)
unique
class-attribute
instance-attribute
¶
unique: bool = Field(
default=True,
description="Whether the generator enforces uniqueness in the generated data. For example, using unique=True with the 'boolean' generator will error if more the two values are generated.",
)
drop_base
class-attribute
instance-attribute
¶
drop_base: bool = Field(
default=False,
description="Whether the base case is dropped.",
)
variations
class-attribute
instance-attribute
¶
variations: tuple[VariationRule, ...] = Field(
default_factory=tuple
)
sql_type
class-attribute
instance-attribute
¶
sql_type: str = Field(
default_factory=lambda data: infer_sql_type(
data["base_generator"], data["parameters"]
)
)
add_variations
¶
add_variations(*rule: VariationRule) -> FeatureConfig
Add a variation rule to the feature.
EntityReference
¶
Bases: frozendict
Reference to an entity’s presence in specific sources.
Maps dataset names to sets of primary keys.
EntityIDMixin
¶
SourcePKMixin
¶
Mixin providing common source primary key functionality for entity classes.
Implements methods for accessing and retrieving source primary keys.
Methods:
-
get_source_pks
–Get PKs for a specific source.
-
get_values
–Get all unique values for this entity across sources.
Attributes:
get_source_pks
¶
get_source_pks(source_name: str) -> set[str]
get_values
¶
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
sources
¶dict[str, SourceTestkit]
) –Dictionary of source name to source data
Returns:
ClusterEntity
¶
Bases: BaseModel
, EntityIDMixin
, SourcePKMixin
Represents a merged entity mid-pipeline.
Methods:
-
get_source_pks
–Get PKs for a specific source.
-
get_values
–Get all unique values for this entity across sources.
-
is_subset_of_source_entity
–Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
-
similarity_ratio
–Return ratio of shared PKs to total PKs across all datasets.
Attributes:
-
id
(int
) – -
source_pks
(EntityReference
) –
get_source_pks
¶
get_source_pks(source_name: str) -> set[str]
get_values
¶
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
sources
¶dict[str, SourceTestkit]
) –Dictionary of source name to source data
Returns:
is_subset_of_source_entity
¶
is_subset_of_source_entity(
source_entity: SourceEntity,
) -> bool
Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
similarity_ratio
¶
similarity_ratio(other: ClusterEntity) -> float
Return ratio of shared PKs to total PKs across all datasets.
SourceEntity
¶
Bases: BaseModel
, EntityIDMixin
, SourcePKMixin
Represents a single entity across all sources.
Methods:
-
get_source_pks
–Get PKs for a specific source.
-
get_values
–Get all unique values for this entity across sources.
-
add_source_reference
–Add or update a source reference.
-
to_cluster_entity
–Convert this SourceEntity to a ClusterEntity with the specified datasets.
Attributes:
-
id
(int
) – -
base_values
(dict[str, Any]
) – -
source_pks
(EntityReference
) – -
total_unique_variations
(int
) –
base_values
class-attribute
instance-attribute
¶
source_pks
class-attribute
instance-attribute
¶
source_pks: EntityReference = Field(
description="Dataset to PKs mapping",
default=EntityReference(mapping=frozenset()),
)
total_unique_variations
class-attribute
instance-attribute
¶
total_unique_variations: int = Field(default=0)
get_source_pks
¶
get_source_pks(source_name: str) -> set[str]
get_values
¶
Get all unique values for this entity across sources.
Each source may have its own variations/transformations of the base data, so we maintain separation between sources.
Parameters:
-
sources
¶dict[str, SourceTestkit]
) –Dictionary of source name to source data
Returns:
add_source_reference
¶
to_cluster_entity
¶
to_cluster_entity(*names: str) -> ClusterEntity | None
Convert this SourceEntity to a ClusterEntity with the specified datasets.
This method makes diffing really easy. Testing whether ClusterEntity objects are subsets of SourceEntity objects is a weaker, logically more fragile test than directly comparing equality of sets of ClusterEntity objects. It enables a really simple syntactical expression of the test.
actual: set[ClusterEntity] = ...
expected: set[ClusterEntity] = {
s.to_cluster_entity("dataset1", "dataset2")
for s in source_entities
}
is_identical = expected) == actual
missing = expected - actual
extra = actual - expected
Parameters:
Returns:
-
ClusterEntity | None
–ClusterEntity containing only the specified datasets’ PKs, or None
-
ClusterEntity | None
–if none of the specified datasets are present in this entity.
query_to_cluster_entities
¶
query_to_cluster_entities(
query: Table | DataFrame, source_pks: dict[str, str]
) -> set[ClusterEntity]
Convert a query result to a set of ClusterEntities.
Useful for turning a real query from a real model resolution in Matchbox into
a set of ClusterEntities that can be used in LinkedSourcesTestkit.diff_results()
.
Parameters:
-
query
¶Table | DataFrame
) –A PyArrow table or DataFrame representing a query result
-
source_pks
¶dict[str, str]
) –Mapping of source names to primary key column names
Returns:
-
set[ClusterEntity]
–A set of ClusterEntity objects
generate_entities
cached
¶
generate_entities(
generator: Faker,
features: tuple[FeatureConfig, ...],
n: int,
) -> tuple[SourceEntity]
Generate base entities with their ground truth values.
probabilities_to_results_entities
¶
probabilities_to_results_entities(
probabilities: Table,
left_clusters: tuple[ClusterEntity, ...],
right_clusters: tuple[ClusterEntity, ...] | None = None,
threshold: float | int = 0,
) -> tuple[ClusterEntity, ...]
Convert probabilities to ClusterEntity objects based on a threshold.
diff_results
¶
diff_results(
expected: list[ClusterEntity],
actual: list[ClusterEntity],
) -> tuple[bool, dict]
Compare two lists of ClusterEntity with detailed diff information.
Parameters:
-
expected
¶list[ClusterEntity]
) –Expected ClusterEntity list
-
actual
¶list[ClusterEntity]
) –Actual ClusterEntity list
Returns:
-
bool
–A tuple containing:
-
dict
–- Boolean: True if lists are identical, False otherwise
-
tuple[bool, dict]
–- Dictionary that counts the number of actual entities that fall into the following criteria:
- ‘perfect’: Match an expected entity exactly
- ‘subset’: Are a subset of an expected entity
- ‘superset’: Are a superset of an expected entity
- ‘wrong’: Don’t match any expected entity
- ‘invalid’: Contain source_pks not present in any expected entity