Skip to content

Entities

matchbox.common.factories.entities

Classes and functions for generating and comparing entities.

These underpin the entity resolution process, which is the core of the source and model testkit factory system.

Classes:

  • VariationRule

    Abstract base class for variation rules.

  • SuffixRule

    Add a suffix to a value.

  • PrefixRule

    Add a prefix to a value.

  • ReplaceRule

    Replace occurrences of a string with another.

  • FeatureConfig

    Configuration for generating a feature with variations.

  • EntityReference

    Reference to an entity’s presence in specific sources.

  • EntityIDMixin

    Mixin providing common ID-based functionality for entity classes.

  • SourceKeyMixin

    Mixin providing common source key functionality for entity classes.

  • ClusterEntity

    Represents a merged entity mid-pipeline.

  • SourceEntity

    Represents a single entity across all sources.

Functions:

VariationRule

Bases: BaseModel, ABC

Abstract base class for variation rules.

Methods:

  • apply

    Apply the variation to a value.

Attributes:

  • type (str) –

    Return the type of variation.

type abstractmethod property

type: str

Return the type of variation.

apply abstractmethod

apply(value: str) -> str

Apply the variation to a value.

SuffixRule

Bases: VariationRule

Add a suffix to a value.

Methods:

  • apply

    Apply the suffix to the value.

Attributes:

suffix instance-attribute

suffix: str

type property

type: str

Return the type of variation.

apply

apply(value: str) -> str

Apply the suffix to the value.

PrefixRule

Bases: VariationRule

Add a prefix to a value.

Methods:

  • apply

    Apply the prefix to the value.

Attributes:

prefix instance-attribute

prefix: str

type property

type: str

Return the type of variation.

apply

apply(value: str) -> str

Apply the prefix to the value.

ReplaceRule

Bases: VariationRule

Replace occurrences of a string with another.

Methods:

  • apply

    Apply the replacement to the value.

Attributes:

old instance-attribute

old: str

new instance-attribute

new: str

type property

type: str

Return the type of variation.

apply

apply(value: str) -> str

Apply the replacement to the value.

FeatureConfig

Bases: BaseModel

Configuration for generating a feature with variations.

Methods:

Attributes:

name instance-attribute

name: str

base_generator instance-attribute

base_generator: str

parameters class-attribute instance-attribute

parameters: tuple | None = Field(
    default=None,
    description="Parameters for the generator. A tuple of tuples passed to the generator.",
)

unique class-attribute instance-attribute

unique: bool = Field(
    default=True,
    description="Whether the generator enforces uniqueness in the generated data. For example, using unique=True with the 'boolean' generator will error if more the two values are generated.",
)

drop_base class-attribute instance-attribute

drop_base: bool = Field(
    default=False,
    description="Whether the base case is dropped.",
)

variations class-attribute instance-attribute

variations: tuple[VariationRule, ...] = Field(
    default_factory=tuple
)

datatype class-attribute instance-attribute

datatype: DataTypes = Field(
    default_factory=lambda data: infer_data_type(
        data["base_generator"], data["parameters"]
    )
)

add_variations

add_variations(*rule: VariationRule) -> FeatureConfig

Add a variation rule to the feature.

protected_names classmethod

protected_names(value: str) -> str

Ensure name is not a reserved keyword.

string_to_strenum classmethod

string_to_strenum(value: str) -> DataTypes

Convert string to DataTypes enum.

EntityReference

EntityReference(
    mapping: dict[SourceResolutionName, frozenset[str]]
    | None = None,
)

Bases: frozendict

Reference to an entity’s presence in specific sources.

Maps source resolution names to sets of primary keys.

EntityIDMixin

Mixin providing common ID-based functionality for entity classes.

Implements integer conversion and comparison operators for sorting based on the entity’s ID.

Attributes:

id instance-attribute

id: int

SourceKeyMixin

Mixin providing common source key functionality for entity classes.

Implements methods for accessing and retrieving source keys.

Methods:

  • get_keys

    Get keys for a specific source.

  • get_values

    Get all unique values for this entity across sources.

Attributes:

keys instance-attribute

get_keys

Get keys for a specific source.

Parameters:

Returns:

  • set[str]

    Set of keys, empty if source not found

get_values

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

Returns:

ClusterEntity

Bases: BaseModel, EntityIDMixin, SourceKeyMixin

Represents a merged entity mid-pipeline.

Methods:

  • is_subset_of_source_entity

    Check if this ClusterEntity’s references are a subset of a SourceEntity’s.

  • similarity_ratio

    Return ratio of shared keys to total keys across all sources.

  • get_keys

    Get keys for a specific source.

  • get_values

    Get all unique values for this entity across sources.

Attributes:

id class-attribute instance-attribute

id: int = Field(default_factory=lambda: getrandbits(63))

keys instance-attribute

is_subset_of_source_entity

is_subset_of_source_entity(
    source_entity: SourceEntity,
) -> bool

Check if this ClusterEntity’s references are a subset of a SourceEntity’s.

similarity_ratio

similarity_ratio(other: ClusterEntity) -> float

Return ratio of shared keys to total keys across all sources.

get_keys

Get keys for a specific source.

Parameters:

Returns:

  • set[str]

    Set of keys, empty if source not found

get_values

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

Returns:

SourceEntity

Bases: BaseModel, EntityIDMixin, SourceKeyMixin

Represents a single entity across all sources.

Methods:

Attributes:

id class-attribute instance-attribute

id: int = Field(default_factory=lambda: getrandbits(63))

base_values class-attribute instance-attribute

base_values: dict[str, Any] = Field(
    description="Feature name -> base value"
)

keys class-attribute instance-attribute

keys: EntityReference = Field(
    description="Source to keys mapping",
    default=EntityReference(mapping=frozenset()),
)

total_unique_variations class-attribute instance-attribute

total_unique_variations: int = Field(default=0)

add_source_reference

add_source_reference(
    name: SourceResolutionName, keys: list[str]
) -> None

Add or update a source reference.

Parameters:

to_cluster_entity

to_cluster_entity(
    *names: SourceResolutionName,
) -> ClusterEntity | None

Convert this SourceEntity to a ClusterEntity with the specified sources.

This method makes diffing really easy. Testing whether ClusterEntity objects are subsets of SourceEntity objects is a weaker, logically more fragile test than directly comparing equality of sets of ClusterEntity objects. It enables a really simple syntactical expression of the test.

actual: set[ClusterEntity] = ...
expected: set[ClusterEntity] = {
    s.to_cluster_entity("source1", "source2")
    for s in source_entities
}

is_identical = expected) == actual
missing = expected - actual
extra = actual - expected

Parameters:

Returns:

  • ClusterEntity | None

    ClusterEntity containing only the specified sources’ keys, or None

  • ClusterEntity | None

    if none of the specified sources are present in this entity.

get_keys

Get keys for a specific source.

Parameters:

Returns:

  • set[str]

    Set of keys, empty if source not found

get_values

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

Returns:

infer_data_type

infer_data_type(
    base: str, parameters: tuple | None
) -> DataTypes

Infer an appropriate Matchbox type from a Faker configuration.

Parameters:

  • base

    (str) –

    Faker generator type

  • parameters

    (tuple | None) –

    Parameters for the generator

Returns:

query_to_cluster_entities

query_to_cluster_entities(
    query: Table | DataFrame,
    keys: dict[SourceResolutionName, str],
) -> set[ClusterEntity]

Convert a query result to a set of ClusterEntities.

Useful for turning a real query from a real model resolution in Matchbox into a set of ClusterEntities that can be used in LinkedSourcesTestkit.diff_results().

Parameters:

  • query

    (Table | DataFrame) –

    A PyArrow table or DataFrame representing a query result

  • keys

    (dict[SourceResolutionName, str]) –

    Mapping of source resolution names to key field names

Returns:

generate_entities cached

generate_entities(
    generator: Faker,
    features: tuple[FeatureConfig, ...],
    n: int,
) -> tuple[SourceEntity]

Generate base entities with their ground truth values from generator.

probabilities_to_results_entities

probabilities_to_results_entities(
    probabilities: Table,
    left_clusters: tuple[ClusterEntity, ...],
    right_clusters: tuple[ClusterEntity, ...] | None = None,
    threshold: float | int = 0,
) -> tuple[ClusterEntity, ...]

Convert probabilities to ClusterEntity objects based on a threshold.

diff_results

Compare two lists of ClusterEntity with detailed diff information.

Parameters:

Returns:

  • bool

    A tuple containing:

  • dict
    • Boolean: True if lists are identical, False otherwise
  • tuple[bool, dict]
    • Dictionary that counts the number of actual entities that fall into the following criteria:
    • ‘perfect’: Match an expected entity exactly
    • ‘subset’: Are a subset of an expected entity
    • ‘superset’: Are a superset of an expected entity
    • ‘wrong’: Don’t match any expected entity
    • ‘invalid’: Contain keys not present in any expected entity