Skip to content

Entities

matchbox.common.factories.entities

Classes and functions for generating and comparing entities.

These underpin the entity resolution process, which is the core of the source and model testkit factory system.

Classes:

  • VariationRule

    Abstract base class for variation rules.

  • SuffixRule

    Add a suffix to a value.

  • PrefixRule

    Add a prefix to a value.

  • ReplaceRule

    Replace occurrences of a string with another.

  • FeatureConfig

    Configuration for generating a feature with variations.

  • EntityReference

    Reference to an entity’s presence in specific sources.

  • EntityIDMixin

    Mixin providing common ID-based functionality for entity classes.

  • SourcePKMixin

    Mixin providing common source primary key functionality for entity classes.

  • ClusterEntity

    Represents a merged entity mid-pipeline.

  • SourceEntity

    Represents a single entity across all sources.

Functions:

VariationRule

Bases: BaseModel, ABC

Abstract base class for variation rules.

Methods:

  • apply

    Apply the variation to a value.

Attributes:

  • type (str) –

    Return the type of variation.

type abstractmethod property

type: str

Return the type of variation.

apply abstractmethod

apply(value: str) -> str

Apply the variation to a value.

SuffixRule

Bases: VariationRule

Add a suffix to a value.

Methods:

  • apply

    Apply the suffix to the value.

Attributes:

suffix instance-attribute

suffix: str

type property

type: str

Return the type of variation.

apply

apply(value: str) -> str

Apply the suffix to the value.

PrefixRule

Bases: VariationRule

Add a prefix to a value.

Methods:

  • apply

    Apply the prefix to the value.

Attributes:

prefix instance-attribute

prefix: str

type property

type: str

Return the type of variation.

apply

apply(value: str) -> str

Apply the prefix to the value.

ReplaceRule

Bases: VariationRule

Replace occurrences of a string with another.

Methods:

  • apply

    Apply the replacement to the value.

Attributes:

old instance-attribute

old: str

new instance-attribute

new: str

type property

type: str

Return the type of variation.

apply

apply(value: str) -> str

Apply the replacement to the value.

FeatureConfig

Bases: BaseModel

Configuration for generating a feature with variations.

Methods:

Attributes:

name instance-attribute

name: str

base_generator instance-attribute

base_generator: str

parameters class-attribute instance-attribute

parameters: tuple = Field(
    default_factory=tuple,
    description="Parameters for the generator. A tuple of tuples passed to the generator.",
)

unique class-attribute instance-attribute

unique: bool = Field(
    default=True,
    description="Whether the generator enforces uniqueness in the generated data. For example, using unique=True with the 'boolean' generator will error if more the two values are generated.",
)

drop_base class-attribute instance-attribute

drop_base: bool = Field(
    default=False,
    description="Whether the base case is dropped.",
)

variations class-attribute instance-attribute

variations: tuple[VariationRule, ...] = Field(
    default_factory=tuple
)

sql_type class-attribute instance-attribute

sql_type: str = Field(
    default_factory=lambda data: infer_sql_type(
        data["base_generator"], data["parameters"]
    )
)

add_variations

add_variations(*rule: VariationRule) -> FeatureConfig

Add a variation rule to the feature.

protected_names

protected_names(value: str) -> str

Ensure name is not a reserved keyword.

EntityReference

EntityReference(
    mapping: dict[str, frozenset[str]] | None = None,
)

Bases: frozendict

Reference to an entity’s presence in specific sources.

Maps dataset names to sets of primary keys.

EntityIDMixin

Mixin providing common ID-based functionality for entity classes.

Implements integer conversion and comparison operators for sorting based on the entity’s ID.

Attributes:

id instance-attribute

id: int

SourcePKMixin

Mixin providing common source primary key functionality for entity classes.

Implements methods for accessing and retrieving source primary keys.

Methods:

  • get_source_pks

    Get PKs for a specific source.

  • get_values

    Get all unique values for this entity across sources.

Attributes:

source_pks instance-attribute

source_pks: EntityReference

get_source_pks

get_source_pks(source_name: str) -> set[str]

Get PKs for a specific source.

Parameters:

  • source_name
    (str) –

    Name of the dataset

Returns:

  • set[str]

    Set of primary keys, empty if dataset not found

get_values

get_values(
    sources: dict[str, SourceTestkit],
) -> dict[str, dict[str, list[str]]]

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

Returns:

  • dict[str, dict[str, list[str]]]

    Dictionary mapping: source_name -> { feature_name -> [unique values for that feature in that source] }

ClusterEntity

Bases: BaseModel, EntityIDMixin, SourcePKMixin

Represents a merged entity mid-pipeline.

Methods:

Attributes:

id class-attribute instance-attribute

id: int = Field(default_factory=lambda: getrandbits(63))

source_pks instance-attribute

source_pks: EntityReference

get_source_pks

get_source_pks(source_name: str) -> set[str]

Get PKs for a specific source.

Parameters:

  • source_name
    (str) –

    Name of the dataset

Returns:

  • set[str]

    Set of primary keys, empty if dataset not found

get_values

get_values(
    sources: dict[str, SourceTestkit],
) -> dict[str, dict[str, list[str]]]

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

Returns:

  • dict[str, dict[str, list[str]]]

    Dictionary mapping: source_name -> { feature_name -> [unique values for that feature in that source] }

is_subset_of_source_entity

is_subset_of_source_entity(
    source_entity: SourceEntity,
) -> bool

Check if this ClusterEntity’s references are a subset of a SourceEntity’s.

similarity_ratio

similarity_ratio(other: ClusterEntity) -> float

Return ratio of shared PKs to total PKs across all datasets.

SourceEntity

Bases: BaseModel, EntityIDMixin, SourcePKMixin

Represents a single entity across all sources.

Methods:

Attributes:

id class-attribute instance-attribute

id: int = Field(default_factory=lambda: getrandbits(63))

base_values class-attribute instance-attribute

base_values: dict[str, Any] = Field(
    description="Feature name -> base value"
)

source_pks class-attribute instance-attribute

source_pks: EntityReference = Field(
    description="Dataset to PKs mapping",
    default=EntityReference(mapping=frozenset()),
)

total_unique_variations class-attribute instance-attribute

total_unique_variations: int = Field(default=0)

get_source_pks

get_source_pks(source_name: str) -> set[str]

Get PKs for a specific source.

Parameters:

  • source_name
    (str) –

    Name of the dataset

Returns:

  • set[str]

    Set of primary keys, empty if dataset not found

get_values

get_values(
    sources: dict[str, SourceTestkit],
) -> dict[str, dict[str, list[str]]]

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

Returns:

  • dict[str, dict[str, list[str]]]

    Dictionary mapping: source_name -> { feature_name -> [unique values for that feature in that source] }

add_source_reference

add_source_reference(name: str, pks: list[str]) -> None

Add or update a source reference.

Parameters:

  • name
    (str) –

    Dataset name

  • pks
    (list[str]) –

    List of primary keys for this dataset

to_cluster_entity

to_cluster_entity(*names: str) -> ClusterEntity | None

Convert this SourceEntity to a ClusterEntity with the specified datasets.

This method makes diffing really easy. Testing whether ClusterEntity objects are subsets of SourceEntity objects is a weaker, logically more fragile test than directly comparing equality of sets of ClusterEntity objects. It enables a really simple syntactical expression of the test.

actual: set[ClusterEntity] = ...
expected: set[ClusterEntity] = {
    s.to_cluster_entity("dataset1", "dataset2")
    for s in source_entities
}

is_identical = expected) == actual
missing = expected - actual
extra = actual - expected

Parameters:

  • *names
    (str, default: () ) –

    Names of datasets to include in the ClusterEntity

Returns:

  • ClusterEntity | None

    ClusterEntity containing only the specified datasets’ PKs, or None

  • ClusterEntity | None

    if none of the specified datasets are present in this entity.

infer_sql_type

infer_sql_type(base: str, parameters: tuple) -> str

Infer an appropriate SQL type from a Faker configuration.

Parameters:

  • base

    (str) –

    Faker generator type

  • parameters

    (tuple) –

    Parameters for the generator

Returns:

  • str

    A SQL type string

query_to_cluster_entities

query_to_cluster_entities(
    query: Table | DataFrame, source_pks: dict[str, str]
) -> set[ClusterEntity]

Convert a query result to a set of ClusterEntities.

Useful for turning a real query from a real model resolution in Matchbox into a set of ClusterEntities that can be used in LinkedSourcesTestkit.diff_results().

Parameters:

  • query

    (Table | DataFrame) –

    A PyArrow table or DataFrame representing a query result

  • source_pks

    (dict[str, str]) –

    Mapping of source names to primary key column names

Returns:

generate_entities cached

generate_entities(
    generator: Faker,
    features: tuple[FeatureConfig, ...],
    n: int,
) -> tuple[SourceEntity]

Generate base entities with their ground truth values.

probabilities_to_results_entities

probabilities_to_results_entities(
    probabilities: Table,
    left_clusters: tuple[ClusterEntity, ...],
    right_clusters: tuple[ClusterEntity, ...] | None = None,
    threshold: float | int = 0,
) -> tuple[ClusterEntity, ...]

Convert probabilities to ClusterEntity objects based on a threshold.

diff_results

Compare two lists of ClusterEntity with detailed diff information.

Parameters:

Returns:

  • bool

    A tuple containing:

  • dict
    • Boolean: True if lists are identical, False otherwise
  • tuple[bool, dict]
    • Dictionary that counts the number of actual entities that fall into the following criteria:
    • ‘perfect’: Match an expected entity exactly
    • ‘subset’: Are a subset of an expected entity
    • ‘superset’: Are a superset of an expected entity
    • ‘wrong’: Don’t match any expected entity
    • ‘invalid’: Contain source_pks not present in any expected entity