Entities

matchbox.common.factories.entities ¶

Classes and functions for generating and comparing entities.

These underpin the entity resolution process, which is the core of the source and model testkit factory system.

Classes:

VariationRule –

Abstract base class for variation rules.
SuffixRule –

Add a suffix to a value.
PrefixRule –

Add a prefix to a value.
ReplaceRule –

Replace occurrences of a string with another.
FeatureConfig –

Configuration for generating a feature with variations.
EntityReference –

Reference to an entity’s presence in specific sources.
EntityIDMixin –

Mixin providing common ID-based functionality for entity classes.
SourceKeyMixin –

Mixin providing common source key functionality for entity classes.
ClusterEntity –

Represents a merged entity mid-pipeline.
SourceEntity –

Represents a single entity across all sources.

Functions:

infer_data_type –

Infer an appropriate Matchbox type from a Faker configuration.
query_to_cluster_entities –

Convert a query result to a set of ClusterEntities.
generate_entities –

Generate base entities with their ground truth values from generator.
probabilities_to_results_entities –

Convert probabilities to ClusterEntity objects based on a threshold.
diff_results –

Compare two lists of ClusterEntity with detailed diff information.

VariationRule ¶

Bases: BaseModel, ABC

Abstract base class for variation rules.

Methods:

apply –

Apply the variation to a value.

Attributes:

type (str) –

Return the type of variation.

type `abstractmethod` `property` ¶

type: str

Return the type of variation.

apply `abstractmethod` ¶

apply(value: str) -> str

Apply the variation to a value.

SuffixRule ¶

Bases: VariationRule

Add a suffix to a value.

Methods:

apply –

Apply the suffix to the value.

Attributes:

suffix (str) –
type (str) –

Return the type of variation.

suffix `instance-attribute` ¶

suffix: str

type `property` ¶

type: str

Return the type of variation.

apply ¶

apply(value: str) -> str

Apply the suffix to the value.

PrefixRule ¶

Bases: VariationRule

Add a prefix to a value.

Methods:

apply –

Apply the prefix to the value.

Attributes:

prefix (str) –
type (str) –

Return the type of variation.

prefix `instance-attribute` ¶

prefix: str

type `property` ¶

type: str

Return the type of variation.

apply ¶

apply(value: str) -> str

Apply the prefix to the value.

ReplaceRule ¶

Bases: VariationRule

Replace occurrences of a string with another.

Methods:

apply –

Apply the replacement to the value.

Attributes:

old (str) –
new (str) –
type (str) –

Return the type of variation.

old `instance-attribute` ¶

old: str

new `instance-attribute` ¶

new: str

type `property` ¶

type: str

Return the type of variation.

apply ¶

apply(value: str) -> str

Apply the replacement to the value.

FeatureConfig ¶

Bases: BaseModel

Configuration for generating a feature with variations.

Methods:

add_variations –

Add a variation rule to the feature.
protected_names –

Ensure name is not a reserved keyword.
string_to_strenum –

Convert string to DataTypes enum.

Attributes:

name (str) –
base_generator (str) –
parameters (tuple | None) –
unique (bool) –
drop_base (bool) –
variations (tuple[VariationRule, ...]) –
datatype (DataTypes) –

name `instance-attribute` ¶

name: str

base_generator `instance-attribute` ¶

base_generator: str

parameters `class-attribute` `instance-attribute` ¶

parameters: tuple | None = Field(
    default=None,
    description="Parameters for the generator. A tuple of tuples passed to the generator.",
)

unique `class-attribute` `instance-attribute` ¶

unique: bool = Field(
    default=True,
    description="Whether the generator enforces uniqueness in the generated data. For example, using unique=True with the 'boolean' generator will error if more the two values are generated.",
)

drop_base `class-attribute` `instance-attribute` ¶

drop_base: bool = Field(
    default=False,
    description="Whether the base case is dropped.",
)

variations `class-attribute` `instance-attribute` ¶

variations: tuple[VariationRule, ...] = Field(
    default_factory=tuple
)

datatype `class-attribute` `instance-attribute` ¶

datatype: DataTypes = Field(
    default_factory=lambda data: infer_data_type(
        data["base_generator"], data["parameters"]
    )
)

add_variations ¶

add_variations(*rule: VariationRule) -> FeatureConfig

Add a variation rule to the feature.

protected_names `classmethod` ¶

protected_names(value: str) -> str

Ensure name is not a reserved keyword.

string_to_strenum `classmethod` ¶

string_to_strenum(value: str) -> DataTypes

Convert string to DataTypes enum.

EntityReference ¶

EntityReference(
    mapping: dict[SourceResolutionName, frozenset[str]]
    | None = None,
)

Bases: frozendict

Reference to an entity’s presence in specific sources.

Maps source resolution names to sets of primary keys.

EntityIDMixin ¶

Mixin providing common ID-based functionality for entity classes.

Implements integer conversion and comparison operators for sorting based on the entity’s ID.

Attributes:

id (int) –

id `instance-attribute` ¶

id: int

SourceKeyMixin ¶

Mixin providing common source key functionality for entity classes.

Implements methods for accessing and retrieving source keys.

Methods:

get_keys –

Get keys for a specific source.
get_values –

Get all unique values for this entity across sources.

Attributes:

keys (EntityReference) –

keys `instance-attribute` ¶

keys: EntityReference

get_keys ¶

get_keys(name: SourceResolutionName) -> set[str]

Get keys for a specific source.

Parameters:

name ¶
(SourceResolutionName) –

Name of the source

Returns:

set[str] –

Set of keys, empty if source not found

get_values ¶

get_values(
    sources: dict[SourceResolutionName, SourceTestkit],
) -> dict[SourceResolutionName, dict[str, list[str]]]

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

sources ¶
(dict[SourceResolutionName, SourceTestkit]) –

Dictionary of source resolution name to source data

Returns:

dict[SourceResolutionName, dict[str, list[str]]] –

Dictionary mapping: source_name -> { feature_name -> [unique values for that feature in that source] }

ClusterEntity ¶

Bases: BaseModel, EntityIDMixin, SourceKeyMixin

Represents a merged entity mid-pipeline.

Methods:

is_subset_of_source_entity –

Check if this ClusterEntity’s references are a subset of a SourceEntity’s.
similarity_ratio –

Return ratio of shared keys to total keys across all sources.
get_keys –

Get keys for a specific source.
get_values –

Get all unique values for this entity across sources.

Attributes:

id (int) –
keys (EntityReference) –

id `class-attribute` `instance-attribute` ¶

id: int = Field(default_factory=lambda: getrandbits(63))

keys `instance-attribute` ¶

keys: EntityReference

is_subset_of_source_entity ¶

is_subset_of_source_entity(
    source_entity: SourceEntity,
) -> bool

Check if this ClusterEntity’s references are a subset of a SourceEntity’s.

similarity_ratio ¶

similarity_ratio(other: ClusterEntity) -> float

Return ratio of shared keys to total keys across all sources.

get_keys ¶

get_keys(name: SourceResolutionName) -> set[str]

Get keys for a specific source.

Parameters:

name ¶
(SourceResolutionName) –

Name of the source

Returns:

set[str] –

Set of keys, empty if source not found

get_values ¶

get_values(
    sources: dict[SourceResolutionName, SourceTestkit],
) -> dict[SourceResolutionName, dict[str, list[str]]]

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

sources ¶
(dict[SourceResolutionName, SourceTestkit]) –

Dictionary of source resolution name to source data

Returns:

dict[SourceResolutionName, dict[str, list[str]]] –

Dictionary mapping: source_name -> { feature_name -> [unique values for that feature in that source] }

SourceEntity ¶

Bases: BaseModel, EntityIDMixin, SourceKeyMixin

Represents a single entity across all sources.

Methods:

add_source_reference –

Add or update a source reference.
to_cluster_entity –

Convert this SourceEntity to a ClusterEntity with the specified sources.
get_keys –

Get keys for a specific source.
get_values –

Get all unique values for this entity across sources.

Attributes:

id (int) –
base_values (dict[str, Any]) –
keys (EntityReference) –
total_unique_variations (int) –

id `class-attribute` `instance-attribute` ¶

id: int = Field(default_factory=lambda: getrandbits(63))

base_values `class-attribute` `instance-attribute` ¶

base_values: dict[str, Any] = Field(
    description="Feature name -> base value"
)

keys `class-attribute` `instance-attribute` ¶

keys: EntityReference = Field(
    description="Source to keys mapping",
    default=EntityReference(mapping=frozenset()),
)

total_unique_variations `class-attribute` `instance-attribute` ¶

total_unique_variations: int = Field(default=0)

add_source_reference ¶

add_source_reference(
    name: SourceResolutionName, keys: list[str]
) -> None

Add or update a source reference.

Parameters:

name ¶
(SourceResolutionName) –

Source name
keys ¶
(list[str]) –

List of primary keys for this source

to_cluster_entity ¶

to_cluster_entity(
    *names: SourceResolutionName,
) -> ClusterEntity | None

Convert this SourceEntity to a ClusterEntity with the specified sources.

This method makes diffing really easy. Testing whether ClusterEntity objects are subsets of SourceEntity objects is a weaker, logically more fragile test than directly comparing equality of sets of ClusterEntity objects. It enables a really simple syntactical expression of the test.

actual: set[ClusterEntity] = ...
expected: set[ClusterEntity] = {
    s.to_cluster_entity("source1", "source2")
    for s in source_entities
}

is_identical = expected) == actual
missing = expected - actual
extra = actual - expected

Parameters:

*names ¶
(SourceResolutionName, default: () ) –

Names of sources to include in the ClusterEntity

Returns:

ClusterEntity | None –

ClusterEntity containing only the specified sources’ keys, or None
ClusterEntity | None –

if none of the specified sources are present in this entity.

get_keys ¶

get_keys(name: SourceResolutionName) -> set[str]

Get keys for a specific source.

Parameters:

name ¶
(SourceResolutionName) –

Name of the source

Returns:

set[str] –

Set of keys, empty if source not found

get_values ¶

get_values(
    sources: dict[SourceResolutionName, SourceTestkit],
) -> dict[SourceResolutionName, dict[str, list[str]]]

Get all unique values for this entity across sources.

Each source may have its own variations/transformations of the base data, so we maintain separation between sources.

Parameters:

sources ¶
(dict[SourceResolutionName, SourceTestkit]) –

Dictionary of source resolution name to source data

Returns:

dict[SourceResolutionName, dict[str, list[str]]] –

Dictionary mapping: source_name -> { feature_name -> [unique values for that feature in that source] }

infer_data_type ¶

infer_data_type(
    base: str, parameters: tuple | None
) -> DataTypes

Infer an appropriate Matchbox type from a Faker configuration.

Parameters:

base ¶
(str) –

Faker generator type
parameters ¶
(tuple | None) –

Parameters for the generator

Returns:

DataTypes –

A Matchbox DataType

query_to_cluster_entities ¶

query_to_cluster_entities(
    query: Table | DataFrame | DataFrame,
    keys: dict[SourceResolutionName, str],
) -> set[ClusterEntity]

Convert a query result to a set of ClusterEntities.

Useful for turning a real query from a real model resolution in Matchbox into a set of ClusterEntities that can be used in LinkedSourcesTestkit.diff_results().

Parameters:

query ¶
(Table | DataFrame | DataFrame) –

A PyArrow table or DataFrame representing a query result
keys ¶
(dict[SourceResolutionName, str]) –

Mapping of source resolution names to key field names

Returns:

set[ClusterEntity] –

A set of ClusterEntity objects

generate_entities `cached` ¶

generate_entities(
    generator: Faker,
    features: tuple[FeatureConfig, ...],
    n: int,
) -> tuple[SourceEntity]

Generate base entities with their ground truth values from generator.

probabilities_to_results_entities ¶

probabilities_to_results_entities(
    probabilities: Table,
    left_clusters: tuple[ClusterEntity, ...],
    right_clusters: tuple[ClusterEntity, ...] | None = None,
    threshold: float | int = 0,
) -> tuple[ClusterEntity, ...]

Convert probabilities to ClusterEntity objects based on a threshold.

diff_results ¶

diff_results(
    expected: list[ClusterEntity],
    actual: list[ClusterEntity],
) -> tuple[bool, dict]

Compare two lists of ClusterEntity with detailed diff information.

Parameters:

expected ¶
(list[ClusterEntity]) –

Expected ClusterEntity list
actual ¶
(list[ClusterEntity]) –

Actual ClusterEntity list

Returns:

bool –

A tuple containing:
dict –
- Boolean: True if lists are identical, False otherwise
tuple[bool, dict] –
- Dictionary that counts the number of actual entities that fall into the following criteria:
- ‘perfect’: Match an expected entity exactly
- ‘subset’: Are a subset of an expected entity
- ‘superset’: Are a superset of an expected entity
- ‘wrong’: Don’t match any expected entity
- ‘invalid’: Contain keys not present in any expected entity

Entities

matchbox.common.factories.entities ¶

VariationRule ¶

type abstractmethod property ¶

apply abstractmethod ¶

SuffixRule ¶

suffix instance-attribute ¶

type property ¶

apply ¶

PrefixRule ¶

prefix instance-attribute ¶

type property ¶

apply ¶

ReplaceRule ¶

old instance-attribute ¶

new instance-attribute ¶

type property ¶

apply ¶

FeatureConfig ¶

name instance-attribute ¶

base_generator instance-attribute ¶

parameters class-attribute instance-attribute ¶

unique class-attribute instance-attribute ¶

drop_base class-attribute instance-attribute ¶

variations class-attribute instance-attribute ¶

datatype class-attribute instance-attribute ¶

add_variations ¶

protected_names classmethod ¶

string_to_strenum classmethod ¶

EntityReference ¶

EntityIDMixin ¶

id instance-attribute ¶

SourceKeyMixin ¶

keys instance-attribute ¶

get_keys ¶

name ¶

get_values ¶

sources ¶

ClusterEntity ¶

id class-attribute instance-attribute ¶

keys instance-attribute ¶

is_subset_of_source_entity ¶

similarity_ratio ¶

get_keys ¶

name ¶

get_values ¶

sources ¶

SourceEntity ¶

id class-attribute instance-attribute ¶

base_values class-attribute instance-attribute ¶

keys class-attribute instance-attribute ¶

total_unique_variations class-attribute instance-attribute ¶

add_source_reference ¶

name ¶

keys ¶

to_cluster_entity ¶

*names ¶

get_keys ¶

name ¶

get_values ¶

sources ¶

infer_data_type ¶

base ¶

parameters ¶

query_to_cluster_entities ¶

query ¶

keys ¶

generate_entities cached ¶

probabilities_to_results_entities ¶

diff_results ¶

expected ¶

actual ¶

type `abstractmethod` `property` ¶

apply `abstractmethod` ¶

suffix `instance-attribute` ¶

type `property` ¶

prefix `instance-attribute` ¶

type `property` ¶

old `instance-attribute` ¶

new `instance-attribute` ¶

type `property` ¶

name `instance-attribute` ¶

base_generator `instance-attribute` ¶

parameters `class-attribute` `instance-attribute` ¶

unique `class-attribute` `instance-attribute` ¶

drop_base `class-attribute` `instance-attribute` ¶

variations `class-attribute` `instance-attribute` ¶

datatype `class-attribute` `instance-attribute` ¶

protected_names `classmethod` ¶

string_to_strenum `classmethod` ¶

id `instance-attribute` ¶

keys `instance-attribute` ¶

`name` ¶

`sources` ¶

id `class-attribute` `instance-attribute` ¶

keys `instance-attribute` ¶

`name` ¶

`sources` ¶

id `class-attribute` `instance-attribute` ¶

base_values `class-attribute` `instance-attribute` ¶

keys `class-attribute` `instance-attribute` ¶

total_unique_variations `class-attribute` `instance-attribute` ¶

`name` ¶

`keys` ¶

**`*names`** ¶

`name` ¶

`sources` ¶

`base` ¶

`parameters` ¶

`query` ¶

`keys` ¶

generate_entities `cached` ¶

`expected` ¶

`actual` ¶