Skip to content

Sources

matchbox.common.factories.sources

Factories for generating sources and linked source testkits for testing.

Classes:

Functions:

SourceTestkitParameters

Bases: BaseModel

Configuration for generating a source.

Attributes:

features class-attribute instance-attribute

features: tuple[FeatureConfig, ...] = Field(
    default_factory=tuple
)

full_name instance-attribute

full_name: str

engine class-attribute instance-attribute

engine: Engine = Field(
    default=create_engine("sqlite:///:memory:")
)

n_true_entities class-attribute instance-attribute

n_true_entities: int | None = Field(default=None)

repetition class-attribute instance-attribute

repetition: int = Field(default=0)

SourceTestkit

Bases: BaseModel

A testkit of data and metadata for a SourceConfig.

Methods:

  • to_warehouse

    Write the data to the SourceConfig’s engine.

Attributes:

source_config class-attribute instance-attribute

source_config: SourceConfig = Field(
    description="The real generated SourceConfig object."
)

features class-attribute instance-attribute

features: tuple[FeatureConfig, ...] | None = Field(
    description="The features used to generate the data. If None, the source data was not generated, but set manually.",
    default=None,
)

data class-attribute instance-attribute

data: Table = Field(
    description="The PyArrow table of generated data."
)

data_hashes class-attribute instance-attribute

data_hashes: Table = Field(
    description="A PyArrow table of hashes for the data."
)

entities class-attribute instance-attribute

entities: tuple[ClusterEntity, ...] = Field(
    description="ClusterEntities that were generated from the source."
)

name property

name: str

Return the resolution name of the SourceConfig.

mock property

mock: Mock

Create a mock SourceConfig object with this testkit’s configuration.

query property

query: Table

Return a PyArrow table in the same format as matchbox.query().

query_backend property

query_backend: Table

Return a PyArrow table in the same format as the SCHEMA_MB_IDS DTO.

to_warehouse

to_warehouse(engine: Engine | None) -> None

Write the data to the SourceConfig’s engine.

As the SourceConfig won’t have an engine set by default, can be supplied.

LinkedSourcesTestkit

Bases: BaseModel

Container for multiple related SourceConfig testkits with entity tracking.

Methods:

  • find_entities

    Find entities matching appearance criteria.

  • true_entity_subset

    Return a subset of true entities that appear in the given sources.

  • diff_results

    Diff a results of probabilities with the true SourceEntities.

Attributes:

true_entities class-attribute instance-attribute

true_entities: set[SourceEntity] = Field(
    default_factory=set
)

sources instance-attribute

sources: dict[str, SourceTestkit]

find_entities

find_entities(
    min_appearances: dict[str, int] | None = None,
    max_appearances: dict[str, int] | None = None,
) -> list[SourceEntity]

Find entities matching appearance criteria.

true_entity_subset

true_entity_subset(
    *sources: SourceResolutionName,
) -> list[ClusterEntity]

Return a subset of true entities that appear in the given sources.

diff_results

diff_results(
    probabilities: Table,
    sources: list[str],
    left_clusters: tuple[ClusterEntity, ...],
    right_clusters: tuple[ClusterEntity, ...] | None = None,
    threshold: int | float = 0,
) -> tuple[bool, dict]

Diff a results of probabilities with the true SourceEntities.

Parameters:

  • probabilities
    (Table) –

    Probabilities table to diff

  • sources
    (list[str]) –

    Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against

  • left_clusters
    (tuple[ClusterEntity, ...]) –

    ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.

  • right_clusters
    (tuple[ClusterEntity, ...] | None, default: None ) –

    ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.

  • threshold
    (int | float, default: 0 ) –

    Threshold for considering a match true

Returns:

  • tuple[bool, dict]

    A tuple of whether the results are identical, and a report dictionary. See diff_results() for the report format.

make_features_hashable

make_features_hashable(
    func: Callable[P, R],
) -> Callable[P, R]

Decorator to allow configuring source_factory with dicts.

This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.

generate_rows

generate_rows(
    generator: Faker,
    selected_entities: tuple[SourceEntity, ...],
    features: tuple[FeatureConfig, ...],
) -> tuple[
    dict[str, list],
    dict[int, list[str]],
    dict[int, list[str]],
    dict[int, bytes],
]

Generate raw data rows with unique keys and shared IDs.

This function generates rows of data plus maps between three types of identifiers:

1. `id`: Is matchbox's unique identifier for each row, shared across rows with
    identical feature values
2. `key`: Is the source's unique identifier for the row. It's like a primary key
    in a database, but not guaranteed to be unique across different entities
3. `entity`: Is the identifier of the SourceEntity that generated the row.
    This identifies the true linked data in the factory system.

This function will therefore return:

* raw_data: A dictionary of column arrays for DataFrame creation
* entity_keys: A dictionary that maps which keys belong to each source entity
* id_keys: A dictionary that maps which keys share the same row content,
    with the same `id`
* id_hashes: A dictionary that maps `id`s to hash values for each unique
    row content

The key insight:

* entity_* groups by "who generated this row"
* id_* groups by "what content does this row have"

Example with two entities generating data:

id key company_name
1 a alpha co
2 b alpha ltd
1 c alpha co
2 d alpha ltd
3 e beta co
4 f beta ltd
3 g beta co
4 h beta ltd

What does this table look like as raw data?

raw_data = {
    "id": [1, 2, 1, 2, 3, 4, 3, 4],
    "key": ["a", "b", "c", "d", "e", "f", "g", "h"],
    "company_name": [
        "alpha co",
        "alpha ltd",
        "alpha co",
        "alpha ltd",
        "beta co",
        "beta ltd",
        "beta co",
        "beta ltd",
    ],
}

Which keys came from each source entity?

entity_keys = {
    1: ["a", "b", "c", "d"],  # All keys entity 1 produced
    2: ["e", "f", "g", "h"],  # All keys entity 2 produced
}

Which keys have identical content?

id_keys = {
    1: ["a", "c"],  # Both have "alpha co" content
    2: ["b", "d"],  # Both have "alpha ltd" content
    3: ["e", "g"],  # Both have "beta co" content
    4: ["f", "h"],  # Both have "beta ltd" content
}
id_hashes = {
    1: b"hash1",  # Hash of "alpha co"
    2: b"hash2",  # Hash of "alpha ltd"
    3: b"hash3",  # Hash of "beta co"
    4: b"hash4",  # Hash of "beta ltd"
}

generate_source cached

generate_source(
    generator: Faker,
    n_true_entities: int,
    features: tuple[FeatureConfig, ...],
    repetition: int,
    seed_entities: tuple[SourceEntity, ...] | None = None,
) -> tuple[
    Table, Table, dict[int, set[str]], dict[int, set[str]]
]

Generate raw data as PyArrow tables with entity tracking.

Returns:

  • Table
    • data: PyArrow table with generated data
  • Table
    • data_hashes: PyArrow table with hash groups
  • dict[int, set[str]]
    • entity_keys: SourceEntity ID -> list of keys mapping
  • dict[int, set[str]]
    • id_keys: Unique row ID -> list of keys mapping for identical rows

source_factory cached

source_factory(
    features: list[FeatureConfig]
    | list[dict]
    | None = None,
    full_name: str | None = None,
    engine: Engine | None = None,
    n_true_entities: int = 10,
    repetition: int = 0,
    seed: int = 42,
) -> SourceTestkit

Generate a complete source testkit from configured features.

source_from_tuple

source_from_tuple(
    data_tuple: tuple[dict[str, Any], ...],
    data_keys: tuple[Any],
    full_name: str | None = None,
    engine: Engine | None = None,
    seed: int = 42,
) -> SourceTestkit

Generate a complete source testkit from dummy data.

linked_sources_factory cached

linked_sources_factory(
    source_parameters: tuple[SourceTestkitParameters, ...]
    | None = None,
    n_true_entities: int | None = None,
    engine: Engine | None = None,
    seed: int = 42,
) -> LinkedSourcesTestkit

Generate a set of linked sources with tracked entities.

Parameters:

  • source_parameters

    (tuple[SourceTestkitParameters, ...] | None, default: None ) –

    Optional tuple of source testkit parameters

  • n_true_entities

    (int | None, default: None ) –

    Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceTestkitParameters must specify its own n_true_entities.

  • engine

    (Engine | None, default: None ) –

    Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility