Skip to content

Sources

matchbox.common.factories.sources

Factories for generating sources and linked source testkits for testing.

Classes:

Functions:

SourceTestkitParameters

Bases: BaseModel

Configuration for generating a source.

Attributes:

features class-attribute instance-attribute

features: tuple[FeatureConfig, ...] = Field(default_factory=tuple)

name instance-attribute

name: str

engine class-attribute instance-attribute

engine: Engine = Field(default=create_engine('sqlite:///:memory:'))

n_true_entities class-attribute instance-attribute

n_true_entities: int | None = Field(default=None)

repetition class-attribute instance-attribute

repetition: int = Field(default=0)

SourceTestkit

Bases: BaseModel

A testkit of data and metadata for a SourceConfig.

Methods:

  • cast_table

    Ensure that the data matches the query schema.

  • into_dag

    Turn source into kwargs for dag.source(), detaching from original DAG.

  • write_to_location

    Write the data to the SourceConfig’s location.

Attributes:

source class-attribute instance-attribute

source: Source = Field(description='The Source object containing config and convenience methods.')

features class-attribute instance-attribute

features: tuple[FeatureConfig, ...] | None = Field(description='The features used to generate the data. If None, the source data was not generated, but set manually.', default=None)

data class-attribute instance-attribute

data: Table = Field(description='The generated data, corresponding to the output of queries.')

data_hashes class-attribute instance-attribute

data_hashes: Table = Field(description='A PyArrow table of hashes for the data.')

entities class-attribute instance-attribute

entities: tuple[ClusterEntity, ...] = Field(description='ClusterEntities that were generated from the source.')

name property

name: str

Return the source name.

source_config property

source_config: SourceConfig

Return the SourceConfig from the source.

cast_table

cast_table(value: Table) -> Table

Ensure that the data matches the query schema.

into_dag

into_dag() -> dict

Turn source into kwargs for dag.source(), detaching from original DAG.

write_to_location

write_to_location(set_client: Any | None = None) -> Self

Write the data to the SourceConfig’s location.

Parameters:

  • set_client
    (Any | None, default: None ) –

    client to replace existing source client

LinkedSourcesTestkit

Bases: BaseModel

Container for multiple related SourceConfig testkits with entity tracking.

Methods:

  • find_entities

    Find entities matching appearance criteria.

  • true_entity_subset

    Return a subset of true entities that appear in the given sources.

  • diff_results

    Diff a results of probabilities with the true SourceEntities.

  • write_to_location

    Write the data to the SourceConfig’s location.

Attributes:

dag instance-attribute

dag: DAG

true_entities class-attribute instance-attribute

true_entities: set[SourceEntity] = Field(default_factory=set)

sources instance-attribute

find_entities

find_entities(min_appearances: dict[str, int] | None = None, max_appearances: dict[str, int] | None = None) -> list[SourceEntity]

Find entities matching appearance criteria.

true_entity_subset

true_entity_subset(*sources: SourceResolutionName) -> list[ClusterEntity]

Return a subset of true entities that appear in the given sources.

diff_results

Diff a results of probabilities with the true SourceEntities.

Parameters:

  • probabilities
    (DataFrame) –

    Probabilities table to diff

  • sources
    (list[SourceResolutionName]) –

    Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against

  • left_clusters
    (tuple[ClusterEntity, ...]) –

    ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.

  • right_clusters
    (tuple[ClusterEntity, ...] | None, default: None ) –

    ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.

  • threshold
    (int | float, default: 0 ) –

    Threshold for considering a match true

Returns:

  • tuple[bool, dict]

    A tuple of whether the results are identical, and a report dictionary. See diff_results() for the report format.

write_to_location

write_to_location() -> Self

Write the data to the SourceConfig’s location.

make_features_hashable

make_features_hashable(func: Callable[P, R]) -> Callable[P, R]

Decorator to allow configuring source_factory with dicts.

This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.

generate_rows

generate_rows(generator: Faker, selected_entities: tuple[SourceEntity, ...], features: tuple[FeatureConfig, ...], repetition: int) -> tuple[dict[str, list], dict[int, list[str]], dict[int, list[str]], dict[int, bytes]]

Generate raw data rows with unique keys and shared IDs.

This function generates rows of data plus maps between three types of identifiers:

1. `id`: Is matchbox's unique identifier for each row, shared across rows with
    identical feature values
2. `key`: Is the source's unique identifier for the row. It's like a primary key
    in a database, but not guaranteed to be unique across different entities
3. `entity`: Is the identifier of the SourceEntity that generated the row.
    This identifies the true linked data in the factory system.

This function will therefore return:

* raw_data: A dictionary of column arrays for DataFrame creation
* entity_keys: A dictionary that maps which keys belong to each source entity
* id_keys: A dictionary that maps which keys share the same row content,
    with the same `id`
* id_hashes: A dictionary that maps `id`s to hash values for each unique
    row content

The key insight:

* entity_* groups by "who generated this row"
* id_* groups by "what content does this row have"

Example with two entities generating data:

id key company_name
1 a alpha co
2 b alpha ltd
1 c alpha co
2 d alpha ltd
3 e beta co
4 f beta ltd
3 g beta co
4 h beta ltd

What does this table look like as raw data?

raw_data = {
    "id": [1, 2, 1, 2, 3, 4, 3, 4],
    "key": ["a", "b", "c", "d", "e", "f", "g", "h"],
    "company_name": [
        "alpha co",
        "alpha ltd",
        "alpha co",
        "alpha ltd",
        "beta co",
        "beta ltd",
        "beta co",
        "beta ltd",
    ],
}

Which keys came from each source entity?

entity_keys = {
    1: ["a", "b", "c", "d"],  # All keys entity 1 produced
    2: ["e", "f", "g", "h"],  # All keys entity 2 produced
}

Which keys have identical content?

id_keys = {
    1: ["a", "c"],  # Both have "alpha co" content
    2: ["b", "d"],  # Both have "alpha ltd" content
    3: ["e", "g"],  # Both have "beta co" content
    4: ["f", "h"],  # Both have "beta ltd" content
}
id_hashes = {
    1: b"hash1",  # Hash of "alpha co"
    2: b"hash2",  # Hash of "alpha ltd"
    3: b"hash3",  # Hash of "beta co"
    4: b"hash4",  # Hash of "beta ltd"
}

generate_source cached

generate_source(generator: Faker, n_true_entities: int, features: tuple[FeatureConfig, ...], repetition: int, seed_entities: tuple[SourceEntity, ...] | None = None) -> tuple[Table, Table, dict[int, set[str]], dict[int, set[str]]]

Generate raw data as PyArrow tables with entity tracking.

Returns:

  • Table
    • data: PyArrow table with generated data
  • Table
    • data_hashes: PyArrow table with hash groups
  • dict[int, set[str]]
    • entity_keys: SourceEntity ID -> list of keys mapping
  • dict[int, set[str]]
    • id_keys: Unique row ID -> list of keys mapping for identical rows

source_factory cached

source_factory(features: list[FeatureConfig] | list[dict] | None = None, name: SourceResolutionName | None = None, location_name: str = 'dbname', dag: DAG | None = None, engine: Engine | None = None, n_true_entities: int = 10, repetition: int = 0, seed: int = 42) -> SourceTestkit

Generate a complete source testkit from configured features.

SourceConfigs created with the factory system can only use a RelationalDBLocation, and the data at that location will be stored in a single table.

Parameters:

  • features

    (list[FeatureConfig] | list[dict] | None, default: None ) –

    List of FeatureConfig objects or dictionaries to use for generating the source data. If None, defaults to a set of common features.

  • name

    (SourceResolutionName | None, default: None ) –

    Name of the source. If None, a unique name is generated. This will be used as the name of the table in the RelationalDBLocation, but also in the SourceResolutionName for the source.

  • location_name

    (str, default: 'dbname' ) –

    Name of the location for the source.

  • dag

    (DAG | None, default: None ) –

    DAG containing the source.

  • engine

    (Engine | None, default: None ) –

    SQLAlchemy engine to use for the source’s RelationalDBLocation. If None, an in-memory SQLite engine is created.

  • n_true_entities

    (int, default: 10 ) –

    Number of true entities to generate. Defaults to 10.

  • repetition

    (int, default: 0 ) –

    Number of times to repeat the generated data. Defaults to 0.

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility. Defaults to 42.

source_from_tuple

source_from_tuple(data_tuple: tuple[dict[str, Any], ...], data_keys: tuple[Any], name: str | None = None, location_name: str = 'dbname', dag: DAG | None = None, engine: Engine | None = None, seed: int = 42) -> SourceTestkit

Generate a complete source testkit from dummy data.

linked_sources_factory cached

linked_sources_factory(source_parameters: tuple[SourceTestkitParameters, ...] | None = None, n_true_entities: int | None = None, engine: Engine | None = None, dag: DAG | None = None, seed: int = 42) -> LinkedSourcesTestkit

Generate a set of linked sources with tracked entities.

Parameters:

  • source_parameters

    (tuple[SourceTestkitParameters, ...] | None, default: None ) –

    Optional tuple of source testkit parameters

  • n_true_entities

    (int | None, default: None ) –

    Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceTestkitParameters must specify its own n_true_entities.

  • engine

    (Engine | None, default: None ) –

    Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.

  • dag

    (DAG | None, default: None ) –

    DAG containing sources

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility