Skip to content

Sources

matchbox.common.factories.sources

Factories for generating sources and linked source testkits for testing.

Classes:

Functions:

SourceConfig

Bases: BaseModel

Configuration for generating a source.

Attributes:

features class-attribute instance-attribute

features: tuple[FeatureConfig, ...] = Field(
    default_factory=tuple
)

full_name instance-attribute

full_name: str

engine class-attribute instance-attribute

engine: Engine = Field(
    default=create_engine("sqlite:///:memory:")
)

n_true_entities class-attribute instance-attribute

n_true_entities: int | None = Field(default=None)

repetition class-attribute instance-attribute

repetition: int = Field(default=0)

SourceTestkit

Bases: BaseModel

A testkit of data and metadata for a Source.

Methods:

Attributes:

source class-attribute instance-attribute

source: Source = Field(
    description="The real generated Source object."
)

features class-attribute instance-attribute

features: tuple[FeatureConfig, ...] = Field(
    description="The features used to generate the data."
)

data class-attribute instance-attribute

data: Table = Field(
    description="The PyArrow table of generated data."
)

data_hashes class-attribute instance-attribute

data_hashes: Table = Field(
    description="A PyArrow table of hashes for the data."
)

entities class-attribute instance-attribute

entities: tuple[ClusterEntity, ...] = Field(
    description="ClusterEntities that were generated from the source."
)

name property

name: str

Return the resolution name of the Source.

mock property

mock: Mock

Create a mock Source object with this testkit’s configuration.

query property

query: Table

Return a PyArrow table in the same format as matchbox.query().

query_backend property

query_backend: Table

Return a PyArrow table in the same format as the SCHEMA_MB_IDS DTO.

to_warehouse

to_warehouse(engine: Engine | None) -> None

Write the data to the Source’s engine.

As the Source won’t have an engine set by default, can be supplied.

LinkedSourcesTestkit

Bases: BaseModel

Container for multiple related Source testkits with entity tracking.

Methods:

  • find_entities

    Find entities matching appearance criteria.

  • true_entity_subset

    Return a subset of true entities that appear in the given sources.

  • diff_results

    Diff a results of probabilities with the true SourceEntities.

Attributes:

true_entities class-attribute instance-attribute

true_entities: set[SourceEntity] = Field(
    default_factory=set
)

sources instance-attribute

sources: dict[str, SourceTestkit]

find_entities

find_entities(
    min_appearances: dict[str, int] | None = None,
    max_appearances: dict[str, int] | None = None,
) -> list[SourceEntity]

Find entities matching appearance criteria.

true_entity_subset

true_entity_subset(*sources: str) -> list[ClusterEntity]

Return a subset of true entities that appear in the given sources.

diff_results

diff_results(
    probabilities: Table,
    sources: list[str],
    left_clusters: tuple[ClusterEntity, ...],
    right_clusters: tuple[ClusterEntity, ...] | None = None,
    threshold: int | float = 0,
) -> tuple[bool, dict]

Diff a results of probabilities with the true SourceEntities.

Parameters:

  • probabilities
    (Table) –

    Probabilities table to diff

  • sources
    (list[str]) –

    Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against

  • left_clusters
    (tuple[ClusterEntity, ...]) –

    ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.

  • right_clusters
    (tuple[ClusterEntity, ...] | None, default: None ) –

    ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.

  • threshold
    (int | float, default: 0 ) –

    Threshold for considering a match true

Returns:

  • tuple[bool, dict]

    A tuple of whether the results are identical, and a report dictionary. See diff_results() for the report format.

make_features_hashable

make_features_hashable(
    func: Callable[P, R],
) -> Callable[P, R]

Decorator to allow configuring source_factory with dicts.

This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.

generate_rows

generate_rows(
    generator: Faker,
    selected_entities: tuple[SourceEntity, ...],
    features: tuple[FeatureConfig, ...],
) -> tuple[
    dict[str, list],
    dict[int, list[str]],
    dict[int, list[str]],
]

Generate raw data rows. Adds an ID shared by unique rows, and a PK for every row.

Returns a tuple of:

  • raw_data: Dictionary of column arrays for DataFrame creation
  • entity_pks: Maps SourceEntity.id to the set of PKs where that entity appears
  • id_pks: Maps each ID to the set of PKs where that row appears

For example, if this is the raw data:

id pk company_name
1 1 alpha co
2 2 alpha ltd
1 3 alpha co
2 4 alpha ltd
3 5 beta co
4 6 beta ltd
3 7 beta co
4 8 beta ltd

Entity PKs would be this, because there are two true SourceEntities:

{ 1: [1, 2, 3, 4], 2: [5, 6, 7, 8], }

And ID PKs would be this, because there are four unique rows:

{ 1: [1, 3], 2: [2, 4], 3: [5, 7], 4: [6, 8], }

generate_source cached

generate_source(
    generator: Faker,
    n_true_entities: int,
    features: tuple[FeatureConfig, ...],
    repetition: int,
    seed_entities: tuple[SourceEntity, ...] | None = None,
) -> tuple[
    Table, Table, dict[int, set[str]], dict[int, set[str]]
]

Generate raw data as PyArrow tables with entity tracking.

Returns:

  • Table
    • data: PyArrow table with generated data
  • Table
    • data_hashes: PyArrow table with hash groups
  • dict[int, set[str]]
    • entity_pks: SourceEntity ID -> list of PKs mapping
  • dict[int, set[str]]
    • row_pks: Results row ID -> list of PKs mapping for identical rows

source_factory cached

source_factory(
    features: list[FeatureConfig]
    | list[dict]
    | None = None,
    full_name: str | None = None,
    engine: Engine | None = None,
    n_true_entities: int = 10,
    repetition: int = 0,
    seed: int = 42,
) -> SourceTestkit

Generate a complete source testkit.

linked_sources_factory cached

linked_sources_factory(
    source_configs: tuple[SourceConfig, ...] | None = None,
    n_true_entities: int | None = None,
    engine: Engine | None = None,
    seed: int = 42,
) -> LinkedSourcesTestkit

Generate a set of linked sources with tracked entities.

Parameters:

  • source_configs

    (tuple[SourceConfig, ...] | None, default: None ) –

    Optional tuple of source configurations

  • n_true_entities

    (int | None, default: None ) –

    Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceConfig must specify its own n_true_entities.

  • engine

    (Engine | None, default: None ) –

    Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.

  • seed

    (int, default: 42 ) –

    Random seed for reproducibility