Sources

matchbox.common.factories.sources ¶

Factories for generating sources and linked source testkits for testing.

Classes:

SourceTestkitParameters –

Configuration for generating a source.
SourceTestkit –

A testkit of data and metadata for a SourceConfig.
LinkedSourcesTestkit –

Container for multiple related SourceConfig testkits with entity tracking.

Functions:

make_features_hashable –

Decorator to allow configuring source_factory with dicts.
generate_rows –

Generate raw data rows with unique keys and shared IDs.
generate_source –

Generate raw data as PyArrow tables with entity tracking.
source_factory –

Generate a complete source testkit from configured features.
source_from_tuple –

Generate a complete source testkit from dummy data.
linked_sources_factory –

Generate a set of linked sources with tracked entities.

SourceTestkitParameters ¶

Bases: BaseModel

Configuration for generating a source.

Attributes:

features (tuple[FeatureConfig, ...]) –
name (str) –
engine (Engine) –
n_true_entities (int | None) –
repetition (int) –

features `class-attribute` `instance-attribute` ¶

features: tuple[FeatureConfig, ...] = Field(
    default_factory=tuple
)

name `instance-attribute` ¶

name: str

engine `class-attribute` `instance-attribute` ¶

engine: Engine = Field(
    default=create_engine("sqlite:///:memory:")
)

n_true_entities `class-attribute` `instance-attribute` ¶

n_true_entities: int | None = Field(default=None)

repetition `class-attribute` `instance-attribute` ¶

repetition: int = Field(default=0)

SourceTestkit ¶

Bases: BaseModel

A testkit of data and metadata for a SourceConfig.

Methods:

write_to_location –

Write the data to the SourceConfig’s location.

Attributes:

source_config (SourceConfig) –
features (tuple[FeatureConfig, ...] | None) –
data (Table) –
data_hashes (Table) –
entities (tuple[ClusterEntity, ...]) –
name (str) –

Return the resolution name of the SourceConfig.
mock (Mock) –

Create a mock SourceConfig object with this testkit’s configuration.
query (Table) –

Return a PyArrow table in the same format as matchbox.query().
query_backend (Table) –

Return a PyArrow table in the same format as the SCHEMA_QUERY DTO.

source_config `class-attribute` `instance-attribute` ¶

source_config: SourceConfig = Field(
    description="The real generated SourceConfig object."
)

features `class-attribute` `instance-attribute` ¶

features: tuple[FeatureConfig, ...] | None = Field(
    description="The features used to generate the data. If None, the source data was not generated, but set manually.",
    default=None,
)

data `class-attribute` `instance-attribute` ¶

data: Table = Field(
    description="The PyArrow table of generated data."
)

data_hashes `class-attribute` `instance-attribute` ¶

data_hashes: Table = Field(
    description="A PyArrow table of hashes for the data."
)

entities `class-attribute` `instance-attribute` ¶

entities: tuple[ClusterEntity, ...] = Field(
    description="ClusterEntities that were generated from the source."
)

name `property` ¶

name: str

Return the resolution name of the SourceConfig.

mock `property` ¶

mock: Mock

Create a mock SourceConfig object with this testkit’s configuration.

query `property` ¶

query: Table

Return a PyArrow table in the same format as matchbox.query().

query_backend `property` ¶

query_backend: Table

Return a PyArrow table in the same format as the SCHEMA_QUERY DTO.

write_to_location ¶

write_to_location(
    client: Any, set_client: bool = False
) -> None

Write the data to the SourceConfig’s location.

The client isn’t set in testkits, so it must be provided here.

Parameters:

client ¶
(Any) –

Client to use for the location.
set_client ¶
(bool, default: False ) –

Whether to set the client on the SourceConfig. Offered here for convenience as it’s often the next step.

LinkedSourcesTestkit ¶

Bases: BaseModel

Container for multiple related SourceConfig testkits with entity tracking.

Methods:

find_entities –

Find entities matching appearance criteria.
true_entity_subset –

Return a subset of true entities that appear in the given sources.
diff_results –

Diff a results of probabilities with the true SourceEntities.
write_to_location –

Write the data to the SourceConfig’s location.

Attributes:

true_entities (set[SourceEntity]) –
sources (dict[str, SourceTestkit]) –

true_entities `class-attribute` `instance-attribute` ¶

true_entities: set[SourceEntity] = Field(
    default_factory=set
)

sources `instance-attribute` ¶

sources: dict[str, SourceTestkit]

find_entities ¶

find_entities(
    min_appearances: dict[str, int] | None = None,
    max_appearances: dict[str, int] | None = None,
) -> list[SourceEntity]

Find entities matching appearance criteria.

true_entity_subset ¶

true_entity_subset(
    *sources: SourceResolutionName,
) -> list[ClusterEntity]

Return a subset of true entities that appear in the given sources.

diff_results ¶

diff_results(
    probabilities: Table,
    sources: list[SourceResolutionName],
    left_clusters: tuple[ClusterEntity, ...],
    right_clusters: tuple[ClusterEntity, ...] | None = None,
    threshold: int | float = 0,
) -> tuple[bool, dict]

Diff a results of probabilities with the true SourceEntities.

Parameters:

probabilities ¶
(Table) –

Probabilities table to diff
sources ¶
(list[SourceResolutionName]) –

Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against
left_clusters ¶
(tuple[ClusterEntity, ...]) –

ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
right_clusters ¶
(tuple[ClusterEntity, ...] | None, default: None ) –

ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
threshold ¶
(int | float, default: 0 ) –

Threshold for considering a match true

Returns:

tuple[bool, dict] –

A tuple of whether the results are identical, and a report dictionary. See diff_results() for the report format.

write_to_location ¶

write_to_location(
    client: Any, set_client: bool = False
) -> None

Write the data to the SourceConfig’s location.

The client isn’t set in testkits, so it must be provided here.

Parameters:

client ¶
(Any) –

Client to use for the location.
set_client ¶
(bool, default: False ) –

Whether to set the client on the SourceConfig. Offered here for convenience as it’s often the next step.

make_features_hashable ¶

make_features_hashable(
    func: Callable[P, R],
) -> Callable[P, R]

Decorator to allow configuring source_factory with dicts.

This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.

generate_rows ¶

generate_rows(
    generator: Faker,
    selected_entities: tuple[SourceEntity, ...],
    features: tuple[FeatureConfig, ...],
    repetition: int,
) -> tuple[
    dict[str, list],
    dict[int, list[str]],
    dict[int, list[str]],
    dict[int, bytes],
]

Generate raw data rows with unique keys and shared IDs.

This function generates rows of data plus maps between three types of identifiers:

1. `id`: Is matchbox's unique identifier for each row, shared across rows with
    identical feature values
2. `key`: Is the source's unique identifier for the row. It's like a primary key
    in a database, but not guaranteed to be unique across different entities
3. `entity`: Is the identifier of the SourceEntity that generated the row.
    This identifies the true linked data in the factory system.

This function will therefore return:

* raw_data: A dictionary of column arrays for DataFrame creation
* entity_keys: A dictionary that maps which keys belong to each source entity
* id_keys: A dictionary that maps which keys share the same row content,
    with the same `id`
* id_hashes: A dictionary that maps `id`s to hash values for each unique
    row content

The key insight:

* entity_* groups by "who generated this row"
* id_* groups by "what content does this row have"

Example with two entities generating data:

id	key	company_name
1	a	alpha co
2	b	alpha ltd
1	c	alpha co
2	d	alpha ltd
3	e	beta co
4	f	beta ltd
3	g	beta co
4	h	beta ltd

What does this table look like as raw data?

raw_data = {
    "id": [1, 2, 1, 2, 3, 4, 3, 4],
    "key": ["a", "b", "c", "d", "e", "f", "g", "h"],
    "company_name": [
        "alpha co",
        "alpha ltd",
        "alpha co",
        "alpha ltd",
        "beta co",
        "beta ltd",
        "beta co",
        "beta ltd",
    ],
}

Which keys came from each source entity?

entity_keys = {
    1: ["a", "b", "c", "d"],  # All keys entity 1 produced
    2: ["e", "f", "g", "h"],  # All keys entity 2 produced
}

Which keys have identical content?

id_keys = {
    1: ["a", "c"],  # Both have "alpha co" content
    2: ["b", "d"],  # Both have "alpha ltd" content
    3: ["e", "g"],  # Both have "beta co" content
    4: ["f", "h"],  # Both have "beta ltd" content
}
id_hashes = {
    1: b"hash1",  # Hash of "alpha co"
    2: b"hash2",  # Hash of "alpha ltd"
    3: b"hash3",  # Hash of "beta co"
    4: b"hash4",  # Hash of "beta ltd"
}

generate_source `cached` ¶

generate_source(
    generator: Faker,
    n_true_entities: int,
    features: tuple[FeatureConfig, ...],
    repetition: int,
    seed_entities: tuple[SourceEntity, ...] | None = None,
) -> tuple[
    Table, Table, dict[int, set[str]], dict[int, set[str]]
]

Generate raw data as PyArrow tables with entity tracking.

Returns:

Table –
- data: PyArrow table with generated data
Table –
- data_hashes: PyArrow table with hash groups
dict[int, set[str]] –
- entity_keys: SourceEntity ID -> list of keys mapping
dict[int, set[str]] –
- id_keys: Unique row ID -> list of keys mapping for identical rows

source_factory `cached` ¶

source_factory(
    features: list[FeatureConfig]
    | list[dict]
    | None = None,
    name: SourceResolutionName | None = None,
    location_name: str = "dbname",
    engine: Engine | None = None,
    n_true_entities: int = 10,
    repetition: int = 0,
    seed: int = 42,
) -> SourceTestkit

Generate a complete source testkit from configured features.

SourceConfigs created with the factory system can only use a RelationalDBLocation, and the data at that location will be stored in a single table.

Parameters:

features ¶
(list[FeatureConfig] | list[dict] | None, default: None ) –

List of FeatureConfig objects or dictionaries to use for generating the source data. If None, defaults to a set of common features.
name ¶
(SourceResolutionName | None, default: None ) –

Name of the source. If None, a unique name is generated. This will be used as the name of the table in the RelationalDBLocation, but also as the SourceResolutionName for the source.
location_name ¶
(str, default: 'dbname' ) –

Name of the location for the source.
engine ¶
(Engine | None, default: None ) –

SQLAlchemy engine to use for the source’s RelationalDBLocation. If None, an in-memory SQLite engine is created.
n_true_entities ¶
(int, default: 10 ) –

Number of true entities to generate. Defaults to 10.
repetition ¶
(int, default: 0 ) –

Number of times to repeat the generated data. Defaults to 0.
seed ¶
(int, default: 42 ) –

Random seed for reproducibility. Defaults to 42.

source_from_tuple ¶

source_from_tuple(
    data_tuple: tuple[dict[str, Any], ...],
    data_keys: tuple[Any],
    name: str | None = None,
    location_name: str = "dbname",
    engine: Engine | None = None,
    seed: int = 42,
) -> SourceTestkit

Generate a complete source testkit from dummy data.

linked_sources_factory `cached` ¶

linked_sources_factory(
    source_parameters: tuple[SourceTestkitParameters, ...]
    | None = None,
    n_true_entities: int | None = None,
    engine: Engine | None = None,
    seed: int = 42,
) -> LinkedSourcesTestkit

Generate a set of linked sources with tracked entities.

Parameters:

source_parameters ¶
(tuple[SourceTestkitParameters, ...] | None, default: None ) –

Optional tuple of source testkit parameters
n_true_entities ¶
(int | None, default: None ) –

Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceTestkitParameters must specify its own n_true_entities.
engine ¶
(Engine | None, default: None ) –

Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.
seed ¶
(int, default: 42 ) –

Random seed for reproducibility

Sources

matchbox.common.factories.sources ¶

SourceTestkitParameters ¶

features class-attribute instance-attribute ¶

name instance-attribute ¶

engine class-attribute instance-attribute ¶

n_true_entities class-attribute instance-attribute ¶

repetition class-attribute instance-attribute ¶

SourceTestkit ¶

source_config class-attribute instance-attribute ¶

features class-attribute instance-attribute ¶

data class-attribute instance-attribute ¶

data_hashes class-attribute instance-attribute ¶

entities class-attribute instance-attribute ¶

name property ¶

mock property ¶

query property ¶

query_backend property ¶

write_to_location ¶

client ¶

set_client ¶

LinkedSourcesTestkit ¶

true_entities class-attribute instance-attribute ¶

sources instance-attribute ¶

find_entities ¶

true_entity_subset ¶

diff_results ¶

probabilities ¶

sources ¶

left_clusters ¶

right_clusters ¶

threshold ¶

write_to_location ¶

client ¶

set_client ¶

make_features_hashable ¶

generate_rows ¶

generate_source cached ¶

source_factory cached ¶

features ¶

name ¶

location_name ¶

engine ¶

n_true_entities ¶

repetition ¶

seed ¶

source_from_tuple ¶

linked_sources_factory cached ¶

source_parameters ¶

n_true_entities ¶

engine ¶

seed ¶

features `class-attribute` `instance-attribute` ¶

name `instance-attribute` ¶

engine `class-attribute` `instance-attribute` ¶

n_true_entities `class-attribute` `instance-attribute` ¶

repetition `class-attribute` `instance-attribute` ¶

source_config `class-attribute` `instance-attribute` ¶

features `class-attribute` `instance-attribute` ¶

data `class-attribute` `instance-attribute` ¶

data_hashes `class-attribute` `instance-attribute` ¶

entities `class-attribute` `instance-attribute` ¶

name `property` ¶

mock `property` ¶

query `property` ¶

query_backend `property` ¶

`client` ¶

`set_client` ¶

true_entities `class-attribute` `instance-attribute` ¶

sources `instance-attribute` ¶

`probabilities` ¶

`sources` ¶

`left_clusters` ¶

`right_clusters` ¶

`threshold` ¶

`client` ¶

`set_client` ¶

generate_source `cached` ¶

source_factory `cached` ¶

`features` ¶

`name` ¶

`location_name` ¶

`engine` ¶

`n_true_entities` ¶

`repetition` ¶

`seed` ¶

linked_sources_factory `cached` ¶

`source_parameters` ¶

`n_true_entities` ¶

`engine` ¶

`seed` ¶