Sources
matchbox.common.factories.sources
¶
Factories for generating sources and linked source testkits for testing.
Classes:
-
SourceTestkitParameters
–Configuration for generating a source.
-
SourceTestkit
–A testkit of data and metadata for a SourceConfig.
-
LinkedSourcesTestkit
–Container for multiple related SourceConfig testkits with entity tracking.
Functions:
-
make_features_hashable
–Decorator to allow configuring source_factory with dicts.
-
generate_rows
–Generate raw data rows with unique keys and shared IDs.
-
generate_source
–Generate raw data as PyArrow tables with entity tracking.
-
source_factory
–Generate a complete source testkit from configured features.
-
source_from_tuple
–Generate a complete source testkit from dummy data.
-
linked_sources_factory
–Generate a set of linked sources with tracked entities.
SourceTestkitParameters
¶
Bases: BaseModel
Configuration for generating a source.
Attributes:
-
features
(tuple[FeatureConfig, ...]
) – -
full_name
(str
) – -
engine
(Engine
) – -
n_true_entities
(int | None
) – -
repetition
(int
) –
SourceTestkit
¶
Bases: BaseModel
A testkit of data and metadata for a SourceConfig.
Methods:
-
to_warehouse
–Write the data to the SourceConfig’s engine.
Attributes:
-
source_config
(SourceConfig
) – -
features
(tuple[FeatureConfig, ...] | None
) – -
data
(Table
) – -
data_hashes
(Table
) – -
entities
(tuple[ClusterEntity, ...]
) – -
name
(str
) –Return the resolution name of the SourceConfig.
-
mock
(Mock
) –Create a mock SourceConfig object with this testkit’s configuration.
-
query
(Table
) –Return a PyArrow table in the same format as matchbox.query().
-
query_backend
(Table
) –Return a PyArrow table in the same format as the SCHEMA_MB_IDS DTO.
source_config
class-attribute
instance-attribute
¶
source_config: SourceConfig = Field(
description="The real generated SourceConfig object."
)
features
class-attribute
instance-attribute
¶
features: tuple[FeatureConfig, ...] | None = Field(
description="The features used to generate the data. If None, the source data was not generated, but set manually.",
default=None,
)
data
class-attribute
instance-attribute
¶
data_hashes
class-attribute
instance-attribute
¶
entities
class-attribute
instance-attribute
¶
entities: tuple[ClusterEntity, ...] = Field(
description="ClusterEntities that were generated from the source."
)
query_backend
property
¶
Return a PyArrow table in the same format as the SCHEMA_MB_IDS DTO.
to_warehouse
¶
Write the data to the SourceConfig’s engine.
As the SourceConfig won’t have an engine set by default, can be supplied.
LinkedSourcesTestkit
¶
Bases: BaseModel
Container for multiple related SourceConfig testkits with entity tracking.
Methods:
-
find_entities
–Find entities matching appearance criteria.
-
true_entity_subset
–Return a subset of true entities that appear in the given sources.
-
diff_results
–Diff a results of probabilities with the true SourceEntities.
Attributes:
-
true_entities
(set[SourceEntity]
) – -
sources
(dict[str, SourceTestkit]
) –
true_entities
class-attribute
instance-attribute
¶
true_entities: set[SourceEntity] = Field(
default_factory=set
)
find_entities
¶
find_entities(
min_appearances: dict[str, int] | None = None,
max_appearances: dict[str, int] | None = None,
) -> list[SourceEntity]
Find entities matching appearance criteria.
true_entity_subset
¶
true_entity_subset(
*sources: SourceResolutionName,
) -> list[ClusterEntity]
Return a subset of true entities that appear in the given sources.
diff_results
¶
diff_results(
probabilities: Table,
sources: list[str],
left_clusters: tuple[ClusterEntity, ...],
right_clusters: tuple[ClusterEntity, ...] | None = None,
threshold: int | float = 0,
) -> tuple[bool, dict]
Diff a results of probabilities with the true SourceEntities.
Parameters:
-
probabilities
¶Table
) –Probabilities table to diff
-
sources
¶list[str]
) –Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against
-
left_clusters
¶tuple[ClusterEntity, ...]
) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
-
right_clusters
¶tuple[ClusterEntity, ...] | None
, default:None
) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
-
threshold
¶int | float
, default:0
) –Threshold for considering a match true
Returns:
-
tuple[bool, dict]
–A tuple of whether the results are identical, and a report dictionary. See
diff_results()
for the report format.
make_features_hashable
¶
Decorator to allow configuring source_factory with dicts.
This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.
generate_rows
¶
generate_rows(
generator: Faker,
selected_entities: tuple[SourceEntity, ...],
features: tuple[FeatureConfig, ...],
) -> tuple[
dict[str, list],
dict[int, list[str]],
dict[int, list[str]],
dict[int, bytes],
]
Generate raw data rows with unique keys and shared IDs.
This function generates rows of data plus maps between three types of identifiers:
1. `id`: Is matchbox's unique identifier for each row, shared across rows with
identical feature values
2. `key`: Is the source's unique identifier for the row. It's like a primary key
in a database, but not guaranteed to be unique across different entities
3. `entity`: Is the identifier of the SourceEntity that generated the row.
This identifies the true linked data in the factory system.
This function will therefore return:
* raw_data: A dictionary of column arrays for DataFrame creation
* entity_keys: A dictionary that maps which keys belong to each source entity
* id_keys: A dictionary that maps which keys share the same row content,
with the same `id`
* id_hashes: A dictionary that maps `id`s to hash values for each unique
row content
The key insight:
* entity_* groups by "who generated this row"
* id_* groups by "what content does this row have"
Example with two entities generating data:
id | key | company_name |
---|---|---|
1 | a | alpha co |
2 | b | alpha ltd |
1 | c | alpha co |
2 | d | alpha ltd |
3 | e | beta co |
4 | f | beta ltd |
3 | g | beta co |
4 | h | beta ltd |
What does this table look like as raw data?
raw_data = {
"id": [1, 2, 1, 2, 3, 4, 3, 4],
"key": ["a", "b", "c", "d", "e", "f", "g", "h"],
"company_name": [
"alpha co",
"alpha ltd",
"alpha co",
"alpha ltd",
"beta co",
"beta ltd",
"beta co",
"beta ltd",
],
}
Which keys came from each source entity?
entity_keys = {
1: ["a", "b", "c", "d"], # All keys entity 1 produced
2: ["e", "f", "g", "h"], # All keys entity 2 produced
}
Which keys have identical content?
id_keys = {
1: ["a", "c"], # Both have "alpha co" content
2: ["b", "d"], # Both have "alpha ltd" content
3: ["e", "g"], # Both have "beta co" content
4: ["f", "h"], # Both have "beta ltd" content
}
id_hashes = {
1: b"hash1", # Hash of "alpha co"
2: b"hash2", # Hash of "alpha ltd"
3: b"hash3", # Hash of "beta co"
4: b"hash4", # Hash of "beta ltd"
}
generate_source
cached
¶
generate_source(
generator: Faker,
n_true_entities: int,
features: tuple[FeatureConfig, ...],
repetition: int,
seed_entities: tuple[SourceEntity, ...] | None = None,
) -> tuple[
Table, Table, dict[int, set[str]], dict[int, set[str]]
]
Generate raw data as PyArrow tables with entity tracking.
Returns:
source_factory
cached
¶
source_factory(
features: list[FeatureConfig]
| list[dict]
| None = None,
full_name: str | None = None,
engine: Engine | None = None,
n_true_entities: int = 10,
repetition: int = 0,
seed: int = 42,
) -> SourceTestkit
Generate a complete source testkit from configured features.
source_from_tuple
¶
source_from_tuple(
data_tuple: tuple[dict[str, Any], ...],
data_keys: tuple[Any],
full_name: str | None = None,
engine: Engine | None = None,
seed: int = 42,
) -> SourceTestkit
Generate a complete source testkit from dummy data.
linked_sources_factory
cached
¶
linked_sources_factory(
source_parameters: tuple[SourceTestkitParameters, ...]
| None = None,
n_true_entities: int | None = None,
engine: Engine | None = None,
seed: int = 42,
) -> LinkedSourcesTestkit
Generate a set of linked sources with tracked entities.
Parameters:
-
source_parameters
¶tuple[SourceTestkitParameters, ...] | None
, default:None
) –Optional tuple of source testkit parameters
-
n_true_entities
¶int | None
, default:None
) –Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceTestkitParameters must specify its own n_true_entities.
-
engine
¶Engine | None
, default:None
) –Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.
-
seed
¶int
, default:42
) –Random seed for reproducibility