Sources
matchbox.common.factories.sources
¶
Factories for generating sources and linked source testkits for testing.
Classes:
-
SourceTestkitParameters
–Configuration for generating a source.
-
SourceTestkit
–A testkit of data and metadata for a SourceConfig.
-
LinkedSourcesTestkit
–Container for multiple related SourceConfig testkits with entity tracking.
Functions:
-
make_features_hashable
–Decorator to allow configuring source_factory with dicts.
-
generate_rows
–Generate raw data rows with unique keys and shared IDs.
-
generate_source
–Generate raw data as PyArrow tables with entity tracking.
-
source_factory
–Generate a complete source testkit from configured features.
-
source_from_tuple
–Generate a complete source testkit from dummy data.
-
linked_sources_factory
–Generate a set of linked sources with tracked entities.
SourceTestkitParameters
¶
Bases: BaseModel
Configuration for generating a source.
Attributes:
-
features
(tuple[FeatureConfig, ...]
) – -
name
(str
) – -
engine
(Engine
) – -
n_true_entities
(int | None
) – -
repetition
(int
) –
SourceTestkit
¶
Bases: BaseModel
A testkit of data and metadata for a SourceConfig.
Methods:
-
cast_table
–Ensure that the data matches the query schema.
-
into_dag
–Turn source into kwargs for
dag.source()
, detaching from original DAG. -
write_to_location
–Write the data to the SourceConfig’s location.
Attributes:
-
source
(Source
) – -
features
(tuple[FeatureConfig, ...] | None
) – -
data
(Table
) – -
data_hashes
(Table
) – -
entities
(tuple[ClusterEntity, ...]
) – -
name
(str
) –Return the source name.
-
source_config
(SourceConfig
) –Return the SourceConfig from the source.
source
class-attribute
instance-attribute
¶
source: Source = Field(description='The Source object containing config and convenience methods.')
features
class-attribute
instance-attribute
¶
features: tuple[FeatureConfig, ...] | None = Field(description='The features used to generate the data. If None, the source data was not generated, but set manually.', default=None)
data
class-attribute
instance-attribute
¶
data_hashes
class-attribute
instance-attribute
¶
entities
class-attribute
instance-attribute
¶
entities: tuple[ClusterEntity, ...] = Field(description='ClusterEntities that were generated from the source.')
LinkedSourcesTestkit
¶
Bases: BaseModel
Container for multiple related SourceConfig testkits with entity tracking.
Methods:
-
find_entities
–Find entities matching appearance criteria.
-
true_entity_subset
–Return a subset of true entities that appear in the given sources.
-
diff_results
–Diff a results of probabilities with the true SourceEntities.
-
write_to_location
–Write the data to the SourceConfig’s location.
Attributes:
-
dag
(DAG
) – -
true_entities
(set[SourceEntity]
) – -
sources
(dict[SourceResolutionName, SourceTestkit]
) –
true_entities
class-attribute
instance-attribute
¶
true_entities: set[SourceEntity] = Field(default_factory=set)
find_entities
¶
find_entities(min_appearances: dict[str, int] | None = None, max_appearances: dict[str, int] | None = None) -> list[SourceEntity]
Find entities matching appearance criteria.
true_entity_subset
¶
true_entity_subset(*sources: SourceResolutionName) -> list[ClusterEntity]
Return a subset of true entities that appear in the given sources.
diff_results
¶
diff_results(probabilities: DataFrame, sources: list[SourceResolutionName], left_clusters: tuple[ClusterEntity, ...], right_clusters: tuple[ClusterEntity, ...] | None = None, threshold: int | float = 0) -> tuple[bool, dict]
Diff a results of probabilities with the true SourceEntities.
Parameters:
-
probabilities
¶DataFrame
) –Probabilities table to diff
-
sources
¶list[SourceResolutionName]
) –Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against
-
left_clusters
¶tuple[ClusterEntity, ...]
) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
-
right_clusters
¶tuple[ClusterEntity, ...] | None
, default:None
) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
-
threshold
¶int | float
, default:0
) –Threshold for considering a match true
Returns:
-
tuple[bool, dict]
–A tuple of whether the results are identical, and a report dictionary. See
diff_results()
for the report format.
make_features_hashable
¶
Decorator to allow configuring source_factory with dicts.
This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.
generate_rows
¶
generate_rows(generator: Faker, selected_entities: tuple[SourceEntity, ...], features: tuple[FeatureConfig, ...], repetition: int) -> tuple[dict[str, list], dict[int, list[str]], dict[int, list[str]], dict[int, bytes]]
Generate raw data rows with unique keys and shared IDs.
This function generates rows of data plus maps between three types of identifiers:
1. `id`: Is matchbox's unique identifier for each row, shared across rows with
identical feature values
2. `key`: Is the source's unique identifier for the row. It's like a primary key
in a database, but not guaranteed to be unique across different entities
3. `entity`: Is the identifier of the SourceEntity that generated the row.
This identifies the true linked data in the factory system.
This function will therefore return:
* raw_data: A dictionary of column arrays for DataFrame creation
* entity_keys: A dictionary that maps which keys belong to each source entity
* id_keys: A dictionary that maps which keys share the same row content,
with the same `id`
* id_hashes: A dictionary that maps `id`s to hash values for each unique
row content
The key insight:
* entity_* groups by "who generated this row"
* id_* groups by "what content does this row have"
Example with two entities generating data:
id | key | company_name |
---|---|---|
1 | a | alpha co |
2 | b | alpha ltd |
1 | c | alpha co |
2 | d | alpha ltd |
3 | e | beta co |
4 | f | beta ltd |
3 | g | beta co |
4 | h | beta ltd |
What does this table look like as raw data?
raw_data = {
"id": [1, 2, 1, 2, 3, 4, 3, 4],
"key": ["a", "b", "c", "d", "e", "f", "g", "h"],
"company_name": [
"alpha co",
"alpha ltd",
"alpha co",
"alpha ltd",
"beta co",
"beta ltd",
"beta co",
"beta ltd",
],
}
Which keys came from each source entity?
entity_keys = {
1: ["a", "b", "c", "d"], # All keys entity 1 produced
2: ["e", "f", "g", "h"], # All keys entity 2 produced
}
Which keys have identical content?
id_keys = {
1: ["a", "c"], # Both have "alpha co" content
2: ["b", "d"], # Both have "alpha ltd" content
3: ["e", "g"], # Both have "beta co" content
4: ["f", "h"], # Both have "beta ltd" content
}
id_hashes = {
1: b"hash1", # Hash of "alpha co"
2: b"hash2", # Hash of "alpha ltd"
3: b"hash3", # Hash of "beta co"
4: b"hash4", # Hash of "beta ltd"
}
generate_source
cached
¶
generate_source(generator: Faker, n_true_entities: int, features: tuple[FeatureConfig, ...], repetition: int, seed_entities: tuple[SourceEntity, ...] | None = None) -> tuple[Table, Table, dict[int, set[str]], dict[int, set[str]]]
Generate raw data as PyArrow tables with entity tracking.
Returns:
source_factory
cached
¶
source_factory(features: list[FeatureConfig] | list[dict] | None = None, name: SourceResolutionName | None = None, location_name: str = 'dbname', dag: DAG | None = None, engine: Engine | None = None, n_true_entities: int = 10, repetition: int = 0, seed: int = 42) -> SourceTestkit
Generate a complete source testkit from configured features.
SourceConfigs created with the factory system can only use a RelationalDBLocation, and the data at that location will be stored in a single table.
Parameters:
-
features
¶list[FeatureConfig] | list[dict] | None
, default:None
) –List of FeatureConfig objects or dictionaries to use for generating the source data. If None, defaults to a set of common features.
-
name
¶SourceResolutionName | None
, default:None
) –Name of the source. If None, a unique name is generated. This will be used as the name of the table in the RelationalDBLocation, but also in the SourceResolutionName for the source.
-
location_name
¶str
, default:'dbname'
) –Name of the location for the source.
-
dag
¶DAG | None
, default:None
) –DAG containing the source.
-
engine
¶Engine | None
, default:None
) –SQLAlchemy engine to use for the source’s RelationalDBLocation. If None, an in-memory SQLite engine is created.
-
n_true_entities
¶int
, default:10
) –Number of true entities to generate. Defaults to 10.
-
repetition
¶int
, default:0
) –Number of times to repeat the generated data. Defaults to 0.
-
seed
¶int
, default:42
) –Random seed for reproducibility. Defaults to 42.
source_from_tuple
¶
source_from_tuple(data_tuple: tuple[dict[str, Any], ...], data_keys: tuple[Any], name: str | None = None, location_name: str = 'dbname', dag: DAG | None = None, engine: Engine | None = None, seed: int = 42) -> SourceTestkit
Generate a complete source testkit from dummy data.
linked_sources_factory
cached
¶
linked_sources_factory(source_parameters: tuple[SourceTestkitParameters, ...] | None = None, n_true_entities: int | None = None, engine: Engine | None = None, dag: DAG | None = None, seed: int = 42) -> LinkedSourcesTestkit
Generate a set of linked sources with tracked entities.
Parameters:
-
source_parameters
¶tuple[SourceTestkitParameters, ...] | None
, default:None
) –Optional tuple of source testkit parameters
-
n_true_entities
¶int | None
, default:None
) –Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceTestkitParameters must specify its own n_true_entities.
-
engine
¶Engine | None
, default:None
) –Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.
-
dag
¶DAG | None
, default:None
) –DAG containing sources
-
seed
¶int
, default:42
) –Random seed for reproducibility