Sources
matchbox.common.factories.sources
¶
Factories for generating sources and linked source testkits for testing.
Classes:
-
SourceConfig
–Configuration for generating a source.
-
SourceTestkit
–A testkit of data and metadata for a Source.
-
LinkedSourcesTestkit
–Container for multiple related Source testkits with entity tracking.
Functions:
-
make_features_hashable
–Decorator to allow configuring source_factory with dicts.
-
generate_rows
–Generate raw data rows. Adds an ID shared by unique rows, and a PK for every row.
-
generate_source
–Generate raw data as PyArrow tables with entity tracking.
-
source_factory
–Generate a complete source testkit.
-
linked_sources_factory
–Generate a set of linked sources with tracked entities.
SourceConfig
¶
Bases: BaseModel
Configuration for generating a source.
Attributes:
-
features
(tuple[FeatureConfig, ...]
) – -
full_name
(str
) – -
engine
(Engine
) – -
n_true_entities
(int | None
) – -
repetition
(int
) –
SourceTestkit
¶
Bases: BaseModel
A testkit of data and metadata for a Source.
Methods:
-
to_warehouse
–Write the data to the Source’s engine.
Attributes:
-
source
(Source
) – -
features
(tuple[FeatureConfig, ...]
) – -
data
(Table
) – -
data_hashes
(Table
) – -
entities
(tuple[ClusterEntity, ...]
) – -
name
(str
) –Return the resolution name of the Source.
-
mock
(Mock
) –Create a mock Source object with this testkit’s configuration.
-
query
(Table
) –Return a PyArrow table in the same format as matchbox.query().
-
query_backend
(Table
) –Return a PyArrow table in the same format as the SCHEMA_MB_IDS DTO.
source
class-attribute
instance-attribute
¶
features
class-attribute
instance-attribute
¶
features: tuple[FeatureConfig, ...] = Field(
description="The features used to generate the data."
)
data
class-attribute
instance-attribute
¶
data_hashes
class-attribute
instance-attribute
¶
entities
class-attribute
instance-attribute
¶
entities: tuple[ClusterEntity, ...] = Field(
description="ClusterEntities that were generated from the source."
)
query_backend
property
¶
Return a PyArrow table in the same format as the SCHEMA_MB_IDS DTO.
to_warehouse
¶
Write the data to the Source’s engine.
As the Source won’t have an engine set by default, can be supplied.
LinkedSourcesTestkit
¶
Bases: BaseModel
Container for multiple related Source testkits with entity tracking.
Methods:
-
find_entities
–Find entities matching appearance criteria.
-
true_entity_subset
–Return a subset of true entities that appear in the given sources.
-
diff_results
–Diff a results of probabilities with the true SourceEntities.
Attributes:
-
true_entities
(set[SourceEntity]
) – -
sources
(dict[str, SourceTestkit]
) –
true_entities
class-attribute
instance-attribute
¶
true_entities: set[SourceEntity] = Field(
default_factory=set
)
find_entities
¶
find_entities(
min_appearances: dict[str, int] | None = None,
max_appearances: dict[str, int] | None = None,
) -> list[SourceEntity]
Find entities matching appearance criteria.
true_entity_subset
¶
true_entity_subset(*sources: str) -> list[ClusterEntity]
Return a subset of true entities that appear in the given sources.
diff_results
¶
diff_results(
probabilities: Table,
sources: list[str],
left_clusters: tuple[ClusterEntity, ...],
right_clusters: tuple[ClusterEntity, ...] | None = None,
threshold: int | float = 0,
) -> tuple[bool, dict]
Diff a results of probabilities with the true SourceEntities.
Parameters:
-
probabilities
¶Table
) –Probabilities table to diff
-
sources
¶list[str]
) –Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against
-
left_clusters
¶tuple[ClusterEntity, ...]
) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
-
right_clusters
¶tuple[ClusterEntity, ...] | None
, default:None
) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities.
-
threshold
¶int | float
, default:0
) –Threshold for considering a match true
Returns:
-
tuple[bool, dict]
–A tuple of whether the results are identical, and a report dictionary. See
diff_results()
for the report format.
make_features_hashable
¶
Decorator to allow configuring source_factory with dicts.
This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.
generate_rows
¶
generate_rows(
generator: Faker,
selected_entities: tuple[SourceEntity, ...],
features: tuple[FeatureConfig, ...],
) -> tuple[
dict[str, list],
dict[int, list[str]],
dict[int, list[str]],
]
Generate raw data rows. Adds an ID shared by unique rows, and a PK for every row.
Returns a tuple of:
- raw_data: Dictionary of column arrays for DataFrame creation
- entity_pks: Maps SourceEntity.id to the set of PKs where that entity appears
- id_pks: Maps each ID to the set of PKs where that row appears
For example, if this is the raw data:
id | pk | company_name |
---|---|---|
1 | 1 | alpha co |
2 | 2 | alpha ltd |
1 | 3 | alpha co |
2 | 4 | alpha ltd |
3 | 5 | beta co |
4 | 6 | beta ltd |
3 | 7 | beta co |
4 | 8 | beta ltd |
Entity PKs would be this, because there are two true SourceEntities:
{ 1: [1, 2, 3, 4], 2: [5, 6, 7, 8], }
And ID PKs would be this, because there are four unique rows:
{ 1: [1, 3], 2: [2, 4], 3: [5, 7], 4: [6, 8], }
generate_source
cached
¶
generate_source(
generator: Faker,
n_true_entities: int,
features: tuple[FeatureConfig, ...],
repetition: int,
seed_entities: tuple[SourceEntity, ...] | None = None,
) -> tuple[
Table, Table, dict[int, set[str]], dict[int, set[str]]
]
Generate raw data as PyArrow tables with entity tracking.
Returns:
source_factory
cached
¶
source_factory(
features: list[FeatureConfig]
| list[dict]
| None = None,
full_name: str | None = None,
engine: Engine | None = None,
n_true_entities: int = 10,
repetition: int = 0,
seed: int = 42,
) -> SourceTestkit
Generate a complete source testkit.
linked_sources_factory
cached
¶
linked_sources_factory(
source_configs: tuple[SourceConfig, ...] | None = None,
n_true_entities: int | None = None,
engine: Engine | None = None,
seed: int = 42,
) -> LinkedSourcesTestkit
Generate a set of linked sources with tracked entities.
Parameters:
-
source_configs
¶tuple[SourceConfig, ...] | None
, default:None
) –Optional tuple of source configurations
-
n_true_entities
¶int | None
, default:None
) –Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceConfig must specify its own n_true_entities.
-
engine
¶Engine | None
, default:None
) –Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs.
-
seed
¶int
, default:42
) –Random seed for reproducibility