Sources
    Factories for generating sources and linked source testkits for testing.
Classes:
- 
          SourceTestkitParameters–Configuration for generating a source. 
- 
          SourceTestkit–A testkit of data and metadata for a SourceConfig. 
- 
          LinkedSourcesTestkit–Container for multiple related SourceConfig testkits with entity tracking. 
Functions:
- 
            make_features_hashable–Decorator to allow configuring source_factory with dicts. 
- 
            generate_rows–Generate raw data rows with unique keys and shared IDs. 
- 
            generate_source–Generate raw data as PyArrow tables with entity tracking. 
- 
            source_factory–Generate a complete source testkit from configured features. 
- 
            source_from_tuple–Generate a complete source testkit from dummy data. 
- 
            linked_sources_factory–Generate a set of linked sources with tracked entities. 
    
              Bases: BaseModel
Configuration for generating a source.
Attributes:
- 
          features(tuple[FeatureConfig, ...]) –
- 
          name(str) –
- 
          engine(Engine) –
- 
          n_true_entities(int | None) –
- 
          repetition(int) –
    
              Bases: BaseModel
A testkit of data and metadata for a SourceConfig.
Methods:
- 
            cast_table–Ensure that the data matches the query schema. 
- 
            into_dag–Turn source into kwargs for dag.source(), detaching from original DAG.
- 
            write_to_location–Write the data to the SourceConfig’s location. 
Attributes:
- 
          source(Source) –
- 
          features(tuple[FeatureConfig, ...] | None) –
- 
          data(Table) –
- 
          data_hashes(Table) –
- 
          entities(tuple[ClusterEntity, ...]) –
- 
          name(str) –Return the source name. 
- 
          source_config(SourceConfig) –Return the SourceConfig from the source. 
class-attribute
      instance-attribute
  
¶
source: Source = Field(description='The Source object containing config and convenience methods.')
class-attribute
      instance-attribute
  
¶
features: tuple[FeatureConfig, ...] | None = Field(description='The features used to generate the data. If None, the source data was not generated, but set manually.', default=None)
class-attribute
      instance-attribute
  
¶
    
class-attribute
      instance-attribute
  
¶
    
class-attribute
      instance-attribute
  
¶
entities: tuple[ClusterEntity, ...] = Field(description='ClusterEntities that were generated from the source.')
    
              Bases: BaseModel
Container for multiple related SourceConfig testkits with entity tracking.
Methods:
- 
            find_entities–Find entities matching appearance criteria. 
- 
            true_entity_subset–Return a subset of true entities that appear in the given sources. 
- 
            diff_results–Diff a results of probabilities with the true SourceEntities. 
- 
            write_to_location–Write the data to the SourceConfig’s location. 
Attributes:
- 
          dag(DAG) –
- 
          true_entities(set[SourceEntity]) –
- 
          sources(dict[SourceResolutionName, SourceTestkit]) –
class-attribute
      instance-attribute
  
¶
true_entities: set[SourceEntity] = Field(default_factory=set)
find_entities(min_appearances: dict[str, int] | None = None, max_appearances: dict[str, int] | None = None) -> list[SourceEntity]
Find entities matching appearance criteria.
true_entity_subset(*sources: SourceResolutionName) -> list[ClusterEntity]
Return a subset of true entities that appear in the given sources.
diff_results(probabilities: DataFrame, sources: list[SourceResolutionName], left_clusters: tuple[ClusterEntity, ...], right_clusters: tuple[ClusterEntity, ...] | None = None, threshold: int | float = 0) -> tuple[bool, dict]
Diff a results of probabilities with the true SourceEntities.
Parameters:
- 
(probabilities¶DataFrame) –Probabilities table to diff 
- 
(sources¶list[SourceResolutionName]) –Subset of the LinkedSourcesTestkit.sources that represents the true sources to compare against 
- 
(left_clusters¶tuple[ClusterEntity, ...]) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities. 
- 
(right_clusters¶tuple[ClusterEntity, ...] | None, default:None) –ClusterEntity objects from the object used as an input to the process that produced the probabilities table. Should be a SourceTestkit.entities or ModelTestkit.entities. 
- 
(threshold¶int | float, default:0) –Threshold for considering a match true 
Returns:
- 
              tuple[bool, dict]–A tuple of whether the results are identical, and a report dictionary. See diff_results()for the report format.
    Decorator to allow configuring source_factory with dicts.
This retains the hashability of FeatureConfig while still making it simple to use the factory without special objects.
generate_rows(generator: Faker, selected_entities: tuple[SourceEntity, ...], features: tuple[FeatureConfig, ...], repetition: int) -> tuple[dict[str, list], dict[int, list[str]], dict[int, list[str]], dict[int, bytes]]
Generate raw data rows with unique keys and shared IDs.
This function generates rows of data plus maps between three types of identifiers:
1. `id`: Is matchbox's unique identifier for each row, shared across rows with
    identical feature values
2. `key`: Is the source's unique identifier for the row. It's like a primary key
    in a database, but not guaranteed to be unique across different entities
3. `entity`: Is the identifier of the SourceEntity that generated the row.
    This identifies the true linked data in the factory system.
This function will therefore return:
* raw_data: A dictionary of column arrays for DataFrame creation
* entity_keys: A dictionary that maps which keys belong to each source entity
* id_keys: A dictionary that maps which keys share the same row content,
    with the same `id`
* id_hashes: A dictionary that maps `id`s to hash values for each unique
    row content
The key insight:
* entity_* groups by "who generated this row"
* id_* groups by "what content does this row have"
Example with two entities generating data:
| id | key | company_name | 
|---|---|---|
| 1 | a | alpha co | 
| 2 | b | alpha ltd | 
| 1 | c | alpha co | 
| 2 | d | alpha ltd | 
| 3 | e | beta co | 
| 4 | f | beta ltd | 
| 3 | g | beta co | 
| 4 | h | beta ltd | 
What does this table look like as raw data?
raw_data = {
    "id": [1, 2, 1, 2, 3, 4, 3, 4],
    "key": ["a", "b", "c", "d", "e", "f", "g", "h"],
    "company_name": [
        "alpha co",
        "alpha ltd",
        "alpha co",
        "alpha ltd",
        "beta co",
        "beta ltd",
        "beta co",
        "beta ltd",
    ],
}
Which keys came from each source entity?
entity_keys = {
    1: ["a", "b", "c", "d"],  # All keys entity 1 produced
    2: ["e", "f", "g", "h"],  # All keys entity 2 produced
}
Which keys have identical content?
id_keys = {
    1: ["a", "c"],  # Both have "alpha co" content
    2: ["b", "d"],  # Both have "alpha ltd" content
    3: ["e", "g"],  # Both have "beta co" content
    4: ["f", "h"],  # Both have "beta ltd" content
}
id_hashes = {
    1: b"hash1",  # Hash of "alpha co"
    2: b"hash2",  # Hash of "alpha ltd"
    3: b"hash3",  # Hash of "beta co"
    4: b"hash4",  # Hash of "beta ltd"
}
cached
  
¶
generate_source(generator: Faker, n_true_entities: int, features: tuple[FeatureConfig, ...], repetition: int, seed_entities: tuple[SourceEntity, ...] | None = None) -> tuple[Table, Table, dict[int, set[str]], dict[int, set[str]]]
Generate raw data as PyArrow tables with entity tracking.
Returns:
cached
  
¶
source_factory(features: list[FeatureConfig] | list[dict] | None = None, name: SourceResolutionName | None = None, location_name: str = 'dbname', dag: DAG | None = None, engine: Engine | None = None, n_true_entities: int = 10, repetition: int = 0, seed: int = 42) -> SourceTestkit
Generate a complete source testkit from configured features.
SourceConfigs created with the factory system can only use a RelationalDBLocation, and the data at that location will be stored in a single table.
Parameters:
- 
(features¶list[FeatureConfig] | list[dict] | None, default:None) –List of FeatureConfig objects or dictionaries to use for generating the source data. If None, defaults to a set of common features. 
- 
(name¶SourceResolutionName | None, default:None) –Name of the source. If None, a unique name is generated. This will be used as the name of the table in the RelationalDBLocation, but also in the SourceResolutionName for the source. 
- 
(location_name¶str, default:'dbname') –Name of the location for the source. 
- 
(dag¶DAG | None, default:None) –DAG containing the source. 
- 
(engine¶Engine | None, default:None) –SQLAlchemy engine to use for the source’s RelationalDBLocation. If None, an in-memory SQLite engine is created. 
- 
(n_true_entities¶int, default:10) –Number of true entities to generate. Defaults to 10. 
- 
(repetition¶int, default:0) –Number of times to repeat the generated data. Defaults to 0. 
- 
(seed¶int, default:42) –Random seed for reproducibility. Defaults to 42. 
source_from_tuple(data_tuple: tuple[dict[str, Any], ...], data_keys: tuple[Any], name: str | None = None, location_name: str = 'dbname', dag: DAG | None = None, engine: Engine | None = None, seed: int = 42) -> SourceTestkit
Generate a complete source testkit from dummy data.
cached
  
¶
linked_sources_factory(source_parameters: tuple[SourceTestkitParameters, ...] | None = None, n_true_entities: int | None = None, engine: Engine | None = None, dag: DAG | None = None, seed: int = 42) -> LinkedSourcesTestkit
Generate a set of linked sources with tracked entities.
Parameters:
- 
(source_parameters¶tuple[SourceTestkitParameters, ...] | None, default:None) –Optional tuple of source testkit parameters 
- 
(n_true_entities¶int | None, default:None) –Optional number of true entities to generate. If provided, overrides any n_true_entities in source configs. If not provided, each SourceTestkitParameters must specify its own n_true_entities. 
- 
(engine¶Engine | None, default:None) –Optional SQLAlchemy engine to use for all sources. If provided, overrides any engine in source configs. 
- 
(dag¶DAG | None, default:None) –DAG containing sources 
- 
(seed¶int, default:42) –Random seed for reproducibility