Skip to content

Overview

matchbox.common.factories

Factory functions for the testkit system.

Modules:

  • dags

    DAG container for testkits.

  • entities

    Classes and functions for generating and comparing entities.

  • models

    Factory functions for generating model testkits and data for testing.

  • sources

    Factories for generating sources and linked source testkits for testing.

Using the system

The factory system aims to provide *Testkit objects that facilitate three groups of testing scenarios:

  • Realistic mock Source and Model objects to test client-side connectivity functions
  • Realistic mock data to test server-side adapter functions
  • Realistic mock pipelines with controlled completeness to test client-side methodologies

Three broad functions are provided:

Underneath, these factories and objects use a system of SourceEntity and ClusterEntitys to share data. The source is the true answer, and the clusters are the merging data as it moves through the system. A comprehensive set of comparators have been implemented to make this simple to implement, understand, and read in unit testing.

All factory functions are configured to provide a sensible, useful default.

The system has been designed to be as hashable as possible to enable caching. Often you’ll need to provide tuples where you might normally provide lists.

There are some common patterns you might consider using when editing or extending tests.

Client-side connectivity

We can use the factories to test inserting or retrieving isolated Source or Model objects.

Perhaps you’re testing the API and want to put a realistic Source in the ingestion pipeline.

source_testkit = source_factory()

# Setup store
store = MetadataStore()
update_id = store.cache_source(source_testkit.source)

Or you’re testing the client handler and want to mock the API.

@patch("matchbox.client.helpers.index.Source")
def test_my_api(MockSource: Mock, matchbox_api: MockRouter):
    source_testkit = source_factory(
        features=[{"name": "company_name", "base_generator": "company"}]
    )
    MockSource.return_value = source_testkit.mock

source_factory() can be configured with a powerful range of FeatureConfig objects, including a variety of rules which distort and duplicate the data in predictable ways. These use Faker to generate data.

source_factory(
    n_true_entities=1_000,
    features=(
        FeatureConfig(
            name="name",
            base_generator="first_name_female",
            drop_base=False,
            variations=(PrefixRule(prefix="Ms "),),
        ),
        FeatureConfig(
            name="title",
            base_generator="job",
            drop_base=True,
            variations=(
                SuffixRule(suffix=" MBE"),
                ReplaceRule(old="Manager", new="Leader"),
            ),
    ),
    repetition=3,
)

Server-side adapters

The factories can generate data suitable for MatchboxDBAdapter.index(), MatchboxDBAdapter.insert_model(), or MatchboxDBAdapter.set_model_results(). Between these functions, we can set up any backend in any configuration we need to test the other adapter methods.

Adding a Source.

source_testkit = source_factory()
backend.index(
    source=source_testkit.source
    data_hashes=source_testkit.data_hashes
)

Adding a Model.

model_testkit = model_factory()
backend.insert_model(model=model_testkit.model.metadata)

Inserting results.

model_testkit = model_factory()
backend.set_model_results(
    model=model_testkit.model.metadata.full_name, 
    results=model_testkit.probabilities
)

linked_sources_factory() and model_factory() can be used together to create broader systems of data that connect – or don’t – in controlled ways.

linked_testkit = linked_sources_factory()

for source_testkit in linked_testkit.sources.values():
    backend.index(
        source=source_testkit.source
        data_hashes=source_testkit.data_hashes
    )

model_testkit = model_factory(
    left_testkit=linked_testkit.sources["crn"],
    true_entities=linked_testkit.true_entities,
)

backend.insert_model(model=model_testkit.model.metadata)
backend.set_model_results(
    model=model_testkit.model.metadata.full_name, 
    results=model_testkit.probabilities
)

Methodologies

Configure the true state of your data with linked_sources_factory(). Its default is a set of three tables of ten unique company entites.

  • CRN (company name, CRN ID) contains all entities with three unique variations of the company’s name
  • CDMS (CRN ID, DUNS ID) contains all entities repeated twice
  • DUNS (company name, DUNS ID) contains half the entities

linked_sources_factory() can be configured using tuples of SourceConfig objects. Using these you can create complex sets of interweaving sources for methodologies to be tested against.

The model_factory() is designed so you can chain together known processes in any order, before using your real methodology. LinkedSourcesTestkit.diff_results() will make any probabilistic output comparable with the true source entities, and give a detailed diff to help you debug.

linked_testkit: LinkedSourcesTestkit = linked_sources_factory()

# Create perfect deduped models first
left_deduped: ModelTestkit = model_factory(
    left_testkit=linked_testkit.sources["crn"],
    true_entities=linked_testkit.true_entities,
)
right_deduped: ModelTestkit = model_factory(
    left_testkit=linked_testkit.sources["cdms"],
    true_entities=linked_testkit.true_entities,
)

# Create a model and generate probabilities
model: Model = make_model(
    left_data=left_deduped.query,
    right_data=right_deduped.query
    ...
)
results: Results = model.run()

# Diff, assert, and log the message if it fails
identical, report = linked_testkit.diff_results(
    probabilities=results.probabilities,  # Your methodology's output
    left_clusters=left_deduped.entities,  # Output of left deduper -- left input to your methodology
    right_clusters=right_deduped.entities,  # Output of right deduper -- left input to your methodology
    sources=("crn", "cdms"),
    threshold=0,
)

assert identical, report