Skip to content

Overview

matchbox.common.factories

Factory functions for the testkit system.

Modules:

  • dags

    DAG container for testkits.

  • entities

    Classes and functions for generating and comparing entities.

  • models

    Factory functions for generating model testkits and data for testing.

  • scenarios

    Scenario factories for creating TestkitDAG scenarios.

  • sources

    Factories for generating sources and linked source testkits for testing.

Using the system

The factory system aims to provide *Testkit objects that facilitate three groups of testing scenarios:

  • Realistic mock SourceConfig and Model objects to test client-side connectivity functions
  • Realistic mock data to test server-side adapter functions
  • Realistic mock pipelines with controlled completeness to test client-side methodologies

Three broad functions are provided:

Underneath, these factories and objects use a system of SourceEntity and ClusterEntitys to share data. The source is the true answer, and the clusters are the merging data as it moves through the system. A comprehensive set of comparators have been implemented to make this simple to implement, understand, and read in unit testing.

All factory functions are configured to provide a sensible, useful default.

The system has been designed to be as hashable as possible to enable caching. Often you’ll need to provide tuples where you might normally provide lists.

There are some common patterns you might consider using when editing or extending tests.

Client-side connectivity

We can use the factories to test inserting or retrieving isolated SourceConfig or Model objects.

Perhaps you’re testing the API and want to put a realistic SourceConfig in the ingestion pipeline.

source_testkit = source_factory()

# Setup store
tracker = InMemoryUploadTracker()
upload_id = tracker.add_source(source_testkit.source_config)

Or you’re testing the client handler and want to mock the API.

@patch("matchbox.client.helpers.index.SourceConfig")
def test_my_api(MockSource: Mock, matchbox_api: MockRouter):
    source_testkit = source_factory(
        features=[{"name": "company_name", "base_generator": "company"}]
    )
    MockSource.return_value = source_testkit.mock

source_factory() can be configured with a powerful range of FeatureConfig objects, including a variety of rules which distort and duplicate the data in predictable ways. These use Faker to generate data.

source_factory(
    n_true_entities=1_000,
    features=(
        FeatureConfig(
            name="name",
            base_generator="first_name_female",
            drop_base=False,
            variations=(PrefixRule(prefix="Ms "),),
        ),
        FeatureConfig(
            name="title",
            base_generator="job",
            drop_base=True,
            variations=(
                SuffixRule(suffix=" MBE"),
                ReplaceRule(old="Manager", new="Leader"),
            ),
    ),
    repetition=3,
)

Server-side adapters

The factories can generate data suitable for MatchboxDBAdapter.index(), MatchboxDBAdapter.insert_model(), or MatchboxDBAdapter.set_model_results(). Between these functions, we can set up any backend in any configuration we need to test the other adapter methods.

Adding a SourceConfig.

source_testkit = source_factory()
backend.index(
    source_config=source_testkit.source_config
    data_hashes=source_testkit.data_hashes
)

Adding a Model.

model_testkit = model_factory()
backend.insert_model(model_config=model_testkit.model.model_config)

Inserting results.

model_testkit = model_factory()
backend.set_model_results(
    name=model_testkit.model.model_config.name, 
    results=model_testkit.probabilities
)

linked_sources_factory() and model_factory() can be used together to create broader systems of data that connect – or don’t – in controlled ways.

linked_testkit = linked_sources_factory()

for source_testkit in linked_testkit.sources.values():
    backend.index(
        source_config=source_testkit.source_config
        data_hashes=source_testkit.data_hashes
    )

model_testkit = model_factory(
    left_testkit=linked_testkit.sources["crn"],
    true_entities=linked_testkit.true_entities,
)

backend.insert_model(model_config=model_testkit.model.model_config)
backend.set_model_results(
    name=model_testkit.model.model_config.name, 
    results=model_testkit.probabilities
)

Methodologies

Configure the true state of your data with linked_sources_factory(). Its default is a set of three tables of ten unique company entites.

  • CRN (company name, CRN ID) contains all entities with three unique variations of the company’s name
  • CDMS (CRN ID, DUNS ID) contains all entities repeated twice
  • DUNS (company name, DUNS ID) contains half the entities

linked_sources_factory() can be configured using tuples of SourceTestkitParameters objects. Using these you can create complex sets of interweaving sources for methodologies to be tested against.

The model_factory() is designed so you can chain together known processes in any order, before using your real methodology. LinkedSourcesTestkit.diff_results() will make any probabilistic output comparable with the true source entities, and give a detailed diff to help you debug.

linked_testkit: LinkedSourcesTestkit = linked_sources_factory()

# Create perfect deduped models first
left_deduped: ModelTestkit = model_factory(
    left_testkit=linked_testkit.sources["crn"],
    true_entities=linked_testkit.true_entities,
)
right_deduped: ModelTestkit = model_factory(
    left_testkit=linked_testkit.sources["cdms"],
    true_entities=linked_testkit.true_entities,
)

# Create a model and generate probabilities
model: Model = make_model(
    left_data=left_deduped.query,
    right_data=right_deduped.query
    ...
)
results: Results = model.run()

# Diff, assert, and log the message if it fails
identical, report = linked_testkit.diff_results(
    probabilities=results.probabilities,  # Your methodology's output
    left_clusters=left_deduped.entities,  # Output of left deduper -- left input to your methodology
    right_clusters=right_deduped.entities,  # Output of right deduper -- left input to your methodology
    sources=("crn", "cdms"),
    threshold=0,
)

assert identical, report

Testing with scenarios

For more complex integration tests, the factory provides a scenario system. This allows you to stand up a fully-populated backend with a single context manager, setup_scenario(). This is particularly useful for testing database adapters and end-to-end methodologies.

The main usage pattern is to call setup_scenario() with a backend adapter and a named scenario. The context manager yields a TestkitDAG containing all the sources and models created for the scenario, giving you access to the ground truth.

from matchbox.common.factories import setup_scenario

def test_my_adapter_function(my_backend_adapter):
    with setup_scenario(my_backend_adapter, "link") as dag:
        # The backend is now populated with the 'link' scenario
        # dag.sources contains the source testkits
        # dag.models contains the model testkits

        # Now you can call the function you want to test
        results = my_backend_adapter.query(resolution="final_join")

        # You can use the dag to verify the results
        assert len(results) > 0

The scenario system is cached, so subsequent runs of the same scenario are significantly faster.

Available scenarios

The following built-in scenarios are available. They are built on top of each other, so link includes all the steps from dedupe, which includes index, and so on.

  • bare: Creates a set of linked sources and writes them to the data warehouse, but does not interact with the matchbox backend.
  • index: Takes the bare scenario and indexes all the sources in the matchbox backend.
  • dedupe: Takes the index scenario and adds perfectly deduplicated models for each source.
  • probabilistic_dedupe: Like dedupe, but the models produce probabilistic scores rather than perfect matches.
  • link: Takes the dedupe scenario and adds linking models between the deduplicated sources, culminating in a final_join resolution.
  • alt_dedupe: A specialised scenario with two alternative deduplication models for the same source.
  • convergent: A specialised scenario where two different sources index to almost identical data.

Creating new scenarios

You can create your own scenarios by writing a builder function and registering it with the @register_scenario decorator. This allows you to build reusable, complex data setups for your tests.