Overview¶
matchbox.common.factories
¶
Factory functions for the testkit system.
Modules:
-
dags
–DAG container for testkits.
-
entities
–Classes and functions for generating and comparing entities.
-
models
–Factory functions for generating model testkits and data for testing.
-
scenarios
–Scenario factories for creating TestkitDAG scenarios.
-
sources
–Factories for generating sources and linked source testkits for testing.
Using the system¶
The factory system aims to provide *Testkit
objects that facilitate three groups of testing scenarios:
- Realistic mock
SourceConfig
andModel
objects to test client-side connectivity functions - Realistic mock data to test server-side adapter functions
- Realistic mock pipelines with controlled completeness to test client-side methodologies
Three broad functions are provided:
source_factory()
generatesSourceTestkit
objects, which contain dummySourceConfig
s and associated datalinked_sources_factory()
generatesLinkedSourcesTestkit
objects, which contain a collection of interconnectedSourceTestkit
objects, and the true entities this data describesmodel_factory()
generatesModelTestkit
objects, which mock probabilities that can connect bothSourceTestkit
and otherModelTestkit
objects in ways that fail and succeed predictably
Underneath, these factories and objects use a system of SourceEntity
and ClusterEntity
s to share data. The source is the true answer, and the clusters are the merging data as it moves through the system. A comprehensive set of comparators have been implemented to make this simple to implement, understand, and read in unit testing.
All factory functions are configured to provide a sensible, useful default.
The system has been designed to be as hashable as possible to enable caching. Often you’ll need to provide tuples where you might normally provide lists.
There are some common patterns you might consider using when editing or extending tests.
Client-side connectivity¶
We can use the factories to test inserting or retrieving isolated SourceConfig
or Model
objects.
Perhaps you’re testing the API and want to put a realistic SourceConfig
in the ingestion pipeline.
source_testkit = source_factory()
# Setup store
tracker = InMemoryUploadTracker()
upload_id = tracker.add_source(source_testkit.source_config)
Or you’re testing the client handler and want to mock the API.
@patch("matchbox.client.helpers.index.SourceConfig")
def test_my_api(MockSource: Mock, matchbox_api: MockRouter):
source_testkit = source_factory(
features=[{"name": "company_name", "base_generator": "company"}]
)
MockSource.return_value = source_testkit.mock
source_factory()
can be configured with a powerful range of FeatureConfig
objects, including a variety of rules which distort and duplicate the data in predictable ways. These use Faker to generate data.
source_factory(
n_true_entities=1_000,
features=(
FeatureConfig(
name="name",
base_generator="first_name_female",
drop_base=False,
variations=(PrefixRule(prefix="Ms "),),
),
FeatureConfig(
name="title",
base_generator="job",
drop_base=True,
variations=(
SuffixRule(suffix=" MBE"),
ReplaceRule(old="Manager", new="Leader"),
),
),
repetition=3,
)
Server-side adapters¶
The factories can generate data suitable for MatchboxDBAdapter.index()
, MatchboxDBAdapter.insert_model()
, or MatchboxDBAdapter.set_model_results()
. Between these functions, we can set up any backend in any configuration we need to test the other adapter methods.
Adding a SourceConfig
.
source_testkit = source_factory()
backend.index(
source_config=source_testkit.source_config
data_hashes=source_testkit.data_hashes
)
Adding a Model
.
Inserting results.
model_testkit = model_factory()
backend.set_model_results(
name=model_testkit.model.model_config.name,
results=model_testkit.probabilities
)
linked_sources_factory()
and model_factory()
can be used together to create broader systems of data that connect – or don’t – in controlled ways.
linked_testkit = linked_sources_factory()
for source_testkit in linked_testkit.sources.values():
backend.index(
source_config=source_testkit.source_config
data_hashes=source_testkit.data_hashes
)
model_testkit = model_factory(
left_testkit=linked_testkit.sources["crn"],
true_entities=linked_testkit.true_entities,
)
backend.insert_model(model_config=model_testkit.model.model_config)
backend.set_model_results(
name=model_testkit.model.model_config.name,
results=model_testkit.probabilities
)
Methodologies¶
Configure the true state of your data with linked_sources_factory()
. Its default is a set of three tables of ten unique company entites.
- CRN (company name, CRN ID) contains all entities with three unique variations of the company’s name
- CDMS (CRN ID, DUNS ID) contains all entities repeated twice
- DUNS (company name, DUNS ID) contains half the entities
linked_sources_factory()
can be configured using tuples of SourceTestkitParameters
objects. Using these you can create complex sets of interweaving sources for methodologies to be tested against.
The model_factory()
is designed so you can chain together known processes in any order, before using your real methodology. LinkedSourcesTestkit.diff_results()
will make any probabilistic output comparable with the true source entities, and give a detailed diff to help you debug.
linked_testkit: LinkedSourcesTestkit = linked_sources_factory()
# Create perfect deduped models first
left_deduped: ModelTestkit = model_factory(
left_testkit=linked_testkit.sources["crn"],
true_entities=linked_testkit.true_entities,
)
right_deduped: ModelTestkit = model_factory(
left_testkit=linked_testkit.sources["cdms"],
true_entities=linked_testkit.true_entities,
)
# Create a model and generate probabilities
model: Model = make_model(
left_data=left_deduped.query,
right_data=right_deduped.query
...
)
results: Results = model.run()
# Diff, assert, and log the message if it fails
identical, report = linked_testkit.diff_results(
probabilities=results.probabilities, # Your methodology's output
left_clusters=left_deduped.entities, # Output of left deduper -- left input to your methodology
right_clusters=right_deduped.entities, # Output of right deduper -- left input to your methodology
sources=("crn", "cdms"),
threshold=0,
)
assert identical, report
Testing with scenarios¶
For more complex integration tests, the factory provides a scenario system. This allows you to stand up a fully-populated backend with a single context manager, setup_scenario()
. This is particularly useful for testing database adapters and end-to-end methodologies.
The main usage pattern is to call setup_scenario()
with a backend adapter and a named scenario. The context manager yields a TestkitDAG
containing all the sources and models created for the scenario, giving you access to the ground truth.
from matchbox.common.factories import setup_scenario
def test_my_adapter_function(my_backend_adapter):
with setup_scenario(my_backend_adapter, "link") as dag:
# The backend is now populated with the 'link' scenario
# dag.sources contains the source testkits
# dag.models contains the model testkits
# Now you can call the function you want to test
results = my_backend_adapter.query(resolution="final_join")
# You can use the dag to verify the results
assert len(results) > 0
The scenario system is cached, so subsequent runs of the same scenario are significantly faster.
Available scenarios¶
The following built-in scenarios are available. They are built on top of each other, so link
includes all the steps from dedupe
, which includes index
, and so on.
bare
: Creates a set of linked sources and writes them to the data warehouse, but does not interact with the matchbox backend.index
: Takes thebare
scenario and indexes all the sources in the matchbox backend.dedupe
: Takes theindex
scenario and adds perfectly deduplicated models for each source.probabilistic_dedupe
: Likededupe
, but the models produce probabilistic scores rather than perfect matches.link
: Takes thededupe
scenario and adds linking models between the deduplicated sources, culminating in afinal_join
resolution.alt_dedupe
: A specialised scenario with two alternative deduplication models for the same source.convergent
: A specialised scenario where two different sources index to almost identical data.
Creating new scenarios¶
You can create your own scenarios by writing a builder function and registering it with the @register_scenario
decorator. This allows you to build reusable, complex data setups for your tests.