Overview¶
matchbox.common.factories
¶
Factory functions for the testkit system.
Modules:
-
dags
–DAG container for testkits.
-
entities
–Classes and functions for generating and comparing entities.
-
models
–Factory functions for generating model testkits and data for testing.
-
sources
–Factories for generating sources and linked source testkits for testing.
Using the system¶
The factory system aims to provide *Testkit
objects that facilitate three groups of testing scenarios:
- Realistic mock
Source
andModel
objects to test client-side connectivity functions - Realistic mock data to test server-side adapter functions
- Realistic mock pipelines with controlled completeness to test client-side methodologies
Three broad functions are provided:
source_factory()
generatesSourceTestkit
objects, which contain dummySource
s and associated datalinked_sources_factory()
generatesLinkedSourcesTestkit
objects, which contain a collection of interconnectedSourceTestkit
objects, and the true entities this data describesmodel_factory()
generatesModelTestkit
objects, which mock probabilities that can connect bothSourceTestkit
and otherModelTestkit
objects in ways that fail and succeed predictably
Underneath, these factories and objects use a system of SourceEntity
and ClusterEntity
s to share data. The source is the true answer, and the clusters are the merging data as it moves through the system. A comprehensive set of comparators have been implemented to make this simple to implement, understand, and read in unit testing.
All factory functions are configured to provide a sensible, useful default.
The system has been designed to be as hashable as possible to enable caching. Often you’ll need to provide tuples where you might normally provide lists.
There are some common patterns you might consider using when editing or extending tests.
Client-side connectivity¶
We can use the factories to test inserting or retrieving isolated Source
or Model
objects.
Perhaps you’re testing the API and want to put a realistic Source
in the ingestion pipeline.
source_testkit = source_factory()
# Setup store
store = MetadataStore()
update_id = store.cache_source(source_testkit.source)
Or you’re testing the client handler and want to mock the API.
@patch("matchbox.client.helpers.index.Source")
def test_my_api(MockSource: Mock, matchbox_api: MockRouter):
source_testkit = source_factory(
features=[{"name": "company_name", "base_generator": "company"}]
)
MockSource.return_value = source_testkit.mock
source_factory()
can be configured with a powerful range of FeatureConfig
objects, including a variety of rules which distort and duplicate the data in predictable ways. These use Faker to generate data.
source_factory(
n_true_entities=1_000,
features=(
FeatureConfig(
name="name",
base_generator="first_name_female",
drop_base=False,
variations=(PrefixRule(prefix="Ms "),),
),
FeatureConfig(
name="title",
base_generator="job",
drop_base=True,
variations=(
SuffixRule(suffix=" MBE"),
ReplaceRule(old="Manager", new="Leader"),
),
),
repetition=3,
)
Server-side adapters¶
The factories can generate data suitable for MatchboxDBAdapter.index()
, MatchboxDBAdapter.insert_model()
, or MatchboxDBAdapter.set_model_results()
. Between these functions, we can set up any backend in any configuration we need to test the other adapter methods.
Adding a Source
.
source_testkit = source_factory()
backend.index(
source=source_testkit.source
data_hashes=source_testkit.data_hashes
)
Adding a Model
.
Inserting results.
model_testkit = model_factory()
backend.set_model_results(
model=model_testkit.model.metadata.full_name,
results=model_testkit.probabilities
)
linked_sources_factory()
and model_factory()
can be used together to create broader systems of data that connect – or don’t – in controlled ways.
linked_testkit = linked_sources_factory()
for source_testkit in linked_testkit.sources.values():
backend.index(
source=source_testkit.source
data_hashes=source_testkit.data_hashes
)
model_testkit = model_factory(
left_testkit=linked_testkit.sources["crn"],
true_entities=linked_testkit.true_entities,
)
backend.insert_model(model=model_testkit.model.metadata)
backend.set_model_results(
model=model_testkit.model.metadata.full_name,
results=model_testkit.probabilities
)
Methodologies¶
Configure the true state of your data with linked_sources_factory()
. Its default is a set of three tables of ten unique company entites.
- CRN (company name, CRN ID) contains all entities with three unique variations of the company’s name
- CDMS (CRN ID, DUNS ID) contains all entities repeated twice
- DUNS (company name, DUNS ID) contains half the entities
linked_sources_factory()
can be configured using tuples of SourceConfig
objects. Using these you can create complex sets of interweaving sources for methodologies to be tested against.
The model_factory()
is designed so you can chain together known processes in any order, before using your real methodology. LinkedSourcesTestkit.diff_results()
will make any probabilistic output comparable with the true source entities, and give a detailed diff to help you debug.
linked_testkit: LinkedSourcesTestkit = linked_sources_factory()
# Create perfect deduped models first
left_deduped: ModelTestkit = model_factory(
left_testkit=linked_testkit.sources["crn"],
true_entities=linked_testkit.true_entities,
)
right_deduped: ModelTestkit = model_factory(
left_testkit=linked_testkit.sources["cdms"],
true_entities=linked_testkit.true_entities,
)
# Create a model and generate probabilities
model: Model = make_model(
left_data=left_deduped.query,
right_data=right_deduped.query
...
)
results: Results = model.run()
# Diff, assert, and log the message if it fails
identical, report = linked_testkit.diff_results(
probabilities=results.probabilities, # Your methodology's output
left_clusters=left_deduped.entities, # Output of left deduper -- left input to your methodology
right_clusters=right_deduped.entities, # Output of right deduper -- left input to your methodology
sources=("crn", "cdms"),
threshold=0,
)
assert identical, report