Models¶
matchbox.client.models
¶
Deduplication and linking methodologies.
Modules:
-
comparison
–Functions to compare fields in different sources.
-
dedupers
–Deduplication methodologies.
-
linkers
–Linking methodologies.
-
models
–Functions and classes to define, run and register models.
Classes:
-
Model
–Unified model class for both linking and deduping operations.
Functions:
-
add_model_class
–Add custom deduper or linker.
Model
¶
Model(name: str, dag: DAG, model_class: type[Deduper], model_settings: DeduperSettings | dict, left_query: Query, right_query: None = None, truth: float = 1.0, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Linker], model_settings: LinkerSettings | dict, left_query: Query, right_query: Query, truth: float = 1.0, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Deduper] | type[Linker] | str, model_settings: DeduperSettings | LinkerSettings | dict, left_query: Query, right_query: Query | None = None, truth: float = 1.0, description: str | None = None)
Unified model class for both linking and deduping operations.
Parameters:
-
dag
¶DAG
) –DAG containing this model.
-
name
¶str
) –Unique name for the model
-
truth
¶float
, default:1.0
) –Truth threshold. Defaults to 1.0. Can be set later after analysis.
-
model_class
¶type[Deduper] | type[Linker] | str
) –Class of Linker or Deduper, or its name.
-
model_settings
¶DeduperSettings | LinkerSettings | dict
) –Appropriate settings object to pass to model class.
-
left_query
¶Query
) –The query that will get the data to deduplicate, or the data to link on the left.
-
right_query
¶Query | None
, default:None
) –The query that will get the data to link on the right.
-
description
¶str | None
, default:None
) –Optional description of the model
Methods:
-
to_resolution
–Convert to Resolution for API calls.
-
from_resolution
–Reconstruct from Resolution.
-
delete
–Delete the model from the database.
-
run
–Execute the model pipeline and return results.
-
sync
–Send the model config, truth and results to the server.
-
download_results
–Retrieve results associated with the model from the database.
-
query
–Generate a query for this model.
Attributes:
-
last_run
(datetime | None
) – -
dag
– -
name
– -
description
– -
left_query
– -
right_query
– -
results
(Results | None
) – -
model_class
(type[Linker | Deduper]
) – -
model_instance
– -
model_type
(ModelType
) – -
model_settings
– -
config
(ModelConfig
) –Generate config DTO from Model.
-
dependencies
(list[ResolutionPath]
) –Returns all resolution paths this model needs as implied by the queries.
-
parents
(list[ResolutionPath]
) –Returns all resolution paths directly input to this model.
-
resolution_path
(ModelResolutionPath
) –Returns the model resolution path.
-
truth
(float | None
) –Returns the truth threshold for the model as a float.
model_type
instance-attribute
¶
model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER
dependencies
property
¶
dependencies: list[ResolutionPath]
Returns all resolution paths this model needs as implied by the queries.
parents
property
¶
parents: list[ResolutionPath]
Returns all resolution paths directly input to this model.
from_resolution
classmethod
¶
from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model
Reconstruct from Resolution.
run
¶
run(for_validation: bool = False, full_rerun: bool = False) -> Results
add_model_class
¶
Add custom deduper or linker.
comparison
¶
Functions to compare fields in different sources.
Functions:
-
comparison
–Validates any number of SQL conditions and prepares for a WHERE clause.
dedupers
¶
Deduplication methodologies.
Modules:
-
base
–Base class for deduplication methodologies.
-
naive
–A deduplication methodology based on a deterministic set of conditions.
Classes:
-
NaiveDeduper
–A simple deduper that deduplicates based on a set of boolean conditions.
NaiveDeduper
¶
Bases: Deduper
A simple deduper that deduplicates based on a set of boolean conditions.
Methods:
Attributes:
base
¶
Base class for deduplication methodologies.
Classes:
-
DeduperSettings
–A data class to enforce basic settings dictionary shapes.
-
Deduper
–A base class for dedupers.
naive
¶
A deduplication methodology based on a deterministic set of conditions.
Classes:
-
NaiveSettings
–A data class to enforce the Naive deduper’s settings dictionary shape.
-
NaiveDeduper
–A simple deduper that deduplicates based on a set of boolean conditions.
NaiveSettings
¶
Bases: DeduperSettings
A data class to enforce the Naive deduper’s settings dictionary shape.
Attributes:
-
unique_fields
(list[str]
) – -
id
(str
) –
linkers
¶
Linking methodologies.
Modules:
-
base
–Base class for linkers.
-
deterministic
–A linking methodology based on a deterministic set of conditions.
-
splinklinker
–A linking methodology leveraging Splink.
-
weighteddeterministic
–A linking methodology that applies different weights to field comparisons.
Classes:
-
DeterministicLinker
–A deterministic linker that links based on a set of boolean conditions.
-
SplinkLinker
–A linker that leverages Bayesian record linkage using Splink.
-
WeightedDeterministicLinker
–A deterministic linker that applies different weights to field comparisons.
DeterministicLinker
¶
Bases: Linker
A deterministic linker that links based on a set of boolean conditions.
Methods:
Attributes:
SplinkLinker
¶
WeightedDeterministicLinker
¶
Bases: Linker
A deterministic linker that applies different weights to field comparisons.
Methods:
Attributes:
base
¶
Base class for linkers.
Classes:
-
LinkerSettings
–A data class to enforce basic settings dictionary shapes.
-
Linker
–A base class for linkers.
deterministic
¶
A linking methodology based on a deterministic set of conditions.
Classes:
-
DeterministicSettings
–A data class to enforce the Deterministic linker’s settings dictionary shape.
-
DeterministicLinker
–A deterministic linker that links based on a set of boolean conditions.
DeterministicSettings
¶
Bases: LinkerSettings
A data class to enforce the Deterministic linker’s settings dictionary shape.
Methods:
-
validate_comparison
–Turn single string into list of one string.
Attributes:
comparisons
class-attribute
instance-attribute
¶
comparisons: list[str] = Field(description='\n A list of valid ON clause to compare fields between the left and \n the right data.\n\n Use left.field and right.field to refer to columns in the respective \n sources.\n\n Each comparison will be treated as OR logic, but more efficiently than using\n an OR condition in the SQL WHERE clause.\n\n For example:\n\n [ \n "left.company_number = right.company_number",\n "left.name = right.name and left.postcode = right.postcode",\n ]\n ')
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(default='id', description='The unique ID field in the left data')
splinklinker
¶
A linking methodology leveraging Splink.
Classes:
-
SplinkLinkerFunction
–A method of splink.Linker.training used to train the linker.
-
SplinkSettings
–A data class to enforce the Splink linker’s settings dictionary shape.
-
SplinkLinker
–A linker that leverages Bayesian record linkage using Splink.
SplinkLinkerFunction
¶
Bases: BaseModel
A method of splink.Linker.training used to train the linker.
Methods:
-
validate_function_and_arguments
–Ensure the function and arguments are valid.
Attributes:
SplinkSettings
¶
Bases: LinkerSettings
A data class to enforce the Splink linker’s settings dictionary shape.
Methods:
-
check_ids_match
–Ensure left_id and right_id match.
-
check_link_only
–Ensure link_type is set to “link_only”.
-
add_enforced_settings
–Ensure ID is the only field we link on.
-
load_linker_settings
–Load serialised settings into SettingsCreator.
-
serialise_settings
–Convert Splink settings to string.
Attributes:
-
model_config
– -
linker_training_functions
(list[SplinkLinkerFunction]
) – -
linker_settings
(SettingsCreator
) – -
threshold
(float | None
) – -
left_id
(str
) – -
right_id
(str
) –
model_config
class-attribute
instance-attribute
¶
linker_training_functions
class-attribute
instance-attribute
¶
linker_training_functions: list[SplinkLinkerFunction] = Field(description='\n A list of dictionaries where keys are the names of methods for\n splink.Linker.training and values are dictionaries encoding the arguments of\n those methods. Each function will be run in the order supplied.\n\n Example:\n \n >>> linker_training_functions=[\n ... {\n ... "function": "estimate_probability_two_random_records_match",\n ... "arguments": {\n ... "deterministic_matching_rules": """\n ... l.company_name = r.company_name\n ... """,\n ... "recall": 0.7,\n ... },\n ... },\n ... {\n ... "function": "estimate_u_using_random_sampling",\n ... "arguments": {"max_pairs": 1e6},\n ... }\n ... ]\n \n ')
linker_settings
class-attribute
instance-attribute
¶
linker_settings: SettingsCreator = Field(description='\n A valid Splink SettingsCreator.\n\n See Splink\'s documentation for a full description of available settings.\n https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n * link_type must be set to "link_only"\n * unique_id_name is overridden to the value of left_id and right_id,\n which must match\n\n Example:\n\n >>> from splink import SettingsCreator, block_on\n ... import splink.comparison_library as cl\n ... import splink.comparison_template_library as ctl\n ... \n ... splink_settings = SettingsCreator(\n ... retain_matching_columns=False,\n ... retain_intermediate_calculation_columns=False,\n ... blocking_rules_to_generate_predictions=[\n ... block_on("company_name"),\n ... block_on("postcode"),\n ... ],\n ... comparisons=[\n ... cl.jaro_winkler_at_thresholds(\n ... "company_name", \n ... [0.9, 0.6], \n ... term_frequency_adjustments=True\n ... ),\n ... ctl.postcode_comparison("postcode"), \n ... ]\n ... ) \n ')
threshold
class-attribute
instance-attribute
¶
threshold: float | None = Field(default=None, description='\n The probability above which matches will be kept.\n\n None is used to indicate no threshold.\n \n Inclusive, so a value of 1 will keep only exact matches across all \n comparisons.\n ', gt=0, le=1)
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id
class-attribute
instance-attribute
¶
right_id: str = Field(default='id', description='The unique ID field in the right data')
add_enforced_settings
¶
add_enforced_settings() -> SplinkSettings
Ensure ID is the only field we link on.
weighteddeterministic
¶
A linking methodology that applies different weights to field comparisons.
Classes:
-
WeightedComparison
–A valid comparison and a weight to give it.
-
WeightedDeterministicSettings
–A data class to enforce the Weighted linker’s settings dictionary shape.
-
WeightedDeterministicLinker
–A deterministic linker that applies different weights to field comparisons.
WeightedComparison
¶
Bases: BaseModel
A valid comparison and a weight to give it.
Methods:
-
validate_comparison
–Validate the comparison string.
Attributes:
-
comparison
(str
) – -
weight
(float
) –
comparison
class-attribute
instance-attribute
¶
comparison: str = Field(description='\n A valid ON clause to compare fields between the left and \n the right data.\n\n Use left.field and right.field to refer to fields in the \n respective sources.\n\n For example:\n\n "left.company_name = right.company_name"\n ')
WeightedDeterministicSettings
¶
Bases: LinkerSettings
A data class to enforce the Weighted linker’s settings dictionary shape.
Example
{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }
Attributes:
-
weighted_comparisons
(list[WeightedComparison]
) – -
threshold
(float
) – -
left_id
(str
) – -
right_id
(str
) –
weighted_comparisons
class-attribute
instance-attribute
¶
weighted_comparisons: list[WeightedComparison] = Field(description='A list of tuples in the form of a comparison, and a weight.')
threshold
class-attribute
instance-attribute
¶
threshold: float = Field(description='\n The probability above which matches will be kept. \n \n Inclusive, so a value of 1 will keep only exact matches across all \n comparisons.\n ', ge=0, le=1)
models
¶
Functions and classes to define, run and register models.
Classes:
-
Model
–Unified model class for both linking and deduping operations.
Functions:
-
add_model_class
–Add custom deduper or linker.
Model
¶
Model(name: str, dag: DAG, model_class: type[Deduper], model_settings: DeduperSettings | dict, left_query: Query, right_query: None = None, truth: float = 1.0, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Linker], model_settings: LinkerSettings | dict, left_query: Query, right_query: Query, truth: float = 1.0, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Deduper] | type[Linker] | str, model_settings: DeduperSettings | LinkerSettings | dict, left_query: Query, right_query: Query | None = None, truth: float = 1.0, description: str | None = None)
Unified model class for both linking and deduping operations.
Parameters:
-
dag
¶DAG
) –DAG containing this model.
-
name
¶str
) –Unique name for the model
-
truth
¶float
, default:1.0
) –Truth threshold. Defaults to 1.0. Can be set later after analysis.
-
model_class
¶type[Deduper] | type[Linker] | str
) –Class of Linker or Deduper, or its name.
-
model_settings
¶DeduperSettings | LinkerSettings | dict
) –Appropriate settings object to pass to model class.
-
left_query
¶Query
) –The query that will get the data to deduplicate, or the data to link on the left.
-
right_query
¶Query | None
, default:None
) –The query that will get the data to link on the right.
-
description
¶str | None
, default:None
) –Optional description of the model
Methods:
-
to_resolution
–Convert to Resolution for API calls.
-
from_resolution
–Reconstruct from Resolution.
-
delete
–Delete the model from the database.
-
run
–Execute the model pipeline and return results.
-
sync
–Send the model config, truth and results to the server.
-
download_results
–Retrieve results associated with the model from the database.
-
query
–Generate a query for this model.
Attributes:
-
last_run
(datetime | None
) – -
dag
– -
name
– -
description
– -
left_query
– -
right_query
– -
results
(Results | None
) – -
model_class
(type[Linker | Deduper]
) – -
model_instance
– -
model_type
(ModelType
) – -
model_settings
– -
config
(ModelConfig
) –Generate config DTO from Model.
-
dependencies
(list[ResolutionPath]
) –Returns all resolution paths this model needs as implied by the queries.
-
parents
(list[ResolutionPath]
) –Returns all resolution paths directly input to this model.
-
resolution_path
(ModelResolutionPath
) –Returns the model resolution path.
-
truth
(float | None
) –Returns the truth threshold for the model as a float.
model_type
instance-attribute
¶
model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER
dependencies
property
¶
dependencies: list[ResolutionPath]
Returns all resolution paths this model needs as implied by the queries.
parents
property
¶
parents: list[ResolutionPath]
Returns all resolution paths directly input to this model.
from_resolution
classmethod
¶
from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model
Reconstruct from Resolution.
run
¶
run(for_validation: bool = False, full_rerun: bool = False) -> Results