Models¶
matchbox.client.models
¶
Deduplication and linking methodologies.
Modules:
-
dedupers
–Deduplication methodologies.
-
linkers
–Linking methodologies.
-
models
–Functions and classes to define, run and register models.
dedupers
¶
Deduplication methodologies.
Modules:
-
base
–Base class for deduplication methodologies.
-
naive
–A deduplication methodology based on a deterministic set of conditions.
Classes:
-
NaiveDeduper
–A simple deduper that deduplicates based on a set of boolean conditions.
NaiveDeduper
¶
Bases: Deduper
A simple deduper that deduplicates based on a set of boolean conditions.
Methods:
-
from_settings
–Create a NaiveDeduper from a settings dictionary.
-
prepare
–Prepare the deduper for deduplication.
-
dedupe
–Deduplicate the dataframe.
Attributes:
from_settings
classmethod
¶
from_settings(
id: str, unique_fields: list[str]
) -> NaiveDeduper
Create a NaiveDeduper from a settings dictionary.
base
¶
Base class for deduplication methodologies.
Classes:
-
DeduperSettings
–A data class to enforce basic settings dictionary shapes.
-
Deduper
–A base class for dedupers.
DeduperSettings
¶
Deduper
¶
Bases: BaseModel
, ABC
A base class for dedupers.
Methods:
-
from_settings
–Create a Deduper from a settings dictionary.
-
prepare
–Prepare the deduper for deduplication.
-
dedupe
–Deduplicate the dataframe.
Attributes:
naive
¶
A deduplication methodology based on a deterministic set of conditions.
Classes:
-
NaiveSettings
–A data class to enforce the Naive deduper’s settings dictionary shape.
-
NaiveDeduper
–A simple deduper that deduplicates based on a set of boolean conditions.
NaiveSettings
¶
Bases: DeduperSettings
A data class to enforce the Naive deduper’s settings dictionary shape.
Attributes:
-
id
(str
) – -
unique_fields
(list[str]
) –
NaiveDeduper
¶
Bases: Deduper
A simple deduper that deduplicates based on a set of boolean conditions.
Methods:
-
from_settings
–Create a NaiveDeduper from a settings dictionary.
-
prepare
–Prepare the deduper for deduplication.
-
dedupe
–Deduplicate the dataframe.
Attributes:
from_settings
classmethod
¶
from_settings(
id: str, unique_fields: list[str]
) -> NaiveDeduper
Create a NaiveDeduper from a settings dictionary.
linkers
¶
Linking methodologies.
Modules:
-
base
–Base class for linkers.
-
deterministic
–A linking methodology based on a deterministic set of conditions.
-
splinklinker
–A linking methodology leveraging Splink.
-
weighteddeterministic
–A linking methodology that applies different weights to field comparisons.
Classes:
-
DeterministicLinker
–A deterministic linker that links based on a set of boolean conditions.
-
SplinkLinker
–A linker that leverages Bayesian record linkage using Splink.
-
WeightedDeterministicLinker
–A deterministic linker that applies different weights to field comparisons.
DeterministicLinker
¶
Bases: Linker
A deterministic linker that links based on a set of boolean conditions.
Methods:
-
from_settings
–Create a DeterministicLinker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
from_settings
classmethod
¶
from_settings(
left_id: str, right_id: str, comparisons: str
) -> DeterministicLinker
Create a DeterministicLinker from a settings dictionary.
SplinkLinker
¶
Bases: Linker
A linker that leverages Bayesian record linkage using Splink.
Methods:
-
from_settings
–Create a SplinkLinker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
WeightedDeterministicLinker
¶
Bases: Linker
A deterministic linker that applies different weights to field comparisons.
Methods:
-
from_settings
–Create a WeightedDeterministicLinker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
base
¶
Base class for linkers.
Classes:
-
LinkerSettings
–A data class to enforce basic settings dictionary shapes.
-
Linker
–A base class for linkers.
LinkerSettings
¶
Bases: BaseModel
A data class to enforce basic settings dictionary shapes.
Attributes:
Linker
¶
Bases: BaseModel
, ABC
A base class for linkers.
Methods:
-
from_settings
–Create a Linker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
deterministic
¶
A linking methodology based on a deterministic set of conditions.
Classes:
-
DeterministicSettings
–A data class to enforce the Deterministic linker’s settings dictionary shape.
-
DeterministicLinker
–A deterministic linker that links based on a set of boolean conditions.
DeterministicSettings
¶
Bases: LinkerSettings
A data class to enforce the Deterministic linker’s settings dictionary shape.
Methods:
-
validate_comparison
–Validate the comparison string.
Attributes:
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(
description="The unique ID column in the left dataset"
)
right_id
class-attribute
instance-attribute
¶
right_id: str = Field(
description="The unique ID column in the right dataset"
)
comparisons
class-attribute
instance-attribute
¶
comparisons: str = Field(
description='\n A valid ON clause to compare fields between the left and \n the right data.\n\n Use left.field and right.field to refer to columns in the \n respective sources.\n\n For example:\n\n "left.name = right.name and left.company_id = right.id"\n '
)
DeterministicLinker
¶
Bases: Linker
A deterministic linker that links based on a set of boolean conditions.
Methods:
-
from_settings
–Create a DeterministicLinker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
from_settings
classmethod
¶
from_settings(
left_id: str, right_id: str, comparisons: str
) -> DeterministicLinker
Create a DeterministicLinker from a settings dictionary.
splinklinker
¶
A linking methodology leveraging Splink.
Classes:
-
SplinkLinkerFunction
–A method of splink.Linker.training used to train the linker.
-
SplinkSettings
–A data class to enforce the Splink linker’s settings dictionary shape.
-
SplinkLinker
–A linker that leverages Bayesian record linkage using Splink.
SplinkLinkerFunction
¶
Bases: BaseModel
A method of splink.Linker.training used to train the linker.
Methods:
-
validate_function_and_arguments
–Ensure the function and arguments are valid.
Attributes:
SplinkSettings
¶
Bases: LinkerSettings
A data class to enforce the Splink linker’s settings dictionary shape.
Methods:
-
check_ids_match
–Ensure left_id and right_id match.
-
check_link_only
–Ensure link_type is set to “link_only”.
-
add_enforced_settings
–Ensure ID is the only field we link on.
Attributes:
-
left_id
(str
) – -
right_id
(str
) – -
model_config
– -
database_api
(Type[DuckDBAPI]
) – -
linker_training_functions
(list[SplinkLinkerFunction]
) – -
linker_settings
(SettingsCreator
) – -
threshold
(float | None
) –
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(
description="The unique ID column in the left dataset"
)
right_id
class-attribute
instance-attribute
¶
right_id: str = Field(
description="The unique ID column in the right dataset"
)
model_config
class-attribute
instance-attribute
¶
database_api
class-attribute
instance-attribute
¶
database_api: Type[DuckDBAPI] = Field(
default=DuckDBAPI,
description="\n The Splink DB API, to choose between DuckDB (default) and Spark (untested)\n ",
)
linker_training_functions
class-attribute
instance-attribute
¶
linker_training_functions: list[SplinkLinkerFunction] = (
Field(
description='\n A list of dictionaries where keys are the names of methods for\n splink.Linker.training and values are dictionaries encoding the arguments of\n those methods. Each function will be run in the order supplied.\n\n Example:\n \n >>> linker_training_functions=[\n ... {\n ... "function": "estimate_probability_two_random_records_match",\n ... "arguments": {\n ... "deterministic_matching_rules": """\n ... l.company_name = r.company_name\n ... """,\n ... "recall": 0.7,\n ... },\n ... },\n ... {\n ... "function": "estimate_u_using_random_sampling",\n ... "arguments": {"max_pairs": 1e6},\n ... }\n ... ]\n \n '
)
)
linker_settings
class-attribute
instance-attribute
¶
linker_settings: SettingsCreator = Field(
description='\n A valid Splink SettingsCreator.\n\n See Splink\'s documentation for a full description of available settings.\n https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n * link_type must be set to "link_only"\n * unique_id_column_name is overridden to the value of left_id and right_id,\n which must match\n\n Example:\n\n >>> from splink import SettingsCreator, block_on\n ... import splink.comparison_library as cl\n ... import splink.comparison_template_library as ctl\n ... \n ... splink_settings = SettingsCreator(\n ... retain_matching_columns=False,\n ... retain_intermediate_calculation_columns=False,\n ... blocking_rules_to_generate_predictions=[\n ... block_on("company_name"),\n ... block_on("postcode"),\n ... ],\n ... comparisons=[\n ... cl.jaro_winkler_at_thresholds(\n ... "company_name", \n ... [0.9, 0.6], \n ... term_frequency_adjustments=True\n ... ),\n ... ctl.postcode_comparison("postcode"), \n ... ]\n ... ) \n '
)
threshold
class-attribute
instance-attribute
¶
threshold: float | None = Field(
default=None,
description="\n The probability above which matches will be kept.\n\n None is used to indicate no threshold.\n \n Inclusive, so a value of 1 will keep only exact matches across all \n comparisons.\n ",
gt=0,
le=1,
)
add_enforced_settings
¶
add_enforced_settings() -> SplinkSettings
Ensure ID is the only field we link on.
SplinkLinker
¶
Bases: Linker
A linker that leverages Bayesian record linkage using Splink.
Methods:
-
from_settings
–Create a SplinkLinker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
weighteddeterministic
¶
A linking methodology that applies different weights to field comparisons.
Classes:
-
WeightedComparison
–A valid comparison and a weight to give it.
-
WeightedDeterministicSettings
–A data class to enforce the Weighted linker’s settings dictionary shape.
-
WeightedDeterministicLinker
–A deterministic linker that applies different weights to field comparisons.
WeightedComparison
¶
Bases: BaseModel
A valid comparison and a weight to give it.
Methods:
-
validate_comparison
–Validate the comparison string.
Attributes:
-
comparison
(str
) – -
weight
(float
) –
comparison
class-attribute
instance-attribute
¶
comparison: str = Field(
description='\n A valid ON clause to compare fields between the left and \n the right data.\n\n Use left.field and right.field to refer to columns in the \n respective sources.\n\n For example:\n\n "left.company_name = right.company_name"\n '
)
WeightedDeterministicSettings
¶
Bases: LinkerSettings
A data class to enforce the Weighted linker’s settings dictionary shape.
Example
{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }
Attributes:
-
left_id
(str
) – -
right_id
(str
) – -
weighted_comparisons
(list[WeightedComparison]
) – -
threshold
(float
) –
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(
description="The unique ID column in the left dataset"
)
right_id
class-attribute
instance-attribute
¶
right_id: str = Field(
description="The unique ID column in the right dataset"
)
weighted_comparisons
class-attribute
instance-attribute
¶
weighted_comparisons: list[WeightedComparison] = Field(
description="A list of tuples in the form of a comparison, and a weight."
)
WeightedDeterministicLinker
¶
Bases: Linker
A deterministic linker that applies different weights to field comparisons.
Methods:
-
from_settings
–Create a WeightedDeterministicLinker from a settings dictionary.
-
prepare
–Prepare the linker for linking.
-
link
–Link the left and right dataframes.
Attributes:
models
¶
Functions and classes to define, run and register models.
Classes:
-
Model
–Unified model class for both linking and deduping operations.
Functions:
-
make_model
–Create a unified model instance for either linking or deduping operations.
Model
¶
Model(
metadata: ModelMetadata,
model_instance: Linker | Deduper,
left_data: DataFrame,
right_data: DataFrame | None = None,
)
Unified model class for both linking and deduping operations.
Methods:
-
insert_model
–Insert the model into the backend database.
-
delete
–Delete the model from the database.
-
run
–Execute the model pipeline and return results.
Attributes:
-
metadata
– -
model_instance
– -
left_data
– -
right_data
– -
results
(Results
) –Retrieve results associated with the model from the database.
-
truth
(float
) –Retrieve the truth threshold for the model.
-
ancestors
(dict[str, float]
) –Retrieve the ancestors of the model.
-
ancestors_cache
(dict[str, float]
) –Retrieve the ancestors cache of the model.
make_model
¶
make_model(
model_name: str,
description: str,
model_class: type[Linker] | type[Deduper],
model_settings: dict[str, Any],
left_data: DataFrame,
left_resolution: str,
right_data: DataFrame | None = None,
right_resolution: str | None = None,
) -> Model
Create a unified model instance for either linking or deduping operations.
Parameters:
-
model_name
¶str
) –Your unique identifier for the model
-
description
¶str
) –Description of the model run
-
model_class
¶type[Linker] | type[Deduper]
) –Either Linker or Deduper class
-
model_settings
¶dict[str, Any]
) –Configuration settings for the model
-
left_data
¶DataFrame
) –Primary dataset
-
left_resolution
¶str
) –Resolution name for primary model or dataset
-
right_data
¶DataFrame | None
, default:None
) –Secondary dataset (linking only)
-
right_resolution
¶str | None
, default:None
) –Resolution name for secondary model or dataset (linking only)
Returns:
-
Model
(Model
) –Configured model instance ready for execution