Skip to content

Models

matchbox.client.models

Deduplication and linking methodologies.

Modules:

  • comparison

    Functions to compare fields in different sources.

  • dedupers

    Deduplication methodologies.

  • linkers

    Linking methodologies.

  • models

    Functions and classes to define, run and register models.

Classes:

  • Model

    Unified model class for both linking and deduping operations.

Functions:

Model

Unified model class for both linking and deduping operations.

Parameters:

  • dag

    (DAG) –

    DAG containing this model.

  • name

    (str) –

    Unique name for the model

  • truth

    (float, default: 1.0 ) –

    Truth threshold. Defaults to 1.0. Can be set later after analysis.

  • model_class

    (type[Deduper] | type[Linker] | str) –

    Class of Linker or Deduper, or its name.

  • model_settings

    (DeduperSettings | LinkerSettings | dict) –

    Appropriate settings object to pass to model class.

  • left_query

    (Query) –

    The query that will get the data to deduplicate, or the data to link on the left.

  • right_query

    (Query | None, default: None ) –

    The query that will get the data to link on the right.

  • description

    (str | None, default: None ) –

    Optional description of the model

Methods:

  • to_resolution

    Convert to Resolution for API calls.

  • from_resolution

    Reconstruct from Resolution.

  • delete

    Delete the model from the database.

  • run

    Execute the model pipeline and return results.

  • sync

    Send the model config, truth and results to the server.

  • download_results

    Retrieve results associated with the model from the database.

  • query

    Generate a query for this model.

Attributes:

last_run instance-attribute

last_run: datetime | None = None

dag instance-attribute

dag = dag

name instance-attribute

name = name

description instance-attribute

description = description

left_query instance-attribute

left_query = left_query

right_query instance-attribute

right_query = right_query

results instance-attribute

results: Results | None = None

model_class instance-attribute

model_class: type[Linker | Deduper] = _MODEL_CLASSES[model_class]

model_instance instance-attribute

model_instance = model_class(settings=model_settings)

model_type instance-attribute

model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER

model_settings instance-attribute

model_settings = SettingsClass(**model_settings)

config property

config: ModelConfig

Generate config DTO from Model.

dependencies property

dependencies: list[ResolutionPath]

Returns all resolution paths this model needs as implied by the queries.

parents property

parents: list[ResolutionPath]

Returns all resolution paths directly input to this model.

resolution_path property

resolution_path: ModelResolutionPath

Returns the model resolution path.

truth property writable

truth: float | None

Returns the truth threshold for the model as a float.

to_resolution

to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution classmethod

from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model

Reconstruct from Resolution.

delete

delete(certain: bool = False) -> bool

Delete the model from the database.

run

run(for_validation: bool = False, full_rerun: bool = False) -> Results

Execute the model pipeline and return results.

Parameters:

  • for_validation
    (bool, default: False ) –

    Whether to download and store extra data to explore and score results.

  • full_rerun
    (bool, default: False ) –

    Whether to force a re-run even if the results are cached

sync

sync() -> None

Send the model config, truth and results to the server.

download_results

download_results() -> Results

Retrieve results associated with the model from the database.

query

query(*sources, **kwargs) -> Query

Generate a query for this model.

add_model_class

add_model_class(ModelClass: type[Linker] | type[Deduper]) -> None

Add custom deduper or linker.

comparison

Functions to compare fields in different sources.

Functions:

  • comparison

    Validates any number of SQL conditions and prepares for a WHERE clause.

comparison

comparison(sql_condition: str, dialect: str = 'postgres') -> str

Validates any number of SQL conditions and prepares for a WHERE clause.

Requires all column references be explicitly declared as from “l” and “r” tables.

dedupers

Deduplication methodologies.

Modules:

  • base

    Base class for deduplication methodologies.

  • naive

    A deduplication methodology based on a deterministic set of conditions.

Classes:

  • NaiveDeduper

    A simple deduper that deduplicates based on a set of boolean conditions.

NaiveDeduper

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: NaiveSettings
prepare
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

base

Base class for deduplication methodologies.

Classes:

  • DeduperSettings

    A data class to enforce basic settings dictionary shapes.

  • Deduper

    A base class for dedupers.

DeduperSettings

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

id class-attribute instance-attribute
id: str = Field(default='id', description='A unique ID field in the data to dedupe')
Deduper

Bases: BaseModel, ABC

A base class for dedupers.

Methods:

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: DeduperSettings
prepare abstractmethod
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe abstractmethod
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

naive

A deduplication methodology based on a deterministic set of conditions.

Classes:

  • NaiveSettings

    A data class to enforce the Naive deduper’s settings dictionary shape.

  • NaiveDeduper

    A simple deduper that deduplicates based on a set of boolean conditions.

NaiveSettings

Bases: DeduperSettings

A data class to enforce the Naive deduper’s settings dictionary shape.

Attributes:

unique_fields class-attribute instance-attribute
unique_fields: list[str] = Field(description='A list of fields that will form a unique, deduplicated record')
id class-attribute instance-attribute
id: str = Field(default='id', description='A unique ID field in the data to dedupe')
NaiveDeduper

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: NaiveSettings
prepare
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

linkers

Linking methodologies.

Modules:

  • base

    Base class for linkers.

  • deterministic

    A linking methodology based on a deterministic set of conditions.

  • splinklinker

    A linking methodology leveraging Splink.

  • weighteddeterministic

    A linking methodology that applies different weights to field comparisons.

Classes:

DeterministicLinker

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

SplinkLinker

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: SplinkSettings
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame = None, right: DataFrame = None) -> DataFrame

Link the left and right dataframes.

WeightedDeterministicLinker

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

base

Base class for linkers.

Classes:

  • LinkerSettings

    A data class to enforce basic settings dictionary shapes.

  • Linker

    A base class for linkers.

LinkerSettings

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
Linker

Bases: BaseModel, ABC

A base class for linkers.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: LinkerSettings
prepare abstractmethod
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

deterministic

A linking methodology based on a deterministic set of conditions.

Classes:

  • DeterministicSettings

    A data class to enforce the Deterministic linker’s settings dictionary shape.

  • DeterministicLinker

    A deterministic linker that links based on a set of boolean conditions.

DeterministicSettings

Bases: LinkerSettings

A data class to enforce the Deterministic linker’s settings dictionary shape.

Methods:

Attributes:

comparisons class-attribute instance-attribute
comparisons: list[str] = Field(description='\n            A list of valid ON clause to compare fields between the left and \n            the right data.\n\n            Use left.field and right.field to refer to columns in the respective \n            sources.\n\n            Each comparison will be treated as OR logic, but more efficiently than using\n            an OR condition in the SQL WHERE clause.\n\n            For example:\n\n            [   \n                "left.company_number = right.company_number",\n                "left.name = right.name and left.postcode = right.postcode",\n            ]\n        ')
left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
validate_comparison classmethod
validate_comparison(value: str | list[str]) -> list[str]

Turn single string into list of one string.

DeterministicLinker

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

splinklinker

A linking methodology leveraging Splink.

Classes:

  • SplinkLinkerFunction

    A method of splink.Linker.training used to train the linker.

  • SplinkSettings

    A data class to enforce the Splink linker’s settings dictionary shape.

  • SplinkLinker

    A linker that leverages Bayesian record linkage using Splink.

SplinkLinkerFunction

Bases: BaseModel

A method of splink.Linker.training used to train the linker.

Methods:

Attributes:

function instance-attribute
function: str
arguments instance-attribute
arguments: dict[str, Any]
validate_function_and_arguments
validate_function_and_arguments() -> SplinkLinkerFunction

Ensure the function and arguments are valid.

SplinkSettings

Bases: LinkerSettings

A data class to enforce the Splink linker’s settings dictionary shape.

Methods:

Attributes:

model_config class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
linker_training_functions class-attribute instance-attribute
linker_training_functions: list[SplinkLinkerFunction] = Field(description='\n            A list of dictionaries where keys are the names of methods for\n            splink.Linker.training and values are dictionaries encoding the arguments of\n            those methods. Each function will be run in the order supplied.\n\n            Example:\n            \n                >>> linker_training_functions=[\n                ...     {\n                ...         "function": "estimate_probability_two_random_records_match",\n                ...         "arguments": {\n                ...             "deterministic_matching_rules": """\n                ...                 l.company_name = r.company_name\n                ...             """,\n                ...             "recall": 0.7,\n                ...         },\n                ...     },\n                ...     {\n                ...         "function": "estimate_u_using_random_sampling",\n                ...         "arguments": {"max_pairs": 1e6},\n                ...     }\n                ... ]\n            \n        ')
linker_settings class-attribute instance-attribute
linker_settings: SettingsCreator = Field(description='\n            A valid Splink SettingsCreator.\n\n            See Splink\'s documentation for a full description of available settings.\n            https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n            * link_type must be set to "link_only"\n            * unique_id_name is overridden to the value of left_id and right_id,\n                which must match\n\n            Example:\n\n                >>> from splink import SettingsCreator, block_on\n                ... import splink.comparison_library as cl\n                ... import splink.comparison_template_library as ctl\n                ... \n                ... splink_settings = SettingsCreator(\n                ...     retain_matching_columns=False,\n                ...     retain_intermediate_calculation_columns=False,\n                ...     blocking_rules_to_generate_predictions=[\n                ...         block_on("company_name"),\n                ...         block_on("postcode"),\n                ...     ],\n                ...     comparisons=[\n                ...         cl.jaro_winkler_at_thresholds(\n                ...             "company_name", \n                ...             [0.9, 0.6], \n                ...             term_frequency_adjustments=True\n                ...         ),\n                ...         ctl.postcode_comparison("postcode"), \n                ...     ]\n                ... )         \n        ')
threshold class-attribute instance-attribute
threshold: float | None = Field(default=None, description='\n            The probability above which matches will be kept.\n\n            None is used to indicate no threshold.\n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ', gt=0, le=1)
left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
check_ids_match
check_ids_match() -> SplinkSettings

Ensure left_id and right_id match.

check_link_only() -> SplinkSettings

Ensure link_type is set to “link_only”.

add_enforced_settings
add_enforced_settings() -> SplinkSettings

Ensure ID is the only field we link on.

load_linker_settings
load_linker_settings(value: str | SettingsCreator) -> SettingsCreator

Load serialised settings into SettingsCreator.

serialise_settings
serialise_settings(value: SettingsCreator, info: Any) -> str

Convert Splink settings to string.

SplinkLinker

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: SplinkSettings
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame = None, right: DataFrame = None) -> DataFrame

Link the left and right dataframes.

weighteddeterministic

A linking methodology that applies different weights to field comparisons.

Classes:

WeightedComparison

Bases: BaseModel

A valid comparison and a weight to give it.

Methods:

Attributes:

comparison class-attribute instance-attribute
comparison: str = Field(description='\n            A valid ON clause to compare fields between the left and \n            the right data.\n\n            Use left.field and right.field to refer to fields in the \n            respective sources.\n\n            For example:\n\n            "left.company_name = right.company_name"\n        ')
weight class-attribute instance-attribute
weight: float = Field(description='\n            A weight to give this comparison. Use 1 for all comparisons to give\n            uniform weight to each.\n        ')
validate_comparison classmethod
validate_comparison(v: str) -> str

Validate the comparison string.

WeightedDeterministicSettings

Bases: LinkerSettings

A data class to enforce the Weighted linker’s settings dictionary shape.

Example

{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }

Attributes:

weighted_comparisons class-attribute instance-attribute
weighted_comparisons: list[WeightedComparison] = Field(description='A list of tuples in the form of a comparison, and a weight.')
threshold class-attribute instance-attribute
threshold: float = Field(description='\n            The probability above which matches will be kept. \n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ', ge=0, le=1)
left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
WeightedDeterministicLinker

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

models

Functions and classes to define, run and register models.

Classes:

  • Model

    Unified model class for both linking and deduping operations.

Functions:

Model

Unified model class for both linking and deduping operations.

Parameters:

  • dag
    (DAG) –

    DAG containing this model.

  • name
    (str) –

    Unique name for the model

  • truth
    (float, default: 1.0 ) –

    Truth threshold. Defaults to 1.0. Can be set later after analysis.

  • model_class
    (type[Deduper] | type[Linker] | str) –

    Class of Linker or Deduper, or its name.

  • model_settings
    (DeduperSettings | LinkerSettings | dict) –

    Appropriate settings object to pass to model class.

  • left_query
    (Query) –

    The query that will get the data to deduplicate, or the data to link on the left.

  • right_query
    (Query | None, default: None ) –

    The query that will get the data to link on the right.

  • description
    (str | None, default: None ) –

    Optional description of the model

Methods:

  • to_resolution

    Convert to Resolution for API calls.

  • from_resolution

    Reconstruct from Resolution.

  • delete

    Delete the model from the database.

  • run

    Execute the model pipeline and return results.

  • sync

    Send the model config, truth and results to the server.

  • download_results

    Retrieve results associated with the model from the database.

  • query

    Generate a query for this model.

Attributes:

last_run instance-attribute
last_run: datetime | None = None
dag instance-attribute
dag = dag
name instance-attribute
name = name
description instance-attribute
description = description
left_query instance-attribute
left_query = left_query
right_query instance-attribute
right_query = right_query
results instance-attribute
results: Results | None = None
model_class instance-attribute
model_class: type[Linker | Deduper] = _MODEL_CLASSES[model_class]
model_instance instance-attribute
model_instance = model_class(settings=model_settings)
model_type instance-attribute
model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER
model_settings instance-attribute
model_settings = SettingsClass(**model_settings)
config property
config: ModelConfig

Generate config DTO from Model.

dependencies property
dependencies: list[ResolutionPath]

Returns all resolution paths this model needs as implied by the queries.

parents property
parents: list[ResolutionPath]

Returns all resolution paths directly input to this model.

resolution_path property
resolution_path: ModelResolutionPath

Returns the model resolution path.

truth property writable
truth: float | None

Returns the truth threshold for the model as a float.

to_resolution
to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution classmethod
from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model

Reconstruct from Resolution.

delete
delete(certain: bool = False) -> bool

Delete the model from the database.

run
run(for_validation: bool = False, full_rerun: bool = False) -> Results

Execute the model pipeline and return results.

Parameters:

  • for_validation
    (bool, default: False ) –

    Whether to download and store extra data to explore and score results.

  • full_rerun
    (bool, default: False ) –

    Whether to force a re-run even if the results are cached

sync
sync() -> None

Send the model config, truth and results to the server.

download_results
download_results() -> Results

Retrieve results associated with the model from the database.

query
query(*sources, **kwargs) -> Query

Generate a query for this model.

add_model_class

add_model_class(ModelClass: type[Linker] | type[Deduper]) -> None

Add custom deduper or linker.