Skip to content

Models

matchbox.client.models

Deduplication and linking methodologies.

Modules:

  • comparison

    Functions to compare fields in different sources.

  • dedupers

    Deduplication methodologies.

  • linkers

    Linking methodologies.

  • models

    Functions and classes to define, run and register models.

Classes:

  • Model

    Unified model class for both linking and deduping operations.

Functions:

Model

Unified model class for both linking and deduping operations.

Parameters:

  • dag

    (DAG) –

    DAG containing this model.

  • name

    (str) –

    Unique name for the model

  • truth

    (float, default: 1.0 ) –

    Truth threshold. Defaults to 1.0. Can be set later after analysis.

  • model_class

    (type[Deduper] | type[Linker] | str) –

    Class of Linker or Deduper, or its name.

  • model_settings

    (DeduperSettings | LinkerSettings | dict) –

    Appropriate settings object to pass to model class.

  • left_query

    (Query) –

    The query that will get the data to deduplicate, or the data to link on the left.

  • right_query

    (Query | None, default: None ) –

    The query that will get the data to link on the right.

  • description

    (str | None, default: None ) –

    Optional description of the model

Methods:

  • to_resolution

    Convert to Resolution for API calls.

  • from_resolution

    Reconstruct from Resolution.

  • delete

    Delete the model from the database.

  • run

    Execute the model pipeline and return results.

  • sync

    Send the model config and results to the server.

  • download_results

    Retrieve results associated with the model from the database.

  • query

    Generate a query for this model.

Attributes:

dag instance-attribute

dag = dag

name instance-attribute

name = name

description instance-attribute

description = description

left_query instance-attribute

left_query = left_query

right_query instance-attribute

right_query = right_query

results instance-attribute

results: ModelResults | None = None

model_class instance-attribute

model_class: type[Linker | Deduper] = _MODEL_CLASSES[model_class]

model_instance instance-attribute

model_instance = model_class(settings=model_settings)

model_type instance-attribute

model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER

model_settings instance-attribute

model_settings = SettingsClass(**model_settings)

config property

config: ModelConfig

Generate config DTO from Model.

sources property

Set of source names upstream of this node.

resolution_path property

resolution_path: ModelResolutionPath

Returns the model resolution path.

truth property writable

truth: float | None

Returns the truth threshold for the model as a float.

to_resolution

to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution classmethod

from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model

Reconstruct from Resolution.

delete

delete(certain: bool = False) -> bool

Delete the model from the database.

run

run(for_validation: bool = False, cache_queries: bool = False) -> ModelResults

Execute the model pipeline and return results.

Parameters:

  • for_validation
    (bool, default: False ) –

    Whether to download and store extra data to explore and score results.

  • cache_queries
    (bool, default: False ) –

    Whether to cache query results on first run and re-use them subsequently.

sync

sync() -> None

Send the model config and results to the server.

Not resistant to race conditions: only one client should call sync at a time.

download_results

download_results() -> ModelResults

Retrieve results associated with the model from the database.

query

query(*sources: Source, **kwargs: Any) -> Query

Generate a query for this model.

add_model_class

add_model_class(ModelClass: type[Linker] | type[Deduper]) -> None

Add custom deduper or linker.

comparison

Functions to compare fields in different sources.

Functions:

  • comparison

    Validates any number of SQL conditions and prepares for a WHERE clause.

comparison

comparison(sql_condition: str, dialect: str = 'postgres') -> str

Validates any number of SQL conditions and prepares for a WHERE clause.

Requires all column references be explicitly declared as from “l” and “r” tables.

dedupers

Deduplication methodologies.

Modules:

  • base

    Base class for deduplication methodologies.

  • naive

    A deduplication methodology based on a deterministic set of conditions.

Classes:

  • NaiveDeduper

    A simple deduper that deduplicates based on a set of boolean conditions.

NaiveDeduper

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: NaiveSettings
prepare
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

base

Base class for deduplication methodologies.

Classes:

  • DeduperSettings

    A data class to enforce basic settings dictionary shapes.

  • Deduper

    A base class for dedupers.

DeduperSettings

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

id class-attribute instance-attribute
id: str = Field(default='id', description='A unique ID field in the data to dedupe')
Deduper

Bases: BaseModel, ABC

A base class for dedupers.

Methods:

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: DeduperSettings
prepare abstractmethod
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe abstractmethod
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

naive

A deduplication methodology based on a deterministic set of conditions.

Classes:

  • NaiveSettings

    A data class to enforce the Naive deduper’s settings dictionary shape.

  • NaiveDeduper

    A simple deduper that deduplicates based on a set of boolean conditions.

NaiveSettings

Bases: DeduperSettings

A data class to enforce the Naive deduper’s settings dictionary shape.

Attributes:

unique_fields class-attribute instance-attribute
unique_fields: list[str] = Field(description='A list of fields that will form a unique, deduplicated record')
id class-attribute instance-attribute
id: str = Field(default='id', description='A unique ID field in the data to dedupe')
NaiveDeduper

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: NaiveSettings
prepare
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

linkers

Linking methodologies.

Modules:

  • base

    Base class for linkers.

  • deterministic

    A linking methodology based on a deterministic set of conditions.

  • splinklinker

    A linking methodology leveraging Splink.

  • weighteddeterministic

    A linking methodology that applies different weights to field comparisons.

Classes:

DeterministicLinker

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Uses DuckDB as the SQL backend, enabling rich SQL operations while maintaining a Polars DataFrame interface. Supports both parallel matching (single round) and sequential matching (multiple rounds where matched records are removed after each round).

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

If comparisons is a flat list, applies all comparisons in parallel. If comparisons is a nested list, applies each round sequentially, removing matched records from the pool after each round.

SplinkLinker

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: SplinkSettings
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame = None, right: DataFrame = None) -> DataFrame

Link the left and right dataframes.

WeightedDeterministicLinker

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

base

Base class for linkers.

Classes:

  • LinkerSettings

    A data class to enforce basic settings dictionary shapes.

  • Linker

    A base class for linkers.

LinkerSettings

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
Linker

Bases: BaseModel, ABC

A base class for linkers.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: LinkerSettings
prepare abstractmethod
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

deterministic

A linking methodology based on a deterministic set of conditions.

Classes:

  • DeterministicSettings

    A data class to enforce the Deterministic linker’s settings dictionary shape.

  • DeterministicLinker

    A deterministic linker that links based on a set of boolean conditions.

DeterministicSettings

Bases: LinkerSettings

A data class to enforce the Deterministic linker’s settings dictionary shape.

Methods:

Attributes:

comparisons class-attribute instance-attribute
comparisons: list[str] | list[list[str]] = Field(description='\n            Comparison rules for matching using DuckDB SQL syntax.\n            \n            Can be specified as:\n            - A flat list of strings: All comparisons applied in parallel (OR logic)\n            - A nested list of lists: Sequential rounds of matching\n            \n            Flat list (parallel):\n            [\n                "left.company_number = right.company_number",\n                "left.name = right.name",\n            ]\n            All comparisons applied to full datasets, results unioned.\n            \n            Nested list (sequential rounds):\n            [\n                [\n                    "left.company_number = right.company_number",\n                    "left.name = right.name",\n                ],\n                [\n                    "left.name_normalised = right.name_normalised",\n                    "left.website = right.website",\n                ],\n            ]\n            Each inner list is a "round". Within each round, comparisons use OR \n            logic. After each round, matched records are removed from the pool \n            before the next round.\n            \n            Use left.field and right.field to refer to columns in the respective \n            sources. Supports all DuckDB SQL operations and functions.\n        ')
left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
validate_comparison classmethod
validate_comparison(value: str | list[str] | list[list[str]]) -> list[list[str]]

Normalise to list of lists format.

DeterministicLinker

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Uses DuckDB as the SQL backend, enabling rich SQL operations while maintaining a Polars DataFrame interface. Supports both parallel matching (single round) and sequential matching (multiple rounds where matched records are removed after each round).

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

If comparisons is a flat list, applies all comparisons in parallel. If comparisons is a nested list, applies each round sequentially, removing matched records from the pool after each round.

splinklinker

A linking methodology leveraging Splink.

Classes:

  • SplinkLinkerFunction

    A method of splink.Linker.training used to train the linker.

  • SplinkSettings

    A data class to enforce the Splink linker’s settings dictionary shape.

  • SplinkLinker

    A linker that leverages Bayesian record linkage using Splink.

SplinkLinkerFunction

Bases: BaseModel

A method of splink.Linker.training used to train the linker.

Methods:

Attributes:

function instance-attribute
function: str
arguments instance-attribute
arguments: dict[str, Any]
validate_function_and_arguments
validate_function_and_arguments() -> SplinkLinkerFunction

Ensure the function and arguments are valid.

SplinkSettings

Bases: LinkerSettings

A data class to enforce the Splink linker’s settings dictionary shape.

Methods:

Attributes:

model_config class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
linker_training_functions class-attribute instance-attribute
linker_training_functions: list[SplinkLinkerFunction] = Field(description='\n            A list of dictionaries where keys are the names of methods for\n            splink.Linker.training and values are dictionaries encoding the arguments of\n            those methods. Each function will be run in the order supplied.\n\n            Example:\n            \n                >>> linker_training_functions=[\n                ...     {\n                ...         "function": "estimate_probability_two_random_records_match",\n                ...         "arguments": {\n                ...             "deterministic_matching_rules": """\n                ...                 l.company_name = r.company_name\n                ...             """,\n                ...             "recall": 0.7,\n                ...         },\n                ...     },\n                ...     {\n                ...         "function": "estimate_u_using_random_sampling",\n                ...         "arguments": {"max_pairs": 1e6},\n                ...     }\n                ... ]\n            \n        ')
linker_settings class-attribute instance-attribute
linker_settings: SettingsCreator = Field(description='\n            A valid Splink SettingsCreator.\n\n            See Splink\'s documentation for a full description of available settings.\n            https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n            * link_type must be set to "link_only"\n            * unique_id_name is overridden to the value of left_id and right_id,\n                which must match\n\n            Example:\n\n                >>> from splink import SettingsCreator, block_on\n                ... import splink.comparison_library as cl\n                ... import splink.comparison_template_library as ctl\n                ... \n                ... splink_settings = SettingsCreator(\n                ...     retain_matching_columns=False,\n                ...     retain_intermediate_calculation_columns=False,\n                ...     blocking_rules_to_generate_predictions=[\n                ...         block_on("company_name"),\n                ...         block_on("postcode"),\n                ...     ],\n                ...     comparisons=[\n                ...         cl.jaro_winkler_at_thresholds(\n                ...             "company_name", \n                ...             [0.9, 0.6], \n                ...             term_frequency_adjustments=True\n                ...         ),\n                ...         ctl.postcode_comparison("postcode"), \n                ...     ]\n                ... )         \n        ')
threshold class-attribute instance-attribute
threshold: float | None = Field(default=None, description='\n            The probability above which matches will be kept.\n\n            None is used to indicate no threshold.\n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ', gt=0, le=1)
left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
check_ids_match
check_ids_match() -> SplinkSettings

Ensure left_id and right_id match.

check_link_only() -> SplinkSettings

Ensure link_type is set to “link_only”.

add_enforced_settings
add_enforced_settings() -> SplinkSettings

Ensure ID is the only field we link on.

load_linker_settings
load_linker_settings(value: str | SettingsCreator) -> SettingsCreator

Load serialised settings into SettingsCreator.

serialise_settings
serialise_settings(value: SettingsCreator, info: SerializationInfo) -> str

Convert Splink settings to string.

SplinkLinker

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: SplinkSettings
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame = None, right: DataFrame = None) -> DataFrame

Link the left and right dataframes.

weighteddeterministic

A linking methodology that applies different weights to field comparisons.

Classes:

WeightedComparison

Bases: BaseModel

A valid comparison and a weight to give it.

Methods:

Attributes:

comparison class-attribute instance-attribute
comparison: str = Field(description='\n            A valid ON clause to compare fields between the left and \n            the right data.\n\n            Use left.field and right.field to refer to fields in the \n            respective sources.\n\n            For example:\n\n            "left.company_name = right.company_name"\n        ')
weight class-attribute instance-attribute
weight: float = Field(description='\n            A weight to give this comparison. Use 1 for all comparisons to give\n            uniform weight to each.\n        ')
validate_comparison classmethod
validate_comparison(v: str) -> str

Validate the comparison string.

WeightedDeterministicSettings

Bases: LinkerSettings

A data class to enforce the Weighted linker’s settings dictionary shape.

Example

{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }

Attributes:

weighted_comparisons class-attribute instance-attribute
weighted_comparisons: list[WeightedComparison] = Field(description='A list of tuples in the form of a comparison, and a weight.')
threshold class-attribute instance-attribute
threshold: float = Field(description='\n            The probability above which matches will be kept. \n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ', ge=0, le=1)
left_id class-attribute instance-attribute
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id class-attribute instance-attribute
right_id: str = Field(default='id', description='The unique ID field in the right data')
WeightedDeterministicLinker

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

models

Functions and classes to define, run and register models.

Classes:

  • Model

    Unified model class for both linking and deduping operations.

Functions:

  • add_model_class

    Add custom deduper or linker.

  • post_run

    Decorator to ensure that a method is called after model run.

Model

Unified model class for both linking and deduping operations.

Parameters:

  • dag
    (DAG) –

    DAG containing this model.

  • name
    (str) –

    Unique name for the model

  • truth
    (float, default: 1.0 ) –

    Truth threshold. Defaults to 1.0. Can be set later after analysis.

  • model_class
    (type[Deduper] | type[Linker] | str) –

    Class of Linker or Deduper, or its name.

  • model_settings
    (DeduperSettings | LinkerSettings | dict) –

    Appropriate settings object to pass to model class.

  • left_query
    (Query) –

    The query that will get the data to deduplicate, or the data to link on the left.

  • right_query
    (Query | None, default: None ) –

    The query that will get the data to link on the right.

  • description
    (str | None, default: None ) –

    Optional description of the model

Methods:

  • to_resolution

    Convert to Resolution for API calls.

  • from_resolution

    Reconstruct from Resolution.

  • delete

    Delete the model from the database.

  • run

    Execute the model pipeline and return results.

  • sync

    Send the model config and results to the server.

  • download_results

    Retrieve results associated with the model from the database.

  • query

    Generate a query for this model.

Attributes:

dag instance-attribute
dag = dag
name instance-attribute
name = name
description instance-attribute
description = description
left_query instance-attribute
left_query = left_query
right_query instance-attribute
right_query = right_query
results instance-attribute
results: ModelResults | None = None
model_class instance-attribute
model_class: type[Linker | Deduper] = _MODEL_CLASSES[model_class]
model_instance instance-attribute
model_instance = model_class(settings=model_settings)
model_type instance-attribute
model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER
model_settings instance-attribute
model_settings = SettingsClass(**model_settings)
config property
config: ModelConfig

Generate config DTO from Model.

sources property

Set of source names upstream of this node.

resolution_path property
resolution_path: ModelResolutionPath

Returns the model resolution path.

truth property writable
truth: float | None

Returns the truth threshold for the model as a float.

to_resolution
to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution classmethod
from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model

Reconstruct from Resolution.

delete
delete(certain: bool = False) -> bool

Delete the model from the database.

run
run(for_validation: bool = False, cache_queries: bool = False) -> ModelResults

Execute the model pipeline and return results.

Parameters:

  • for_validation
    (bool, default: False ) –

    Whether to download and store extra data to explore and score results.

  • cache_queries
    (bool, default: False ) –

    Whether to cache query results on first run and re-use them subsequently.

sync
sync() -> None

Send the model config and results to the server.

Not resistant to race conditions: only one client should call sync at a time.

download_results
download_results() -> ModelResults

Retrieve results associated with the model from the database.

query
query(*sources: Source, **kwargs: Any) -> Query

Generate a query for this model.

add_model_class

add_model_class(ModelClass: type[Linker] | type[Deduper]) -> None

Add custom deduper or linker.

post_run

post_run(method: Callable[..., T]) -> Callable[..., T]

Decorator to ensure that a method is called after model run.

Raises: