Skip to content

Models

matchbox.client.models

Deduplication and linking methodologies.

Modules:

  • dedupers

    Deduplication methodologies.

  • linkers

    Linking methodologies.

  • models

    Functions and classes to define, run and register models.

dedupers

Deduplication methodologies.

Modules:

  • base

    Base class for deduplication methodologies.

  • naive

    A deduplication methodology based on a deterministic set of conditions.

Classes:

  • NaiveDeduper

    A simple deduper that deduplicates based on a set of boolean conditions.

NaiveDeduper

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

  • from_settings

    Create a NaiveDeduper from a settings dictionary.

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: NaiveSettings
from_settings classmethod
from_settings(
    id: str, unique_fields: list[str]
) -> NaiveDeduper

Create a NaiveDeduper from a settings dictionary.

prepare
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

base

Base class for deduplication methodologies.

Classes:

  • DeduperSettings

    A data class to enforce basic settings dictionary shapes.

  • Deduper

    A base class for dedupers.

DeduperSettings

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

id class-attribute instance-attribute
id: str = Field(
    description="A unique ID column in the table to dedupe"
)
Deduper

Bases: BaseModel, ABC

A base class for dedupers.

Methods:

  • from_settings

    Create a Deduper from a settings dictionary.

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: DeduperSettings
from_settings abstractmethod classmethod
from_settings() -> Deduper

Create a Deduper from a settings dictionary.

prepare abstractmethod
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe abstractmethod
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

naive

A deduplication methodology based on a deterministic set of conditions.

Classes:

  • NaiveSettings

    A data class to enforce the Naive deduper’s settings dictionary shape.

  • NaiveDeduper

    A simple deduper that deduplicates based on a set of boolean conditions.

NaiveSettings

Bases: DeduperSettings

A data class to enforce the Naive deduper’s settings dictionary shape.

Attributes:

id class-attribute instance-attribute
id: str = Field(
    description="A unique ID column in the table to dedupe"
)
unique_fields class-attribute instance-attribute
unique_fields: list[str] = Field(
    description="A list of columns that will form a unique, deduplicated record"
)
NaiveDeduper

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

  • from_settings

    Create a NaiveDeduper from a settings dictionary.

  • prepare

    Prepare the deduper for deduplication.

  • dedupe

    Deduplicate the dataframe.

Attributes:

settings instance-attribute
settings: NaiveSettings
from_settings classmethod
from_settings(
    id: str, unique_fields: list[str]
) -> NaiveDeduper

Create a NaiveDeduper from a settings dictionary.

prepare
prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe
dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

linkers

Linking methodologies.

Modules:

  • base

    Base class for linkers.

  • deterministic

    A linking methodology based on a deterministic set of conditions.

  • splinklinker

    A linking methodology leveraging Splink.

  • weighteddeterministic

    A linking methodology that applies different weights to field comparisons.

Classes:

DeterministicLinker

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Methods:

  • from_settings

    Create a DeterministicLinker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
from_settings classmethod
from_settings(
    left_id: str, right_id: str, comparisons: str
) -> DeterministicLinker

Create a DeterministicLinker from a settings dictionary.

prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

SplinkLinker

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

  • from_settings

    Create a SplinkLinker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: SplinkSettings
from_settings classmethod
from_settings(
    left_id: str,
    right_id: str,
    linker_training_functions: list[dict[str, Any]],
    linker_settings: SettingsCreator,
    threshold: float,
) -> SplinkLinker

Create a SplinkLinker from a settings dictionary.

prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(
    left: DataFrame = None, right: DataFrame = None
) -> DataFrame

Link the left and right dataframes.

WeightedDeterministicLinker

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

  • from_settings

    Create a WeightedDeterministicLinker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
from_settings classmethod
from_settings(
    left_id: str,
    right_id: str,
    weighted_comparisons: list[dict[str, Any]],
    threshold: float,
) -> WeightedDeterministicLinker

Create a WeightedDeterministicLinker from a settings dictionary.

prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

base

Base class for linkers.

Classes:

  • LinkerSettings

    A data class to enforce basic settings dictionary shapes.

  • Linker

    A base class for linkers.

LinkerSettings

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

left_id class-attribute instance-attribute
left_id: str = Field(
    description="The unique ID column in the left dataset"
)
right_id class-attribute instance-attribute
right_id: str = Field(
    description="The unique ID column in the right dataset"
)
Linker

Bases: BaseModel, ABC

A base class for linkers.

Methods:

  • from_settings

    Create a Linker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: LinkerSettings
from_settings abstractmethod classmethod
from_settings() -> Linker

Create a Linker from a settings dictionary.

prepare abstractmethod
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

deterministic

A linking methodology based on a deterministic set of conditions.

Classes:

  • DeterministicSettings

    A data class to enforce the Deterministic linker’s settings dictionary shape.

  • DeterministicLinker

    A deterministic linker that links based on a set of boolean conditions.

DeterministicSettings

Bases: LinkerSettings

A data class to enforce the Deterministic linker’s settings dictionary shape.

Methods:

Attributes:

left_id class-attribute instance-attribute
left_id: str = Field(
    description="The unique ID column in the left dataset"
)
right_id class-attribute instance-attribute
right_id: str = Field(
    description="The unique ID column in the right dataset"
)
comparisons class-attribute instance-attribute
comparisons: str = Field(
    description='\n            A valid ON clause to compare fields between the left and \n            the right data.\n\n            Use left.field and right.field to refer to columns in the \n            respective sources.\n\n            For example:\n\n            "left.name = right.name and left.company_id = right.id"\n        '
)
validate_comparison classmethod
validate_comparison(v: str) -> str

Validate the comparison string.

DeterministicLinker

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Methods:

  • from_settings

    Create a DeterministicLinker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
from_settings classmethod
from_settings(
    left_id: str, right_id: str, comparisons: str
) -> DeterministicLinker

Create a DeterministicLinker from a settings dictionary.

prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

splinklinker

A linking methodology leveraging Splink.

Classes:

  • SplinkLinkerFunction

    A method of splink.Linker.training used to train the linker.

  • SplinkSettings

    A data class to enforce the Splink linker’s settings dictionary shape.

  • SplinkLinker

    A linker that leverages Bayesian record linkage using Splink.

SplinkLinkerFunction

Bases: BaseModel

A method of splink.Linker.training used to train the linker.

Methods:

Attributes:

function instance-attribute
function: str
arguments instance-attribute
arguments: dict[str, Any]
validate_function_and_arguments
validate_function_and_arguments() -> SplinkLinkerFunction

Ensure the function and arguments are valid.

SplinkSettings

Bases: LinkerSettings

A data class to enforce the Splink linker’s settings dictionary shape.

Methods:

Attributes:

left_id class-attribute instance-attribute
left_id: str = Field(
    description="The unique ID column in the left dataset"
)
right_id class-attribute instance-attribute
right_id: str = Field(
    description="The unique ID column in the right dataset"
)
model_config class-attribute instance-attribute
model_config = ConfigDict(arbitrary_types_allowed=True)
database_api class-attribute instance-attribute
database_api: Type[DuckDBAPI] = Field(
    default=DuckDBAPI,
    description="\n            The Splink DB API, to choose between DuckDB (default) and Spark (untested)\n        ",
)
linker_training_functions class-attribute instance-attribute
linker_training_functions: list[SplinkLinkerFunction] = (
    Field(
        description='\n            A list of dictionaries where keys are the names of methods for\n            splink.Linker.training and values are dictionaries encoding the arguments of\n            those methods. Each function will be run in the order supplied.\n\n            Example:\n            \n                >>> linker_training_functions=[\n                ...     {\n                ...         "function": "estimate_probability_two_random_records_match",\n                ...         "arguments": {\n                ...             "deterministic_matching_rules": """\n                ...                 l.company_name = r.company_name\n                ...             """,\n                ...             "recall": 0.7,\n                ...         },\n                ...     },\n                ...     {\n                ...         "function": "estimate_u_using_random_sampling",\n                ...         "arguments": {"max_pairs": 1e6},\n                ...     }\n                ... ]\n            \n        '
    )
)
linker_settings class-attribute instance-attribute
linker_settings: SettingsCreator = Field(
    description='\n            A valid Splink SettingsCreator.\n\n            See Splink\'s documentation for a full description of available settings.\n            https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n            * link_type must be set to "link_only"\n            * unique_id_column_name is overridden to the value of left_id and right_id,\n                which must match\n\n            Example:\n\n                >>> from splink import SettingsCreator, block_on\n                ... import splink.comparison_library as cl\n                ... import splink.comparison_template_library as ctl\n                ... \n                ... splink_settings = SettingsCreator(\n                ...     retain_matching_columns=False,\n                ...     retain_intermediate_calculation_columns=False,\n                ...     blocking_rules_to_generate_predictions=[\n                ...         block_on("company_name"),\n                ...         block_on("postcode"),\n                ...     ],\n                ...     comparisons=[\n                ...         cl.jaro_winkler_at_thresholds(\n                ...             "company_name", \n                ...             [0.9, 0.6], \n                ...             term_frequency_adjustments=True\n                ...         ),\n                ...         ctl.postcode_comparison("postcode"), \n                ...     ]\n                ... )         \n        '
)
threshold class-attribute instance-attribute
threshold: float | None = Field(
    default=None,
    description="\n            The probability above which matches will be kept.\n\n            None is used to indicate no threshold.\n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ",
    gt=0,
    le=1,
)
check_ids_match
check_ids_match() -> SplinkSettings

Ensure left_id and right_id match.

check_link_only() -> SplinkSettings

Ensure link_type is set to “link_only”.

add_enforced_settings
add_enforced_settings() -> SplinkSettings

Ensure ID is the only field we link on.

SplinkLinker

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

  • from_settings

    Create a SplinkLinker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
settings: SplinkSettings
from_settings classmethod
from_settings(
    left_id: str,
    right_id: str,
    linker_training_functions: list[dict[str, Any]],
    linker_settings: SettingsCreator,
    threshold: float,
) -> SplinkLinker

Create a SplinkLinker from a settings dictionary.

prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(
    left: DataFrame = None, right: DataFrame = None
) -> DataFrame

Link the left and right dataframes.

weighteddeterministic

A linking methodology that applies different weights to field comparisons.

Classes:

WeightedComparison

Bases: BaseModel

A valid comparison and a weight to give it.

Methods:

Attributes:

comparison class-attribute instance-attribute
comparison: str = Field(
    description='\n            A valid ON clause to compare fields between the left and \n            the right data.\n\n            Use left.field and right.field to refer to columns in the \n            respective sources.\n\n            For example:\n\n            "left.company_name = right.company_name"\n        '
)
weight class-attribute instance-attribute
weight: float = Field(
    description="\n            A weight to give this comparison. Use 1 for all comparisons to give\n            uniform weight to each.\n        "
)
validate_comparison classmethod
validate_comparison(v: str) -> str

Validate the comparison string.

WeightedDeterministicSettings

Bases: LinkerSettings

A data class to enforce the Weighted linker’s settings dictionary shape.

Example

{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }

Attributes:

left_id class-attribute instance-attribute
left_id: str = Field(
    description="The unique ID column in the left dataset"
)
right_id class-attribute instance-attribute
right_id: str = Field(
    description="The unique ID column in the right dataset"
)
weighted_comparisons class-attribute instance-attribute
weighted_comparisons: list[WeightedComparison] = Field(
    description="A list of tuples in the form of a comparison, and a weight."
)
threshold class-attribute instance-attribute
threshold: float = Field(
    description="\n            The probability above which matches will be kept. \n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ",
    ge=0,
    le=1,
)
WeightedDeterministicLinker

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

  • from_settings

    Create a WeightedDeterministicLinker from a settings dictionary.

  • prepare

    Prepare the linker for linking.

  • link

    Link the left and right dataframes.

Attributes:

settings instance-attribute
from_settings classmethod
from_settings(
    left_id: str,
    right_id: str,
    weighted_comparisons: list[dict[str, Any]],
    threshold: float,
) -> WeightedDeterministicLinker

Create a WeightedDeterministicLinker from a settings dictionary.

prepare
prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

models

Functions and classes to define, run and register models.

Classes:

  • Model

    Unified model class for both linking and deduping operations.

Functions:

  • make_model

    Create a unified model instance for either linking or deduping operations.

Model

Model(
    metadata: ModelMetadata,
    model_instance: Linker | Deduper,
    left_data: DataFrame,
    right_data: DataFrame | None = None,
)

Unified model class for both linking and deduping operations.

Methods:

  • insert_model

    Insert the model into the backend database.

  • delete

    Delete the model from the database.

  • run

    Execute the model pipeline and return results.

Attributes:

metadata instance-attribute
metadata = metadata
model_instance instance-attribute
model_instance = model_instance
left_data instance-attribute
left_data = left_data
right_data instance-attribute
right_data = right_data
results property writable
results: Results

Retrieve results associated with the model from the database.

truth property writable
truth: float

Retrieve the truth threshold for the model.

ancestors property
ancestors: dict[str, float]

Retrieve the ancestors of the model.

ancestors_cache property writable
ancestors_cache: dict[str, float]

Retrieve the ancestors cache of the model.

insert_model
insert_model() -> None

Insert the model into the backend database.

delete
delete(certain: bool = False) -> bool

Delete the model from the database.

run
run() -> Results

Execute the model pipeline and return results.

make_model

make_model(
    model_name: str,
    description: str,
    model_class: type[Linker] | type[Deduper],
    model_settings: dict[str, Any],
    left_data: DataFrame,
    left_resolution: str,
    right_data: DataFrame | None = None,
    right_resolution: str | None = None,
) -> Model

Create a unified model instance for either linking or deduping operations.

Parameters:

  • model_name
    (str) –

    Your unique identifier for the model

  • description
    (str) –

    Description of the model run

  • model_class
    (type[Linker] | type[Deduper]) –

    Either Linker or Deduper class

  • model_settings
    (dict[str, Any]) –

    Configuration settings for the model

  • left_data
    (DataFrame) –

    Primary dataset

  • left_resolution
    (str) –

    Resolution name for primary model or dataset

  • right_data
    (DataFrame | None, default: None ) –

    Secondary dataset (linking only)

  • right_resolution
    (str | None, default: None ) –

    Resolution name for secondary model or dataset (linking only)

Returns:

  • Model ( Model ) –

    Configured model instance ready for execution