Models¶

matchbox.client.models ¶

Deduplication and linking methodologies.

Modules:

comparison –

Functions to compare fields in different sources.
dedupers –

Deduplication methodologies.
linkers –

Linking methodologies.
models –

Functions and classes to define, run and register models.

Classes:

Model –

Unified model class for both linking and deduping operations.

Functions:

add_model_class –

Add custom deduper or linker.

Model ¶

Model(name: str, dag: DAG, model_class: type[Deduper], model_settings: DeduperSettings | dict, left_query: Query, right_query: None = None, truth: float = 1.0, description: str | None = None)

Model(dag: DAG, name: str, model_class: type[Linker], model_settings: LinkerSettings | dict, left_query: Query, right_query: Query, truth: float = 1.0, description: str | None = None)

Model(dag: DAG, name: str, model_class: type[Deduper] | type[Linker] | str, model_settings: DeduperSettings | LinkerSettings | dict, left_query: Query, right_query: Query | None = None, truth: float = 1.0, description: str | None = None)

Unified model class for both linking and deduping operations.

Parameters:

dag ¶
(DAG) –

DAG containing this model.
name ¶
(str) –

Unique name for the model
truth ¶
(float, default: 1.0 ) –

Truth threshold. Defaults to 1.0. Can be set later after analysis.
model_class ¶
(type[Deduper] | type[Linker] | str) –

Class of Linker or Deduper, or its name.
model_settings ¶
(DeduperSettings | LinkerSettings | dict) –

Appropriate settings object to pass to model class.
left_query ¶
(Query) –

The query that will get the data to deduplicate, or the data to link on the left.
right_query ¶
(Query | None, default: None ) –

The query that will get the data to link on the right.
description ¶
(str | None, default: None ) –

Optional description of the model

Methods:

to_resolution –

Convert to Resolution for API calls.
from_resolution –

Reconstruct from Resolution.
delete –

Delete the model from the database.
run –

Execute the model pipeline and return results.
sync –

Send the model config and results to the server.
download_results –

Retrieve results associated with the model from the database.
query –

Generate a query for this model.

Attributes:

dag –
name –
description –
left_query –
right_query –
results (ModelResults | None) –
model_class (type[Linker | Deduper]) –
model_instance –
model_type (ModelType) –
model_settings –
config (ModelConfig) –

Generate config DTO from Model.
sources (set[SourceResolutionName]) –

Set of source names upstream of this node.
resolution_path (ModelResolutionPath) –

Returns the model resolution path.
truth (float | None) –

Returns the truth threshold for the model as a float.

dag `instance-attribute` ¶

dag = dag

name `instance-attribute` ¶

name = name

description `instance-attribute` ¶

description = description

left_query `instance-attribute` ¶

left_query = left_query

right_query `instance-attribute` ¶

right_query = right_query

results `instance-attribute` ¶

results: ModelResults | None = None

model_class `instance-attribute` ¶

model_class: type[Linker | Deduper] = _MODEL_CLASSES[model_class]

model_instance `instance-attribute` ¶

model_instance = model_class(settings=model_settings)

model_type `instance-attribute` ¶

model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER

model_settings `instance-attribute` ¶

model_settings = SettingsClass(**model_settings)

config `property` ¶

config: ModelConfig

Generate config DTO from Model.

sources `property` ¶

sources: set[SourceResolutionName]

Set of source names upstream of this node.

resolution_path `property` ¶

resolution_path: ModelResolutionPath

Returns the model resolution path.

truth `property` `writable` ¶

truth: float | None

Returns the truth threshold for the model as a float.

to_resolution ¶

to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution `classmethod` ¶

from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model

Reconstruct from Resolution.

delete ¶

delete(certain: bool = False) -> bool

Delete the model from the database.

run ¶

run(for_validation: bool = False, cache_queries: bool = False) -> ModelResults

Execute the model pipeline and return results.

Parameters:

for_validation ¶
(bool, default: False ) –

Whether to download and store extra data to explore and score results.
cache_queries ¶
(bool, default: False ) –

Whether to cache query results on first run and re-use them subsequently.

sync ¶

sync() -> None

Send the model config and results to the server.

Not resistant to race conditions: only one client should call sync at a time.

download_results ¶

download_results() -> ModelResults

Retrieve results associated with the model from the database.

query ¶

query(*sources: Source, **kwargs: Any) -> Query

Generate a query for this model.

add_model_class ¶

add_model_class(ModelClass: type[Linker] | type[Deduper]) -> None

Add custom deduper or linker.

comparison ¶

Functions to compare fields in different sources.

Functions:

comparison –

Validates any number of SQL conditions and prepares for a WHERE clause.

comparison ¶

comparison(sql_condition: str, dialect: str = 'postgres') -> str

Validates any number of SQL conditions and prepares for a WHERE clause.

Requires all column references be explicitly declared as from “l” and “r” tables.

dedupers ¶

Deduplication methodologies.

Modules:

base –

Base class for deduplication methodologies.
naive –

A deduplication methodology based on a deterministic set of conditions.

Classes:

NaiveDeduper –

A simple deduper that deduplicates based on a set of boolean conditions.

NaiveDeduper ¶

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

prepare –

Prepare the deduper for deduplication.
dedupe –

Deduplicate the dataframe.

Attributes:

settings (NaiveSettings) –

settings `instance-attribute` ¶

settings: NaiveSettings

prepare ¶

prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe ¶

dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

base ¶

Base class for deduplication methodologies.

Classes:

DeduperSettings –

A data class to enforce basic settings dictionary shapes.
Deduper –

A base class for dedupers.

DeduperSettings ¶

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

id (str) –

id `class-attribute` `instance-attribute` ¶

id: str = Field(default='id', description='A unique ID field in the data to dedupe')

Deduper ¶

Bases: BaseModel, ABC

A base class for dedupers.

Methods:

prepare –

Prepare the deduper for deduplication.
dedupe –

Deduplicate the dataframe.

Attributes:

settings (DeduperSettings) –

settings `instance-attribute` ¶

settings: DeduperSettings

prepare `abstractmethod` ¶

prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe `abstractmethod` ¶

dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

naive ¶

A deduplication methodology based on a deterministic set of conditions.

Classes:

NaiveSettings –

A data class to enforce the Naive deduper’s settings dictionary shape.
NaiveDeduper –

A simple deduper that deduplicates based on a set of boolean conditions.

NaiveSettings ¶

Bases: DeduperSettings

A data class to enforce the Naive deduper’s settings dictionary shape.

Attributes:

unique_fields (list[str]) –
id (str) –

unique_fields `class-attribute` `instance-attribute` ¶

unique_fields: list[str] = Field(description='A list of fields that will form a unique, deduplicated record')

id `class-attribute` `instance-attribute` ¶

id: str = Field(default='id', description='A unique ID field in the data to dedupe')

NaiveDeduper ¶

Bases: Deduper

A simple deduper that deduplicates based on a set of boolean conditions.

Methods:

prepare –

Prepare the deduper for deduplication.
dedupe –

Deduplicate the dataframe.

Attributes:

settings (NaiveSettings) –

settings `instance-attribute` ¶

settings: NaiveSettings

prepare ¶

prepare(data: DataFrame) -> None

Prepare the deduper for deduplication.

dedupe ¶

dedupe(data: DataFrame) -> DataFrame

Deduplicate the dataframe.

linkers ¶

Linking methodologies.

Modules:

base –

Base class for linkers.
deterministic –

A linking methodology based on a deterministic set of conditions.
splinklinker –

A linking methodology leveraging Splink.
weighteddeterministic –

A linking methodology that applies different weights to field comparisons.

Classes:

DeterministicLinker –

A deterministic linker that links based on a set of boolean conditions.
SplinkLinker –

A linker that leverages Bayesian record linkage using Splink.
WeightedDeterministicLinker –

A deterministic linker that applies different weights to field comparisons.

DeterministicLinker ¶

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Uses DuckDB as the SQL backend, enabling rich SQL operations while maintaining a Polars DataFrame interface. Supports both parallel matching (single round) and sequential matching (multiple rounds where matched records are removed after each round).

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (DeterministicSettings) –

settings `instance-attribute` ¶

settings: DeterministicSettings

prepare ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link ¶

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

If comparisons is a flat list, applies all comparisons in parallel. If comparisons is a nested list, applies each round sequentially, removing matched records from the pool after each round.

SplinkLinker ¶

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (SplinkSettings) –

settings `instance-attribute` ¶

settings: SplinkSettings

prepare ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link ¶

link(left: DataFrame = None, right: DataFrame = None) -> DataFrame

Link the left and right dataframes.

WeightedDeterministicLinker ¶

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (WeightedDeterministicSettings) –

settings `instance-attribute` ¶

settings: WeightedDeterministicSettings

prepare ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link ¶

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

base ¶

Base class for linkers.

Classes:

LinkerSettings –

A data class to enforce basic settings dictionary shapes.
Linker –

A base class for linkers.

LinkerSettings ¶

Bases: BaseModel

A data class to enforce basic settings dictionary shapes.

Attributes:

left_id (str) –
right_id (str) –

left_id `class-attribute` `instance-attribute` ¶

left_id: str = Field(default='id', description='The unique ID field in the left data')

right_id `class-attribute` `instance-attribute` ¶

right_id: str = Field(default='id', description='The unique ID field in the right data')

Linker ¶

Bases: BaseModel, ABC

A base class for linkers.

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (LinkerSettings) –

settings `instance-attribute` ¶

settings: LinkerSettings

prepare `abstractmethod` ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link `abstractmethod` ¶

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

deterministic ¶

A linking methodology based on a deterministic set of conditions.

Classes:

DeterministicSettings –

A data class to enforce the Deterministic linker’s settings dictionary shape.
DeterministicLinker –

A deterministic linker that links based on a set of boolean conditions.

DeterministicSettings ¶

Bases: LinkerSettings

A data class to enforce the Deterministic linker’s settings dictionary shape.

Methods:

validate_comparison –

Normalise to list of lists format.

Attributes:

comparisons (list[str] | list[list[str]]) –
left_id (str) –
right_id (str) –

comparisons `class-attribute` `instance-attribute` ¶

comparisons: list[str] | list[list[str]] = Field(description='\n            Comparison rules for matching using DuckDB SQL syntax.\n            \n            Can be specified as:\n            - A flat list of strings: All comparisons applied in parallel (OR logic)\n            - A nested list of lists: Sequential rounds of matching\n            \n            Flat list (parallel):\n            [\n                "left.company_number = right.company_number",\n                "left.name = right.name",\n            ]\n            All comparisons applied to full datasets, results unioned.\n            \n            Nested list (sequential rounds):\n            [\n                [\n                    "left.company_number = right.company_number",\n                    "left.name = right.name",\n                ],\n                [\n                    "left.name_normalised = right.name_normalised",\n                    "left.website = right.website",\n                ],\n            ]\n            Each inner list is a "round". Within each round, comparisons use OR \n            logic. After each round, matched records are removed from the pool \n            before the next round.\n            \n            Use left.field and right.field to refer to columns in the respective \n            sources. Supports all DuckDB SQL operations and functions.\n        ')

left_id `class-attribute` `instance-attribute` ¶

left_id: str = Field(default='id', description='The unique ID field in the left data')

right_id `class-attribute` `instance-attribute` ¶

right_id: str = Field(default='id', description='The unique ID field in the right data')

validate_comparison `classmethod` ¶

validate_comparison(value: str | list[str] | list[list[str]]) -> list[list[str]]

Normalise to list of lists format.

DeterministicLinker ¶

Bases: Linker

A deterministic linker that links based on a set of boolean conditions.

Uses DuckDB as the SQL backend, enabling rich SQL operations while maintaining a Polars DataFrame interface. Supports both parallel matching (single round) and sequential matching (multiple rounds where matched records are removed after each round).

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (DeterministicSettings) –

settings `instance-attribute` ¶

settings: DeterministicSettings

prepare ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link ¶

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

If comparisons is a flat list, applies all comparisons in parallel. If comparisons is a nested list, applies each round sequentially, removing matched records from the pool after each round.

splinklinker ¶

A linking methodology leveraging Splink.

Classes:

SplinkLinkerFunction –

A method of splink.Linker.training used to train the linker.
SplinkSettings –

A data class to enforce the Splink linker’s settings dictionary shape.
SplinkLinker –

A linker that leverages Bayesian record linkage using Splink.

SplinkLinkerFunction ¶

Bases: BaseModel

A method of splink.Linker.training used to train the linker.

Methods:

validate_function_and_arguments –

Ensure the function and arguments are valid.

Attributes:

function (str) –
arguments (dict[str, Any]) –

function `instance-attribute` ¶

function: str

arguments `instance-attribute` ¶

arguments: dict[str, Any]

validate_function_and_arguments ¶

validate_function_and_arguments() -> SplinkLinkerFunction

Ensure the function and arguments are valid.

SplinkSettings ¶

Bases: LinkerSettings

A data class to enforce the Splink linker’s settings dictionary shape.

Methods:

check_ids_match –

Ensure left_id and right_id match.
check_link_only –

Ensure link_type is set to “link_only”.
add_enforced_settings –

Ensure ID is the only field we link on.
load_linker_settings –

Load serialised settings into SettingsCreator.
serialise_settings –

Convert Splink settings to string.

Attributes:

model_config –
linker_training_functions (list[SplinkLinkerFunction]) –
linker_settings (SettingsCreator) –
threshold (float | None) –
left_id (str) –
right_id (str) –

model_config `class-attribute` `instance-attribute` ¶

model_config = ConfigDict(arbitrary_types_allowed=True)

linker_training_functions `class-attribute` `instance-attribute` ¶

linker_training_functions: list[SplinkLinkerFunction] = Field(description='\n            A list of dictionaries where keys are the names of methods for\n            splink.Linker.training and values are dictionaries encoding the arguments of\n            those methods. Each function will be run in the order supplied.\n\n            Example:\n            \n                >>> linker_training_functions=[\n                ...     {\n                ...         "function": "estimate_probability_two_random_records_match",\n                ...         "arguments": {\n                ...             "deterministic_matching_rules": """\n                ...                 l.company_name = r.company_name\n                ...             """,\n                ...             "recall": 0.7,\n                ...         },\n                ...     },\n                ...     {\n                ...         "function": "estimate_u_using_random_sampling",\n                ...         "arguments": {"max_pairs": 1e6},\n                ...     }\n                ... ]\n            \n        ')

linker_settings `class-attribute` `instance-attribute` ¶

linker_settings: SettingsCreator = Field(description='\n            A valid Splink SettingsCreator.\n\n            See Splink\'s documentation for a full description of available settings.\n            https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n            * link_type must be set to "link_only"\n            * unique_id_name is overridden to the value of left_id and right_id,\n                which must match\n\n            Example:\n\n                >>> from splink import SettingsCreator, block_on\n                ... import splink.comparison_library as cl\n                ... import splink.comparison_template_library as ctl\n                ... \n                ... splink_settings = SettingsCreator(\n                ...     retain_matching_columns=False,\n                ...     retain_intermediate_calculation_columns=False,\n                ...     blocking_rules_to_generate_predictions=[\n                ...         block_on("company_name"),\n                ...         block_on("postcode"),\n                ...     ],\n                ...     comparisons=[\n                ...         cl.jaro_winkler_at_thresholds(\n                ...             "company_name", \n                ...             [0.9, 0.6], \n                ...             term_frequency_adjustments=True\n                ...         ),\n                ...         ctl.postcode_comparison("postcode"), \n                ...     ]\n                ... )         \n        ')

threshold `class-attribute` `instance-attribute` ¶

threshold: float | None = Field(default=None, description='\n            The probability above which matches will be kept.\n\n            None is used to indicate no threshold.\n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ', gt=0, le=1)

left_id `class-attribute` `instance-attribute` ¶

left_id: str = Field(default='id', description='The unique ID field in the left data')

right_id `class-attribute` `instance-attribute` ¶

right_id: str = Field(default='id', description='The unique ID field in the right data')

check_ids_match ¶

check_ids_match() -> SplinkSettings

Ensure left_id and right_id match.

check_link_only ¶

check_link_only() -> SplinkSettings

Ensure link_type is set to “link_only”.

add_enforced_settings ¶

add_enforced_settings() -> SplinkSettings

Ensure ID is the only field we link on.

load_linker_settings ¶

load_linker_settings(value: str | SettingsCreator) -> SettingsCreator

Load serialised settings into SettingsCreator.

serialise_settings ¶

serialise_settings(value: SettingsCreator, info: SerializationInfo) -> str

Convert Splink settings to string.

SplinkLinker ¶

Bases: Linker

A linker that leverages Bayesian record linkage using Splink.

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (SplinkSettings) –

settings `instance-attribute` ¶

settings: SplinkSettings

prepare ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link ¶

link(left: DataFrame = None, right: DataFrame = None) -> DataFrame

Link the left and right dataframes.

weighteddeterministic ¶

A linking methodology that applies different weights to field comparisons.

Classes:

WeightedComparison –

A valid comparison and a weight to give it.
WeightedDeterministicSettings –

A data class to enforce the Weighted linker’s settings dictionary shape.
WeightedDeterministicLinker –

A deterministic linker that applies different weights to field comparisons.

WeightedComparison ¶

Bases: BaseModel

A valid comparison and a weight to give it.

Methods:

validate_comparison –

Validate the comparison string.

Attributes:

comparison (str) –
weight (float) –

comparison `class-attribute` `instance-attribute` ¶

comparison: str = Field(description='\n            A valid ON clause to compare fields between the left and \n            the right data.\n\n            Use left.field and right.field to refer to fields in the \n            respective sources.\n\n            For example:\n\n            "left.company_name = right.company_name"\n        ')

weight `class-attribute` `instance-attribute` ¶

weight: float = Field(description='\n            A weight to give this comparison. Use 1 for all comparisons to give\n            uniform weight to each.\n        ')

validate_comparison `classmethod` ¶

validate_comparison(v: str) -> str

Validate the comparison string.

WeightedDeterministicSettings ¶

Bases: LinkerSettings

A data class to enforce the Weighted linker’s settings dictionary shape.

Example

{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }

Attributes:

weighted_comparisons (list[WeightedComparison]) –
threshold (float) –
left_id (str) –
right_id (str) –

weighted_comparisons `class-attribute` `instance-attribute` ¶

weighted_comparisons: list[WeightedComparison] = Field(description='A list of tuples in the form of a comparison, and a weight.')

threshold `class-attribute` `instance-attribute` ¶

threshold: float = Field(description='\n            The probability above which matches will be kept. \n            \n            Inclusive, so a value of 1 will keep only exact matches across all \n            comparisons.\n        ', ge=0, le=1)

left_id `class-attribute` `instance-attribute` ¶

left_id: str = Field(default='id', description='The unique ID field in the left data')

right_id `class-attribute` `instance-attribute` ¶

right_id: str = Field(default='id', description='The unique ID field in the right data')

WeightedDeterministicLinker ¶

Bases: Linker

A deterministic linker that applies different weights to field comparisons.

Methods:

prepare –

Prepare the linker for linking.
link –

Link the left and right dataframes.

Attributes:

settings (WeightedDeterministicSettings) –

settings `instance-attribute` ¶

settings: WeightedDeterministicSettings

prepare ¶

prepare(left: DataFrame, right: DataFrame) -> None

Prepare the linker for linking.

link ¶

link(left: DataFrame, right: DataFrame) -> DataFrame

Link the left and right dataframes.

models ¶

Functions and classes to define, run and register models.

Classes:

Model –

Unified model class for both linking and deduping operations.

Functions:

add_model_class –

Add custom deduper or linker.
post_run –

Decorator to ensure that a method is called after model run.

Model ¶

Model(name: str, dag: DAG, model_class: type[Deduper], model_settings: DeduperSettings | dict, left_query: Query, right_query: None = None, truth: float = 1.0, description: str | None = None)

Model(dag: DAG, name: str, model_class: type[Linker], model_settings: LinkerSettings | dict, left_query: Query, right_query: Query, truth: float = 1.0, description: str | None = None)

Model(dag: DAG, name: str, model_class: type[Deduper] | type[Linker] | str, model_settings: DeduperSettings | LinkerSettings | dict, left_query: Query, right_query: Query | None = None, truth: float = 1.0, description: str | None = None)

Unified model class for both linking and deduping operations.

Parameters:

dag ¶
(DAG) –

DAG containing this model.
name ¶
(str) –

Unique name for the model
truth ¶
(float, default: 1.0 ) –

Truth threshold. Defaults to 1.0. Can be set later after analysis.
model_class ¶
(type[Deduper] | type[Linker] | str) –

Class of Linker or Deduper, or its name.
model_settings ¶
(DeduperSettings | LinkerSettings | dict) –

Appropriate settings object to pass to model class.
left_query ¶
(Query) –

The query that will get the data to deduplicate, or the data to link on the left.
right_query ¶
(Query | None, default: None ) –

The query that will get the data to link on the right.
description ¶
(str | None, default: None ) –

Optional description of the model

Methods:

to_resolution –

Convert to Resolution for API calls.
from_resolution –

Reconstruct from Resolution.
delete –

Delete the model from the database.
run –

Execute the model pipeline and return results.
sync –

Send the model config and results to the server.
download_results –

Retrieve results associated with the model from the database.
query –

Generate a query for this model.

Attributes:

dag –
name –
description –
left_query –
right_query –
results (ModelResults | None) –
model_class (type[Linker | Deduper]) –
model_instance –
model_type (ModelType) –
model_settings –
config (ModelConfig) –

Generate config DTO from Model.
sources (set[SourceResolutionName]) –

Set of source names upstream of this node.
resolution_path (ModelResolutionPath) –

Returns the model resolution path.
truth (float | None) –

Returns the truth threshold for the model as a float.

dag `instance-attribute` ¶

dag = dag

name `instance-attribute` ¶

name = name

description `instance-attribute` ¶

description = description

left_query `instance-attribute` ¶

left_query = left_query

right_query `instance-attribute` ¶

right_query = right_query

results `instance-attribute` ¶

results: ModelResults | None = None

model_class `instance-attribute` ¶

model_class: type[Linker | Deduper] = _MODEL_CLASSES[model_class]

model_instance `instance-attribute` ¶

model_instance = model_class(settings=model_settings)

model_type `instance-attribute` ¶

model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER

model_settings `instance-attribute` ¶

model_settings = SettingsClass(**model_settings)

config `property` ¶

config: ModelConfig

Generate config DTO from Model.

sources `property` ¶

sources: set[SourceResolutionName]

Set of source names upstream of this node.

resolution_path `property` ¶

resolution_path: ModelResolutionPath

Returns the model resolution path.

truth `property` `writable` ¶

truth: float | None

Returns the truth threshold for the model as a float.

to_resolution ¶

to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution `classmethod` ¶

from_resolution(resolution: Resolution, resolution_name: str, dag: DAG) -> Model

Reconstruct from Resolution.

delete ¶

delete(certain: bool = False) -> bool

Delete the model from the database.

run ¶

run(for_validation: bool = False, cache_queries: bool = False) -> ModelResults

Execute the model pipeline and return results.

Parameters:

for_validation ¶
(bool, default: False ) –

Whether to download and store extra data to explore and score results.
cache_queries ¶
(bool, default: False ) –

Whether to cache query results on first run and re-use them subsequently.

sync ¶

sync() -> None

Send the model config and results to the server.

Not resistant to race conditions: only one client should call sync at a time.

download_results ¶

download_results() -> ModelResults

Retrieve results associated with the model from the database.

query ¶

query(*sources: Source, **kwargs: Any) -> Query

Generate a query for this model.

add_model_class ¶

add_model_class(ModelClass: type[Linker] | type[Deduper]) -> None

Add custom deduper or linker.

post_run ¶

post_run(method: Callable[..., T]) -> Callable[..., T]

Decorator to ensure that a method is called after model run.

Raises:

RuntimeError –

If run hasn’t happened.

Models¶

matchbox.client.models ¶

Model ¶

dag ¶

name ¶

truth ¶

model_class ¶

model_settings ¶

left_query ¶

right_query ¶

description ¶

dag instance-attribute ¶

name instance-attribute ¶

description instance-attribute ¶

left_query instance-attribute ¶

right_query instance-attribute ¶

results instance-attribute ¶

model_class instance-attribute ¶

model_instance instance-attribute ¶

model_type instance-attribute ¶

model_settings instance-attribute ¶

config property ¶

sources property ¶

resolution_path property ¶

truth property writable ¶

to_resolution ¶

from_resolution classmethod ¶

delete ¶

run ¶

for_validation ¶

cache_queries ¶

sync ¶

download_results ¶

query ¶

add_model_class ¶

comparison ¶

comparison ¶

dedupers ¶

NaiveDeduper ¶

settings instance-attribute ¶

prepare ¶

dedupe ¶

base ¶

DeduperSettings ¶

id class-attribute instance-attribute ¶

Deduper ¶

settings instance-attribute ¶

prepare abstractmethod ¶

dedupe abstractmethod ¶

naive ¶

NaiveSettings ¶

unique_fields class-attribute instance-attribute ¶

id class-attribute instance-attribute ¶

NaiveDeduper ¶

settings instance-attribute ¶

prepare ¶

dedupe ¶

linkers ¶

DeterministicLinker ¶

settings instance-attribute ¶

prepare ¶

link ¶

SplinkLinker ¶

settings instance-attribute ¶

prepare ¶

link ¶

WeightedDeterministicLinker ¶

settings instance-attribute ¶

prepare ¶

link ¶

base ¶

LinkerSettings ¶

left_id class-attribute instance-attribute ¶

right_id class-attribute instance-attribute ¶

Linker ¶

settings instance-attribute ¶

prepare abstractmethod ¶

link abstractmethod ¶

deterministic ¶

DeterministicSettings ¶

`dag` ¶

`name` ¶

`truth` ¶

`model_class` ¶

`model_settings` ¶

`left_query` ¶

`right_query` ¶

`description` ¶

dag `instance-attribute` ¶

name `instance-attribute` ¶

description `instance-attribute` ¶

left_query `instance-attribute` ¶

right_query `instance-attribute` ¶

results `instance-attribute` ¶

model_class `instance-attribute` ¶

model_instance `instance-attribute` ¶

model_type `instance-attribute` ¶

model_settings `instance-attribute` ¶

config `property` ¶

sources `property` ¶

resolution_path `property` ¶

truth `property` `writable` ¶

from_resolution `classmethod` ¶

`for_validation` ¶

`cache_queries` ¶

settings `instance-attribute` ¶

id `class-attribute` `instance-attribute` ¶

settings `instance-attribute` ¶

prepare `abstractmethod` ¶

dedupe `abstractmethod` ¶

unique_fields `class-attribute` `instance-attribute` ¶

id `class-attribute` `instance-attribute` ¶

settings `instance-attribute` ¶

settings `instance-attribute` ¶

settings `instance-attribute` ¶

settings `instance-attribute` ¶

left_id `class-attribute` `instance-attribute` ¶

right_id `class-attribute` `instance-attribute` ¶

settings `instance-attribute` ¶

prepare `abstractmethod` ¶

link `abstractmethod` ¶

comparisons `class-attribute` `instance-attribute` ¶

left_id `class-attribute` `instance-attribute` ¶

right_id `class-attribute` `instance-attribute` ¶

validate_comparison `classmethod` ¶

settings `instance-attribute` ¶

function `instance-attribute` ¶

arguments `instance-attribute` ¶

model_config `class-attribute` `instance-attribute` ¶

linker_training_functions `class-attribute` `instance-attribute` ¶

linker_settings `class-attribute` `instance-attribute` ¶

threshold `class-attribute` `instance-attribute` ¶

left_id `class-attribute` `instance-attribute` ¶

right_id `class-attribute` `instance-attribute` ¶

settings `instance-attribute` ¶

comparison `class-attribute` `instance-attribute` ¶

weight `class-attribute` `instance-attribute` ¶

validate_comparison `classmethod` ¶

weighted_comparisons `class-attribute` `instance-attribute` ¶

threshold `class-attribute` `instance-attribute` ¶

left_id `class-attribute` `instance-attribute` ¶

right_id `class-attribute` `instance-attribute` ¶

settings `instance-attribute` ¶

`dag` ¶

`name` ¶

`truth` ¶

`model_class` ¶

`model_settings` ¶

`left_query` ¶

`right_query` ¶

`description` ¶

dag `instance-attribute` ¶

name `instance-attribute` ¶

description `instance-attribute` ¶

left_query `instance-attribute` ¶

right_query `instance-attribute` ¶

results `instance-attribute` ¶

model_class `instance-attribute` ¶

model_instance `instance-attribute` ¶

model_type `instance-attribute` ¶