Models¶
matchbox.client.models
¶
Deduplication and linking methodologies.
Modules:
-
comparison–Functions to compare fields in different sources.
-
dedupers–Deduplication methodologies.
-
linkers–Linking methodologies.
-
models–Functions and classes to define, run and register models.
Classes:
-
Model–Unified model class for both linking and deduping operations.
Functions:
-
add_model_class–Add custom deduper or linker.
Model
¶
Model(name: str, dag: DAG, model_class: type[Deduper], model_settings: DeduperSettings | dict, left_query: Query, right_query: None = None, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Linker], model_settings: LinkerSettings | dict, left_query: Query, right_query: Query, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Deduper] | type[Linker] | str, model_settings: DeduperSettings | LinkerSettings | dict, left_query: Query, right_query: Query | None = None, description: str | None = None)
Bases: StepABC
flowchart TD
matchbox.client.models.Model[Model]
matchbox.client.steps.StepABC[StepABC]
matchbox.client.steps.StepABC --> matchbox.client.models.Model
click matchbox.client.models.Model href "" "matchbox.client.models.Model"
click matchbox.client.steps.StepABC href "" "matchbox.client.steps.StepABC"
Unified model class for both linking and deduping operations.
Parameters:
-
(dag¶DAG) –DAG containing this model.
-
(name¶str) –Unique name for the model.
-
(model_class¶type[Deduper] | type[Linker] | str) –Class of Linker or Deduper, or its name.
-
(model_settings¶DeduperSettings | LinkerSettings | dict) –Appropriate settings object to pass to model class.
-
(left_query¶Query) –The query that will get the data to deduplicate, or the data to link on the left.
-
(right_query¶Query | None, default:None) –The query that will get the data to link on the right.
-
(description¶str | None, default:None) –Optional description of the model.
Methods:
-
delete–Delete this step and its associated data from the backend.
-
download–Fetch remote data for this step and store it locally.
-
sync–Send step config and local data to the server.
-
to_dto–Convert to Step DTO for API calls.
-
from_dto–Reconstruct from Step DTO.
-
compute_scores–Run model instance against data.
-
run–Execute the model pipeline and return results.
-
resolver–Create a resolver rooted at this model and add it to the DAG.
-
clear_data–Suppress data clearing for models.
Attributes:
-
dag– -
name– -
description– -
local_data(DataFrame | None) –The locally computed results for this step.
-
cache_path(Path) –Path within the DAG cache for storing this step’s local data.
-
left_query– -
right_query– -
model_class(type[Linker | Deduper]) – -
model_instance– -
model_type(ModelType) – -
model_settings– -
results(DataFrame | None) –The locally computed model scores. Alias for local_data.
-
config(ModelConfig) –Generate config DTO from Model.
-
sources(set[SourceStepName]) –Set of source names upstream of this node.
-
path(ModelStepPath) –Return the model step path.
cache_path
property
¶
cache_path: Path
Path within the DAG cache for storing this step’s local data.
model_type
instance-attribute
¶
model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER
results
property
writable
¶
The locally computed model scores. Alias for local_data.
delete
¶
Delete this step and its associated data from the backend.
sync
¶
Send step config and local data to the server.
Not resistant to race conditions: only one client should call sync at a time.
from_dto
classmethod
¶
Reconstruct from Step DTO.
compute_scores
¶
Run model instance against data.
run
¶
run(left_data: DataFrame | None = None, right_data: DataFrame | None = None, low_memory: bool = False) -> DataFrame
Execute the model pipeline and return results.
Parameters:
-
(left_data¶optional, default:None) –Pre-fetched query data to deduplicate if the model is a deduper, or link on the left if the model is a linker.
-
(right_data¶optional, default:None) –Pre-fetched query data to link on the right, if the model is a linker. If the model is a deduper, this argument is ignored.
-
(low_memory¶bool, default:False) –If True, it will not download data from the server to support evaluation.
add_model_class
¶
Add custom deduper or linker.
comparison
¶
Functions to compare fields in different sources.
Functions:
-
comparison–Validates any number of SQL conditions and prepares for a WHERE clause.
dedupers
¶
Deduplication methodologies.
Modules:
-
base–Base class for deduplication methodologies.
-
naive–A deduplication methodology based on a deterministic set of conditions.
Classes:
-
NaiveDeduper–A simple deduper that deduplicates based on a set of boolean conditions.
NaiveDeduper
¶
Bases: Deduper
flowchart TD
matchbox.client.models.dedupers.NaiveDeduper[NaiveDeduper]
matchbox.client.models.dedupers.base.Deduper[Deduper]
matchbox.client.models.dedupers.base.Deduper --> matchbox.client.models.dedupers.NaiveDeduper
click matchbox.client.models.dedupers.NaiveDeduper href "" "matchbox.client.models.dedupers.NaiveDeduper"
click matchbox.client.models.dedupers.base.Deduper href "" "matchbox.client.models.dedupers.base.Deduper"
A simple deduper that deduplicates based on a set of boolean conditions.
Methods:
Attributes:
base
¶
Base class for deduplication methodologies.
Classes:
-
DeduperSettings–A data class to enforce basic settings dictionary shapes.
-
Deduper–A base class for dedupers.
DeduperSettings
¶
Bases: BaseModel
flowchart TD
matchbox.client.models.dedupers.base.DeduperSettings[DeduperSettings]
click matchbox.client.models.dedupers.base.DeduperSettings href "" "matchbox.client.models.dedupers.base.DeduperSettings"
A data class to enforce basic settings dictionary shapes.
Attributes:
naive
¶
A deduplication methodology based on a deterministic set of conditions.
Classes:
-
NaiveSettings–A data class to enforce the Naive deduper’s settings dictionary shape.
-
NaiveDeduper–A simple deduper that deduplicates based on a set of boolean conditions.
NaiveSettings
¶
Bases: DeduperSettings
flowchart TD
matchbox.client.models.dedupers.naive.NaiveSettings[NaiveSettings]
matchbox.client.models.dedupers.base.DeduperSettings[DeduperSettings]
matchbox.client.models.dedupers.base.DeduperSettings --> matchbox.client.models.dedupers.naive.NaiveSettings
click matchbox.client.models.dedupers.naive.NaiveSettings href "" "matchbox.client.models.dedupers.naive.NaiveSettings"
click matchbox.client.models.dedupers.base.DeduperSettings href "" "matchbox.client.models.dedupers.base.DeduperSettings"
A data class to enforce the Naive deduper’s settings dictionary shape.
Attributes:
-
unique_fields(list[str]) – -
id(str) –
NaiveDeduper
¶
Bases: Deduper
flowchart TD
matchbox.client.models.dedupers.naive.NaiveDeduper[NaiveDeduper]
matchbox.client.models.dedupers.base.Deduper[Deduper]
matchbox.client.models.dedupers.base.Deduper --> matchbox.client.models.dedupers.naive.NaiveDeduper
click matchbox.client.models.dedupers.naive.NaiveDeduper href "" "matchbox.client.models.dedupers.naive.NaiveDeduper"
click matchbox.client.models.dedupers.base.Deduper href "" "matchbox.client.models.dedupers.base.Deduper"
A simple deduper that deduplicates based on a set of boolean conditions.
Methods:
Attributes:
linkers
¶
Linking methodologies.
Modules:
-
base–Base class for linkers.
-
deterministic–A linking methodology based on a deterministic set of conditions.
-
splinklinker–A linking methodology leveraging Splink.
-
weighteddeterministic–A linking methodology that applies different weights to field comparisons.
Classes:
-
DeterministicLinker–A deterministic linker that links based on a set of boolean conditions.
-
SplinkLinker–A linker that leverages Bayesian record linkage using Splink.
-
WeightedDeterministicLinker–A deterministic linker that applies different weights to field comparisons.
DeterministicLinker
¶
Bases: Linker
flowchart TD
matchbox.client.models.linkers.DeterministicLinker[DeterministicLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.client.models.linkers.DeterministicLinker
click matchbox.client.models.linkers.DeterministicLinker href "" "matchbox.client.models.linkers.DeterministicLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A deterministic linker that links based on a set of boolean conditions.
Uses DuckDB as the SQL backend, enabling rich SQL operations while maintaining a Polars DataFrame interface. Supports both parallel matching (single round) and sequential matching (multiple rounds where matched records are removed after each round).
Methods:
Attributes:
link
¶
Link the left and right dataframes.
If comparisons is a flat list, applies all comparisons in parallel. If comparisons is a nested list, applies each round sequentially, removing matched records from the pool after each round.
SplinkLinker
¶
Bases: Linker
flowchart TD
matchbox.client.models.linkers.SplinkLinker[SplinkLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.client.models.linkers.SplinkLinker
click matchbox.client.models.linkers.SplinkLinker href "" "matchbox.client.models.linkers.SplinkLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A linker that leverages Bayesian record linkage using Splink.
Methods:
Attributes:
link
¶
Link the left and right dataframes.
WeightedDeterministicLinker
¶
Bases: Linker
flowchart TD
matchbox.client.models.linkers.WeightedDeterministicLinker[WeightedDeterministicLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.client.models.linkers.WeightedDeterministicLinker
click matchbox.client.models.linkers.WeightedDeterministicLinker href "" "matchbox.client.models.linkers.WeightedDeterministicLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A deterministic linker that applies different weights to field comparisons.
Methods:
Attributes:
base
¶
Base class for linkers.
Classes:
-
LinkerSettings–A data class to enforce basic settings dictionary shapes.
-
Linker–A base class for linkers.
LinkerSettings
¶
Bases: BaseModel
flowchart TD
matchbox.client.models.linkers.base.LinkerSettings[LinkerSettings]
click matchbox.client.models.linkers.base.LinkerSettings href "" "matchbox.client.models.linkers.base.LinkerSettings"
A data class to enforce basic settings dictionary shapes.
Attributes:
Linker
¶
Bases: BaseModel, ABC
flowchart TD
matchbox.client.models.linkers.base.Linker[Linker]
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A base class for linkers.
Methods:
Attributes:
deterministic
¶
A linking methodology based on a deterministic set of conditions.
Classes:
-
DeterministicSettings–A data class to enforce the Deterministic linker’s settings dictionary shape.
-
DeterministicLinker–A deterministic linker that links based on a set of boolean conditions.
DeterministicSettings
¶
Bases: LinkerSettings
flowchart TD
matchbox.client.models.linkers.deterministic.DeterministicSettings[DeterministicSettings]
matchbox.client.models.linkers.base.LinkerSettings[LinkerSettings]
matchbox.client.models.linkers.base.LinkerSettings --> matchbox.client.models.linkers.deterministic.DeterministicSettings
click matchbox.client.models.linkers.deterministic.DeterministicSettings href "" "matchbox.client.models.linkers.deterministic.DeterministicSettings"
click matchbox.client.models.linkers.base.LinkerSettings href "" "matchbox.client.models.linkers.base.LinkerSettings"
A data class to enforce the Deterministic linker’s settings dictionary shape.
Methods:
-
validate_comparison–Normalise to list of lists format.
Attributes:
comparisons
class-attribute
instance-attribute
¶
comparisons: list[str] | list[list[str]] = Field(description='\n Comparison rules for matching using DuckDB SQL syntax.\n \n Can be specified as:\n - A flat list of strings: All comparisons applied in parallel (OR logic)\n - A nested list of lists: Sequential rounds of matching\n \n Flat list (parallel):\n [\n "left.company_number = right.company_number",\n "left.name = right.name",\n ]\n All comparisons applied to full datasets, results unioned.\n \n Nested list (sequential rounds):\n [\n [\n "left.company_number = right.company_number",\n "left.name = right.name",\n ],\n [\n "left.name_normalised = right.name_normalised",\n "left.website = right.website",\n ],\n ]\n Each inner list is a "round". Within each round, comparisons use OR \n logic. After each round, matched records are removed from the pool \n before the next round.\n \n Use left.field and right.field to refer to columns in the respective \n sources. Supports all DuckDB SQL operations and functions.\n ')
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(default='id', description='The unique ID field in the left data')
DeterministicLinker
¶
Bases: Linker
flowchart TD
matchbox.client.models.linkers.deterministic.DeterministicLinker[DeterministicLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.client.models.linkers.deterministic.DeterministicLinker
click matchbox.client.models.linkers.deterministic.DeterministicLinker href "" "matchbox.client.models.linkers.deterministic.DeterministicLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A deterministic linker that links based on a set of boolean conditions.
Uses DuckDB as the SQL backend, enabling rich SQL operations while maintaining a Polars DataFrame interface. Supports both parallel matching (single round) and sequential matching (multiple rounds where matched records are removed after each round).
Methods:
Attributes:
link
¶
Link the left and right dataframes.
If comparisons is a flat list, applies all comparisons in parallel. If comparisons is a nested list, applies each round sequentially, removing matched records from the pool after each round.
splinklinker
¶
A linking methodology leveraging Splink.
Classes:
-
SplinkLinkerFunction–A method of splink.Linker.training used to train the linker.
-
SplinkSettings–A data class to enforce the Splink linker’s settings dictionary shape.
-
SplinkLinker–A linker that leverages Bayesian record linkage using Splink.
SplinkLinkerFunction
¶
Bases: BaseModel
flowchart TD
matchbox.client.models.linkers.splinklinker.SplinkLinkerFunction[SplinkLinkerFunction]
click matchbox.client.models.linkers.splinklinker.SplinkLinkerFunction href "" "matchbox.client.models.linkers.splinklinker.SplinkLinkerFunction"
A method of splink.Linker.training used to train the linker.
Methods:
-
validate_function_and_arguments–Ensure the function and arguments are valid.
Attributes:
SplinkSettings
¶
Bases: LinkerSettings
flowchart TD
matchbox.client.models.linkers.splinklinker.SplinkSettings[SplinkSettings]
matchbox.client.models.linkers.base.LinkerSettings[LinkerSettings]
matchbox.client.models.linkers.base.LinkerSettings --> matchbox.client.models.linkers.splinklinker.SplinkSettings
click matchbox.client.models.linkers.splinklinker.SplinkSettings href "" "matchbox.client.models.linkers.splinklinker.SplinkSettings"
click matchbox.client.models.linkers.base.LinkerSettings href "" "matchbox.client.models.linkers.base.LinkerSettings"
A data class to enforce the Splink linker’s settings dictionary shape.
Methods:
-
check_ids_match–Ensure left_id and right_id match.
-
check_link_only–Ensure link_type is set to “link_only”.
-
add_enforced_settings–Ensure ID is the only field we link on.
-
load_linker_settings–Load serialised settings into SettingsCreator.
-
serialise_settings–Convert Splink settings to string.
Attributes:
-
model_config– -
linker_training_functions(list[SplinkLinkerFunction]) – -
linker_settings(SettingsCreator) – -
threshold(float | None) – -
left_id(str) – -
right_id(str) –
model_config
class-attribute
instance-attribute
¶
linker_training_functions
class-attribute
instance-attribute
¶
linker_training_functions: list[SplinkLinkerFunction] = Field(description='\n A list of dictionaries where keys are the names of methods for\n splink.Linker.training and values are dictionaries encoding the arguments of\n those methods. Each function will be run in the order supplied.\n\n Example:\n \n >>> linker_training_functions=[\n ... {\n ... "function": "estimate_probability_two_random_records_match",\n ... "arguments": {\n ... "deterministic_matching_rules": """\n ... l.company_name = r.company_name\n ... """,\n ... "recall": 0.7,\n ... },\n ... },\n ... {\n ... "function": "estimate_u_using_random_sampling",\n ... "arguments": {"max_pairs": 1e6},\n ... }\n ... ]\n \n ')
linker_settings
class-attribute
instance-attribute
¶
linker_settings: SettingsCreator = Field(description='\n A valid Splink SettingsCreator.\n\n See Splink\'s documentation for a full description of available settings.\n https://moj-analytical-services.github.io/splink/api_docs/settings_dict_guide.html\n\n * link_type must be set to "link_only"\n * unique_id_name is overridden to the value of left_id and right_id,\n which must match\n\n Example:\n\n >>> from splink import SettingsCreator, block_on\n ... import splink.comparison_library as cl\n ... import splink.comparison_template_library as ctl\n ... \n ... splink_settings = SettingsCreator(\n ... retain_matching_columns=False,\n ... retain_intermediate_calculation_columns=False,\n ... blocking_rules_to_generate_predictions=[\n ... block_on("company_name"),\n ... block_on("postcode"),\n ... ],\n ... comparisons=[\n ... cl.jaro_winkler_at_thresholds(\n ... "company_name", \n ... [0.9, 0.6], \n ... term_frequency_adjustments=True\n ... ),\n ... ctl.postcode_comparison("postcode"), \n ... ]\n ... ) \n ')
threshold
class-attribute
instance-attribute
¶
threshold: float | None = Field(default=None, description='\n The score above which matches will be kept.\n\n None is used to indicate no threshold.\n \n Inclusive, so a value of 1 will keep only exact matches across all \n comparisons.\n ', gt=0, le=1)
left_id
class-attribute
instance-attribute
¶
left_id: str = Field(default='id', description='The unique ID field in the left data')
right_id
class-attribute
instance-attribute
¶
right_id: str = Field(default='id', description='The unique ID field in the right data')
add_enforced_settings
¶
add_enforced_settings() -> SplinkSettings
Ensure ID is the only field we link on.
SplinkLinker
¶
Bases: Linker
flowchart TD
matchbox.client.models.linkers.splinklinker.SplinkLinker[SplinkLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.client.models.linkers.splinklinker.SplinkLinker
click matchbox.client.models.linkers.splinklinker.SplinkLinker href "" "matchbox.client.models.linkers.splinklinker.SplinkLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A linker that leverages Bayesian record linkage using Splink.
Methods:
Attributes:
link
¶
Link the left and right dataframes.
weighteddeterministic
¶
A linking methodology that applies different weights to field comparisons.
Classes:
-
WeightedComparison–A valid comparison and a weight to give it.
-
WeightedDeterministicSettings–A data class to enforce the Weighted linker’s settings dictionary shape.
-
WeightedDeterministicLinker–A deterministic linker that applies different weights to field comparisons.
WeightedComparison
¶
Bases: BaseModel
flowchart TD
matchbox.client.models.linkers.weighteddeterministic.WeightedComparison[WeightedComparison]
click matchbox.client.models.linkers.weighteddeterministic.WeightedComparison href "" "matchbox.client.models.linkers.weighteddeterministic.WeightedComparison"
A valid comparison and a weight to give it.
Methods:
-
validate_comparison–Validate the comparison string.
Attributes:
-
comparison(str) – -
weight(float) –
comparison
class-attribute
instance-attribute
¶
comparison: str = Field(description='\n A valid ON clause to compare fields between the left and \n the right data.\n\n Use left.field and right.field to refer to fields in the \n respective sources.\n\n For example:\n\n "left.company_name = right.company_name"\n ')
WeightedDeterministicSettings
¶
Bases: LinkerSettings
flowchart TD
matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicSettings[WeightedDeterministicSettings]
matchbox.client.models.linkers.base.LinkerSettings[LinkerSettings]
matchbox.client.models.linkers.base.LinkerSettings --> matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicSettings
click matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicSettings href "" "matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicSettings"
click matchbox.client.models.linkers.base.LinkerSettings href "" "matchbox.client.models.linkers.base.LinkerSettings"
A data class to enforce the Weighted linker’s settings dictionary shape.
Example
{ … left_id: “hash”, … right_id: “hash”, … weighted_comparisons: [ … (“l.company_name = r.company_name”, 0.7), … (“l.postcode = r.postcode”, 0.7), … (“l.company_id = r.company_id”, 1), … ], … threshold: 0.8, … }
Attributes:
-
weighted_comparisons(list[WeightedComparison]) – -
threshold(float) – -
left_id(str) – -
right_id(str) –
weighted_comparisons
class-attribute
instance-attribute
¶
weighted_comparisons: list[WeightedComparison] = Field(description='A list of tuples in the form of a comparison, and a weight.')
threshold
class-attribute
instance-attribute
¶
threshold: float = Field(description='\n The score above which matches will be kept. \n \n Inclusive, so a value of 1 will keep only exact matches across all \n comparisons.\n ', ge=0, le=1)
WeightedDeterministicLinker
¶
Bases: Linker
flowchart TD
matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicLinker[WeightedDeterministicLinker]
matchbox.client.models.linkers.base.Linker[Linker]
matchbox.client.models.linkers.base.Linker --> matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicLinker
click matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicLinker href "" "matchbox.client.models.linkers.weighteddeterministic.WeightedDeterministicLinker"
click matchbox.client.models.linkers.base.Linker href "" "matchbox.client.models.linkers.base.Linker"
A deterministic linker that applies different weights to field comparisons.
Methods:
Attributes:
models
¶
Functions and classes to define, run and register models.
Classes:
-
Model–Unified model class for both linking and deduping operations.
Functions:
-
add_model_class–Add custom deduper or linker.
Model
¶
Model(name: str, dag: DAG, model_class: type[Deduper], model_settings: DeduperSettings | dict, left_query: Query, right_query: None = None, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Linker], model_settings: LinkerSettings | dict, left_query: Query, right_query: Query, description: str | None = None)
Model(dag: DAG, name: str, model_class: type[Deduper] | type[Linker] | str, model_settings: DeduperSettings | LinkerSettings | dict, left_query: Query, right_query: Query | None = None, description: str | None = None)
Bases: StepABC
flowchart TD
matchbox.client.models.models.Model[Model]
matchbox.client.steps.StepABC[StepABC]
matchbox.client.steps.StepABC --> matchbox.client.models.models.Model
click matchbox.client.models.models.Model href "" "matchbox.client.models.models.Model"
click matchbox.client.steps.StepABC href "" "matchbox.client.steps.StepABC"
Unified model class for both linking and deduping operations.
Parameters:
-
(dag¶DAG) –DAG containing this model.
-
(name¶str) –Unique name for the model.
-
(model_class¶type[Deduper] | type[Linker] | str) –Class of Linker or Deduper, or its name.
-
(model_settings¶DeduperSettings | LinkerSettings | dict) –Appropriate settings object to pass to model class.
-
(left_query¶Query) –The query that will get the data to deduplicate, or the data to link on the left.
-
(right_query¶Query | None, default:None) –The query that will get the data to link on the right.
-
(description¶str | None, default:None) –Optional description of the model.
Methods:
-
to_dto–Convert to Step DTO for API calls.
-
from_dto–Reconstruct from Step DTO.
-
compute_scores–Run model instance against data.
-
run–Execute the model pipeline and return results.
-
resolver–Create a resolver rooted at this model and add it to the DAG.
-
clear_data–Suppress data clearing for models.
-
delete–Delete this step and its associated data from the backend.
-
download–Fetch remote data for this step and store it locally.
-
sync–Send step config and local data to the server.
Attributes:
-
left_query– -
right_query– -
model_class(type[Linker | Deduper]) – -
model_instance– -
model_type(ModelType) – -
model_settings– -
results(DataFrame | None) –The locally computed model scores. Alias for local_data.
-
config(ModelConfig) –Generate config DTO from Model.
-
sources(set[SourceStepName]) –Set of source names upstream of this node.
-
path(ModelStepPath) –Return the model step path.
-
dag– -
name– -
description– -
local_data(DataFrame | None) –The locally computed results for this step.
-
cache_path(Path) –Path within the DAG cache for storing this step’s local data.
model_type
instance-attribute
¶
model_type: ModelType = LINKER if issubclass(model_class, Linker) else DEDUPER
results
property
writable
¶
The locally computed model scores. Alias for local_data.
cache_path
property
¶
cache_path: Path
Path within the DAG cache for storing this step’s local data.
from_dto
classmethod
¶
Reconstruct from Step DTO.
compute_scores
¶
Run model instance against data.
run
¶
run(left_data: DataFrame | None = None, right_data: DataFrame | None = None, low_memory: bool = False) -> DataFrame
Execute the model pipeline and return results.
Parameters:
-
(left_data¶optional, default:None) –Pre-fetched query data to deduplicate if the model is a deduper, or link on the left if the model is a linker.
-
(right_data¶optional, default:None) –Pre-fetched query data to link on the right, if the model is a linker. If the model is a deduper, this argument is ignored.
-
(low_memory¶bool, default:False) –If True, it will not download data from the server to support evaluation.
resolver
¶
resolver(*other_models: Model, name: str, resolver_class: type[ResolverMethod] | str, resolver_settings: ResolverSettings | dict[str, Any] | None = None, description: str | None = None) -> Resolver
Create a resolver rooted at this model and add it to the DAG.
delete
¶
Delete this step and its associated data from the backend.
sync
¶
Send step config and local data to the server.
Not resistant to race conditions: only one client should call sync at a time.