API reference¶
matchbox.client is the client for interacting with the Matchbox server.
All names in matchbox.client are also accessible from the top-level matchbox module.
matchbox.client
¶
All client-side functionalities of Matchbox.
Modules:
-
cli–CLI package for Matchbox client.
-
dags–Objects to define a DAG which indexes, deduplicates and links data.
-
eval–Public evaluation helpers for Matchbox clients.
-
locations–Interface to locations where source data is stored.
-
models–Deduplication and linking methodologies.
-
queries–Definition of model inputs.
-
resolvers–Resolver methodologies and resolver DAG nodes.
-
results–Objects representing the results of running a model client-side.
-
sources–Interface to source data.
-
steps–Base class for client-side DAG step nodes.
Classes:
-
DAG–Self-sufficient pipeline of indexing, deduping and linking steps.
-
RelationalDBLocation–A location for a relational database.
DAG
¶
DAG(name: CollectionName, admin_group: GroupName = PUBLIC)
Self-sufficient pipeline of indexing, deduping and linking steps.
Parameters:
-
(name¶CollectionName) –The name of the DAG, and therefore the collection it will connect to
-
(admin_group¶GroupName, default:PUBLIC) –The name of the group that will be given admin permission over the DAG. Defaults to public, where anyone can modify, delete or run it
Methods:
-
list_all–List available DAG names on the server.
-
source–Create Source and add it to the DAG.
-
model–Create Model and add it to the DAG.
-
resolver–Create a resolver and add it to the DAG.
-
add_step–Add a step to the DAG.
-
get_source–Get a source by name from the DAG.
-
get_model–Get a model by name from the DAG.
-
get_resolver–Get a resolver by name from the DAG.
-
query–Create Query object.
-
draw–Create a string representation of the DAG.
-
new_run–Start a new run.
-
set_client–Assign a client to all sources at once.
-
load_default–Attach to default run in this collection, loading all DAG nodes.
-
load_pending–Attach to the pending run in this collection, loading all DAG nodes.
-
run_and_sync–Run entire DAG and send results to server.
-
set_default–Set the current run as the default for the collection.
-
lookup_key–Matches IDs against the selected backend.
-
get_matches–Return ResolverMatches, optionally filtering.
Attributes:
-
name(CollectionName) – -
admin_group(GroupName) – -
nodes(dict[StepName, Source | Model | Resolver]) – -
graph(dict[StepName, list[StepName]]) – -
run(RunID) –Return run ID if available, else error.
-
sequence(list[StepName]) –Return nodes in topological execution order.
-
final_steps(list[Source | Model | Resolver]) –Returns all apex nodes in the DAG.
-
default_resolver(Resolver) –Return the default resolver for this DAG.
sequence
property
¶
final_steps
property
¶
get_source
¶
Get a source by name from the DAG.
Parameters:
Returns:
-
Source–The Source object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Source.
get_model
¶
Get a model by name from the DAG.
Parameters:
Returns:
-
Model–The Model object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Model.
draw
¶
draw(status: DAGExecutionStatus | None = None, mode: Literal['tree', 'list'] = 'tree') -> str
Create a string representation of the DAG.
In tree mode, nodes are shown in a dependency tree.
In list mode, nodes are shown in execution order as a numbered list.
If status is provided, it will show the status of each node.
The status indicators are:
- ✅ Done
- 🔄 Working
- ⏸️ Awaiting
- ⏭️ Skipped
Node type indicators are:
- 💎 Resolver
- ⚙️ Model
- 📄 Source
Parameters:
-
(status¶DAGExecutionStatus | None, default:None) –Object describing the status of each node.
-
(mode¶Literal['tree', 'list'], default:'tree') –“tree” renders the DAG as a tree structure (default). “list” renders nodes in flat execution order.
Returns:
-
str–String representation of the DAG with status indicators.
load_default
¶
load_default() -> Self
Attach to default run in this collection, loading all DAG nodes.
load_pending
¶
load_pending() -> Self
Attach to the pending run in this collection, loading all DAG nodes.
Pending is defined as the last non-default run.
run_and_sync
¶
run_and_sync(start: str | None = None, finish: str | None = None, low_memory: bool = False, batch_size: int | None = None, profile: bool = False) -> None
Run entire DAG and send results to server.
Parameters:
-
(start¶str | None, default:None) –Name of first node to run
-
(finish¶str | None, default:None) –Name of last node to run
-
(low_memory¶bool, default:False) –Whether to delete data for each node after it is run
-
(batch_size¶int | None, default:None) –The size used for internal batching. Overrides environment variable if set.
-
(profile¶bool, default:False) –whether to log to INFO level the memory usage
set_default
¶
Set the current run as the default for the collection.
Makes it immutable, then moves the default pointer to it.
lookup_key
¶
get_matches
¶
get_matches(resolver: ResolverStepName | None = None, source_filter: list[SourceStepName] | None = None, location_names: list[str] | None = None) -> ResolverMatches
Return ResolverMatches, optionally filtering.
Parameters:
-
(resolver¶ResolverStepName | None, default:None) –Name of resolver to query within DAG. If not provided, will look for an apex.
-
(source_filter¶list[SourceStepName] | None, default:None) –An optional list of source step names to filter by.
-
(location_names¶list[str] | None, default:None) –An optional list of location names to filter by.
RelationalDBLocation
¶
RelationalDBLocation(name: str)
Bases: Location
flowchart TD
matchbox.client.RelationalDBLocation[RelationalDBLocation]
matchbox.client.locations.Location[Location]
matchbox.client.locations.Location --> matchbox.client.RelationalDBLocation
click matchbox.client.RelationalDBLocation href "" "matchbox.client.RelationalDBLocation"
click matchbox.client.locations.Location href "" "matchbox.client.locations.Location"
A location for a relational database.
Methods:
-
set_client–Set client for location and return the location.
-
from_config–Initialise location from a location config.
-
connect–Establish connection to the data location.
-
validate_extract_transform–Check that the SQL statement only contains a single data-extracting command.
-
infer_types–Extract all data types from the ET logic.
-
execute–Execute ET logic against location and return batches.
Attributes:
-
config– -
client(Engine | Connection | None) –Retrieve client.
-
location_type(LocationType) –Output location type string.
-
client_type(ClientType | None) –Determine client type from the client.
location_type
class-attribute
instance-attribute
¶
location_type: LocationType = RDBMS
Output location type string.
from_config
classmethod
¶
from_config(config: LocationConfig) -> Self
Initialise location from a location config.
validate_extract_transform
¶
validate_extract_transform(extract_transform: str) -> None
Check that the SQL statement only contains a single data-extracting command.
We are NOT attempting a full sanitisation of the SQL statement
Validation is done purely to stop accidental mistakes, not malicious actors¶
Users should only run indexing using SourceConfigs they trust and have read,¶
using least privilege credentials¶
Parameters:
Raises:
-
ParseError–If the SQL statement cannot be parsed
-
MatchboxSourceExtractTransformError–If validation requirements are not met
infer_types
¶
Extract all data types from the ET logic.
execute
¶
execute(extract_transform: str, batch_size: int | None = None, rename: dict[str, str] | Callable | None = None, return_type: Literal[POLARS] = ..., keys: tuple[str, list[str]] | None = None, schema_overrides: dict[str, DataType] | None = None) -> Generator[DataFrame, None, None]
execute(extract_transform: str, batch_size: int | None = None, rename: dict[str, str] | Callable | None = None, return_type: QueryReturnType = POLARS, keys: tuple[str, list[str]] | None = None, schema_overrides: dict[str, DataType] | None = None) -> Generator[QueryReturnClass, None, None]
Execute ET logic against location and return batches.
Parameters:
-
(extract_transform¶str) –The ET logic to execute.
-
(batch_size¶int | None, default:None) –The size used for internal batching. Overrides environment variable if set.
-
(rename¶dict[str, str] | Callable | None, default:None) –Renaming to apply after the ET logic is executed.
- If a dictionary is provided, it will be used to rename the columns.
- If a callable is provided, it will take the old name as input and return the new name.
-
(return_type¶QueryReturnType, default:POLARS) –The type of data to return. Defaults to “polars”.
-
(keys¶tuple[str, list[str]] | None, default:None) –Rule to only retrieve rows by specific keys. The key of the dictionary is a field name on which to filter. Filters source entries where the key field is in the dict values.
Raises:
-
AttributeError–If the cliet is not set.