API reference¶
matchbox.client is the client used to interact with the Matchbox server.
All names in matchbox.client are also accessible from the top-level matchbox module.
matchbox.client
¶
All client-side functionalities of Matchbox.
Modules:
-
authorisation–Utilities for JWT API authorisation.
-
cli–CLI package for Matchbox client.
-
dags–Objects to define a DAG which indexes, deduplicates and links data.
-
eval–Public evaluation helpers for Matchbox clients.
-
locations–Interface to locations where source data is stored.
-
models–Deduplication and linking methodologies.
-
queries–Definition of model inputs.
-
results–Objects representing the results of running a model client-side.
-
sources–Interface to source data.
Classes:
-
DAG–Self-sufficient pipeline of indexing, deduping and linking steps.
-
RelationalDBLocation–A location for a relational database.
DAG
¶
DAG(name: str)
Self-sufficient pipeline of indexing, deduping and linking steps.
Methods:
-
set_downstream_to_rerun–Mark step and downstream steps as not run.
-
source–Create Source and add it to the DAG.
-
model–Create Model and add it to the DAG.
-
add_resolution–Convert a resolution from the server to a Source or Model and add to DAG.
-
get_source–Get a source by name from the DAG.
-
get_model–Get a model by name from the DAG.
-
query–Create Query object.
-
draw–Create a string representation of the DAG as a tree structure.
-
new_run–Start a new run.
-
set_client–Assign a client to all sources at once.
-
load_default–Attach to default run in this collection, loading all DAG nodes.
-
load_pending–Attach to the pending run in this collection, loading all DAG nodes.
-
run_and_sync–Run entire DAG and send results to server.
-
set_default–Set the current run as the default for the collection.
-
lookup_key–Matches IDs against the selected backend.
-
extract_lookup–Return matchbox IDs to source key mapping, optionally filtering.
Attributes:
-
name(CollectionName) – -
nodes(dict[ResolutionName, Source | Model]) – -
graph(dict[ResolutionName, list[ResolutionName]]) – -
run(RunID) –Return run ID if available, else error.
-
final_steps(list[Source | Model]) –Returns all apex nodes in the DAG.
-
final_step(Source | Model) –Returns the root node in the DAG.
final_steps
property
¶
final_step
property
¶
Returns the root node in the DAG.
Returns:
Raises:
-
ValueError–If the DAG does not have exactly one final step
set_downstream_to_rerun
¶
set_downstream_to_rerun(step_name: ResolutionName)
Mark step and downstream steps as not run.
add_resolution
¶
add_resolution(name: ResolutionName, resolution: Resolution) -> None
Convert a resolution from the server to a Source or Model and add to DAG.
get_source
¶
get_source(name: ResolutionName) -> Source
Get a source by name from the DAG.
Parameters:
-
(name¶ResolutionName) –The name of the source to retrieve.
Returns:
-
Source–The Source object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Source.
get_model
¶
get_model(name: ResolutionName) -> Model
Get a model by name from the DAG.
Parameters:
-
(name¶ResolutionName) –The name of the model to retrieve.
Returns:
-
Model–The Model object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Model.
draw
¶
draw(start_time: datetime | None = None, doing: str | None = None, skipped: list[str] | None = None) -> str
Create a string representation of the DAG as a tree structure.
If start_time is provided, it will show the status of each node
based on the last run time. The status indicators are:
- ✅ Done
- 🔄 Working
- ⏸️ Awaiting
- ⏭️ Skipped
Parameters:
-
(start_time¶datetime | None, default:None) –Start time of the DAG run. Used to calculate node status.
-
(doing¶str | None, default:None) –Name of the node currently being processed (if any).
-
(skipped¶list[str] | None, default:None) –List of node names that were skipped.
Returns:
-
str–String representation of the DAG with status indicators.
load_default
¶
load_default() -> Self
Attach to default run in this collection, loading all DAG nodes.
load_pending
¶
load_pending() -> Self
Attach to the pending run in this collection, loading all DAG nodes.
Pending is defined as the last non-default run.
run_and_sync
¶
Run entire DAG and send results to server.
set_default
¶
Set the current run as the default for the collection.
Makes it immutable, then moves the default pointer to it.
lookup_key
¶
lookup_key(from_source: str, to_sources: list[str], key: str, threshold: int | None = None) -> dict[str, list[str]]
Matches IDs against the selected backend.
Parameters:
-
(from_source¶str) –Name of source the provided key belongs to
-
(to_sources¶list[str]) –Names of sources to find keys in
-
(key¶str) –The value to match from the source. Usually a primary key
-
(threshold¶optional, default:None) –The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
Returns:
Examples:
extract_lookup
¶
extract_lookup(source_filter: list[str] | None = None, location_names: list[str] | None = None) -> Table
RelationalDBLocation
¶
RelationalDBLocation(name: str)
Bases: Location
A location for a relational database.
Methods:
-
set_client–Set client for location and return the location.
-
from_config–Initialise location from a location config.
-
connect–Establish connection to the data location.
-
validate_extract_transform–Check that the SQL statement only contains a single data-extracting command.
-
infer_types–Extract all data types from the ET logic.
-
execute–Execute ET logic against location and return batches.
Attributes:
-
config– -
client(Engine) –Retrieve client.
-
location_type(LocationType) –Output location type string.
-
client_type(ClientType) –Client type string.
location_type
class-attribute
instance-attribute
¶
location_type: LocationType = RDBMS
Output location type string.
client_type
class-attribute
instance-attribute
¶
client_type: ClientType = SQLALCHEMY
Client type string.
from_config
¶
from_config(config: LocationConfig) -> Self
Initialise location from a location config.
validate_extract_transform
¶
validate_extract_transform(extract_transform: str) -> None
Check that the SQL statement only contains a single data-extracting command.
We are NOT attempting a full sanitisation of the SQL statement
Validation is done purely to stop accidental mistakes, not malicious actors¶
Users should only run indexing using SourceConfigs they trust and have read,¶
using least privilege credentials¶
Parameters:
Raises:
-
ParseError–If the SQL statement cannot be parsed
-
MatchboxSourceExtractTransformError–If validation requirements are not met
infer_types
¶
Extract all data types from the ET logic.
execute
¶
execute(extract_transform: str, batch_size: int | None = None, rename: dict[str, str] | Callable | None = None, return_type: QueryReturnType = POLARS, keys: tuple[str, list[str]] | None = None, schema_overrides: dict[str, DataType] | None = None) -> Generator[QueryReturnClass, None, None]
Execute ET logic against location and return batches.
Parameters:
-
(extract_transform¶str) –The ET logic to execute.
-
(batch_size¶int | None, default:None) –The size of the batches to return.
-
(rename¶dict[str, str] | Callable | None, default:None) –Renaming to apply after the ET logic is executed.
- If a dictionary is provided, it will be used to rename the columns.
- If a callable is provided, it will take the old name as input and return the new name.
-
(return_type¶QueryReturnType, default:POLARS) –The type of data to return. Defaults to “polars”.
-
(keys¶tuple[str, list[str]] | None, default:None) –Rule to only retrieve rows by specific keys. The key of the dictionary is a field name on which to filter. Filters source entries where the key field is in the dict values.
Raises:
-
AttributeError–If the cliet is not set.