API reference¶
matchbox.client
is the client used to interact with the Matchbox server.
All names in matchbox.client
are also accessible from the top-level matchbox
module.
matchbox.client
¶
All client-side functionalities of Matchbox.
Modules:
-
authorisation
–Utilities for JWT API authorisation.
-
dags
–Objects to define a DAG which indexes, deduplicates and links data.
-
eval
–Module implementing client-side evaluation features.
-
models
–Deduplication and linking methodologies.
-
queries
–Definition of model inputs.
-
results
–Objects representing the results of running a model client-side.
-
sources
–Interface to locations where source data is stored.
Classes:
-
DAG
–Self-sufficient pipeline of indexing, deduping and linking steps.
-
RelationalDBLocation
–A location for a relational database.
DAG
¶
DAG(name: str)
Self-sufficient pipeline of indexing, deduping and linking steps.
Methods:
-
source
–Create Source and add it to the DAG.
-
model
–Create Model and add it to the DAG.
-
add_resolution
–Convert a resolution to a Source or Model and add to DAG.
-
get_source
–Get a source by name from the DAG.
-
get_model
–Get a model by name from the DAG.
-
query
–Create Query object.
-
draw
–Create a string representation of the DAG as a tree structure.
-
new_run
–Start a new run.
-
load_default
–Attach to default run in this collection, loading all DAG nodes.
-
run_and_sync
–Run entire DAG and send results to server.
-
set_default
–Set the current run as the default for the collection.
-
lookup_key
–Matches IDs against the selected backend.
-
extract_lookup
–Return matchbox IDs to source key mapping, optionally filtering.
Attributes:
-
name
(CollectionName
) – -
nodes
(dict[ResolutionName, Source | Model]
) – -
graph
(dict[ResolutionName, list[ResolutionName]]
) – -
run
(RunID
) –Return run ID if available, else error.
-
final_step
(Source | Model
) –Returns the root node in the DAG.
final_step
property
¶
Returns the root node in the DAG.
Returns:
Raises:
-
ValueError
–If the DAG does not have a final step
add_resolution
¶
add_resolution(name: ResolutionName, resolution: Resolution, location: Location) -> None
Convert a resolution to a Source or Model and add to DAG.
get_source
¶
get_source(name: ResolutionName) -> Source
Get a source by name from the DAG.
Parameters:
-
name
¶ResolutionName
) –The name of the source to retrieve.
Returns:
-
Source
–The Source object.
Raises:
-
ValueError
–If the name doesn’t exist in the DAG or isn’t a Source.
get_model
¶
get_model(name: ResolutionName) -> Model
Get a model by name from the DAG.
Parameters:
-
name
¶ResolutionName
) –The name of the model to retrieve.
Returns:
-
Model
–The Model object.
Raises:
-
ValueError
–If the name doesn’t exist in the DAG or isn’t a Model.
draw
¶
draw(start_time: datetime | None = None, doing: str | None = None, skipped: list[str] | None = None) -> str
Create a string representation of the DAG as a tree structure.
If start_time
is provided, it will show the status of each node
based on the last run time. The status indicators are:
- ✅ Done
- 🔄 Working
- ⏸️ Awaiting
- ⏭️ Skipped
Parameters:
-
start_time
¶datetime | None
, default:None
) –Start time of the DAG run. Used to calculate node status.
-
doing
¶str | None
, default:None
) –Name of the node currently being processed (if any).
-
skipped
¶list[str] | None
, default:None
) –List of node names that were skipped.
Returns:
-
str
–String representation of the DAG with status indicators.
load_default
¶
run_and_sync
¶
Run entire DAG and send results to server.
set_default
¶
Set the current run as the default for the collection.
Makes it immutable, then moves the default pointer to it.
lookup_key
¶
lookup_key(from_source: str, to_sources: list[str], key: str, threshold: int | None = None) -> dict[str, list[str]]
Matches IDs against the selected backend.
Parameters:
-
from_source
¶str
) –Name of source the provided key belongs to
-
to_sources
¶list[str]
) –Names of sources to find keys in
-
key
¶str
) –The value to match from the source. Usually a primary key
-
threshold
¶optional
, default:None
) –The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
Returns:
Examples:
extract_lookup
¶
extract_lookup(source_filter: list[str] | None = None, location_names: list[str] | None = None) -> Table
RelationalDBLocation
¶
Bases: Location
A location for a relational database.
Methods:
-
from_config
–Initialise location from a location config and an appropriate client.
-
connect
–Establish connection to the data location.
-
validate_extract_transform
–Check that the SQL statement only contains a single data-extracting command.
-
infer_types
–Extract all data types from the ET logic.
-
execute
–Execute ET logic against location and return batches.
Attributes:
-
config
– -
client
(Engine
) – -
location_type
(LocationType
) –Output location type string.
location_type
class-attribute
instance-attribute
¶
location_type: LocationType = RDBMS
Output location type string.
from_config
¶
from_config(config: LocationConfig, client: Any) -> Self
Initialise location from a location config and an appropriate client.
validate_extract_transform
¶
validate_extract_transform(extract_transform: str) -> bool
Check that the SQL statement only contains a single data-extracting command.
We are NOT attempting a full sanitisation of the SQL statement
Validation is done purely to stop accidental mistakes, not malicious actors¶
Users should only run indexing using SourceConfigs they trust and have read,¶
using least privilege credentials¶
Parameters:
Returns:
-
bool
(bool
) –True if the SQL statement is valid
Raises:
-
ParseError
–If the SQL statement cannot be parsed
-
MatchboxSourceExtractTransformError
–If validation requirements are not met
infer_types
¶
Extract all data types from the ET logic.
execute
¶
execute(extract_transform: str, batch_size: int | None = None, rename: dict[str, str] | Callable | None = None, return_type: QueryReturnType = POLARS, keys: tuple[str, list[str]] | None = None, schema_overrides: dict[str, DataType] | None = None) -> Generator[QueryReturnClass, None, None]
Execute ET logic against location and return batches.
Parameters:
-
extract_transform
¶str
) –The ET logic to execute.
-
batch_size
¶int | None
, default:None
) –The size of the batches to return.
-
rename
¶dict[str, str] | Callable | None
, default:None
) –Renaming to apply after the ET logic is executed.
- If a dictionary is provided, it will be used to rename the columns.
- If a callable is provided, it will take the old name as input and return the new name.
-
return_type
¶QueryReturnType
, default:POLARS
) –The type of data to return. Defaults to “polars”.
-
keys
¶tuple[str, list[str]] | None
, default:None
) –Rule to only retrieve rows by specific keys. The key of the dictionary is a field name on which to filter. Filters source entries where the key field is in the dict values.
Raises:
-
AttributeError
–If the cliet is not set.