DAGs
matchbox.client.dags
¶
Objects to define a DAG which indexes, deduplicates and links data.
Classes:
-
DAG–Self-sufficient pipeline of indexing, deduping and linking steps.
DAG
¶
DAG(name: str)
Self-sufficient pipeline of indexing, deduping and linking steps.
Methods:
-
set_downstream_to_rerun–Mark step and downstream steps as not run.
-
source–Create Source and add it to the DAG.
-
model–Create Model and add it to the DAG.
-
add_resolution–Convert a resolution from the server to a Source or Model and add to DAG.
-
get_source–Get a source by name from the DAG.
-
get_model–Get a model by name from the DAG.
-
query–Create Query object.
-
draw–Create a string representation of the DAG as a tree structure.
-
new_run–Start a new run.
-
set_client–Assign a client to all sources at once.
-
load_default–Attach to default run in this collection, loading all DAG nodes.
-
load_pending–Attach to the pending run in this collection, loading all DAG nodes.
-
run_and_sync–Run entire DAG and send results to server.
-
set_default–Set the current run as the default for the collection.
-
lookup_key–Matches IDs against the selected backend.
-
extract_lookup–Return matchbox IDs to source key mapping, optionally filtering.
Attributes:
-
name(CollectionName) – -
nodes(dict[ResolutionName, Source | Model]) – -
graph(dict[ResolutionName, list[ResolutionName]]) – -
run(RunID) –Return run ID if available, else error.
-
final_steps(list[Source | Model]) –Returns all apex nodes in the DAG.
-
final_step(Source | Model) –Returns the root node in the DAG.
final_steps
property
¶
final_step
property
¶
Returns the root node in the DAG.
Returns:
Raises:
-
ValueError–If the DAG does not have exactly one final step
set_downstream_to_rerun
¶
set_downstream_to_rerun(step_name: ResolutionName)
Mark step and downstream steps as not run.
add_resolution
¶
add_resolution(name: ResolutionName, resolution: Resolution) -> None
Convert a resolution from the server to a Source or Model and add to DAG.
get_source
¶
get_source(name: ResolutionName) -> Source
Get a source by name from the DAG.
Parameters:
-
(name¶ResolutionName) –The name of the source to retrieve.
Returns:
-
Source–The Source object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Source.
get_model
¶
get_model(name: ResolutionName) -> Model
Get a model by name from the DAG.
Parameters:
-
(name¶ResolutionName) –The name of the model to retrieve.
Returns:
-
Model–The Model object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Model.
draw
¶
draw(start_time: datetime | None = None, doing: str | None = None, skipped: list[str] | None = None) -> str
Create a string representation of the DAG as a tree structure.
If start_time is provided, it will show the status of each node
based on the last run time. The status indicators are:
- ✅ Done
- 🔄 Working
- ⏸️ Awaiting
- ⏭️ Skipped
Parameters:
-
(start_time¶datetime | None, default:None) –Start time of the DAG run. Used to calculate node status.
-
(doing¶str | None, default:None) –Name of the node currently being processed (if any).
-
(skipped¶list[str] | None, default:None) –List of node names that were skipped.
Returns:
-
str–String representation of the DAG with status indicators.
load_default
¶
load_default() -> Self
Attach to default run in this collection, loading all DAG nodes.
load_pending
¶
load_pending() -> Self
Attach to the pending run in this collection, loading all DAG nodes.
Pending is defined as the last non-default run.
run_and_sync
¶
Run entire DAG and send results to server.
set_default
¶
Set the current run as the default for the collection.
Makes it immutable, then moves the default pointer to it.
lookup_key
¶
lookup_key(from_source: str, to_sources: list[str], key: str, threshold: int | None = None) -> dict[str, list[str]]
Matches IDs against the selected backend.
Parameters:
-
(from_source¶str) –Name of source the provided key belongs to
-
(to_sources¶list[str]) –Names of sources to find keys in
-
(key¶str) –The value to match from the source. Usually a primary key
-
(threshold¶optional, default:None) –The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
Returns:
Examples:
extract_lookup
¶
extract_lookup(source_filter: list[str] | None = None, location_names: list[str] | None = None) -> Table