DAGs
matchbox.client.dags
¶
Objects to define a DAG which indexes, deduplicates and links data.
Classes:
-
DAGNodeExecutionStatus–Enumeration of node execution statuses.
-
DAG–Self-sufficient pipeline of indexing, deduping and linking steps.
Attributes:
DAGExecutionStatus
module-attribute
¶
DAGExecutionStatus: TypeAlias = dict[str, DAGNodeExecutionStatus]
DAGNodeExecutionStatus
¶
Bases: StrEnum
flowchart TD
matchbox.client.dags.DAGNodeExecutionStatus[DAGNodeExecutionStatus]
click matchbox.client.dags.DAGNodeExecutionStatus href "" "matchbox.client.dags.DAGNodeExecutionStatus"
Enumeration of node execution statuses.
Attributes:
DAG
¶
DAG(name: CollectionName, admin_group: GroupName = PUBLIC)
Self-sufficient pipeline of indexing, deduping and linking steps.
Parameters:
-
(name¶CollectionName) –The name of the DAG, and therefore the collection it will connect to
-
(admin_group¶GroupName, default:PUBLIC) –The name of the group that will be given admin permission over the DAG. Defaults to public, where anyone can modify, delete or run it
Methods:
-
list_all–List available DAG names on the server.
-
source–Create Source and add it to the DAG.
-
model–Create Model and add it to the DAG.
-
resolver–Create a resolver and add it to the DAG.
-
add_step–Add a step to the DAG.
-
get_source–Get a source by name from the DAG.
-
get_model–Get a model by name from the DAG.
-
get_resolver–Get a resolver by name from the DAG.
-
query–Create Query object.
-
draw–Create a string representation of the DAG.
-
new_run–Start a new run.
-
set_client–Assign a client to all sources at once.
-
load_default–Attach to default run in this collection, loading all DAG nodes.
-
load_pending–Attach to the pending run in this collection, loading all DAG nodes.
-
run_and_sync–Run entire DAG and send results to server.
-
set_default–Set the current run as the default for the collection.
-
lookup_key–Matches IDs against the selected backend.
-
get_matches–Return ResolverMatches, optionally filtering.
Attributes:
-
name(CollectionName) – -
admin_group(GroupName) – -
nodes(dict[StepName, Source | Model | Resolver]) – -
graph(dict[StepName, list[StepName]]) – -
run(RunID) –Return run ID if available, else error.
-
sequence(list[StepName]) –Return nodes in topological execution order.
-
final_steps(list[Source | Model | Resolver]) –Returns all apex nodes in the DAG.
-
default_resolver(Resolver) –Return the default resolver for this DAG.
sequence
property
¶
final_steps
property
¶
get_source
¶
Get a source by name from the DAG.
Parameters:
Returns:
-
Source–The Source object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Source.
get_model
¶
Get a model by name from the DAG.
Parameters:
Returns:
-
Model–The Model object.
Raises:
-
ValueError–If the name doesn’t exist in the DAG or isn’t a Model.
draw
¶
draw(status: DAGExecutionStatus | None = None, mode: Literal['tree', 'list'] = 'tree') -> str
Create a string representation of the DAG.
In tree mode, nodes are shown in a dependency tree.
In list mode, nodes are shown in execution order as a numbered list.
If status is provided, it will show the status of each node.
The status indicators are:
- ✅ Done
- 🔄 Working
- ⏸️ Awaiting
- ⏭️ Skipped
Node type indicators are:
- 💎 Resolver
- ⚙️ Model
- 📄 Source
Parameters:
-
(status¶DAGExecutionStatus | None, default:None) –Object describing the status of each node.
-
(mode¶Literal['tree', 'list'], default:'tree') –“tree” renders the DAG as a tree structure (default). “list” renders nodes in flat execution order.
Returns:
-
str–String representation of the DAG with status indicators.
load_default
¶
load_default() -> Self
Attach to default run in this collection, loading all DAG nodes.
load_pending
¶
load_pending() -> Self
Attach to the pending run in this collection, loading all DAG nodes.
Pending is defined as the last non-default run.
run_and_sync
¶
run_and_sync(start: str | None = None, finish: str | None = None, low_memory: bool = False, batch_size: int | None = None, profile: bool = False) -> None
Run entire DAG and send results to server.
Parameters:
-
(start¶str | None, default:None) –Name of first node to run
-
(finish¶str | None, default:None) –Name of last node to run
-
(low_memory¶bool, default:False) –Whether to delete data for each node after it is run
-
(batch_size¶int | None, default:None) –The size used for internal batching. Overrides environment variable if set.
-
(profile¶bool, default:False) –whether to log to INFO level the memory usage
set_default
¶
Set the current run as the default for the collection.
Makes it immutable, then moves the default pointer to it.
lookup_key
¶
get_matches
¶
get_matches(resolver: ResolverStepName | None = None, source_filter: list[SourceStepName] | None = None, location_names: list[str] | None = None) -> ResolverMatches
Return ResolverMatches, optionally filtering.
Parameters:
-
(resolver¶ResolverStepName | None, default:None) –Name of resolver to query within DAG. If not provided, will look for an apex.
-
(source_filter¶list[SourceStepName] | None, default:None) –An optional list of source step names to filter by.
-
(location_names¶list[str] | None, default:None) –An optional list of location names to filter by.