Sources
matchbox.client.sources
¶
Interface to source data.
Classes:
-
Source–Client-side wrapper for source configs.
Source
¶
Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: str, index_fields: list[str], description: str | None = None, infer_types: bool = True, validate_etl: bool = True)
Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: SourceField, index_fields: list[SourceField], description: str | None = None, infer_types: bool = False, validate_etl: bool = True)
Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: str | SourceField, index_fields: list[str] | list[SourceField], description: str | None = None, infer_types: bool = False, validate_etl: bool = True)
Bases: StepABC
flowchart TD
matchbox.client.sources.Source[Source]
matchbox.client.steps.StepABC[StepABC]
matchbox.client.steps.StepABC --> matchbox.client.sources.Source
click matchbox.client.sources.Source href "" "matchbox.client.sources.Source"
click matchbox.client.steps.StepABC href "" "matchbox.client.steps.StepABC"
Client-side wrapper for source configs.
Parameters:
-
(dag¶DAG) –DAG containing the source.
-
(location¶Location) –The location where the source data is stored.
-
(name¶str) –The name of the source.
-
(description¶str | None, default:None) –An optional description of the source.
-
(extract_transform¶str) –The extract/transform logic to apply to the source data.
-
(key_field¶str | SourceField) –The name of the field to use as the key, or a SourceField instance defining the key field. This is the unique identifier we’ll use to refer to matched data in the source.
-
(index_fields¶list[str] | list[SourceField]) –The names of the fields to use as index fields, or a list of SourceField instances defining the index fields. These are the fields you plan to match on.
-
(infer_types¶bool, default:False) –Whether to infer data types for the fields from the source. If False, you must provide SourceField instances for key_field and index_fields.
-
(validate_etl¶bool, default:True) –Whether to skip query validation. If True, it will perform query validation. It should be False when loading sources from the server. Default True.
Methods:
-
to_dto–Convert to Step DTO for API calls.
-
from_dto–Reconstruct from Step DTO.
-
fetch–Apply the extract/transform logic to the source and return batches lazily.
-
sample–Peek at the top n entries in a source.
-
run–Hash a dataset from its warehouse, ready to be inserted, and cache hashes.
-
qualify_field–Qualify a field name with the source name.
-
f–Qualify one or more field names with the source name.
-
query–Generate a query for this source.
-
clear_data–Drop locally computed data.
-
delete–Delete this step and its associated data from the backend.
-
download–Fetch remote data for this step and store it locally.
-
sync–Send step config and local data to the server.
Attributes:
-
location– -
extract_transform– -
key_field– -
index_fields– -
hashes(DataFrame | None) –The locally computed hashes. Alias for local_data.
-
config(SourceConfig) –Generate SourceConfig from Source.
-
sources(set[SourceStepName]) –Set of source names upstream of this node.
-
path(SourceStepPath) –Return the source step path.
-
prefix(str) –Get the prefix for the source.
-
qualified_key(str) –Get the qualified key for the source.
-
qualified_index_fields(list[str]) –Get the qualified index fields for the source.
-
dag– -
name– -
description– -
local_data(DataFrame | None) –The locally computed results for this step.
index_fields
instance-attribute
¶
index_fields = tuple((remote_fields[field]) for field in index_fields)
hashes
property
writable
¶
The locally computed hashes. Alias for local_data.
qualified_index_fields
property
¶
Get the qualified index fields for the source.
from_dto
classmethod
¶
Reconstruct from Step DTO.
fetch
¶
fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: Literal[POLARS] = ..., keys: list[str] | None = None) -> Generator[DataFrame, None, None]
fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: Literal[PANDAS] = ..., keys: list[str] | None = None) -> Generator[DataFrame, None, None]
fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: Literal[ARROW] = ..., keys: list[str] | None = None) -> Generator[Table, None, None]
fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: QueryReturnType = POLARS, keys: list[str] | None = None) -> Generator[QueryReturnClass, None, None]
Apply the extract/transform logic to the source and return batches lazily.
Parameters:
-
(qualify_names¶bool, default:False) –If True, qualify the names of the columns with the source name.
-
(batch_size¶int | None, default:None) –Indicate the size of each batch when fetching data in batches.
-
(return_type¶QueryReturnType, default:POLARS) –The type of data to return. Defaults to “polars”.
-
(keys¶list[str] | None, default:None) –List of keys to select a subset of all source entries.
Returns:
-
None–The requested data in the specified format, as an iterator of tables.
sample
¶
sample(n: int = 100, return_type: QueryReturnType = POLARS) -> None
Peek at the top n entries in a source.
run
¶
run(batch_size: int | None = None) -> DataFrame
Hash a dataset from its warehouse, ready to be inserted, and cache hashes.
Hashes the index fields defined in the source based on the extract/transform logic. Does not hash the key field.
Parameters:
qualify_field
¶
f
¶
delete
¶
Delete this step and its associated data from the backend.
sync
¶
Send step config and local data to the server.
Not resistant to race conditions: only one client should call sync at a time.