Sources
matchbox.client.sources
¶
Interface to source data.
Classes:
-
Source–Client-side wrapper for source configs.
Source
¶
Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: str, index_fields: list[str], description: str | None = None, infer_types: bool = True, validate_etl: bool = True)
Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: SourceField, index_fields: list[SourceField], description: str | None = None, infer_types: bool = False, validate_etl: bool = True)
Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: str | SourceField, index_fields: list[str] | list[SourceField], description: str | None = None, infer_types: bool = False, validate_etl: bool = True)
Client-side wrapper for source configs.
Parameters:
-
(dag¶DAG) –DAG containing the source.
-
(location¶Location) –The location where the source data is stored.
-
(name¶str) –The name of the source.
-
(description¶str | None, default:None) –An optional description of the source.
-
(extract_transform¶str) –The extract/transform logic to apply to the source data.
-
(key_field¶str | SourceField) –The name of the field to use as the key, or a SourceField instance defining the key field. This is the unique identifier we’ll use to refer to matched data in the source.
-
(index_fields¶list[str] | list[SourceField]) –The names of the fields to use as index fields, or a list of SourceField instances defining the index fields. These are the fields you plan to match on.
-
(infer_types¶bool, default:False) –Whether to infer data types for the fields from the source. If False, you must provide SourceField instances for key_field and index_fields.
-
(validate_etl¶bool, default:True) –Whether to skip query validation. If True, it will perform query validation. It should be False when loading sources from the server. Default True.
Methods:
-
to_resolution–Convert to Resolution for API calls.
-
from_resolution–Reconstruct from Resolution.
-
fetch–Applies the extract/transform logic to the source and returns the results.
-
run–Hash a dataset from its warehouse, ready to be inserted, and cache hashes.
-
qualify_field–Qualify field names with the source name.
-
f–Qualify one or more field names with the source name.
-
sync–Send the source config and hashes to the server.
-
query–Generate a query for this source.
Attributes:
-
last_run(datetime | None) – -
location– -
dag– -
name– -
description– -
extract_transform– -
key_field– -
index_fields– -
config(SourceConfig) –Generate SourceConfig from Source.
-
prefix(str) –Get the prefix for the source.
-
qualified_key(str) –Get the qualified key for the source.
-
qualified_index_fields(list[str]) –Get the qualified index fields for the source.
index_fields
instance-attribute
¶
index_fields = tuple((remote_fields[field]) for field in index_fields)
qualified_index_fields
property
¶
Get the qualified index fields for the source.
from_resolution
classmethod
¶
from_resolution(resolution: Resolution, resolution_name: str, dag: DAG, location: Location) -> Source
Reconstruct from Resolution.
fetch
¶
fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: QueryReturnType = POLARS, keys: list[str] | None = None) -> Generator[QueryReturnClass, None, None]
Applies the extract/transform logic to the source and returns the results.
Parameters:
-
(qualify_names¶bool, default:False) –If True, qualify the names of the columns with the source name.
-
(batch_size¶int | None, default:None) –Indicate the size of each batch when fetching data in batches.
-
(return_type¶QueryReturnType, default:POLARS) –The type of data to return. Defaults to “polars”.
-
(keys¶list[str] | None, default:None) –List of keys to select a subset of all source entries.
Returns:
-
None–The requested data in the specified format, as an iterator of tables.
run
¶
run(batch_size: int | None = None, full_rerun: bool = False) -> Table
Hash a dataset from its warehouse, ready to be inserted, and cache hashes.
Hashes the index fields defined in the source based on the extract/transform logic.
Does not hash the key field.
Parameters:
-
(batch_size¶int | None, default:None) –If set, process data in batches internally. Indicates the size of each batch.
-
(full_rerun¶bool, default:False) –Whether to force a re-run even if the hashes are cached
Returns:
-
Table–A PyArrow Table containing source keys and their hashes.