Sources

matchbox.client.sources ¶

Interface to source data.

Classes:

Source –

Client-side wrapper for source configs.

Functions:

post_run –

Decorator to ensure that a method is called after source run.

Source ¶

Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: str, index_fields: list[str], description: str | None = None, infer_types: bool = True, validate_etl: bool = True)

Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: SourceField, index_fields: list[SourceField], description: str | None = None, infer_types: bool = False, validate_etl: bool = True)

Source(dag: DAG, location: Location, name: str, extract_transform: str, key_field: str | SourceField, index_fields: list[str] | list[SourceField], description: str | None = None, infer_types: bool = False, validate_etl: bool = True)

Client-side wrapper for source configs.

Parameters:

dag ¶
(DAG) –

DAG containing the source.
location ¶
(Location) –

The location where the source data is stored.
name ¶
(str) –

The name of the source.
description ¶
(str | None, default: None ) –

An optional description of the source.
extract_transform ¶
(str) –

The extract/transform logic to apply to the source data.
key_field ¶
(str | SourceField) –

The name of the field to use as the key, or a SourceField instance defining the key field. This is the unique identifier we’ll use to refer to matched data in the source.
index_fields ¶
(list[str] | list[SourceField]) –

The names of the fields to use as index fields, or a list of SourceField instances defining the index fields. These are the fields you plan to match on.
infer_types ¶
(bool, default: False ) –

Whether to infer data types for the fields from the source. If False, you must provide SourceField instances for key_field and index_fields.
validate_etl ¶
(bool, default: True ) –

Whether to skip query validation. If True, it will perform query validation. It should be False when loading sources from the server. Default True.

Methods:

to_resolution –

Convert to Resolution for API calls.
from_resolution –

Reconstruct from Resolution.
fetch –

Applies the extract/transform logic to the source and returns the results.
run –

Hash a dataset from its warehouse, ready to be inserted, and cache hashes.
qualify_field –

Qualify field names with the source name.
f –

Qualify one or more field names with the source name.
sync –

Send the source config and hashes to the server.
query –

Generate a query for this source.

Attributes:

location –
dag –
name –
description –
extract_transform –
hashes (Table | None) –
key_field –
index_fields –
config (SourceConfig) –

Generate SourceConfig from Source.
sources (set[SourceResolutionName]) –

Set of source names upstream of this node.
prefix (str) –

Get the prefix for the source.
qualified_key (str) –

Get the qualified key for the source.
qualified_index_fields (list[str]) –

Get the qualified index fields for the source.

location `instance-attribute` ¶

location = location

dag `instance-attribute` ¶

dag = dag

name `instance-attribute` ¶

name = name

description `instance-attribute` ¶

description = description

extract_transform `instance-attribute` ¶

extract_transform = extract_transform

hashes `instance-attribute` ¶

hashes: Table | None = None

key_field `instance-attribute` ¶

key_field = SourceField(name=key_field, type=STRING)

index_fields `instance-attribute` ¶

index_fields = tuple((remote_fields[field]) for field in index_fields)

config `property` ¶

config: SourceConfig

Generate SourceConfig from Source.

sources `property` ¶

sources: set[SourceResolutionName]

Set of source names upstream of this node.

prefix `property` ¶

prefix: str

Get the prefix for the source.

qualified_key `property` ¶

qualified_key: str

Get the qualified key for the source.

qualified_index_fields `property` ¶

qualified_index_fields: list[str]

Get the qualified index fields for the source.

to_resolution ¶

to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution `classmethod` ¶

from_resolution(resolution: Resolution, resolution_name: str, dag: DAG, location: Location) -> Source

Reconstruct from Resolution.

fetch ¶

fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: QueryReturnType = POLARS, keys: list[str] | None = None) -> Generator[QueryReturnClass, None, None]

Applies the extract/transform logic to the source and returns the results.

Parameters:

qualify_names ¶
(bool, default: False ) –

If True, qualify the names of the columns with the source name.
batch_size ¶
(int | None, default: None ) –

Indicate the size of each batch when fetching data in batches.
return_type ¶
(QueryReturnType, default: POLARS ) –

The type of data to return. Defaults to “polars”.
keys ¶
(list[str] | None, default: None ) –

List of keys to select a subset of all source entries.

Returns:

None –

The requested data in the specified format, as an iterator of tables.

run ¶

run(batch_size: int | None = None) -> Table

Hash a dataset from its warehouse, ready to be inserted, and cache hashes.

Hashes the index fields defined in the source based on the extract/transform logic.

Does not hash the key field.

Parameters:

batch_size ¶
(int | None, default: None ) –

If set, process data in batches internally. Indicates the size of each batch.

Returns:

Table –

A PyArrow Table containing source keys and their hashes.

qualify_field ¶

qualify_field(field: str) -> str

Qualify field names with the source name.

Parameters:

field ¶
(str) –

The field name to qualify.

Returns:

str –

A single qualified field.

f ¶

f(fields: str | Iterable[str]) -> str | list[str]

Qualify one or more field names with the source name.

Parameters:

fields ¶
(str | Iterable[str]) –

The field name to qualify, or a list of field names.

Returns:

str | list[str] –

A single qualified field, or a list of qualified field names.

sync ¶

sync() -> None

Send the source config and hashes to the server.

Not resistant to race conditions: only one client should call sync at a time.

query ¶

query(**kwargs: Any) -> Query

Generate a query for this source.

post_run ¶

post_run(method: Callable[..., T]) -> Callable[..., T]

Decorator to ensure that a method is called after source run.

Raises:

RuntimeError –

If run hasn’t happened.

Sources

matchbox.client.sources ¶

Source ¶

dag ¶

location ¶

name ¶

description ¶

extract_transform ¶

key_field ¶

index_fields ¶

infer_types ¶

validate_etl ¶

location instance-attribute ¶

dag instance-attribute ¶

name instance-attribute ¶

description instance-attribute ¶

extract_transform instance-attribute ¶

hashes instance-attribute ¶

key_field instance-attribute ¶

index_fields instance-attribute ¶

config property ¶

sources property ¶

prefix property ¶

qualified_key property ¶

qualified_index_fields property ¶

to_resolution ¶

from_resolution classmethod ¶

fetch ¶

qualify_names ¶

batch_size ¶

return_type ¶

keys ¶

run ¶

batch_size ¶

qualify_field ¶

field ¶

f ¶

fields ¶

sync ¶

query ¶

post_run ¶

`dag` ¶

`location` ¶

`name` ¶

`description` ¶

`extract_transform` ¶

`key_field` ¶

`index_fields` ¶

`infer_types` ¶

`validate_etl` ¶

location `instance-attribute` ¶

dag `instance-attribute` ¶

name `instance-attribute` ¶

description `instance-attribute` ¶

extract_transform `instance-attribute` ¶

hashes `instance-attribute` ¶

key_field `instance-attribute` ¶

index_fields `instance-attribute` ¶

config `property` ¶

sources `property` ¶

prefix `property` ¶

qualified_key `property` ¶

qualified_index_fields `property` ¶

from_resolution `classmethod` ¶

`qualify_names` ¶

`batch_size` ¶

`return_type` ¶

`keys` ¶

`batch_size` ¶

`field` ¶

`fields` ¶