Skip to content

Sources

matchbox.client.sources

Interface to source data.

Classes:

  • Source

    Client-side wrapper for source configs.

Source

Client-side wrapper for source configs.

Parameters:

  • dag

    (DAG) –

    DAG containing the source.

  • location

    (Location) –

    The location where the source data is stored.

  • name

    (str) –

    The name of the source.

  • description

    (str | None, default: None ) –

    An optional description of the source.

  • extract_transform

    (str) –

    The extract/transform logic to apply to the source data.

  • key_field

    (str | SourceField) –

    The name of the field to use as the key, or a SourceField instance defining the key field. This is the unique identifier we’ll use to refer to matched data in the source.

  • index_fields

    (list[str] | list[SourceField]) –

    The names of the fields to use as index fields, or a list of SourceField instances defining the index fields. These are the fields you plan to match on.

  • infer_types

    (bool, default: False ) –

    Whether to infer data types for the fields from the source. If False, you must provide SourceField instances for key_field and index_fields.

  • validate_etl

    (bool, default: True ) –

    Whether to skip query validation. If True, it will perform query validation. It should be False when loading sources from the server. Default True.

Methods:

  • to_resolution

    Convert to Resolution for API calls.

  • from_resolution

    Reconstruct from Resolution.

  • fetch

    Applies the extract/transform logic to the source and returns the results.

  • run

    Hash a dataset from its warehouse, ready to be inserted, and cache hashes.

  • qualify_field

    Qualify field names with the source name.

  • f

    Qualify one or more field names with the source name.

  • sync

    Send the source config and hashes to the server.

  • query

    Generate a query for this source.

Attributes:

last_run instance-attribute

last_run: datetime | None = None

location instance-attribute

location = location

dag instance-attribute

dag = dag

name instance-attribute

name = name

description instance-attribute

description = description

extract_transform instance-attribute

extract_transform = extract_transform

key_field instance-attribute

key_field = SourceField(name=key_field, type=STRING)

index_fields instance-attribute

index_fields = tuple((remote_fields[field]) for field in index_fields)

config property

config: SourceConfig

Generate SourceConfig from Source.

prefix property

prefix: str

Get the prefix for the source.

qualified_key property

qualified_key: str

Get the qualified key for the source.

qualified_index_fields property

qualified_index_fields: list[str]

Get the qualified index fields for the source.

to_resolution

to_resolution() -> Resolution

Convert to Resolution for API calls.

from_resolution classmethod

from_resolution(resolution: Resolution, resolution_name: str, dag: DAG, location: Location) -> Source

Reconstruct from Resolution.

fetch

fetch(qualify_names: bool = False, batch_size: int | None = None, return_type: QueryReturnType = POLARS, keys: list[str] | None = None) -> Generator[QueryReturnClass, None, None]

Applies the extract/transform logic to the source and returns the results.

Parameters:

  • qualify_names
    (bool, default: False ) –

    If True, qualify the names of the columns with the source name.

  • batch_size
    (int | None, default: None ) –

    Indicate the size of each batch when fetching data in batches.

  • return_type
    (QueryReturnType, default: POLARS ) –

    The type of data to return. Defaults to “polars”.

  • keys
    (list[str] | None, default: None ) –

    List of keys to select a subset of all source entries.

Returns:

  • None

    The requested data in the specified format, as an iterator of tables.

run

run(batch_size: int | None = None, full_rerun: bool = False) -> Table

Hash a dataset from its warehouse, ready to be inserted, and cache hashes.

Hashes the index fields defined in the source based on the extract/transform logic.

Does not hash the key field.

Parameters:

  • batch_size
    (int | None, default: None ) –

    If set, process data in batches internally. Indicates the size of each batch.

  • full_rerun
    (bool, default: False ) –

    Whether to force a re-run even if the hashes are cached

Returns:

  • Table

    A PyArrow Table containing source keys and their hashes.

qualify_field

qualify_field(field: str) -> str

Qualify field names with the source name.

Parameters:

  • field
    (str) –

    The field name to qualify.

Returns:

  • str

    A single qualified field.

f

f(fields: str | Iterable[str]) -> str | list[str]

Qualify one or more field names with the source name.

Parameters:

  • fields
    (str | Iterable[str]) –

    The field name to qualify, or a list of field names.

Returns:

  • str | list[str]

    A single qualified field, or a list of qualified field names.

sync

sync() -> None

Send the source config and hashes to the server.

query

query(**kwargs) -> Query

Generate a query for this source.