Skip to content

Queries

matchbox.client.queries

Definition of model inputs.

Classes:

  • Query

    Queriable input to a model.

Query

Query(*sources: Source, dag: DAG, resolver: Resolver | None = None, combine_type: QueryCombineType = CONCAT, cleaning: dict[str, str] | None = None)

Queriable input to a model.

Parameters:

  • sources

    (Source, default: () ) –

    List of sources to query from

  • dag

    (DAG) –

    DAG containing sources and models.

  • resolver

    (optional, default: None ) –

    Resolver to use to resolve sources. It can be missing if querying from a single source.

  • combine_type

    (optional, default: CONCAT ) –

    How to combine the data from different sources. Default is concat.

    • If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
    • If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
    • If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
  • cleaning

    (optional, default: None ) –

    A dictionary mapping an output column name to a SQL expression that will populate a new column.

Methods:

  • from_config

    Create query from config.

  • data_raw

    Fetches raw query data by joining source data and matchbox matches.

  • data

    Returns final data from defined query.

  • deduper

    Create deduper for data in this query.

  • linker

    Create linker for data in this query and another query.

Attributes:

raw_data instance-attribute

raw_data: DataFrame | None = None

leaf_id instance-attribute

leaf_id: DataFrame | None = None

dag instance-attribute

dag = dag

sources instance-attribute

sources = sources

resolver instance-attribute

resolver = resolver

combine_type instance-attribute

combine_type = combine_type

cleaning instance-attribute

cleaning = cleaning

config property

config: QueryConfig

The query configuration for the current DAG.

from_config classmethod

from_config(config: QueryConfig, dag: DAG) -> Self

Create query from config.

The DAG must have had relevant sources and model added already.

Parameters:

  • config
    (QueryConfig) –

    The QueryConfig to reconstruct from.

  • dag
    (DAG) –

    The DAG containing the sources and model.

Returns:

  • Self

    A reconstructed Query instance.

data_raw

data_raw(return_type: Literal[POLARS] = ..., cache_leaf_ids: bool = False) -> DataFrame
data_raw(return_type: Literal[PANDAS] = ..., cache_leaf_ids: bool = False) -> DataFrame
data_raw(return_type: Literal[ARROW] = ..., cache_leaf_ids: bool = False) -> Table

Fetches raw query data by joining source data and matchbox matches.

Parameters:

  • return_type
    (optional, default: POLARS ) –

    Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.

  • cache_leaf_ids
    (optional, default: False ) –

    Whether matchbox IDs for source clusters should be saved as a byproduct in the leaf_ids attribute.

Returns: Data in the requested return type

Raises:

  • MatchboxEmptyServerResponse

    If no data was returned by the server.

data

data(raw_data: DataFrame | None = None, return_type: QueryReturnType = POLARS, cache_leaf_ids: bool = False) -> QueryReturnClass

Returns final data from defined query.

Parameters:

  • raw_data
    (DataFrame | None, default: None ) –

    If passed, will only apply cleaning instead of fetching raw data.

  • return_type
    (optional, default: POLARS ) –

    Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.

  • cache_leaf_ids
    (optional, default: False ) –

    Whether matchbox IDs for source clusters should be saved as a byproduct in the leaf_ids attribute. If pre-fetched raw data is passed, this argument is ignored.

Returns: Data in the requested return type

deduper

deduper(name: str, model_class: Deduper, model_settings: DeduperSettings, description: str | None = None) -> Model

Create deduper for data in this query.

linker

linker(other_query: Self, name: str, model_class: Linker, model_settings: LinkerSettings, description: str | None = None) -> Model

Create linker for data in this query and another query.