Skip to content

Queries

matchbox.client.queries

Definition of model inputs.

Classes:

  • Query

    Queriable input to a model.

Functions:

  • clean

    Clean data using DuckDB with the provided cleaning SQL.

Query

Query(*sources: Source, dag: DAG, model: Model | None = None, combine_type: QueryCombineType = CONCAT, threshold: float | None = None, cleaning: dict[str, str] | None = None)

Queriable input to a model.

Parameters:

  • sources

    (Source, default: () ) –

    List of sources to query from

  • dag

    (DAG) –

    DAG containing sources and models.

  • model

    (optional, default: None ) –

    Model to use to resolve sources. It can only be missing if querying from a single source.

  • combine_type

    (optional, default: CONCAT ) –

    How to combine the data from different sources. Default is concat.

    • If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
    • If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
    • If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
  • threshold

    (optional, default: None ) –

    The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors

  • cleaning

    (optional, default: None ) –

    A dictionary mapping an output column name to a SQL expression that will populate a new column.

Methods:

  • from_config

    Create query from config.

  • run

    Runs queries against the selected backend.

  • clean

    Change cleaning dictionary and re-apply cleaning, if raw data was cached.

  • deduper

    Create deduper for data in this query.

  • linker

    Create linker for data in this query and another query.

Attributes:

last_run instance-attribute

last_run: datetime | None = None

raw_data instance-attribute

raw_data: DataFrame | None = None

dag instance-attribute

dag = dag

sources instance-attribute

sources = sources

model instance-attribute

model = model

combine_type instance-attribute

combine_type = combine_type

threshold instance-attribute

threshold = threshold

cleaning instance-attribute

cleaning = cleaning

config property

config: QueryConfig

The query configuration for the current DAG.

from_config classmethod

from_config(config: QueryConfig, dag: DAG) -> Self

Create query from config.

The DAG must have had relevant sources and model added already.

Parameters:

  • config
    (QueryConfig) –

    The QueryConfig to reconstruct from.

  • dag
    (DAG) –

    The DAG containing the sources and model.

Returns:

  • Self

    A reconstructed Query instance.

run

Runs queries against the selected backend.

Parameters:

  • return_type
    (optional, default: POLARS ) –

    Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.

  • return_leaf_id
    (optional, default: False ) –

    Whether matchbox IDs for source clusters should be saved as a byproduct in the leaf_ids attribute.

  • batch_size
    (optional, default: None ) –

    The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.

  • full_rerun
    (bool, default: False ) –

    Whether to force a re-run of the query

  • cache_raw
    (bool, default: False ) –

    Whether to store the pre-cleaned data to iterate on cleaning.

Returns: Data in the requested return type

Raises:

  • MatchboxEmptyServerResponse

    If no data was returned by the server.

clean

Change cleaning dictionary and re-apply cleaning, if raw data was cached.

Parameters:

  • cleaning
    (dict[str, str] | None) –

    A dictionary mapping field aliases to SQL expressions. The SQL expressions can reference columns in the data using their names. If None, no cleaning is applied and the original data is returned. SourceConfig.f() can be used to help reference qualified fields.

  • return_type
    (optional, default: POLARS ) –

    Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.

deduper

deduper(name: str, model_class: Deduper, model_settings: DeduperSettings, description: str | None = None) -> Model

Create deduper for data in this query.

linker

linker(other_query: Self, name: str, model_class: Linker, model_settings: LinkerSettings, description: str | None = None) -> Model

Create linker for data in this query and another query.

clean

clean(data: DataFrame, cleaning_dict: dict[str, str] | None) -> DataFrame

Clean data using DuckDB with the provided cleaning SQL.

  • ID is passed through automatically
  • If present, leaf_id and key are passed through automatically
  • Columns not mentioned in the cleaning_dict are passed through unchanged
  • Each key in cleaning_dict is an alias for a SQL expression

Parameters:

  • data

    (DataFrame) –

    Raw polars dataframe to clean

  • cleaning_dict

    (dict[str, str] | None) –

    A dictionary mapping field aliases to SQL expressions. The SQL expressions can reference columns in the data using their names. If None, no cleaning is applied and the original data is returned. SourceConfig.f() can be used to help reference qualified fields.

Returns:

  • DataFrame

    Cleaned polars dataframe

Examples:

Column passthrough behavior:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["John", "Jane", "Bob"],
        "age": [25, 30, 35],
        "city": ["London", "Hull", "Stratford-upon-Avon"],
    }
)
cleaning_dict = {
    "full_name": "name"  # Only references 'name' column
}
result = clean(data, cleaning_dict)
# Result columns: id, full_name, age, city
# 'name' is dropped because it was used in cleaning_dict
# 'age' and 'city' are passed through unchanged

Multiple column references:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "first": ["John", "Jane", "Bob"],
        "last": ["Doe", "Smith", "Johnson"],
        "salary": [50000, 60000, 55000],
    }
)
cleaning_dict = {
    "name": "first || ' ' || last",  # References both 'first' and 'last'
    "high_earner": "salary > 55000",
}
result = clean(data, cleaning_dict)
# Result columns: id, name, high_earner
# 'first', 'last', and 'salary' are dropped (used in expressions)
# No other columns to pass through

Special columns (leaf_id, key) handling:

data = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "leaf_id": ["a", "b", "c"],
        "key": ["x", "y", "z"],
        "value": [10, 20, 30],
        "status": ["active", "inactive", "pending"],
    }
)
cleaning_dict = {"processed_value": "value * 2"}
result = clean(data, cleaning_dict)
# Result columns: id, leaf_id, key, processed_value, status
# 'id', 'leaf_id', 'key' always included automatically
# 'value' dropped (used in expression), 'status' passed through

No cleaning (returns original data):

data = pl.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"], "score": [95, 87]})
result = clean(data, None)
# Returns exact same dataframe with all original columns