Queries
matchbox.client.queries
¶
Definition of model inputs.
Classes:
-
Query–Queriable input to a model.
Query
¶
Query(*sources: Source, dag: DAG, resolver: Resolver | None = None, combine_type: QueryCombineType = CONCAT, cleaning: dict[str, str] | None = None)
Queriable input to a model.
Parameters:
-
(sources¶Source, default:()) –List of sources to query from
-
(dag¶DAG) –DAG containing sources and models.
-
(resolver¶optional, default:None) –Resolver to use to resolve sources. It can be missing if querying from a single source.
-
(combine_type¶optional, default:CONCAT) –How to combine the data from different sources. Default is
concat.- If
concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available - If
explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources - If
set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
- If
-
(cleaning¶optional, default:None) –A dictionary mapping an output column name to a SQL expression that will populate a new column.
Methods:
-
from_config–Create query from config.
-
data_raw–Fetches raw query data by joining source data and matchbox matches.
-
data–Returns final data from defined query.
-
deduper–Create deduper for data in this query.
-
linker–Create linker for data in this query and another query.
Attributes:
-
raw_data(DataFrame | None) – -
leaf_id(DataFrame | None) – -
dag– -
sources– -
resolver– -
combine_type– -
cleaning– -
config(QueryConfig) –The query configuration for the current DAG.
from_config
classmethod
¶
from_config(config: QueryConfig, dag: DAG) -> Self
Create query from config.
The DAG must have had relevant sources and model added already.
Parameters:
-
(config¶QueryConfig) –The QueryConfig to reconstruct from.
-
(dag¶DAG) –The DAG containing the sources and model.
Returns:
-
Self–A reconstructed Query instance.
data_raw
¶
data_raw(return_type: Literal[POLARS] = ..., cache_leaf_ids: bool = False) -> DataFrame
data_raw(return_type: Literal[PANDAS] = ..., cache_leaf_ids: bool = False) -> DataFrame
data_raw(return_type: Literal[ARROW] = ..., cache_leaf_ids: bool = False) -> Table
data_raw(return_type: QueryReturnType = POLARS, cache_leaf_ids: bool = False) -> QueryReturnClass
Fetches raw query data by joining source data and matchbox matches.
Parameters:
-
(return_type¶optional, default:POLARS) –Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.
-
(cache_leaf_ids¶optional, default:False) –Whether matchbox IDs for source clusters should be saved as a byproduct in the
leaf_idsattribute.
Returns: Data in the requested return type
Raises:
-
MatchboxEmptyServerResponse–If no data was returned by the server.
data
¶
data(raw_data: DataFrame | None = None, return_type: QueryReturnType = POLARS, cache_leaf_ids: bool = False) -> QueryReturnClass
Returns final data from defined query.
Parameters:
-
(raw_data¶DataFrame | None, default:None) –If passed, will only apply cleaning instead of fetching raw data.
-
(return_type¶optional, default:POLARS) –Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.
-
(cache_leaf_ids¶optional, default:False) –Whether matchbox IDs for source clusters should be saved as a byproduct in the
leaf_idsattribute. If pre-fetched raw data is passed, this argument is ignored.
Returns: Data in the requested return type