Queries
matchbox.client.queries
¶
Definition of model inputs.
Classes:
-
Query–Queriable input to a model.
Query
¶
Query(*sources: Source, dag: DAG, model: Model | None = None, combine_type: QueryCombineType = CONCAT, threshold: float | None = None, cleaning: dict[str, str] | None = None)
Queriable input to a model.
Parameters:
-
(sources¶Source, default:()) –List of sources to query from
-
(dag¶DAG) –DAG containing sources and models.
-
(model¶optional, default:None) –Model to use to resolve sources. It can only be missing if querying from a single source.
-
(combine_type¶optional, default:CONCAT) –How to combine the data from different sources. Default is
concat.- If
concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available - If
explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources - If
set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
- If
-
(threshold¶optional, default:None) –The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
-
(cleaning¶optional, default:None) –A dictionary mapping an output column name to a SQL expression that will populate a new column.
Methods:
-
from_config–Create query from config.
-
data_raw–Fetches raw query data by joining source data and matchbox matches.
-
data–Returns final data from defined query.
-
deduper–Create deduper for data in this query.
-
linker–Create linker for data in this query and another query.
Attributes:
-
raw_data(DataFrame | None) – -
dag– -
sources– -
model– -
combine_type– -
threshold– -
cleaning– -
config(QueryConfig) –The query configuration for the current DAG.
from_config
classmethod
¶
from_config(config: QueryConfig, dag: DAG) -> Self
Create query from config.
The DAG must have had relevant sources and model added already.
Parameters:
-
(config¶QueryConfig) –The QueryConfig to reconstruct from.
-
(dag¶DAG) –The DAG containing the sources and model.
Returns:
-
Self–A reconstructed Query instance.
data_raw
¶
data_raw(return_type: QueryReturnType = POLARS, return_leaf_id: bool = False) -> QueryReturnClass
Fetches raw query data by joining source data and matchbox matches.
Parameters:
-
(return_type¶optional, default:POLARS) –Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.
-
(return_leaf_id¶optional, default:False) –Whether matchbox IDs for source clusters should be saved as a byproduct in the
leaf_idsattribute.
Returns: Data in the requested return type
Raises:
-
MatchboxEmptyServerResponse–If no data was returned by the server.
data
¶
data(raw_data: DataFrame | None = None, return_type: QueryReturnType = POLARS, return_leaf_id: bool = False) -> QueryReturnClass
Returns final data from defined query.
Parameters:
-
(raw_data¶DataFrame | None, default:None) –If passed, will only apply cleaning instead of fetching raw data.
-
(return_type¶optional, default:POLARS) –Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.
-
(return_leaf_id¶optional, default:False) –Whether matchbox IDs for source clusters should be saved as a byproduct in the
leaf_idsattribute. If pre-fetched raw data is passed, this argument is ignored.
Returns: Data in the requested return type