Queries

matchbox.client.queries ¶

Definition of model inputs.

Classes:

CacheMode –

Settings determining what query data gets cached.
Query –

Queriable input to a model.

CacheMode ¶

Bases: StrEnum

Settings determining what query data gets cached.

Attributes:

OFF –
RAW –
CLEAN –

OFF `class-attribute` `instance-attribute` ¶

OFF = 'off'

RAW `class-attribute` `instance-attribute` ¶

RAW = 'raw'

CLEAN `class-attribute` `instance-attribute` ¶

CLEAN = 'clean'

Query ¶

Query(*sources: Source, dag: DAG, model: Model | None = None, combine_type: QueryCombineType = CONCAT, threshold: float | None = None, cleaning: dict[str, str] | None = None)

Queriable input to a model.

Parameters:

sources ¶
(Source, default: () ) –

List of sources to query from
dag ¶
(DAG) –

DAG containing sources and models.
model ¶
(optional, default: None ) –

Model to use to resolve sources. It can only be missing if querying from a single source.
combine_type ¶
(optional, default: CONCAT ) –
How to combine the data from different sources. Default is concat.
- If concat, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available
- If explode, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources
- If set_agg, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
threshold ¶
(optional, default: None ) –

The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
cleaning ¶
(optional, default: None ) –

A dictionary mapping an output column name to a SQL expression that will populate a new column.

Methods:

from_config –

Create query from config.
set_cache_mode –

Configures caching behaviour of query operations.
run –

Runs queries against the selected backend.
clean –

Change cleaning dictionary and re-apply cleaning, if raw data was cached.
deduper –

Create deduper for data in this query.
linker –

Create linker for data in this query and another query.

Attributes:

raw_data (DataFrame | None) –
data (DataFrame | None) –
dag –
sources –
model –
combine_type –
threshold –
cleaning –
config (QueryConfig) –

The query configuration for the current DAG.

raw_data `instance-attribute` ¶

raw_data: DataFrame | None = None

data `instance-attribute` ¶

data: DataFrame | None = None

dag `instance-attribute` ¶

dag = dag

sources `instance-attribute` ¶

sources = sources

model `instance-attribute` ¶

model = model

combine_type `instance-attribute` ¶

combine_type = combine_type

threshold `instance-attribute` ¶

threshold = threshold

cleaning `instance-attribute` ¶

cleaning = cleaning

config `property` ¶

config: QueryConfig

The query configuration for the current DAG.

from_config `classmethod` ¶

from_config(config: QueryConfig, dag: DAG) -> Self

Create query from config.

The DAG must have had relevant sources and model added already.

Parameters:

config ¶
(QueryConfig) –

The QueryConfig to reconstruct from.
dag ¶
(DAG) –

The DAG containing the sources and model.

Returns:

Self –

A reconstructed Query instance.

set_cache_mode ¶

set_cache_mode(mode: CacheMode = OFF) -> Self

Configures caching behaviour of query operations.

If “off” (default), doesn’t cache anything
If “raw”, caches data as fetched from the source
If “clean”, it additionally caches the result of applying the cleaning dict.

run ¶

run(return_type: QueryReturnType = POLARS, return_leaf_id: bool = False, batch_size: int | None = None, reuse_cache: bool = False) -> QueryReturnClass

Runs queries against the selected backend.

Parameters:

return_type ¶
(optional, default: POLARS ) –

Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.
return_leaf_id ¶
(optional, default: False ) –

Whether matchbox IDs for source clusters should be saved as a byproduct in the leaf_ids attribute.
batch_size ¶
(optional, default: None ) –

The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.
reuse_cache ¶
(bool, default: False ) –

Whether to re-use raw cached data if available.

Returns: Data in the requested return type

Raises:

MatchboxEmptyServerResponse –

If no data was returned by the server.

clean ¶

clean(cleaning: dict[str, str] | None, return_type: QueryReturnType = POLARS) -> QueryReturnClass

Change cleaning dictionary and re-apply cleaning, if raw data was cached.

Parameters:

cleaning ¶
(dict[str, str] | None) –

A dictionary mapping field aliases to SQL expressions. The SQL expressions can reference columns in the data using their names. If None, no cleaning is applied and the original data is returned. SourceConfig.f() can be used to help reference qualified fields.
return_type ¶
(optional, default: POLARS ) –

Type of dataframe returned, defaults to “polars”. Other options are “pandas” and “arrow”.

deduper ¶

deduper(name: str, model_class: Deduper, model_settings: DeduperSettings, description: str | None = None) -> Model

Create deduper for data in this query.

linker ¶

linker(other_query: Self, name: str, model_class: Linker, model_settings: LinkerSettings, description: str | None = None) -> Model

Create linker for data in this query and another query.

Queries

matchbox.client.queries ¶

CacheMode ¶

OFF `class-attribute` `instance-attribute` ¶

RAW `class-attribute` `instance-attribute` ¶

CLEAN `class-attribute` `instance-attribute` ¶

Query ¶

`sources` ¶

`dag` ¶

`model` ¶

`combine_type` ¶

`threshold` ¶

`cleaning` ¶

raw_data `instance-attribute` ¶

data `instance-attribute` ¶

dag `instance-attribute` ¶

sources `instance-attribute` ¶

model `instance-attribute` ¶

combine_type `instance-attribute` ¶

threshold `instance-attribute` ¶

cleaning `instance-attribute` ¶

config `property` ¶

from_config `classmethod` ¶

`config` ¶

`dag` ¶

set_cache_mode ¶

run ¶

`return_type` ¶

`return_leaf_id` ¶

`batch_size` ¶

`reuse_cache` ¶

clean ¶

`cleaning` ¶

`return_type` ¶

deduper ¶

linker ¶

Queries

matchbox.client.queries ¶

CacheMode ¶

OFF class-attribute instance-attribute ¶

RAW class-attribute instance-attribute ¶

CLEAN class-attribute instance-attribute ¶

Query ¶

sources ¶

dag ¶

model ¶

combine_type ¶

threshold ¶

cleaning ¶

raw_data instance-attribute ¶

data instance-attribute ¶

dag instance-attribute ¶

sources instance-attribute ¶

model instance-attribute ¶

combine_type instance-attribute ¶

threshold instance-attribute ¶

cleaning instance-attribute ¶

config property ¶

from_config classmethod ¶

config ¶

dag ¶

set_cache_mode ¶

run ¶

return_type ¶

return_leaf_id ¶

batch_size ¶

reuse_cache ¶

clean ¶

cleaning ¶

return_type ¶

deduper ¶

linker ¶

OFF `class-attribute` `instance-attribute` ¶

RAW `class-attribute` `instance-attribute` ¶

CLEAN `class-attribute` `instance-attribute` ¶

`sources` ¶

`dag` ¶

`model` ¶

`combine_type` ¶

`threshold` ¶

`cleaning` ¶

raw_data `instance-attribute` ¶

data `instance-attribute` ¶

dag `instance-attribute` ¶

sources `instance-attribute` ¶

model `instance-attribute` ¶

combine_type `instance-attribute` ¶

threshold `instance-attribute` ¶

cleaning `instance-attribute` ¶

config `property` ¶

from_config `classmethod` ¶

`config` ¶

`dag` ¶

`return_type` ¶

`return_leaf_id` ¶

`batch_size` ¶

`reuse_cache` ¶

`cleaning` ¶

`return_type` ¶