Helpers¶
matchbox.client.helpers
¶
Core functionalities of the Matchbox client.
Modules:
-
cleaner
–Functions to pre-process data sources.
-
comparison
–Functions to compare fields in different datasets.
-
index
–Functions to index data sources to the Matchbox server.
-
selector
–Functions to select and retrieve data from the Matchbox server.
Functions:
-
cleaners
–Combine multiple cleaners in a single object to pass to
process()
. -
select
–From one engine, builds and verifies a list of selectors.
cleaners
¶
select
¶
select(
*selection: str | dict[str, str],
engine: Engine | None = None,
only_indexed: bool = True,
) -> list[Selector]
From one engine, builds and verifies a list of selectors.
Parameters:
-
selection
¶str | dict[str, str]
, default:()
) –Full source names and optionally a subset of columns to select
-
engine
¶Engine | None
, default:None
) –The engine to connect to the data warehouse hosting the source. If not provided, will use a connection string from the
MB__CLIENT__DEFAULT_WAREHOUSE
environment variable. -
only_indexed
¶bool
, default:True
) –Whether you intend to select indexed columns only. Will raise a warning if True and non-indexed columns are selected. Defaults to True. Non-indexed columns should only be selected if you’re querying data for a purpose other than matching
Returns:
Examples:
cleaner
¶
Functions to pre-process data sources.
Functions:
-
cleaner
–Define a function to clean a dataset.
-
cleaners
–Combine multiple cleaners in a single object to pass to
process()
. -
process
–Apply cleaners to input dataframe.
cleaner
¶
comparison
¶
Functions to compare fields in different datasets.
Functions:
-
comparison
–Validates any number of SQL conditions and prepares for a WHERE clause.
index
¶
Functions to index data sources to the Matchbox server.
Functions:
-
index
–Indexes data in Matchbox.
index
¶
index(
full_name: str,
db_pk: str,
engine: Engine,
resolution_name: str | None = None,
columns: list[str]
| list[dict[str, dict[str, str]]]
| None = None,
batch_size: int | None = None,
) -> None
Indexes data in Matchbox.
Parameters:
-
full_name
¶str
) –the full name of the source
-
db_pk
¶str
) –the primary key of the source
-
engine
¶Engine
) –the engine to connect to a data warehouse
-
resolution_name
¶str | None
, default:None
) –a custom resolution name If missing, will use the default name for a
Source
-
columns
¶list[str] | list[dict[str, dict[str, str]]] | None
, default:None
) –the columns to index
-
batch_size
¶int | None
, default:None
) –the size of each batch when fetching data from the warehouse, which helps reduce the load on the database. Default is None.
Examples:
selector
¶
Functions to select and retrieve data from the Matchbox server.
Classes:
-
Selector
–A selector to choose a source and optionally a subset of columns to select.
Functions:
-
select
–From one engine, builds and verifies a list of selectors.
-
query
–Runs queries against the selected backend.
-
match
–Matches IDs against the selected backend.
Selector
¶
select
¶
select(
*selection: str | dict[str, str],
engine: Engine | None = None,
only_indexed: bool = True,
) -> list[Selector]
From one engine, builds and verifies a list of selectors.
Parameters:
-
selection
¶str | dict[str, str]
, default:()
) –Full source names and optionally a subset of columns to select
-
engine
¶Engine | None
, default:None
) –The engine to connect to the data warehouse hosting the source. If not provided, will use a connection string from the
MB__CLIENT__DEFAULT_WAREHOUSE
environment variable. -
only_indexed
¶bool
, default:True
) –Whether you intend to select indexed columns only. Will raise a warning if True and non-indexed columns are selected. Defaults to True. Non-indexed columns should only be selected if you’re querying data for a purpose other than matching
Returns:
Examples:
query
¶
query(
*selectors: list[Selector],
resolution_name: str | None = None,
combine_type: Literal[
"concat", "explode", "set_agg"
] = "concat",
return_type: ReturnTypeStr = "pandas",
threshold: int | None = None,
limit: int | None = None,
batch_size: int | None = None,
return_batches: bool = False,
) -> QueryReturnType | Iterator[QueryReturnType]
Runs queries against the selected backend.
Parameters:
-
selectors
¶list[Selector]
, default:()
) –Each selector is the output of
select()
. This allows querying sources coming from different engines -
resolution_name
¶optional
, default:None
) –The name of the resolution point to query If not set:
- If querying a single source, it will use the source resolution
- If querying 2 or more sources, it will look for a default resolution
-
combine_type
¶Literal['concat', 'explode', 'set_agg']
, default:'concat'
) –How to combine the data from different sources.
- If
concat
, concatenate all sources queried without any merging. Multiple rows per ID, with null values where data isn’t available - If
explode
, outer join on Matchbox ID. Multiple rows per ID, with one for every unique combination of data requested across all sources - If
set_agg
, join on Matchbox ID, group on Matchbox ID, then aggregate to nested lists of unique values. One row per ID, but all requested data is in nested arrays
- If
-
return_type
¶ReturnTypeStr
, default:'pandas'
) –The form to return data in, one of “pandas” or “arrow” Defaults to pandas for ease of use
-
threshold
¶optional
, default:None
) –The threshold to use for creating clusters If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
-
limit
¶optional
, default:None
) –The number to use in a limit clause. Useful for testing
-
batch_size
¶optional
, default:None
) –The size of each batch when fetching data from the warehouse, which helps reduce memory usage and load on the database. Default is None.
-
return_batches
¶optional
, default:False
) –If True, returns an iterator of batches instead of a single combined result, which is useful for processing large datasets with limited memory. Default is False.
Returns:
-
QueryReturnType | Iterator[QueryReturnType]
–If return_batches is False: Data in the requested return type (DataFrame or ArrowTable)
-
QueryReturnType | Iterator[QueryReturnType]
–If return_batches is True: An iterator yielding batches in the requested return type
Examples:
match
¶
match(
*targets: list[Selector],
source: list[Selector],
source_pk: str,
resolution_name: str = DEFAULT_RESOLUTION,
threshold: int | None = None,
) -> list[Match]
Matches IDs against the selected backend.
Parameters:
-
targets
¶list[Selector]
, default:()
) –Each target is the output of
select()
. This allows matching against sources coming from different engines -
source
¶list[Selector]
) –The output of using
select()
on a single source. -
source_pk
¶str
) –The primary key value to match from the source.
-
resolution_name
¶optional
, default:DEFAULT_RESOLUTION
) –The resolution name to use for filtering results. If not set, it will look for a default resolution.
-
threshold
¶optional
, default:None
) –The threshold to use for creating clusters. If None, uses the resolutions’ default threshold If an integer, uses that threshold for the specified resolution, and the resolution’s cached thresholds for its ancestors
Examples: