Skip to content

Database

matchbox.common.db

Common database utilities for Matchbox.

Functions:

Attributes:

ReturnTypeStr module-attribute

ReturnTypeStr = Literal['arrow', 'pandas', 'polars']

QueryReturnType module-attribute

QueryReturnType = Table | DataFrame | DataFrame

sql_to_df

sql_to_df(
    stmt: str,
    connection: Engine | Connection,
    return_type: Literal["arrow", "pandas", "polars"],
    *,
    return_batches: Literal[False] = False,
    batch_size: int | None = None,
    rename: dict[str, str] | Callable | None = None,
    schema_overrides: dict[str, Any] | None = None,
    execute_options: dict[str, Any] | None = None,
) -> QueryReturnType
sql_to_df(
    stmt: str,
    connection: Engine | Connection,
    return_type: Literal["arrow", "pandas", "polars"],
    *,
    return_batches: Literal[True],
    batch_size: int | None = None,
    rename: dict[str, str] | Callable | None = None,
    schema_overrides: dict[str, Any] | None = None,
    execute_options: dict[str, Any] | None = None,
) -> Iterator[QueryReturnType]
sql_to_df(
    stmt: str,
    connection: Engine | Connection,
    return_type: ReturnTypeStr = "pandas",
    *,
    return_batches: bool = False,
    batch_size: int | None = None,
    rename: dict[str, str] | Callable | None = None,
    schema_overrides: dict[str, Any] | None = None,
    execute_options: dict[str, Any] | None = None,
) -> QueryReturnType | Iterator[QueryReturnType]

Executes the given SQLAlchemy statement or SQL string using Polars.

Parameters:

  • stmt

    (str) –

    A SQL string to be executed.

  • connection

    (Engine | Connection) –

    A SQLAlchemy Engine object or ADBC connection.

  • return_type

    (str, default: 'pandas' ) –

    The type of the return value. One of “arrow”, “pandas”, or “polars”.

  • return_batches

    (bool, default: False ) –

    If True, return an iterator that yields each batch separately. If False, return a single DataFrame with all results. Default is False.

  • batch_size

    (int | None, default: None ) –

    Indicate the size of each batch when processing data in batches. Default is None.

  • rename

    (dict[str, str] | Callable | None, default: None ) –

    A dictionary mapping old column names to new column names, or a callable that takes a DataFrame and returns a DataFrame with renamed columns. Default is None.

  • schema_overrides

    (dict[str, Any] | None, default: None ) –

    A dictionary mapping column names to dtypes. Default is None.

  • execute_options

    (dict[str, Any] | None, default: None ) –

    These options will be passed through into the underlying query execution method as kwargs. Default is None.

Returns:

Raises:

  • ValueError
    • If the connection is not properly configured or if an unsupported return type is specified.
    • If batch_size and return_batches are either both set or both unset.

validate_sql_for_data_extraction

validate_sql_for_data_extraction(sql: str) -> bool

Validates that the SQL statement only contains a single data-extracting command.

Parameters:

  • sql

    (str) –

    The SQL statement to validate

Returns:

  • bool ( bool ) –

    True if the SQL statement is valid

Raises: