Skip to content

Database

matchbox.common.db

Common database utilities for Matchbox.

Functions:

Attributes:

ReturnTypeStr module-attribute

ReturnTypeStr = Literal['arrow', 'pandas', 'polars']

QueryReturnType module-attribute

QueryReturnType = Table | DataFrame | DataFrame

sql_to_df

sql_to_df(
    stmt: str,
    connection: Engine | Connection,
    return_type: Literal["arrow", "pandas", "polars"],
    *,
    return_batches: Literal[False] = False,
    batch_size: int | None = None,
    rename: dict[str, str] | Callable | None = None,
    schema_overrides: dict[str, Any] | None = None,
    execute_options: dict[str, Any] | None = None,
) -> QueryReturnType
sql_to_df(
    stmt: str,
    connection: Engine | Connection,
    return_type: Literal["arrow", "pandas", "polars"],
    *,
    return_batches: Literal[True],
    batch_size: int | None = None,
    rename: dict[str, str] | Callable | None = None,
    schema_overrides: dict[str, Any] | None = None,
    execute_options: dict[str, Any] | None = None,
) -> Iterator[QueryReturnType]
sql_to_df(
    stmt: str,
    connection: Engine | Connection,
    return_type: ReturnTypeStr = "pandas",
    *,
    return_batches: bool = False,
    batch_size: int | None = None,
    rename: dict[str, str] | Callable | None = None,
    schema_overrides: dict[str, Any] | None = None,
    execute_options: dict[str, Any] | None = None,
) -> QueryReturnType | Iterator[QueryReturnType]

Executes the given SQLAlchemy statement or SQL string using Polars.

Parameters:

  • stmt

    (str) –

    A SQL string to be executed.

  • connection

    (Engine | Connection) –

    A SQLAlchemy Engine object or ADBC connection.

  • return_type

    (str, default: 'pandas' ) –

    The type of the return value. One of “arrow”, “pandas”, or “polars”.

  • return_batches

    (bool, default: False ) –

    If True, return an iterator that yields each batch separately. If False, return a single DataFrame with all results. Default is False.

  • batch_size

    (int | None, default: None ) –

    Indicate the size of each batch when processing data in batches. Default is None.

  • rename

    (dict[str, str] | Callable | None, default: None ) –

    A dictionary mapping old column names to new column names, or a callable that takes a DataFrame and returns a DataFrame with renamed columns. Default is None.

  • schema_overrides

    (dict[str, Any] | None, default: None ) –

    A dictionary mapping column names to dtypes. Default is None.

  • execute_options

    (dict[str, Any] | None, default: None ) –

    These options will be passed through into the underlying query execution method as kwargs. Default is None.

Returns:

Raises:

  • ValueError
    • If the connection is not properly configured or if an unsupported return type is specified.
    • If batch_size and return_batches are either both set or both unset.

validate_sql_for_data_extraction

validate_sql_for_data_extraction(sql: str) -> bool

Validates that the SQL statement only contains a single data-extracting command.

Parameters:

  • sql

    (str) –

    The SQL statement to validate

Returns:

  • bool ( bool ) –

    True if the SQL statement is valid

Raises:

clean_uri

clean_uri(
    uri: str | AnyUrl,
    strip_driver: bool = True,
    strip_credentials: bool = True,
    strip_query_fragment: bool = True,
) -> AnyUrl

Clean a database URI.

Optionally removes driver, credentials, and/or query/fragment components.

Parameters:

  • uri

    (str | AnyUrl) –

    A database URI as a string or AnyUrl object

  • strip_driver

    (bool, default: True ) –

    Whether to strip the driver component (e.g., ‘postgresql+psycopg2’ -> ‘postgresql’)

  • strip_credentials

    (bool, default: True ) –

    Whether to strip username and password components

  • strip_query_fragment

    (bool, default: True ) –

    Whether to strip query and fragment components

Returns:

  • AnyUrl

    An AnyUrl object with the specified components removed