Sources
matchbox.common.sources
¶
Classes and functions for working with data sources in Matchbox.
Classes:
-
SourceColumn
–A column in a dataset that can be indexed in the Matchbox database.
-
SourceAddress
–A unique identifier for a dataset in a warehouse.
-
Source
–A dataset that can, or has been indexed on the backend.
-
Match
–A match between primary keys in the Matchbox database.
Functions:
-
b64_bytes_validator
–Ensure that a value is a base64 encoded string or bytes.
-
needs_engine
–Decorator to check that engine is set.
Attributes:
SerialisableBytes
module-attribute
¶
SerialisableBytes = Annotated[
bytes,
PlainValidator(b64_bytes_validator),
PlainSerializer(lambda v: hash_to_base64(v)),
WithJsonSchema(
{
"type": "string",
"format": "base64",
"description": "Base64 encoded bytes",
}
),
]
SourceColumn
¶
Bases: BaseModel
A column in a dataset that can be indexed in the Matchbox database.
Attributes:
SourceAddress
¶
Bases: BaseModel
A unique identifier for a dataset in a warehouse.
Methods:
-
compose
–Generate a SourceAddress from a SQLAlchemy Engine and full source name.
-
format_column
–Outputs a full SQLAlchemy column representation.
Attributes:
-
full_name
(str
) – -
warehouse_hash
(SerialisableBytes
) – -
pretty
(str
) –Return a pretty representation of the address.
-
warehouse_hash_b64
(str
) –Return warehouse hash as a base64 encoded string.
warehouse_hash_b64
property
¶
warehouse_hash_b64: str
Return warehouse hash as a base64 encoded string.
compose
classmethod
¶
compose(engine: Engine, full_name: str) -> SourceAddress
Generate a SourceAddress from a SQLAlchemy Engine and full source name.
Source
¶
Bases: BaseModel
A dataset that can, or has been indexed on the backend.
Methods:
-
set_engine
–Adds engine, and use it to validate current columns.
-
get_remote_columns
–Returns a dictionary of column names and SQLAlchemy types.
-
default_columns
–Returns a new source with default columns.
-
to_table
–Returns the dataset as a SQLAlchemy Table object.
-
check_columns
–Check that columns are available in the warehouse and correctly typed.
-
to_arrow
–Returns the dataset as a PyArrow Table or an iterator of PyArrow Tables.
-
to_polars
–Returns the dataset as a PyArrow Table or an iterator of PyArrow Tables.
-
to_pandas
–Returns the dataset as a pandas DataFrame or an iterator of DataFrames.
-
hash_data
–Retrieve and hash a dataset from its warehouse, ready to be inserted.
Attributes:
-
address
(SourceAddress
) – -
resolution_name
(str
) – -
db_pk
(str
) – -
columns
(tuple[SourceColumn, ...] | None
) – -
engine
(Engine | None
) –The SQLAlchemy Engine used to connect to the dataset.
resolution_name
class-attribute
instance-attribute
¶
get_remote_columns
¶
Returns a dictionary of column names and SQLAlchemy types.
default_columns
¶
default_columns() -> Source
Returns a new source with default columns.
Default columns are all from the source warehouse other than self.db_pk
.
All other attributes are copied, and its engine (if present) is set.
check_columns
¶
to_arrow
¶
to_arrow(
fields: list[str] | None = None,
pks: list[T] | None = None,
limit: int | None = None,
*,
return_batches: bool = False,
batch_size: int | None = None,
schema_overrides: dict[str, Any] | None = None,
execute_options: dict[str, Any] | None = None,
) -> Table | Iterator[Table]
Returns the dataset as a PyArrow Table or an iterator of PyArrow Tables.
Parameters:
-
fields
¶list[str] | None
, default:None
) –List of column names to retrieve. If None, retrieves all columns.
-
pks
¶list[T] | None
, default:None
) –List of primary keys to filter by. If None, retrieves all rows.
-
limit
¶int | None
, default:None
) –Maximum number of rows to retrieve. If None, retrieves all rows.
-
return_batches
¶bool
, default:False
) –- If True, return an iterator that yields each batch separately
- If False, return a single Table with all results
-
batch_size
¶int | None
, default:None
) –Indicate the size of each batch when processing data in batches.
-
schema_overrides
¶dict[str, Any] | None
, default:None
) –A dictionary mapping column names to dtypes.
-
execute_options
¶dict[str, Any] | None
, default:None
) –These options will be passed through into the underlying query execution method as kwargs.
Returns:
-
Table | Iterator[Table]
–The requested data in PyArrow format.
- If return_batches is False: a PyArrow Table
- If return_batches is True: an iterator of PyArrow Tables
to_polars
¶
to_polars(
fields: list[str] | None = None,
pks: list[T] | None = None,
limit: int | None = None,
*,
return_batches: bool = False,
batch_size: int | None = None,
schema_overrides: dict[str, Any] | None = None,
execute_options: dict[str, Any] | None = None,
) -> DataFrame | Iterator[DataFrame]
Returns the dataset as a PyArrow Table or an iterator of PyArrow Tables.
Parameters:
-
fields
¶list[str] | None
, default:None
) –List of column names to retrieve. If None, retrieves all columns.
-
pks
¶list[T] | None
, default:None
) –List of primary keys to filter by. If None, retrieves all rows.
-
limit
¶int | None
, default:None
) –Maximum number of rows to retrieve. If None, retrieves all rows.
-
return_batches
¶bool
, default:False
) –- If True, return an iterator that yields each batch separately
- If False, return a single Table with all results
-
batch_size
¶int | None
, default:None
) –Indicate the size of each batch when processing data in batches.
-
schema_overrides
¶dict[str, Any] | None
, default:None
) –A dictionary mapping column names to dtypes.
-
execute_options
¶dict[str, Any] | None
, default:None
) –These options will be passed through into the underlying query execution method as kwargs.
Returns:
-
DataFrame | Iterator[DataFrame]
–The requested data in Polars format.
- If return_batches is False: a Polars DataFrame
- If return_batches is True: an iterator of Polars DataFrames
to_pandas
¶
to_pandas(
fields: list[str] | None = None,
pks: list[T] | None = None,
limit: int | None = None,
*,
return_batches: bool = False,
batch_size: int | None = None,
schema_overrides: dict[str, Any] | None = None,
execute_options: dict[str, Any] | None = None,
) -> DataFrame | Iterator[DataFrame]
Returns the dataset as a pandas DataFrame or an iterator of DataFrames.
Parameters:
-
fields
¶list[str] | None
, default:None
) –List of column names to retrieve. If None, retrieves all columns.
-
pks
¶list[T] | None
, default:None
) –List of primary keys to filter by. If None, retrieves all rows.
-
limit
¶int | None
, default:None
) –Maximum number of rows to retrieve. If None, retrieves all rows.
-
return_batches
¶bool
, default:False
) –- If True, return an iterator that yields each batch separately
- If False, return a single Table with all results
-
batch_size
¶int | None
, default:None
) –Indicate the size of each batch when processing data in batches.
-
schema_overrides
¶dict[str, Any] | None
, default:None
) –A dictionary mapping column names to dtypes.
-
execute_options
¶dict[str, Any] | None
, default:None
) –These options will be passed through into the underlying query execution method as kwargs.
Returns:
-
DataFrame | Iterator[DataFrame]
–The requested data in Pandas format.
- If return_batches is False: a Pandas DataFrame
- If return_batches is True: an iterator of Pandas DataFrames
hash_data
¶
hash_data(
*,
batch_size: int | None = None,
schema_overrides: dict[str, Any] | None = None,
execute_options: dict[str, Any] | None = None,
) -> Table
Retrieve and hash a dataset from its warehouse, ready to be inserted.
Parameters:
-
batch_size
¶int | None
, default:None
) –If set, process data in batches internally. Indicates the size of each batch.
-
schema_overrides
¶dict[str, Any] | None
, default:None
) –A dictionary mapping column names to dtypes.
-
execute_options
¶dict[str, Any] | None
, default:None
) –These options will be passed through into the underlying query execution method as kwargs.
Returns:
-
Table
–A PyArrow Table containing source primary keys and their hashes.
Match
¶
Bases: BaseModel
A match between primary keys in the Matchbox database.
Methods:
-
found_or_none
–Ensure that a match has sources and a cluster if target was found.
Attributes:
-
cluster
(int | None
) – -
source
(SourceAddress
) – -
source_id
(set[str]
) – -
target
(SourceAddress
) – -
target_id
(set[str]
) –
b64_bytes_validator
¶
Ensure that a value is a base64 encoded string or bytes.