Hashing

matchbox.common.hash ¶

Utilities for hashing data and creating unique identifiers.

Classes:

HashMethod –

Supported hash methods for row hashing.
IntMap –

A data structure to map integers without collisions within a dedicated space.

Functions:

hash_to_base64 –

Converts a hash to a base64 string.
base64_to_hash –

Converts a base64 string to a hash.
prep_for_hash –

Encodes strings so they can be hashed, otherwises, passes through.
hash_data –

Hash the given data using the globally defined hash function.
hash_values –

Returns a single hash of a tuple of items ordered by its values.
process_column_for_hashing –

Process a column for hashing based on its type.
hash_rows –

Hash all rows in a dataframe.
hash_arrow_table –

Computes a content hash of an Arrow table invariant to row and field order.
fields_to_value_ordered_hash –

Returns the rowwise hash ordered by the row’s values, ignoring field order.

Attributes:

HashableItem –
HASH_FUNC –

HashableItem `module-attribute` ¶

HashableItem = TypeVar(
    "HashableItem", bytes, bool, str, int, float, bytearray
)

HASH_FUNC `module-attribute` ¶

HASH_FUNC = sha256

HashMethod ¶

Bases: StrEnum

Supported hash methods for row hashing.

Attributes:

XXH3_128 –
SHA256 –

XXH3_128 `class-attribute` `instance-attribute` ¶

XXH3_128 = 'xxh3_128'

SHA256 `class-attribute` `instance-attribute` ¶

SHA256 = 'sha256'

IntMap ¶

IntMap(salt: int = 42)

A data structure to map integers without collisions within a dedicated space.

A stand-in for hashing integers within pa.int64.

Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.

The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.

Parameters:

salt ¶
(optional, default: 42 ) –

A positive integer to salt the Cantor pairing function

Methods:

index –

Index a set of integers.
has_mapping –

Check if index for values already exists.

Attributes:

mapping (dict[frozenset[int], int]) –
salt (int) –

mapping `instance-attribute` ¶

mapping: dict[frozenset[int], int] = {}

salt `instance-attribute` ¶

salt: int = salt

index ¶

index(*values: int) -> int

Index a set of integers.

Parameters:

values ¶
(int, default: () ) –

the integers in the set you want to index

Returns:

int –

The old or new ID corresponding to the set

has_mapping ¶

has_mapping(*values: int) -> bool

Check if index for values already exists.

Parameters:

values ¶
(int, default: () ) –

the integers in the set you want to index

Returns:

bool –

Boolean indicating whether index for values already exists

hash_to_base64 ¶

hash_to_base64(hash: bytes) -> str

Converts a hash to a base64 string.

base64_to_hash ¶

base64_to_hash(b64: str) -> bytes

Converts a base64 string to a hash.

prep_for_hash ¶

prep_for_hash(item: HashableItem) -> bytes

Encodes strings so they can be hashed, otherwises, passes through.

hash_data ¶

hash_data(data: HashableItem) -> bytes

Hash the given data using the globally defined hash function.

This function ties into the existing hashing utilities.

hash_values ¶

hash_values(*values: tuple[T, ...]) -> bytes

Returns a single hash of a tuple of items ordered by its values.

List must be sorted as the different orders of value must produce the same hash.

process_column_for_hashing ¶

process_column_for_hashing(
    column_name: str, schema_type: DataType
) -> Expr

Process a column for hashing based on its type.

Parameters:

column_name ¶
(str) –

The column name
schema_type ¶
(DataType) –

The polars schema type of the column

Returns:

Expr –

A polars expression for processing the column

hash_rows ¶

hash_rows(
    df: DataFrame,
    columns: list[str],
    method: HashMethod = XXH3_128,
) -> Series

Hash all rows in a dataframe.

Parameters:

df ¶
(DataFrame) –

The DataFrame to hash rows from
columns ¶
(list[str]) –

The column names to include in the hash
method ¶
(HashMethod, default: XXH3_128 ) –

The hash method to use

Returns:

Series –

List of row hashes as bytes

hash_arrow_table ¶

hash_arrow_table(
    table: Table,
    method: HashMethod = XXH3_128,
    as_sorted_list: list[str] | None = None,
) -> bytes

Computes a content hash of an Arrow table invariant to row and field order.

This is used to content-address an Arrow table for caching.

Parameters:

table ¶
(Table) –

The pyarrow Table to hash
method ¶
(HashMethod, default: XXH3_128 ) –

The method to use for hashing rows (XXH3_128 or SHA256)
as_sorted_list ¶
(list[str] | None, default: None ) –

Optional list of column names to hash as a sorted list. For example, [“left_id”, “right_id”] will create a “sorted_list” column and drop the original columns to ensure (1,2) and (2,1) hash to the same value. Works with 2 or more columns.

Note: if list columns are combined with a column that’s nullable, list + null value returns null. See Polars’ concat_list documentation for more details.

Returns:

bytes –

Bytes representing the content hash of the table

fields_to_value_ordered_hash ¶

fields_to_value_ordered_hash(
    data: DataFrame, fields: list[str]
) -> Series

Returns the rowwise hash ordered by the row’s values, ignoring field order.

This function is used to add a field to a dataframe that represents the hash of each its rows, but where the order of the row values doesn’t change the hash value. field order is ignored in favour of value order.

This is primarily used to give a consistent hash to a new cluster no matter whether its parent hashes were used in the left or right table.

Hashing

matchbox.common.hash ¶

HashableItem `module-attribute` ¶

HASH_FUNC `module-attribute` ¶

HashMethod ¶

XXH3_128 `class-attribute` `instance-attribute` ¶

SHA256 `class-attribute` `instance-attribute` ¶

IntMap ¶

`salt` ¶

mapping `instance-attribute` ¶

salt `instance-attribute` ¶

index ¶

`values` ¶

has_mapping ¶

`values` ¶

hash_to_base64 ¶

base64_to_hash ¶

prep_for_hash ¶

hash_data ¶

hash_values ¶

process_column_for_hashing ¶

`column_name` ¶

`schema_type` ¶

hash_rows ¶

`df` ¶

`columns` ¶

`method` ¶

hash_arrow_table ¶

`table` ¶

`method` ¶

`as_sorted_list` ¶

fields_to_value_ordered_hash ¶

Hashing

matchbox.common.hash ¶

HashableItem module-attribute ¶

HASH_FUNC module-attribute ¶

HashMethod ¶

XXH3_128 class-attribute instance-attribute ¶

SHA256 class-attribute instance-attribute ¶

IntMap ¶

salt ¶

mapping instance-attribute ¶

salt instance-attribute ¶

index ¶

values ¶

has_mapping ¶

values ¶

hash_to_base64 ¶

base64_to_hash ¶

prep_for_hash ¶

hash_data ¶

hash_values ¶

process_column_for_hashing ¶

column_name ¶

schema_type ¶

hash_rows ¶

df ¶

columns ¶

method ¶

hash_arrow_table ¶

table ¶

method ¶

as_sorted_list ¶

fields_to_value_ordered_hash ¶

HashableItem `module-attribute` ¶

HASH_FUNC `module-attribute` ¶

XXH3_128 `class-attribute` `instance-attribute` ¶

SHA256 `class-attribute` `instance-attribute` ¶

`salt` ¶

mapping `instance-attribute` ¶

salt `instance-attribute` ¶

`values` ¶

`values` ¶

`column_name` ¶

`schema_type` ¶

`df` ¶

`columns` ¶

`method` ¶

`table` ¶

`method` ¶

`as_sorted_list` ¶