Skip to content

Hashing

matchbox.common.hash

Utilities for hashing data and creating unique identifiers.

Classes:

  • HashMethod

    Supported hash methods for row hashing.

  • IntMap

    A data structure to map integers without collisions within a dedicated space.

Functions:

Attributes:

HashableItem module-attribute

HashableItem = TypeVar(
    "HashableItem", bytes, bool, str, int, float, bytearray
)

HASH_FUNC module-attribute

HASH_FUNC = sha256

HashMethod

Bases: StrEnum

Supported hash methods for row hashing.

Attributes:

XXH3_128 class-attribute instance-attribute

XXH3_128 = 'xxh3_128'

SHA256 class-attribute instance-attribute

SHA256 = 'sha256'

IntMap

IntMap(salt: int = 42)

A data structure to map integers without collisions within a dedicated space.

A stand-in for hashing integers within pa.int64.

Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.

The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.

Parameters:

  • salt

    (optional, default: 42 ) –

    A positive integer to salt the Cantor pairing function

Methods:

  • index

    Index a set of integers.

  • has_mapping

    Check if index for values already exists.

Attributes:

mapping instance-attribute

mapping: dict[frozenset[int], int] = {}

salt instance-attribute

salt: int = salt

index

index(*values: int) -> int

Index a set of integers.

Parameters:

  • values
    (int, default: () ) –

    the integers in the set you want to index

Returns:

  • int

    The old or new ID corresponding to the set

has_mapping

has_mapping(*values: int) -> bool

Check if index for values already exists.

Parameters:

  • values
    (int, default: () ) –

    the integers in the set you want to index

Returns:

  • bool

    Boolean indicating whether index for values already exists

hash_to_base64

hash_to_base64(hash: bytes) -> str

Converts a hash to a base64 string.

base64_to_hash

base64_to_hash(b64: str) -> bytes

Converts a base64 string to a hash.

prep_for_hash

prep_for_hash(item: HashableItem) -> bytes

Encodes strings so they can be hashed, otherwises, passes through.

hash_data

hash_data(data: HashableItem) -> bytes

Hash the given data using the globally defined hash function.

This function ties into the existing hashing utilities.

hash_values

hash_values(*values: tuple[T, ...]) -> bytes

Returns a single hash of a tuple of items ordered by its values.

List must be sorted as the different orders of value must produce the same hash.

process_column_for_hashing

process_column_for_hashing(
    column_name: str, schema_type: DataType
) -> Expr

Process a column for hashing based on its type.

Parameters:

  • column_name

    (str) –

    The column name

  • schema_type

    (DataType) –

    The polars schema type of the column

Returns:

  • Expr

    A polars expression for processing the column

hash_rows

hash_rows(
    df: DataFrame,
    columns: list[str],
    method: HashMethod = XXH3_128,
) -> Series

Hash all rows in a dataframe.

Parameters:

  • df

    (DataFrame) –

    The DataFrame to hash rows from

  • columns

    (list[str]) –

    The column names to include in the hash

  • method

    (HashMethod, default: XXH3_128 ) –

    The hash method to use

Returns:

  • Series

    List of row hashes as bytes

hash_arrow_table

hash_arrow_table(
    table: Table, method: HashMethod = XXH3_128
) -> bytes

Computes a content hash of an Arrow table invariant to row and field order.

This is used to content-address an Arrow table for caching.

Parameters:

  • table

    (Table) –

    The pyarrow Table to hash

  • method

    (HashMethod, default: XXH3_128 ) –

    The method to use for hashing rows (XXH3_128 or SHA256)

Returns:

  • bytes

    Bytes representing the content hash of the table

fields_to_value_ordered_hash

fields_to_value_ordered_hash(
    data: DataFrame, fields: list[str]
) -> Series

Returns the rowwise hash ordered by the row’s values, ignoring field order.

This function is used to add a field to a dataframe that represents the hash of each its rows, but where the order of the row values doesn’t change the hash value. field order is ignored in favour of value order.

This is primarily used to give a consistent hash to a new cluster no matter whether its parent hashes were used in the left or right table.