Skip to content

Hashing

matchbox.common.hash

Utilities for hashing data and creating unique identifiers.

Classes:

  • HashMethod

    Supported hash methods for row hashing.

  • IntMap

    A data structure to map integers without collisions within a dedicated space.

Functions:

  • hash_to_base64

    Converts a hash to a base64 string.

  • base64_to_hash

    Converts a base64 string to a hash.

  • prep_for_hash

    Encodes strings so they can be hashed, otherwises, passes through.

  • hash_data

    Hash the given data using the globally defined hash function.

  • hash_values

    Returns a single hash of a tuple of items ordered by its values.

  • process_column_for_hashing

    Process a column for hashing based on its type.

  • hash_rows

    Hash all rows in a dataframe.

  • hash_arrow_table

    Computes a content hash of an Arrow table invariant to row and field order.

Attributes:

HashableItem module-attribute

HashableItem = TypeVar('HashableItem', bytes, bool, str, int, float, bytearray)

HASH_FUNC module-attribute

HASH_FUNC = sha256

HashMethod

Bases: StrEnum

Supported hash methods for row hashing.

Attributes:

XXH3_128 class-attribute instance-attribute

XXH3_128 = 'xxh3_128'

SHA256 class-attribute instance-attribute

SHA256 = 'sha256'

IntMap

IntMap(salt: int = 42)

A data structure to map integers without collisions within a dedicated space.

A stand-in for hashing integers within pa.int64.

Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.

The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.

Parameters:

  • salt

    (optional, default: 42 ) –

    A positive integer to salt the Cantor pairing function

Methods:

  • index

    Index a set of integers.

  • has_mapping

    Check if index for values already exists.

Attributes:

mapping instance-attribute

mapping: dict[frozenset[int], int] = {}

salt instance-attribute

salt: int = salt

index

index(*values: int) -> int

Index a set of integers.

Parameters:

  • values
    (int, default: () ) –

    the integers in the set you want to index

Returns:

  • int

    The old or new ID corresponding to the set

has_mapping

has_mapping(*values: int) -> bool

Check if index for values already exists.

Parameters:

  • values
    (int, default: () ) –

    the integers in the set you want to index

Returns:

  • bool

    Boolean indicating whether index for values already exists

hash_to_base64

hash_to_base64(hash: bytes) -> str

Converts a hash to a base64 string.

base64_to_hash

base64_to_hash(b64: str) -> bytes

Converts a base64 string to a hash.

prep_for_hash

prep_for_hash(item: HashableItem) -> bytes

Encodes strings so they can be hashed, otherwises, passes through.

hash_data

hash_data(data: HashableItem) -> bytes

Hash the given data using the globally defined hash function.

This function ties into the existing hashing utilities.

hash_values

hash_values(*values: tuple[T, ...]) -> bytes

Returns a single hash of a tuple of items ordered by its values.

List must be sorted as the different orders of value must produce the same hash.

process_column_for_hashing

process_column_for_hashing(column_name: str, schema_type: DataType) -> Expr

Process a column for hashing based on its type.

Parameters:

  • column_name

    (str) –

    The column name

  • schema_type

    (DataType) –

    The polars schema type of the column

Returns:

  • Expr

    A polars expression for processing the column

hash_rows

hash_rows(df: DataFrame, columns: list[str], method: HashMethod = XXH3_128) -> Series

Hash all rows in a dataframe.

Parameters:

  • df

    (DataFrame) –

    The DataFrame to hash rows from

  • columns

    (list[str]) –

    The column names to include in the hash

  • method

    (HashMethod, default: XXH3_128 ) –

    The hash method to use

Returns:

  • Series

    List of row hashes as bytes

hash_arrow_table

hash_arrow_table(table: Table, method: HashMethod = XXH3_128, as_sorted_list: list[str] | None = None) -> bytes

Computes a content hash of an Arrow table invariant to row and field order.

This is used to content-address an Arrow table for caching.

Parameters:

  • table

    (Table) –

    The pyarrow Table to hash

  • method

    (HashMethod, default: XXH3_128 ) –

    The method to use for hashing rows (XXH3_128 or SHA256)

  • as_sorted_list

    (list[str] | None, default: None ) –

    Optional list of column names to hash as a sorted list. For example, [“left_id”, “right_id”] will create a “sorted_list” column and drop the original columns to ensure (1,2) and (2,1) hash to the same value. Works with 2 or more columns.

    Note: if list columns are combined with a column that’s nullable, list + null value returns null. See Polars’ concat_list documentation for more details.

Returns:

  • bytes

    Bytes representing the content hash of the table