Skip to content

Hashing

matchbox.common.hash

Utilities for hashing data and creating unique identifiers.

Classes:

  • IntMap

    A data structure to map integers without collisions within a dedicated space.

Functions:

  • hash_to_base64

    Converts a hash to a base64 string.

  • base64_to_hash

    Converts a base64 string to a hash.

  • prep_for_hash

    Encodes strings so they can be hashed, otherwises, passes through.

  • hash_data

    Hash the given data using the globally defined hash function.

  • hash_values

    Returns a single hash of a tuple of items ordered by its values.

  • columns_to_value_ordered_hash

    Returns the rowwise hash ordered by the row’s values, ignoring column order.

Attributes:

HashableItem module-attribute

HashableItem = TypeVar(
    "HashableItem", bytes, bool, str, int, float, bytearray
)

HASH_FUNC module-attribute

HASH_FUNC = sha256

IntMap

IntMap(salt: int = 42)

A data structure to map integers without collisions within a dedicated space.

A stand-in for hashing integers within pa.int64.

Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.

The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.

Parameters:

  • salt

    (optional, default: 42 ) –

    A positive integer to salt the Cantor pairing function

Methods:

  • index

    Index a set of integers.

  • has_mapping

    Check if index for values already exists.

Attributes:

mapping instance-attribute

mapping: dict[frozenset[int], int] = {}

salt instance-attribute

salt: int = salt

index

index(*values: int) -> int

Index a set of integers.

Parameters:

  • values
    (int, default: () ) –

    the integers in the set you want to index

Returns:

  • int

    The old or new ID corresponding to the set

has_mapping

has_mapping(*values: int) -> bool

Check if index for values already exists.

Parameters:

  • values
    (int, default: () ) –

    the integers in the set you want to index

Returns:

  • bool

    Boolean indicating whether index for values already exists

hash_to_base64

hash_to_base64(hash: bytes) -> str

Converts a hash to a base64 string.

base64_to_hash

base64_to_hash(b64: str) -> bytes

Converts a base64 string to a hash.

prep_for_hash

prep_for_hash(item: HashableItem) -> bytes

Encodes strings so they can be hashed, otherwises, passes through.

hash_data

hash_data(data: HashableItem) -> bytes

Hash the given data using the globally defined hash function.

This function ties into the existing hashing utilities.

hash_values

hash_values(*values: tuple[T, ...]) -> bytes

Returns a single hash of a tuple of items ordered by its values.

List must be sorted as the different orders of value must produce the same hash.

columns_to_value_ordered_hash

columns_to_value_ordered_hash(
    data: DataFrame, columns: list[str]
) -> Series

Returns the rowwise hash ordered by the row’s values, ignoring column order.

This function is used to add a column to a dataframe that represents the hash of each its rows, but where the order of the row values doesn’t change the hash value. Column order is ignored in favour of value order.

This is primarily used to give a consistent hash to a new cluster no matter whether its parent hashes were used in the left or right table.