Skip to content

Hashing

matchbox.common.hash

Utilities for hashing data and creating unique identifiers.

Classes:

  • HashMethod

    Supported hash methods for row hashing.

Functions:

  • hash_to_base64

    Converts a hash to a base64 string.

  • base64_to_hash

    Converts a base64 string to a hash, or returns a hash as is.

  • prep_for_hash

    Encodes strings so they can be hashed, otherwises, passes through.

  • hash_data

    Hash the given data using the globally defined hash function.

  • hash_values

    Returns a single hash of a tuple of items ordered by its values.

  • process_column_for_hashing

    Process a column for hashing based on its type.

  • hash_rows

    Hash all rows in a dataframe.

  • hash_arrow_table

    Computes a content hash of an Arrow table invariant to row and field order.

  • hash_model_results

    Fingerprint model results.

  • hash_clusters

    Fingerprint resolver cluster assignments by cluster membership semantics.

Attributes:

HashableItem module-attribute

HashableItem = TypeVar('HashableItem', bytes, bool, str, int, float, bytearray)

HASH_FUNC module-attribute

HASH_FUNC = sha256

HashMethod

Bases: StrEnum


              flowchart TD
              matchbox.common.hash.HashMethod[HashMethod]

              

              click matchbox.common.hash.HashMethod href "" "matchbox.common.hash.HashMethod"
            

Supported hash methods for row hashing.

Attributes:

XXH3_128 class-attribute instance-attribute

XXH3_128 = 'xxh3_128'

SHA256 class-attribute instance-attribute

SHA256 = 'sha256'

hash_to_base64

hash_to_base64(hash: bytes) -> str

Converts a hash to a base64 string.

base64_to_hash

base64_to_hash(value: str | bytes) -> bytes

Converts a base64 string to a hash, or returns a hash as is.

prep_for_hash

prep_for_hash(item: HashableItem) -> bytes

Encodes strings so they can be hashed, otherwises, passes through.

hash_data

hash_data(data: HashableItem) -> bytes

Hash the given data using the globally defined hash function.

This function ties into the existing hashing utilities.

hash_values

hash_values(*values: tuple[T, ...]) -> bytes

Returns a single hash of a tuple of items ordered by its values.

List must be sorted as the different orders of value must produce the same hash.

process_column_for_hashing

process_column_for_hashing(column_name: str, schema_type: DataType) -> Expr

Process a column for hashing based on its type.

Parameters:

  • column_name

    (str) –

    The column name

  • schema_type

    (DataType) –

    The polars schema type of the column

Returns:

  • Expr

    A polars expression for processing the column

hash_rows

hash_rows(df: DataFrame, columns: list[str], method: HashMethod = XXH3_128) -> Series

Hash all rows in a dataframe.

Parameters:

  • df

    (DataFrame) –

    The DataFrame to hash rows from

  • columns

    (list[str]) –

    The column names to include in the hash

  • method

    (HashMethod, default: XXH3_128 ) –

    The hash method to use

Returns:

  • Series

    List of row hashes as bytes

hash_arrow_table

hash_arrow_table(table: Table, method: HashMethod = XXH3_128, as_sorted_list: list[str] | None = None) -> bytes

Computes a content hash of an Arrow table invariant to row and field order.

This is used to content-address an Arrow table for caching.

Parameters:

  • table

    (Table) –

    The pyarrow Table to hash

  • method

    (HashMethod, default: XXH3_128 ) –

    The method to use for hashing rows (XXH3_128 or SHA256)

  • as_sorted_list

    (list[str] | None, default: None ) –

    Optional list of column names to hash as a sorted list. For example, [“left_id”, “right_id”] will create a “sorted_list” column and drop the original columns to ensure (1,2) and (2,1) hash to the same value. Works with 2 or more columns.

    Note: if list columns are combined with a column that’s nullable, list + null value returns null. See Polars’ concat_list documentation for more details.

Returns:

  • bytes

    Bytes representing the content hash of the table

hash_model_results

hash_model_results(results: Table) -> bytes

Fingerprint model results.

hash_clusters

hash_clusters(assignments: Table) -> bytes

Fingerprint resolver cluster assignments by cluster membership semantics.

This hash is invariant to
  • row ordering
  • parent_id relabeling
  • child row ordering within a parent cluster