Hashing
matchbox.common.hash
¶
Utilities for hashing data and creating unique identifiers.
Classes:
-
HashMethod–Supported hash methods for row hashing.
Functions:
-
hash_to_base64–Converts a hash to a base64 string.
-
base64_to_hash–Converts a base64 string to a hash, or returns a hash as is.
-
prep_for_hash–Encodes strings so they can be hashed, otherwises, passes through.
-
hash_data–Hash the given data using the globally defined hash function.
-
hash_values–Returns a single hash of a tuple of items ordered by its values.
-
process_column_for_hashing–Process a column for hashing based on its type.
-
hash_rows–Hash all rows in a dataframe.
-
hash_arrow_table–Computes a content hash of an Arrow table invariant to row and field order.
-
hash_model_results–Fingerprint model results.
-
hash_clusters–Fingerprint resolver cluster assignments by cluster membership semantics.
Attributes:
HashableItem
module-attribute
¶
HashMethod
¶
Bases: StrEnum
flowchart TD
matchbox.common.hash.HashMethod[HashMethod]
click matchbox.common.hash.HashMethod href "" "matchbox.common.hash.HashMethod"
Supported hash methods for row hashing.
Attributes:
base64_to_hash
¶
Converts a base64 string to a hash, or returns a hash as is.
prep_for_hash
¶
prep_for_hash(item: HashableItem) -> bytes
Encodes strings so they can be hashed, otherwises, passes through.
hash_data
¶
hash_data(data: HashableItem) -> bytes
Hash the given data using the globally defined hash function.
This function ties into the existing hashing utilities.
hash_values
¶
Returns a single hash of a tuple of items ordered by its values.
List must be sorted as the different orders of value must produce the same hash.
process_column_for_hashing
¶
process_column_for_hashing(column_name: str, schema_type: DataType) -> Expr
hash_rows
¶
hash_arrow_table
¶
hash_arrow_table(table: Table, method: HashMethod = XXH3_128, as_sorted_list: list[str] | None = None) -> bytes
Computes a content hash of an Arrow table invariant to row and field order.
This is used to content-address an Arrow table for caching.
Parameters:
-
(table¶Table) –The pyarrow Table to hash
-
(method¶HashMethod, default:XXH3_128) –The method to use for hashing rows (XXH3_128 or SHA256)
-
(as_sorted_list¶list[str] | None, default:None) –Optional list of column names to hash as a sorted list. For example, [“left_id”, “right_id”] will create a “sorted_list” column and drop the original columns to ensure (1,2) and (2,1) hash to the same value. Works with 2 or more columns.
Note: if list columns are combined with a column that’s nullable, list + null value returns null. See Polars’ concat_list documentation for more details.
Returns:
-
bytes–Bytes representing the content hash of the table