Hashing
matchbox.common.hash
¶
Utilities for hashing data and creating unique identifiers.
Classes:
-
HashMethod–Supported hash methods for row hashing.
-
IntMap–A data structure to map integers without collisions within a dedicated space.
Functions:
-
hash_to_base64–Converts a hash to a base64 string.
-
base64_to_hash–Converts a base64 string to a hash, or returns a hash as is.
-
prep_for_hash–Encodes strings so they can be hashed, otherwises, passes through.
-
hash_data–Hash the given data using the globally defined hash function.
-
hash_values–Returns a single hash of a tuple of items ordered by its values.
-
process_column_for_hashing–Process a column for hashing based on its type.
-
hash_rows–Hash all rows in a dataframe.
-
hash_arrow_table–Computes a content hash of an Arrow table invariant to row and field order.
-
hash_model_results–Fingerprint model results.
Attributes:
HashableItem
module-attribute
¶
HashMethod
¶
IntMap
¶
A data structure to map integers without collisions within a dedicated space.
A stand-in for hashing integers within pa.int64.
Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.
The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.
Parameters:
-
(salt¶optional, default:42) –A positive integer to salt the Cantor pairing function
Methods:
-
index–Index a set of integers.
-
has_mapping–Check if index for values already exists.
Attributes:
index
¶
base64_to_hash
¶
Converts a base64 string to a hash, or returns a hash as is.
prep_for_hash
¶
prep_for_hash(item: HashableItem) -> bytes
Encodes strings so they can be hashed, otherwises, passes through.
hash_data
¶
hash_data(data: HashableItem) -> bytes
Hash the given data using the globally defined hash function.
This function ties into the existing hashing utilities.
hash_values
¶
Returns a single hash of a tuple of items ordered by its values.
List must be sorted as the different orders of value must produce the same hash.
process_column_for_hashing
¶
process_column_for_hashing(column_name: str, schema_type: DataType) -> Expr
hash_rows
¶
hash_arrow_table
¶
hash_arrow_table(table: Table, method: HashMethod = XXH3_128, as_sorted_list: list[str] | None = None) -> bytes
Computes a content hash of an Arrow table invariant to row and field order.
This is used to content-address an Arrow table for caching.
Parameters:
-
(table¶Table) –The pyarrow Table to hash
-
(method¶HashMethod, default:XXH3_128) –The method to use for hashing rows (XXH3_128 or SHA256)
-
(as_sorted_list¶list[str] | None, default:None) –Optional list of column names to hash as a sorted list. For example, [“left_id”, “right_id”] will create a “sorted_list” column and drop the original columns to ensure (1,2) and (2,1) hash to the same value. Works with 2 or more columns.
Note: if list columns are combined with a column that’s nullable, list + null value returns null. See Polars’ concat_list documentation for more details.
Returns:
-
bytes–Bytes representing the content hash of the table