Hashing
matchbox.common.hash
¶
Utilities for hashing data and creating unique identifiers.
Classes:
-
HashMethod
–Supported hash methods for row hashing.
-
IntMap
–A data structure to map integers without collisions within a dedicated space.
Functions:
-
hash_to_base64
–Converts a hash to a base64 string.
-
base64_to_hash
–Converts a base64 string to a hash.
-
prep_for_hash
–Encodes strings so they can be hashed, otherwises, passes through.
-
hash_data
–Hash the given data using the globally defined hash function.
-
hash_values
–Returns a single hash of a tuple of items ordered by its values.
-
process_column_for_hashing
–Process a column for hashing based on its type.
-
hash_rows
–Hash all rows in a dataframe.
-
hash_arrow_table
–Computes a content hash of an Arrow table invariant to row and field order.
Attributes:
HashableItem
module-attribute
¶
HashMethod
¶
IntMap
¶
A data structure to map integers without collisions within a dedicated space.
A stand-in for hashing integers within pa.int64.
Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.
The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.
Parameters:
-
salt
¶optional
, default:42
) –A positive integer to salt the Cantor pairing function
Methods:
-
index
–Index a set of integers.
-
has_mapping
–Check if index for values already exists.
Attributes:
index
¶
prep_for_hash
¶
prep_for_hash(item: HashableItem) -> bytes
Encodes strings so they can be hashed, otherwises, passes through.
hash_data
¶
hash_data(data: HashableItem) -> bytes
Hash the given data using the globally defined hash function.
This function ties into the existing hashing utilities.
hash_values
¶
Returns a single hash of a tuple of items ordered by its values.
List must be sorted as the different orders of value must produce the same hash.
process_column_for_hashing
¶
process_column_for_hashing(column_name: str, schema_type: DataType) -> Expr
hash_rows
¶
hash_arrow_table
¶
hash_arrow_table(table: Table, method: HashMethod = XXH3_128, as_sorted_list: list[str] | None = None) -> bytes
Computes a content hash of an Arrow table invariant to row and field order.
This is used to content-address an Arrow table for caching.
Parameters:
-
table
¶Table
) –The pyarrow Table to hash
-
method
¶HashMethod
, default:XXH3_128
) –The method to use for hashing rows (XXH3_128 or SHA256)
-
as_sorted_list
¶list[str] | None
, default:None
) –Optional list of column names to hash as a sorted list. For example, [“left_id”, “right_id”] will create a “sorted_list” column and drop the original columns to ensure (1,2) and (2,1) hash to the same value. Works with 2 or more columns.
Note: if list columns are combined with a column that’s nullable, list + null value returns null. See Polars’ concat_list documentation for more details.
Returns:
-
bytes
–Bytes representing the content hash of the table