Hashing
matchbox.common.hash
¶
Utilities for hashing data and creating unique identifiers.
Classes:
-
IntMap
–A data structure to map integers without collisions within a dedicated space.
Functions:
-
hash_to_base64
–Converts a hash to a base64 string.
-
base64_to_hash
–Converts a base64 string to a hash.
-
prep_for_hash
–Encodes strings so they can be hashed, otherwises, passes through.
-
hash_data
–Hash the given data using the globally defined hash function.
-
hash_values
–Returns a single hash of a tuple of items ordered by its values.
-
columns_to_value_ordered_hash
–Returns the rowwise hash ordered by the row’s values, ignoring column order.
Attributes:
HashableItem
module-attribute
¶
IntMap
¶
A data structure to map integers without collisions within a dedicated space.
A stand-in for hashing integers within pa.int64.
Takes unordered sets of integers, and maps them a to an ID that 1) is a negative integer; 2) does not collide with other IDs generated by other instances of this class, as long as they are initialised with a different salt.
The fact that IDs are always negative means that it’s possible to build a hierarchy where IDs are themselves parts of other sets, and it’s easy to distinguish integers mapped to raw data points (which will be non-negative), to integers that are IDs (which will be negative). The salt allows to work with a parallel execution model, where each worker maintains their separate ID space, as long as each worker operates on disjoint subsets of positive integers.
Parameters:
-
salt
¶optional
, default:42
) –A positive integer to salt the Cantor pairing function
Methods:
-
index
–Index a set of integers.
-
has_mapping
–Check if index for values already exists.
Attributes:
index
¶
prep_for_hash
¶
prep_for_hash(item: HashableItem) -> bytes
Encodes strings so they can be hashed, otherwises, passes through.
hash_data
¶
hash_data(data: HashableItem) -> bytes
Hash the given data using the globally defined hash function.
This function ties into the existing hashing utilities.
hash_values
¶
Returns a single hash of a tuple of items ordered by its values.
List must be sorted as the different orders of value must produce the same hash.
columns_to_value_ordered_hash
¶
Returns the rowwise hash ordered by the row’s values, ignoring column order.
This function is used to add a column to a dataframe that represents the hash of each its rows, but where the order of the row values doesn’t change the hash value. Column order is ignored in favour of value order.
This is primarily used to give a consistent hash to a new cluster no matter whether its parent hashes were used in the left or right table.