Transform
matchbox.common.transform
¶
Functions to transform data between tabular and graph structures.
Classes:
-
DisjointSet
–Disjoint set forest with “path compression” and “union by rank” heuristics.
Functions:
-
to_clusters
–Converts probabilities into connected components formed at each threshold.
-
graph_results
–Convert probability table to graph representation.
-
attach_components_to_probabilities
–Takes an Arrow table of probabilities and adds a component column.
-
component_to_hierarchy
–Convert pairwise probabilities into a hierarchical representation.
-
to_hierarchical_clusters
–Converts a table of pairwise probabilities into a table of hierarchical clusters.
DisjointSet
¶
Bases: Generic[T]
Disjoint set forest with “path compression” and “union by rank” heuristics.
This follows implementation from Cormen, Thomas H., et al. Introduction to algorithms. MIT press, 2022
Methods:
-
add
–Add a new element to the disjoint set.
-
union
–Merge the sets containing elements x and y.
-
get_components
–Return the connected components of the disjoint set.
Attributes:
to_clusters
¶
to_clusters(
results: Table,
dtype: DataType = large_binary,
hash_func: Callable[
[*tuple[T, ...]], T
] = hash_values,
) -> Table
Converts probabilities into connected components formed at each threshold.
Parameters:
-
results
¶Table
) –Arrow table with columns [‘left_id’, ‘right_id’, ‘probability’]
-
dtype
¶DataType
, default:large_binary
) –Arrow data type for the parent and child columns
-
hash_func
¶Callable[[*tuple[T, ...]], T]
, default:hash_values
) –Function to hash the parent and child values
Returns:
-
Table
–Arrow table of parent, child, threshold, sorted by probability descending.
graph_results
¶
graph_results(
probabilities: Table,
all_nodes: Iterable[int] | None = None,
) -> tuple[PyDiGraph, ndarray, ndarray]
Convert probability table to graph representation.
Parameters:
-
probabilities
¶Table
) –PyArrow table with ‘left_id’, ‘right_id’ columns
-
all_nodes
¶Iterable[int] | None
, default:None
) –superset of node identities figuring in probabilities table. Used to optionally add isolated nodes to the graph.
Returns:
-
PyDiGraph
–A tuple containing:
-
ndarray
–- Rustwork directed graph
-
ndarray
–- A list mapping the ‘left_id’ probabilities to node indices in the graph
-
tuple[PyDiGraph, ndarray, ndarray]
–- A list mapping the ‘right_id’ probabilities to node indices in the graph
attach_components_to_probabilities
¶
Takes an Arrow table of probabilities and adds a component column.
Expects an Arrow table of column left_id, right_id, probability.
Returns a table with an additional column, component.
component_to_hierarchy
¶
component_to_hierarchy(
table: Table,
dtype: DataType = large_binary,
hash_func: Callable[
[*tuple[T, ...]], T
] = hash_values,
) -> Table
Convert pairwise probabilities into a hierarchical representation.
Assumes data is pre-sorted by probability descending.
Parameters:
-
table
¶Table
) –Arrow Table with columns [‘left’, ‘right’, ‘probability’]
-
dtype
¶DataType
, default:large_binary
) –Arrow data type for the parent and child columns
-
hash_func
¶Callable[[*tuple[T, ...]], T]
, default:hash_values
) –Function to hash the parent and child values
Returns:
-
Table
–Arrow Table with columns [‘parent’, ‘child’, ‘probability’]
to_hierarchical_clusters
¶
to_hierarchical_clusters(
probabilities: Table,
proc_func: Callable[
[Table, DataType], Table
] = component_to_hierarchy,
hash_func: Callable[
[*tuple[T, ...]], T
] = hash_values,
dtype: DataType = large_binary,
parallel: bool = True,
timeout: int = 300,
min_rows_per_worker: int = 1000,
) -> Table
Converts a table of pairwise probabilities into a table of hierarchical clusters.
Parameters:
-
probabilities
¶Table
) –Arrow table with columns [‘component’, ‘left’, ‘right’, ‘probability’]
-
proc_func
¶Callable[[Table, DataType], Table]
, default:component_to_hierarchy
) –Function to process each component
-
hash_func
¶Callable[[*tuple[T, ...]], T]
, default:hash_values
) –Function to hash the parent and child values
-
dtype
¶DataType
, default:large_binary
) –Arrow data type for the parent and child columns
-
parallel
¶bool
, default:True
) –Whether to work on separate connected components in parallel
-
timeout
¶int
, default:300
) –Maximum seconds to wait for each worker to finish processing. Ignored if parallel==False
-
min_rows_per_worker
¶int
, default:1000
) –Minimum number of rows to pass to each worker. Ignored if parallel==False
Returns:
-
Table
–Arrow table with columns [‘parent’, ‘child’, ‘probability’]