Transform
matchbox.common.transform
¶
Functions to transform data between tabular and graph structures.
Classes:
-
DisjointSet
–Disjoint set forest with “path compression” and “union by rank” heuristics.
-
Cluster
–A cluster of connected components.
Functions:
-
to_clusters
–Converts probabilities into connected components formed at each threshold.
-
graph_results
–Convert probability table to graph representation.
DisjointSet
¶
Bases: Generic[T]
Disjoint set forest with “path compression” and “union by rank” heuristics.
This follows implementation from Cormen, Thomas H., et al. Introduction to algorithms. MIT press, 2022
Methods:
-
add
–Add a new element to the disjoint set.
-
union
–Merge the sets containing elements x and y.
-
get_components
–Return the connected components of the disjoint set.
Attributes:
Cluster
¶
Cluster(
intmap: IntMap,
probability: int | None = None,
leaves: tuple[Cluster] | list[Cluster] | None = None,
id: int | None = None,
hash: bytes | None = None,
)
A cluster of connected components.
Can be a source cluster (a single data point) or a model cluster (a cluster of clusters). The hash of a cluster is the hash of its source clusters – its leaves.
We generate negative integers for IDs, allowing us to generate a true ID with the database after we’ve calculated components using these objects.
Parameters:
-
intmap
¶IntMap
) –An IntMap instance for generating unique IDs
-
probability
¶int | None
, default:None
) –probability of the cluster from its resolution, or None if source
-
leaves
¶tuple[Cluster] | list[Cluster] | None
, default:None
) –A list of Cluster objects that are the leaves of this cluster
-
id
¶int | None
, default:None
) –The ID of the cluster (only for leaf nodes)
-
hash
¶bytes | None
, default:None
) –The hash of the cluster (only for leaf nodes)
Methods:
-
combine
–Efficiently combine multiple clusters at once.
Attributes:
combine
classmethod
¶
Efficiently combine multiple clusters at once.
Parameters:
-
clusters
¶Iterable[Cluster]
) –An iterable of Cluster objects to combine
-
probability
¶int | None
, default:None
) –the probability of the cluster from its resolution
Returns:
-
Cluster
–A new Cluster containing all unique leaves from the input clusters
to_clusters
¶
to_clusters(
results: Table,
dtype: DataType = large_binary,
hash_func: Callable[
[*tuple[T, ...]], T
] = hash_values,
) -> Table
Converts probabilities into connected components formed at each threshold.
Parameters:
-
results
¶Table
) –Arrow table with columns [‘left_id’, ‘right_id’, ‘probability’]
-
dtype
¶DataType
, default:large_binary
) –Arrow data type for the parent and child columns
-
hash_func
¶Callable[[*tuple[T, ...]], T]
, default:hash_values
) –Function to hash the parent and child values
Returns:
-
Table
–Arrow table of parent, child, threshold, sorted by probability descending.
graph_results
¶
graph_results(
probabilities: Table,
all_nodes: Iterable[int] | None = None,
) -> tuple[PyDiGraph, ndarray, ndarray]
Convert probability table to graph representation.
Parameters:
-
probabilities
¶Table
) –PyArrow table with ‘left_id’, ‘right_id’ columns
-
all_nodes
¶Iterable[int] | None
, default:None
) –superset of node identities figuring in probabilities table. Used to optionally add isolated nodes to the graph.
Returns:
-
PyDiGraph
–A tuple containing:
-
ndarray
–- Rustwork directed graph
-
ndarray
–- A list mapping the ‘left_id’ probabilities to node indices in the graph
-
tuple[PyDiGraph, ndarray, ndarray]
–- A list mapping the ‘right_id’ probabilities to node indices in the graph