Transform

matchbox.common.transform ¶

Functions to transform data between tabular and graph structures.

Classes:

DisjointSet –

Disjoint set forest with “path compression” and “union by rank” heuristics.
Cluster –

A cluster of connected components.

Functions:

to_clusters –

Converts probabilities into connected components formed at each threshold.
graph_results –

Convert probability table to graph representation.
hash_cluster_leaves –

Canonical method to convert list of cluster IDs to their combined hash.

DisjointSet ¶

DisjointSet()

Bases: Generic[T]

Disjoint set forest with “path compression” and “union by rank” heuristics.

This follows implementation from Cormen, Thomas H., et al. Introduction to algorithms. MIT press, 2022

Methods:

add –

Add a new element to the disjoint set.
union –

Merge the sets containing elements x and y.
get_components –

Return the connected components of the disjoint set.

Attributes:

parent (dict[T, T]) –
rank (dict[T, int]) –

parent `instance-attribute` ¶

parent: dict[T, T] = {}

rank `instance-attribute` ¶

rank: dict[T, int] = {}

add ¶

add(x: T) -> None

Add a new element to the disjoint set.

union ¶

union(x: T, y: T) -> None

Merge the sets containing elements x and y.

get_components ¶

get_components() -> list[set[T]]

Return the connected components of the disjoint set.

Cluster ¶

Cluster(
    intmap: IntMap,
    probability: int | None = None,
    leaves: tuple[Cluster] | list[Cluster] | None = None,
    id: int | None = None,
    hash: bytes | None = None,
)

A cluster of connected components.

Can be a source cluster (a single data point) or a model cluster (a cluster of clusters). The hash of a cluster is the hash of its source clusters – its leaves.

We generate negative integers for IDs, allowing us to generate a true ID with the database after we’ve calculated components using these objects.

Parameters:

intmap ¶
(IntMap) –

An IntMap instance for generating unique IDs
probability ¶
(int | None, default: None ) –

probability of the cluster from its resolution, or None if source
leaves ¶
(tuple[Cluster] | list[Cluster] | None, default: None ) –

A list of Cluster objects that are the leaves of this cluster
id ¶
(int | None, default: None ) –

The ID of the cluster (only for leaf nodes)
hash ¶
(bytes | None, default: None ) –

The hash of the cluster (only for leaf nodes)

Methods:

combine –

Efficiently combine multiple clusters at once.

Attributes:

id (int) –
hash (bytes) –
leaves (tuple[Cluster] | None) –
probability (int | None) –

id `instance-attribute` ¶

id: int

hash `instance-attribute` ¶

hash: bytes

leaves `instance-attribute` ¶

leaves: tuple[Cluster] | None

probability `instance-attribute` ¶

probability: int | None = probability

combine `classmethod` ¶

combine(
    clusters: Iterable[Cluster],
    probability: int | None = None,
) -> Cluster

Efficiently combine multiple clusters at once.

Parameters:

clusters ¶
(Iterable[Cluster]) –

An iterable of Cluster objects to combine
probability ¶
(int | None, default: None ) –

the probability of the cluster from its resolution

Returns:

Cluster –

A new Cluster containing all unique leaves from the input clusters

to_clusters ¶

to_clusters(
    results: Table,
    dtype: DataType = large_binary,
    hash_func: Callable[
        [*tuple[T, ...]], T
    ] = hash_values,
) -> Table

Converts probabilities into connected components formed at each threshold.

Parameters:

results ¶
(Table) –

Arrow table with columns [‘left_id’, ‘right_id’, ‘probability’]
dtype ¶
(DataType, default: large_binary ) –

Arrow data type for the parent and child columns
hash_func ¶
(Callable[[*tuple[T, ...]], T], default: hash_values ) –

Function to hash the parent and child values

Returns:

Table –

Arrow table of parent, child, threshold, sorted by probability descending.

graph_results ¶

graph_results(
    probabilities: Table,
    all_nodes: Iterable[int] | None = None,
) -> tuple[PyDiGraph, ndarray, ndarray]

Convert probability table to graph representation.

Parameters:

probabilities ¶
(Table) –

PyArrow table with ‘left_id’, ‘right_id’ columns
all_nodes ¶
(Iterable[int] | None, default: None ) –

superset of node identities figuring in probabilities table. Used to optionally add isolated nodes to the graph.

Returns:

PyDiGraph –

A tuple containing:
ndarray –
- Rustwork directed graph
ndarray –
- A list mapping the ‘left_id’ probabilities to node indices in the graph
tuple[PyDiGraph, ndarray, ndarray] –
- A list mapping the ‘right_id’ probabilities to node indices in the graph

hash_cluster_leaves ¶

hash_cluster_leaves(leaves: list[bytes]) -> bytes

Canonical method to convert list of cluster IDs to their combined hash.

Transform

matchbox.common.transform ¶

DisjointSet ¶

parent `instance-attribute` ¶

rank `instance-attribute` ¶

add ¶

union ¶

get_components ¶

Cluster ¶

`intmap` ¶

`probability` ¶

`leaves` ¶

`id` ¶

`hash` ¶

id `instance-attribute` ¶

hash `instance-attribute` ¶

leaves `instance-attribute` ¶

probability `instance-attribute` ¶

combine `classmethod` ¶

`clusters` ¶

`probability` ¶

to_clusters ¶

`results` ¶

`dtype` ¶

`hash_func` ¶

graph_results ¶

`probabilities` ¶

`all_nodes` ¶

hash_cluster_leaves ¶

Transform

matchbox.common.transform ¶

DisjointSet ¶

parent instance-attribute ¶

rank instance-attribute ¶

add ¶

union ¶

get_components ¶

Cluster ¶

intmap ¶

probability ¶

leaves ¶

id ¶

hash ¶

id instance-attribute ¶

hash instance-attribute ¶

leaves instance-attribute ¶

probability instance-attribute ¶

combine classmethod ¶

clusters ¶

probability ¶

to_clusters ¶

results ¶

dtype ¶

hash_func ¶

graph_results ¶

probabilities ¶

all_nodes ¶

hash_cluster_leaves ¶

parent `instance-attribute` ¶

rank `instance-attribute` ¶

`intmap` ¶

`probability` ¶

`leaves` ¶

`id` ¶

`hash` ¶

id `instance-attribute` ¶

hash `instance-attribute` ¶

leaves `instance-attribute` ¶

probability `instance-attribute` ¶

combine `classmethod` ¶

`clusters` ¶

`probability` ¶

`results` ¶

`dtype` ¶

`hash_func` ¶

`probabilities` ¶

`all_nodes` ¶