Skip to content

Transform

matchbox.common.transform

Functions to transform data between tabular and graph structures.

Classes:

  • DisjointSet

    Disjoint set forest with “path compression” and “union by rank” heuristics.

  • Cluster

    A cluster of connected components.

Functions:

  • to_clusters

    Converts probabilities into connected components formed at each threshold.

  • graph_results

    Convert probability table to graph representation.

DisjointSet

DisjointSet()

Bases: Generic[T]

Disjoint set forest with “path compression” and “union by rank” heuristics.

This follows implementation from Cormen, Thomas H., et al. Introduction to algorithms. MIT press, 2022

Methods:

  • add

    Add a new element to the disjoint set.

  • union

    Merge the sets containing elements x and y.

  • get_components

    Return the connected components of the disjoint set.

Attributes:

parent instance-attribute

parent: dict[T, T] = {}

rank instance-attribute

rank: dict[T, int] = {}

add

add(x: T) -> None

Add a new element to the disjoint set.

union

union(x: T, y: T) -> None

Merge the sets containing elements x and y.

get_components

get_components() -> list[set[T]]

Return the connected components of the disjoint set.

Cluster

Cluster(
    intmap: IntMap,
    probability: int | None = None,
    leaves: tuple[Cluster] | list[Cluster] | None = None,
    id: int | None = None,
    hash: bytes | None = None,
)

A cluster of connected components.

Can be a source cluster (a single data point) or a model cluster (a cluster of clusters). The hash of a cluster is the hash of its source clusters – its leaves.

We generate negative integers for IDs, allowing us to generate a true ID with the database after we’ve calculated components using these objects.

Parameters:

  • intmap

    (IntMap) –

    An IntMap instance for generating unique IDs

  • probability

    (int | None, default: None ) –

    probability of the cluster from its resolution, or None if source

  • leaves

    (tuple[Cluster] | list[Cluster] | None, default: None ) –

    A list of Cluster objects that are the leaves of this cluster

  • id

    (int | None, default: None ) –

    The ID of the cluster (only for leaf nodes)

  • hash

    (bytes | None, default: None ) –

    The hash of the cluster (only for leaf nodes)

Methods:

  • combine

    Efficiently combine multiple clusters at once.

Attributes:

id instance-attribute

id: int

hash instance-attribute

hash: bytes

leaves instance-attribute

leaves: tuple[Cluster] | None

probability instance-attribute

probability: int | None = probability

combine classmethod

combine(
    clusters: Iterable[Cluster],
    probability: int | None = None,
) -> Cluster

Efficiently combine multiple clusters at once.

Parameters:

  • clusters
    (Iterable[Cluster]) –

    An iterable of Cluster objects to combine

  • probability
    (int | None, default: None ) –

    the probability of the cluster from its resolution

Returns:

  • Cluster

    A new Cluster containing all unique leaves from the input clusters

to_clusters

to_clusters(
    results: Table,
    dtype: DataType = large_binary,
    hash_func: Callable[
        [*tuple[T, ...]], T
    ] = hash_values,
) -> Table

Converts probabilities into connected components formed at each threshold.

Parameters:

  • results

    (Table) –

    Arrow table with columns [‘left_id’, ‘right_id’, ‘probability’]

  • dtype

    (DataType, default: large_binary ) –

    Arrow data type for the parent and child columns

  • hash_func

    (Callable[[*tuple[T, ...]], T], default: hash_values ) –

    Function to hash the parent and child values

Returns:

  • Table

    Arrow table of parent, child, threshold, sorted by probability descending.

graph_results

graph_results(
    probabilities: Table,
    all_nodes: Iterable[int] | None = None,
) -> tuple[PyDiGraph, ndarray, ndarray]

Convert probability table to graph representation.

Parameters:

  • probabilities

    (Table) –

    PyArrow table with ‘left_id’, ‘right_id’ columns

  • all_nodes

    (Iterable[int] | None, default: None ) –

    superset of node identities figuring in probabilities table. Used to optionally add isolated nodes to the graph.

Returns:

  • PyDiGraph

    A tuple containing:

  • ndarray
    • Rustwork directed graph
  • ndarray
    • A list mapping the ‘left_id’ probabilities to node indices in the graph
  • tuple[PyDiGraph, ndarray, ndarray]
    • A list mapping the ‘right_id’ probabilities to node indices in the graph