Skip to content

Transform

matchbox.common.transform

Functions to transform data between tabular and graph structures.

Classes:

  • DisjointSet

    Disjoint set forest with “path compression” and “union by rank” heuristics.

  • Cluster

    A cluster of connected components.

Functions:

DisjointSet

DisjointSet()

Bases: Generic[T]

Disjoint set forest with “path compression” and “union by rank” heuristics.

This follows implementation from Cormen, Thomas H., et al. Introduction to algorithms. MIT press, 2022

Methods:

  • add

    Add a new element to the disjoint set.

  • union

    Merge the sets containing elements x and y.

  • get_components

    Return the connected components of the disjoint set.

Attributes:

parent instance-attribute

parent: dict[T, T] = {}

rank instance-attribute

rank: dict[T, int] = {}

add

add(x: T) -> None

Add a new element to the disjoint set.

union

union(x: T, y: T) -> None

Merge the sets containing elements x and y.

get_components

get_components() -> list[set[T]]

Return the connected components of the disjoint set.

Cluster

Cluster(intmap: IntMap, probability: int | None = None, leaves: tuple[Cluster] | list[Cluster] | None = None, id: int | None = None, hash: bytes | None = None)

A cluster of connected components.

Can be a source cluster (a single data point) or a model cluster (a cluster of clusters). The hash of a cluster is the hash of its source clusters – its leaves.

We generate negative integers for IDs, allowing us to generate a true ID with the database after we’ve calculated components using these objects.

Parameters:

  • intmap

    (IntMap) –

    An IntMap instance for generating unique IDs

  • probability

    (int | None, default: None ) –

    probability of the cluster from its resolution, or None if source

  • leaves

    (tuple[Cluster] | list[Cluster] | None, default: None ) –

    A list of Cluster objects that are the leaves of this cluster

  • id

    (int | None, default: None ) –

    The ID of the cluster (only for leaf nodes)

  • hash

    (bytes | None, default: None ) –

    The hash of the cluster (only for leaf nodes)

Methods:

  • combine

    Efficiently combine multiple clusters at once.

Attributes:

id instance-attribute

id: int

hash instance-attribute

hash: bytes

leaves instance-attribute

leaves: tuple[Cluster] | None

probability instance-attribute

probability: int | None = probability

combine classmethod

combine(clusters: Iterable[Cluster], probability: int | None = None) -> Cluster

Efficiently combine multiple clusters at once.

Parameters:

  • clusters
    (Iterable[Cluster]) –

    An iterable of Cluster objects to combine

  • probability
    (int | None, default: None ) –

    the probability of the cluster from its resolution

Returns:

  • Cluster

    A new Cluster containing all unique leaves from the input clusters

to_clusters

to_clusters(results: DataFrame, dtype: DataType = Int64, hash_func: Callable[[*(tuple[T, ...])], T] = hash_values) -> DataFrame

Converts probabilities into connected components formed at each threshold.

Parameters:

  • results

    (DataFrame) –

    Polars dataframe with columns [‘left_id’, ‘right_id’, ‘probability’]

  • dtype

    (DataType, default: Int64 ) –

    Polars data type for the parent and child columns

  • hash_func

    (Callable[[*(tuple[T, ...])], T], default: hash_values ) –

    Function to hash the parent and child values

Returns:

  • DataFrame

    Polars dataframe of parent, child, threshold, sorted by probability descending.

graph_results

graph_results(probabilities: DataFrame, all_nodes: Iterable[int] | None = None) -> tuple[PyDiGraph, ndarray, ndarray]

Convert probability table to graph representation.

Parameters:

  • probabilities

    (DataFrame) –

    Polars dataframe with ‘left_id’, ‘right_id’ columns

  • all_nodes

    (Iterable[int] | None, default: None ) –

    superset of node identities figuring in probabilities table. Used to optionally add isolated nodes to the graph.

Returns:

  • PyDiGraph

    A tuple containing:

  • ndarray
    • Rustwork directed graph
  • ndarray
    • A list mapping the ‘left_id’ probabilities to node indices in the graph
  • tuple[PyDiGraph, ndarray, ndarray]
    • A list mapping the ‘right_id’ probabilities to node indices in the graph

hash_cluster_leaves

hash_cluster_leaves(leaves: list[bytes]) -> bytes

Canonical method to convert list of cluster IDs to their combined hash.

truth_float_to_int

truth_float_to_int(truth: float) -> int

Convert user input float truth values to int.

truth_int_to_float

truth_int_to_float(truth: int) -> float

Convert backend int truth values to float.