Skip to content

Transform

matchbox.common.transform

Functions to transform data between tabular and graph structures.

Classes:

  • DisjointSet

    Disjoint set forest with “path compression” and “union by rank” heuristics.

Functions:

DisjointSet

DisjointSet()

Bases: Generic[T]

Disjoint set forest with “path compression” and “union by rank” heuristics.

This follows implementation from Cormen, Thomas H., et al. Introduction to algorithms. MIT press, 2022

Methods:

  • add

    Add a new element to the disjoint set.

  • union

    Merge the sets containing elements x and y.

  • get_components

    Return the connected components of the disjoint set.

Attributes:

parent instance-attribute

parent: dict[T, T] = {}

rank instance-attribute

rank: dict[T, int] = {}

add

add(x: T) -> None

Add a new element to the disjoint set.

union

union(x: T, y: T) -> None

Merge the sets containing elements x and y.

get_components

get_components() -> list[set[T]]

Return the connected components of the disjoint set.

to_clusters

to_clusters(
    results: Table,
    dtype: DataType = large_binary,
    hash_func: Callable[
        [*tuple[T, ...]], T
    ] = hash_values,
) -> Table

Converts probabilities into connected components formed at each threshold.

Parameters:

  • results

    (Table) –

    Arrow table with columns [‘left_id’, ‘right_id’, ‘probability’]

  • dtype

    (DataType, default: large_binary ) –

    Arrow data type for the parent and child columns

  • hash_func

    (Callable[[*tuple[T, ...]], T], default: hash_values ) –

    Function to hash the parent and child values

Returns:

  • Table

    Arrow table of parent, child, threshold, sorted by probability descending.

graph_results

graph_results(
    probabilities: Table,
    all_nodes: Iterable[int] | None = None,
) -> tuple[PyDiGraph, ndarray, ndarray]

Convert probability table to graph representation.

Parameters:

  • probabilities

    (Table) –

    PyArrow table with ‘left_id’, ‘right_id’ columns

  • all_nodes

    (Iterable[int] | None, default: None ) –

    superset of node identities figuring in probabilities table. Used to optionally add isolated nodes to the graph.

Returns:

  • PyDiGraph

    A tuple containing:

  • ndarray
    • Rustwork directed graph
  • ndarray
    • A list mapping the ‘left_id’ probabilities to node indices in the graph
  • tuple[PyDiGraph, ndarray, ndarray]
    • A list mapping the ‘right_id’ probabilities to node indices in the graph

attach_components_to_probabilities

attach_components_to_probabilities(
    probabilities: Table,
) -> Table

Takes an Arrow table of probabilities and adds a component column.

Expects an Arrow table of column left_id, right_id, probability.

Returns a table with an additional column, component.

component_to_hierarchy

component_to_hierarchy(
    table: Table,
    dtype: DataType = large_binary,
    hash_func: Callable[
        [*tuple[T, ...]], T
    ] = hash_values,
) -> Table

Convert pairwise probabilities into a hierarchical representation.

Assumes data is pre-sorted by probability descending.

Parameters:

  • table

    (Table) –

    Arrow Table with columns [‘left’, ‘right’, ‘probability’]

  • dtype

    (DataType, default: large_binary ) –

    Arrow data type for the parent and child columns

  • hash_func

    (Callable[[*tuple[T, ...]], T], default: hash_values ) –

    Function to hash the parent and child values

Returns:

  • Table

    Arrow Table with columns [‘parent’, ‘child’, ‘probability’]

to_hierarchical_clusters

to_hierarchical_clusters(
    probabilities: Table,
    proc_func: Callable[
        [Table, DataType], Table
    ] = component_to_hierarchy,
    hash_func: Callable[
        [*tuple[T, ...]], T
    ] = hash_values,
    dtype: DataType = large_binary,
    parallel: bool = True,
    timeout: int = 300,
    min_rows_per_worker: int = 1000,
) -> Table

Converts a table of pairwise probabilities into a table of hierarchical clusters.

Parameters:

  • probabilities

    (Table) –

    Arrow table with columns [‘component’, ‘left’, ‘right’, ‘probability’]

  • proc_func

    (Callable[[Table, DataType], Table], default: component_to_hierarchy ) –

    Function to process each component

  • hash_func

    (Callable[[*tuple[T, ...]], T], default: hash_values ) –

    Function to hash the parent and child values

  • dtype

    (DataType, default: large_binary ) –

    Arrow data type for the parent and child columns

  • parallel

    (bool, default: True ) –

    Whether to work on separate connected components in parallel

  • timeout

    (int, default: 300 ) –

    Maximum seconds to wait for each worker to finish processing. Ignored if parallel==False

  • min_rows_per_worker

    (int, default: 1000 ) –

    Minimum number of rows to pass to each worker. Ignored if parallel==False

Returns:

  • Table

    Arrow table with columns [‘parent’, ‘child’, ‘probability’]