Skip to content

Arrow

matchbox.common.arrow

Common Arrow utilities.

Classes:

Functions:

Attributes:

SCHEMA_QUERY module-attribute

SCHEMA_QUERY: Final[Schema] = schema(
    [("id", int64()), ("key", large_string())]
)

Data transfer schema for root cluster IDs keyed to primary keys.

SCHEMA_QUERY_WITH_LEAVES module-attribute

SCHEMA_QUERY_WITH_LEAVES = append(field("leaf_id", int64()))

Data transfer schema for root cluster IDs keyed to primary keys and leaf IDs.

SCHEMA_INDEX module-attribute

SCHEMA_INDEX: Final[Schema] = schema(
    [
        ("hash", large_binary()),
        ("keys", large_list(large_string())),
    ]
)

Data transfer schema for data to be indexed in Matchbox.

SCHEMA_RESULTS module-attribute

SCHEMA_RESULTS: Final[Schema] = schema(
    [
        ("left_id", uint64()),
        ("right_id", uint64()),
        ("probability", uint8()),
    ]
)

Data transfer schema for the results of a deduplication or linking process.

SCHEMA_JUDGEMENTS module-attribute

SCHEMA_JUDGEMENTS: Final[Schema] = schema(
    [
        ("user_id", uint64()),
        ("endorsed", uint64()),
        ("shown", uint64()),
    ]
)

Data transfer schema for retrieved evaluation judgements from users.

SCHEMA_CLUSTER_EXPANSION module-attribute

SCHEMA_CLUSTER_EXPANSION: Final[Schema] = schema(
    [("root", uint64()), ("leaves", list_(uint64()))]
)

Data transfer schema for mapping from a cluster ID to all its source cluster IDs.

SCHEMA_EVAL_SAMPLES module-attribute

SCHEMA_EVAL_SAMPLES: Final[Schema] = schema(
    [
        ("root", uint64()),
        ("leaf", uint64()),
        ("key", large_string()),
        ("source", large_string()),
    ]
)

Data transfer schema for evaluation samples.

JudgementsZipFilenames

Bases: StrEnum

Enumeration of file names in ZIP file with downloaded judgements.

Attributes:

JUDGEMENTS class-attribute instance-attribute

JUDGEMENTS = 'judgements.parquet'

EXPANSION class-attribute instance-attribute

EXPANSION = 'expansion.parquet'

table_to_buffer

table_to_buffer(table: Table) -> BytesIO

Converts an Arrow table to a BytesIO buffer.

check_schema

check_schema(expected: Schema, actual: Schema) -> None

Validate equality of Arrow schemas.