Arrow
matchbox.common.arrow
¶
Common Arrow utilities.
Classes:
-
JudgementsZipFilenames
–Enumeration of file names in ZIP file with downloaded judgements.
Functions:
-
table_to_buffer
–Converts an Arrow table to a BytesIO buffer.
-
check_schema
–Validate equality of Arrow schemas.
Attributes:
-
SCHEMA_QUERY
(Final[Schema]
) –Data transfer schema for root cluster IDs keyed to primary keys.
-
SCHEMA_QUERY_WITH_LEAVES
–Data transfer schema for root cluster IDs keyed to primary keys and leaf IDs.
-
SCHEMA_INDEX
(Final[Schema]
) –Data transfer schema for data to be indexed in Matchbox.
-
SCHEMA_RESULTS
(Final[Schema]
) –Data transfer schema for the results of a deduplication or linking process.
-
SCHEMA_JUDGEMENTS
(Final[Schema]
) –Data transfer schema for retrieved evaluation judgements from users.
-
SCHEMA_CLUSTER_EXPANSION
(Final[Schema]
) –Data transfer schema for mapping from a cluster ID to all its source cluster IDs.
-
SCHEMA_EVAL_SAMPLES
(Final[Schema]
) –Data transfer schema for evaluation samples.
SCHEMA_QUERY
module-attribute
¶
SCHEMA_QUERY: Final[Schema] = schema(
[("id", int64()), ("key", large_string())]
)
Data transfer schema for root cluster IDs keyed to primary keys.
SCHEMA_QUERY_WITH_LEAVES
module-attribute
¶
Data transfer schema for root cluster IDs keyed to primary keys and leaf IDs.
SCHEMA_INDEX
module-attribute
¶
SCHEMA_INDEX: Final[Schema] = schema(
[
("hash", large_binary()),
("keys", large_list(large_string())),
]
)
Data transfer schema for data to be indexed in Matchbox.
SCHEMA_RESULTS
module-attribute
¶
SCHEMA_RESULTS: Final[Schema] = schema(
[
("left_id", uint64()),
("right_id", uint64()),
("probability", uint8()),
]
)
Data transfer schema for the results of a deduplication or linking process.
SCHEMA_JUDGEMENTS
module-attribute
¶
SCHEMA_JUDGEMENTS: Final[Schema] = schema(
[
("user_id", uint64()),
("endorsed", uint64()),
("shown", uint64()),
]
)
Data transfer schema for retrieved evaluation judgements from users.
SCHEMA_CLUSTER_EXPANSION
module-attribute
¶
SCHEMA_CLUSTER_EXPANSION: Final[Schema] = schema(
[("root", uint64()), ("leaves", list_(uint64()))]
)
Data transfer schema for mapping from a cluster ID to all its source cluster IDs.
SCHEMA_EVAL_SAMPLES
module-attribute
¶
SCHEMA_EVAL_SAMPLES: Final[Schema] = schema(
[
("root", uint64()),
("leaf", uint64()),
("key", large_string()),
("source", large_string()),
]
)
Data transfer schema for evaluation samples.
JudgementsZipFilenames
¶
Bases: StrEnum
Enumeration of file names in ZIP file with downloaded judgements.
Attributes:
-
JUDGEMENTS
– -
EXPANSION
–
table_to_buffer
¶
table_to_buffer(table: Table) -> BytesIO
Converts an Arrow table to a BytesIO buffer.
check_schema
¶
Validate equality of Arrow schemas.