Clean¶

matchbox.client.clean ¶

Library of default cleaning functions.

Modules:

lib –

Implementation of default cleaning functions.
steps –

Low-level components of default cleaning functions.
utils –

Generic utilities for default cleaning functions.

Functions:

company_name –

Standard cleaning function for company names.
company_number –

Remove non-numbers, and then leading zeroes.
drop –

Drops the column from the dataframe.
extract_cdms_number_to_new –

Detects the CDMS nuber in a column and moves it to a new column.
extract_company_number_to_new –

Detects the Companies House CRN in a column and moves it to a new column.
extract_duns_number_to_new –

Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.
postcode –

Removes all punctuation, converts to upper, removes all spaces.
postcode_to_area –

Extracts postcode area from a postcode.
remove_prefix –

Remove a prefix from any column names that contain it.
alias –

Takes a cleaning function and aliases the output to a new column.
cleaning_function –

Takes a list of basic cleaning functions and composes them into a callable.
unnest_renest –

Takes a cleaning function and adds unnesting and renesting either side of it.

company_name ¶

company_name(
    df: DataFrame,
    column: str,
    column_secondary: str = None,
    stopwords: str = STOPWORDS,
) -> DataFrame

Standard cleaning function for company names.

Lower case, remove punctuation & tokenise the company name into an array
Extract tokens into: ‘unusual’ and ‘stopwords’. Dedupe. Sort alphabetically
Untokenise the unusual words back to a string

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing the company’s main name
column_secondary ¶
(str, default: None ) –

a column containing an array of the company’s secondary names
stopwords ¶
(str, default: STOPWORDS ) –

a list of stopwords to use for this clean

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

company_number ¶

company_number(df: DataFrame, column: str) -> DataFrame

Remove non-numbers, and then leading zeroes.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing a company number

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

drop ¶

drop(df: DataFrame, column: str) -> DataFrame

Drops the column from the dataframe.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column

Returns:

dataframe ( DataFrame ) –

the same as went in without the column

extract_cdms_number_to_new ¶

extract_cdms_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the CDMS nuber in a column and moves it to a new column.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing some CDMS numbers
new_column ¶
(str) –

the name of the column to add

Returns:

dataframe ( DataFrame ) –

the same as went in with a new column for CDMS numbers

extract_company_number_to_new ¶

extract_company_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Companies House CRN in a column and moves it to a new column.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing some company numbers
new_column ¶
(str) –

the name of the column to add

Returns:

dataframe ( DataFrame ) –

the same as went in with a new column for CRNs

extract_duns_number_to_new ¶

extract_duns_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing some DUNS numbers
new_column ¶
(str) –

the name of the column to add

Returns:

dataframe ( DataFrame ) –

the same as went in with a new column for DUNs numbers

postcode ¶

postcode(df: DataFrame, column: str) -> DataFrame

Removes all punctuation, converts to upper, removes all spaces.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing a postcode

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

postcode_to_area ¶

postcode_to_area(df: DataFrame, column: str) -> DataFrame

Extracts postcode area from a postcode.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing a postcode

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

remove_prefix ¶

remove_prefix(
    df: DataFrame, column: str, prefix: str
) -> DataFrame

Remove a prefix from any column names that contain it.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a dead variable to appease Matchbox while a bug is fixed
prefix ¶
(str) –

the prefix string to remove from column names

Returns:

dataframe ( DataFrame ) –

the same as went in, but with column names cleaned

alias ¶

alias(function: Callable, alias: str) -> Callable

Takes a cleaning function and aliases the output to a new column.

Parameters:

function ¶
(Callable) –

an outut from a cleaning_function function
alias ¶
(str) –

the new column name to use

cleaning_function ¶

cleaning_function(*functions: Callable) -> Callable

Takes a list of basic cleaning functions and composes them into a callable.

Functions must be appropriate for a select statement.

Only for use with cleaning methods that take a single column as their argument. Consider using functools.partial to coerce functions that need arguments into this shape.

Parameters:

functions ¶
(Callable, default: () ) –

a list of functions appropriate for a select statement. See clean_basic for some examples

unnest_renest ¶

unnest_renest(function: Callable) -> Callable

Takes a cleaning function and adds unnesting and renesting either side of it.

Useful for applying the same function to an array where there are sub-functions that also use arrays, blocking list_transform.

Parameters:

function ¶
(Callable) –

an outut from a cleaning_function function

lib ¶

Implementation of default cleaning functions.

Functions:

company_name –

Standard cleaning function for company names.
company_number –

Remove non-numbers, and then leading zeroes.
postcode –

Removes all punctuation, converts to upper, removes all spaces.
postcode_to_area –

Extracts postcode area from a postcode.
extract_company_number_to_new –

Detects the Companies House CRN in a column and moves it to a new column.
extract_duns_number_to_new –

Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.
extract_cdms_number_to_new –

Detects the CDMS nuber in a column and moves it to a new column.
drop –

Drops the column from the dataframe.
remove_prefix –

Remove a prefix from any column names that contain it.

company_name ¶

company_name(
    df: DataFrame,
    column: str,
    column_secondary: str = None,
    stopwords: str = STOPWORDS,
) -> DataFrame

Standard cleaning function for company names.

Lower case, remove punctuation & tokenise the company name into an array
Extract tokens into: ‘unusual’ and ‘stopwords’. Dedupe. Sort alphabetically
Untokenise the unusual words back to a string

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing the company’s main name
column_secondary ¶
(str, default: None ) –

a column containing an array of the company’s secondary names
stopwords ¶
(str, default: STOPWORDS ) –

a list of stopwords to use for this clean

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

company_number ¶

company_number(df: DataFrame, column: str) -> DataFrame

Remove non-numbers, and then leading zeroes.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing a company number

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

postcode ¶

postcode(df: DataFrame, column: str) -> DataFrame

Removes all punctuation, converts to upper, removes all spaces.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing a postcode

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

postcode_to_area ¶

postcode_to_area(df: DataFrame, column: str) -> DataFrame

Extracts postcode area from a postcode.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing a postcode

Returns:

dataframe ( DataFrame ) –

the same as went in, but cleaned

extract_company_number_to_new ¶

extract_company_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Companies House CRN in a column and moves it to a new column.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing some company numbers
new_column ¶
(str) –

the name of the column to add

Returns:

dataframe ( DataFrame ) –

the same as went in with a new column for CRNs

extract_duns_number_to_new ¶

extract_duns_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing some DUNS numbers
new_column ¶
(str) –

the name of the column to add

Returns:

dataframe ( DataFrame ) –

the same as went in with a new column for DUNs numbers

extract_cdms_number_to_new ¶

extract_cdms_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the CDMS nuber in a column and moves it to a new column.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column containing some CDMS numbers
new_column ¶
(str) –

the name of the column to add

Returns:

dataframe ( DataFrame ) –

the same as went in with a new column for CDMS numbers

drop ¶

drop(df: DataFrame, column: str) -> DataFrame

Drops the column from the dataframe.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a column

Returns:

dataframe ( DataFrame ) –

the same as went in without the column

remove_prefix ¶

remove_prefix(
    df: DataFrame, column: str, prefix: str
) -> DataFrame

Remove a prefix from any column names that contain it.

Parameters:

df ¶
(DataFrame) –

a dataframe
column ¶
(str) –

a dead variable to appease Matchbox while a bug is fixed
prefix ¶
(str) –

the prefix string to remove from column names

Returns:

dataframe ( DataFrame ) –

the same as went in, but with column names cleaned

steps ¶

Low-level components of default cleaning functions.

Modules:

clean_basic –

Low-level primitives supporting default cleaning functions.
clean_basic_original –

Legacy cleaning rules inherited by the Company Matching Service.

Functions:

array_except –

Remove terms from an array.
array_intersect –

Filter an array to only keep terms in a list.
clean_punctuation –

Removes all punctuation and spaces, trim, lowercase.
clean_punctuation_except_hyphens –

Revove all punctuation and spaces except hyphens, trim.
dedupe_and_sort –

De-duplicate an array of tokens and sort alphabetically.
expand_abbreviations –

Expand abbreviations found in the column.
filter_cdms_number –

Filter out non-CDMS numbers.
filter_company_number –

Filter out non-Companies House numbers.
filter_duns_number –

Filter out non-DUNS numbers.
get_digits_only –

Extract digits only, including nonconsecutive.
get_low_freq_char_sig –

Removes letters with a frequency of 5% or higher, and spaces.
get_postcode_area –

Extract the postcode area from a column.
list_join_to_string –

Join a list of strings into a single string.
periods_to_nothing –

Removes periods and replaces with nothing (U.K. -> UK).
punctuation_to_spaces –

Removes all punctuation and replaces with spaces.
regex_extract_list_of_strings –

Extract a list of strings from a column using regex.
regex_remove_list_of_strings –

Remove a list of strings from a column using regex.
remove_notnumbers_leadingzeroes –

Remove any char that is not a number, then remove all leading zeroes.
remove_stopwords –

A thin optinionated wrapper for array_except to clean the global stopwords list.
remove_whitespace –

Removes all whitespaces.
to_lower –

All characters to lowercase.
to_upper –

All characters to uppercase.
tokenise –

Split the text in column into an array.
trim –

Remove leading and trailing whitespace.
cms_original_clean_cdms_id –

Replicates the original Company Matching Service CDMS ID cleaning.
cms_original_clean_ch_id –

Replicates the original Company Matching Service Companies House ID cleaning.
cms_original_clean_company_name_ch –

Replicates the original Company Matching Service company name cleaning.
cms_original_clean_company_name_general –

Replicates the original Company Matching Service company name cleaning.
cms_original_clean_email –

Replicates the original Company Matching Service email cleaning.
cms_original_clean_postcode –

Replicates the original Company Matching Service postcode cleaning.

array_except ¶

array_except(
    column: str, terms_to_remove: list[str]
) -> str

Remove terms from an array.

Parameters:

column ¶
(str) –

The name of the column to treat
terms_to_remove ¶
(list[str]) –

A list of terms to remove

Returns:

str –

String to insert into SQL query

array_intersect ¶

array_intersect(
    column: str, terms_to_keep: list[str]
) -> str

Filter an array to only keep terms in a list.

Parameters:

column ¶
(str) –

The name of the column to treat
terms_to_keep ¶
(list[str]) –

A list of terms to keep

Returns:

str –

String to insert into SQL query

clean_punctuation ¶

clean_punctuation(column: str) -> str

Removes all punctuation and spaces, trim, lowercase.

Set to lower case
Remove punctuation
Multiple to single space replace.
Trim leading and trailing spaces.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

clean_punctuation_except_hyphens ¶

clean_punctuation_except_hyphens(column: str) -> str

Revove all punctuation and spaces except hyphens, trim.

Useful for cleaning reference numbers.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

dedupe_and_sort ¶

dedupe_and_sort(column: str) -> str

De-duplicate an array of tokens and sort alphabetically.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

expand_abbreviations ¶

expand_abbreviations(
    column: str,
    replacements: dict[str, str] = ABBREVIATIONS,
) -> str

Expand abbreviations found in the column.

Takes a dictionary where the keys are matches and the values are what to replace them with.

Matches only when term is surrounded by regex word boundaries.

Parameters:

column ¶
(str) –

the name of the column to clean
replacements ¶
(dict[str, str], default: ABBREVIATIONS ) –

a dictionary where keys are matches and values are what the replace them with

Returns:

str –

String to insert into SQL query

filter_cdms_number ¶

filter_cdms_number(column: str) -> str

Filter out non-CDMS numbers.

Returns a CASE WHEN filter on the specified column that will match only CDMS numbers. Must be either:

6 or 12 digits long
Start with ‘000’
Start with ‘ORG-‘

Will return false positives on some CRN numbers when they are 8 digits long and begin with ‘000’.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

filter_company_number ¶

filter_company_number(column: str) -> str

Filter out non-Companies House numbers.

Returns a CASE WHEN filter on the specified column that will match only Companies House numbers, CRNs.

Uses regex derived from: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

filter_duns_number ¶

filter_duns_number(column: str) -> str

Filter out non-DUNS numbers.

Returns a CASE WHEN filter on the specified column that will match only a Dun & Bradstreet DUNS number. Must be both:

9 characters
Numeric

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

get_digits_only ¶

get_digits_only(column: str) -> str

Extract digits only, including nonconsecutive.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

get_low_freq_char_sig ¶

get_low_freq_char_sig(column: str) -> str

Removes letters with a frequency of 5% or higher, and spaces.

See https://en.wikipedia.org/wiki/Letter_frequency for details.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

get_postcode_area ¶

get_postcode_area(column: str) -> str

Extract the postcode area from a column.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

list_join_to_string ¶

list_join_to_string(
    column: str, seperator: str = " "
) -> str

Join a list of strings into a single string.

Parameters:

column ¶
(str) –

The name of the column to treat
seperator ¶
(str, default: ' ' ) –

The string to use to join the list

Returns:

str –

String to insert into SQL query

periods_to_nothing ¶

periods_to_nothing(column: str) -> str

Removes periods and replaces with nothing (U.K. -> UK).

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

punctuation_to_spaces ¶

punctuation_to_spaces(column: str) -> str

Removes all punctuation and replaces with spaces.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

regex_extract_list_of_strings ¶

regex_extract_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Extract a list of strings from a column using regex.

Parameters:

column ¶
(str) –

The name of the column to treat
list_of_strings ¶
(list[str]) –

A list of strings to extract

Returns:

str –

String to insert into SQL query

regex_remove_list_of_strings ¶

regex_remove_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Remove a list of strings from a column using regex.

Parameters:

column ¶
(str) –

The name of the column to treat
list_of_strings ¶
(list[str]) –

A list of strings to remove

Returns:

str –

String to insert into SQL query

remove_notnumbers_leadingzeroes ¶

remove_notnumbers_leadingzeroes(column: str) -> str

Remove any char that is not a number, then remove all leading zeroes.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

remove_stopwords ¶

remove_stopwords(
    column: str, stopwords: list[str] = STOPWORDS
) -> str

A thin optinionated wrapper for array_except to clean the global stopwords list.

Parameters:

column ¶
(str) –

The name of the column to treat
stopwords ¶
(list[str], default: STOPWORDS ) –

A list of terms to remove

Returns:

str –

String to insert into SQL query

remove_whitespace ¶

remove_whitespace(column: str) -> str

Removes all whitespaces.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

to_lower ¶

to_lower(column: str) -> str

All characters to lowercase.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

to_upper ¶

to_upper(column: str) -> str

All characters to uppercase.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

tokenise ¶

tokenise(column: str) -> str

Split the text in column into an array.

Uses any char that is not alphanumeric as delimeter.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

trim ¶

trim(column: str) -> str

Remove leading and trailing whitespace.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

cms_original_clean_cdms_id ¶

cms_original_clean_cdms_id(column)

Replicates the original Company Matching Service CDMS ID cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_ch_id ¶

cms_original_clean_ch_id(column)

Replicates the original Company Matching Service Companies House ID cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_company_name_ch ¶

cms_original_clean_company_name_ch(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _ch_name_simplification version from app/algorithm/sql_statements.py#L14.

Use with Companies House only.

cms_original_clean_company_name_general ¶

cms_original_clean_company_name_general(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _general_name_simplification version from app/algorithm/sql_statements.py#L24.

Use with any data except Companies House.

cms_original_clean_email ¶

cms_original_clean_email(column)

Replicates the original Company Matching Service email cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_postcode ¶

cms_original_clean_postcode(column)

Replicates the original Company Matching Service postcode cleaning.

Intended to help replicate the methodology for comparison.

clean_basic ¶

Low-level primitives supporting default cleaning functions.

Functions:

remove_whitespace –

Removes all whitespaces.
punctuation_to_spaces –

Removes all punctuation and replaces with spaces.
periods_to_nothing –

Removes periods and replaces with nothing (U.K. -> UK).
clean_punctuation –

Removes all punctuation and spaces, trim, lowercase.
clean_punctuation_except_hyphens –

Revove all punctuation and spaces except hyphens, trim.
expand_abbreviations –

Expand abbreviations found in the column.
tokenise –

Split the text in column into an array.
dedupe_and_sort –

De-duplicate an array of tokens and sort alphabetically.
remove_notnumbers_leadingzeroes –

Remove any char that is not a number, then remove all leading zeroes.
array_except –

Remove terms from an array.
array_intersect –

Filter an array to only keep terms in a list.
remove_stopwords –

A thin optinionated wrapper for array_except to clean the global stopwords list.
regex_remove_list_of_strings –

Remove a list of strings from a column using regex.
regex_extract_list_of_strings –

Extract a list of strings from a column using regex.
list_join_to_string –

Join a list of strings into a single string.
get_postcode_area –

Extract the postcode area from a column.
get_low_freq_char_sig –

Removes letters with a frequency of 5% or higher, and spaces.
filter_cdms_number –

Filter out non-CDMS numbers.
filter_company_number –

Filter out non-Companies House numbers.
filter_duns_number –

Filter out non-DUNS numbers.
to_upper –

All characters to uppercase.
to_lower –

All characters to lowercase.
trim –

Remove leading and trailing whitespace.
get_digits_only –

Extract digits only, including nonconsecutive.

remove_whitespace ¶

remove_whitespace(column: str) -> str

Removes all whitespaces.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

punctuation_to_spaces ¶

punctuation_to_spaces(column: str) -> str

Removes all punctuation and replaces with spaces.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

periods_to_nothing ¶

periods_to_nothing(column: str) -> str

Removes periods and replaces with nothing (U.K. -> UK).

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

clean_punctuation ¶

clean_punctuation(column: str) -> str

Removes all punctuation and spaces, trim, lowercase.

Set to lower case
Remove punctuation
Multiple to single space replace.
Trim leading and trailing spaces.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

clean_punctuation_except_hyphens ¶

clean_punctuation_except_hyphens(column: str) -> str

Revove all punctuation and spaces except hyphens, trim.

Useful for cleaning reference numbers.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

expand_abbreviations ¶

expand_abbreviations(
    column: str,
    replacements: dict[str, str] = ABBREVIATIONS,
) -> str

Expand abbreviations found in the column.

Takes a dictionary where the keys are matches and the values are what to replace them with.

Matches only when term is surrounded by regex word boundaries.

Parameters:

column ¶
(str) –

the name of the column to clean
replacements ¶
(dict[str, str], default: ABBREVIATIONS ) –

a dictionary where keys are matches and values are what the replace them with

Returns:

str –

String to insert into SQL query

tokenise ¶

tokenise(column: str) -> str

Split the text in column into an array.

Uses any char that is not alphanumeric as delimeter.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

dedupe_and_sort ¶

dedupe_and_sort(column: str) -> str

De-duplicate an array of tokens and sort alphabetically.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

remove_notnumbers_leadingzeroes ¶

remove_notnumbers_leadingzeroes(column: str) -> str

Remove any char that is not a number, then remove all leading zeroes.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

array_except ¶

array_except(
    column: str, terms_to_remove: list[str]
) -> str

Remove terms from an array.

Parameters:

column ¶
(str) –

The name of the column to treat
terms_to_remove ¶
(list[str]) –

A list of terms to remove

Returns:

str –

String to insert into SQL query

array_intersect ¶

array_intersect(
    column: str, terms_to_keep: list[str]
) -> str

Filter an array to only keep terms in a list.

Parameters:

column ¶
(str) –

The name of the column to treat
terms_to_keep ¶
(list[str]) –

A list of terms to keep

Returns:

str –

String to insert into SQL query

remove_stopwords ¶

remove_stopwords(
    column: str, stopwords: list[str] = STOPWORDS
) -> str

A thin optinionated wrapper for array_except to clean the global stopwords list.

Parameters:

column ¶
(str) –

The name of the column to treat
stopwords ¶
(list[str], default: STOPWORDS ) –

A list of terms to remove

Returns:

str –

String to insert into SQL query

regex_remove_list_of_strings ¶

regex_remove_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Remove a list of strings from a column using regex.

Parameters:

column ¶
(str) –

The name of the column to treat
list_of_strings ¶
(list[str]) –

A list of strings to remove

Returns:

str –

String to insert into SQL query

regex_extract_list_of_strings ¶

regex_extract_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Extract a list of strings from a column using regex.

Parameters:

column ¶
(str) –

The name of the column to treat
list_of_strings ¶
(list[str]) –

A list of strings to extract

Returns:

str –

String to insert into SQL query

list_join_to_string ¶

list_join_to_string(
    column: str, seperator: str = " "
) -> str

Join a list of strings into a single string.

Parameters:

column ¶
(str) –

The name of the column to treat
seperator ¶
(str, default: ' ' ) –

The string to use to join the list

Returns:

str –

String to insert into SQL query

get_postcode_area ¶

get_postcode_area(column: str) -> str

Extract the postcode area from a column.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

get_low_freq_char_sig ¶

get_low_freq_char_sig(column: str) -> str

Removes letters with a frequency of 5% or higher, and spaces.

See https://en.wikipedia.org/wiki/Letter_frequency for details.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

filter_cdms_number ¶

filter_cdms_number(column: str) -> str

Filter out non-CDMS numbers.

Returns a CASE WHEN filter on the specified column that will match only CDMS numbers. Must be either:

6 or 12 digits long
Start with ‘000’
Start with ‘ORG-‘

Will return false positives on some CRN numbers when they are 8 digits long and begin with ‘000’.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

filter_company_number ¶

filter_company_number(column: str) -> str

Filter out non-Companies House numbers.

Returns a CASE WHEN filter on the specified column that will match only Companies House numbers, CRNs.

Uses regex derived from: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

filter_duns_number ¶

filter_duns_number(column: str) -> str

Filter out non-DUNS numbers.

Returns a CASE WHEN filter on the specified column that will match only a Dun & Bradstreet DUNS number. Must be both:

9 characters
Numeric

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

to_upper ¶

to_upper(column: str) -> str

All characters to uppercase.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

to_lower ¶

to_lower(column: str) -> str

All characters to lowercase.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

trim ¶

trim(column: str) -> str

Remove leading and trailing whitespace.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

get_digits_only ¶

get_digits_only(column: str) -> str

Extract digits only, including nonconsecutive.

Parameters:

column ¶
(str) –

The name of the column to treat

Returns:

str –

String to insert into SQL query

clean_basic_original ¶

Legacy cleaning rules inherited by the Company Matching Service.

Functions:

cms_original_clean_company_name_general –

Replicates the original Company Matching Service company name cleaning.
cms_original_clean_company_name_ch –

Replicates the original Company Matching Service company name cleaning.
cms_original_clean_postcode –

Replicates the original Company Matching Service postcode cleaning.
cms_original_clean_email –

Replicates the original Company Matching Service email cleaning.
cms_original_clean_ch_id –

Replicates the original Company Matching Service Companies House ID cleaning.
cms_original_clean_cdms_id –

Replicates the original Company Matching Service CDMS ID cleaning.

cms_original_clean_company_name_general ¶

cms_original_clean_company_name_general(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _general_name_simplification version from app/algorithm/sql_statements.py#L24.

Use with any data except Companies House.

cms_original_clean_company_name_ch ¶

cms_original_clean_company_name_ch(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _ch_name_simplification version from app/algorithm/sql_statements.py#L14.

Use with Companies House only.

cms_original_clean_postcode ¶

cms_original_clean_postcode(column)

Replicates the original Company Matching Service postcode cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_email ¶

cms_original_clean_email(column)

Replicates the original Company Matching Service email cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_ch_id ¶

cms_original_clean_ch_id(column)

Replicates the original Company Matching Service Companies House ID cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_cdms_id ¶

cms_original_clean_cdms_id(column)

Replicates the original Company Matching Service CDMS ID cleaning.

Intended to help replicate the methodology for comparison.

utils ¶

Generic utilities for default cleaning functions.

Functions:

cleaning_function –

Takes a list of basic cleaning functions and composes them into a callable.
alias –

Takes a cleaning function and aliases the output to a new column.
unnest_renest –

Takes a cleaning function and adds unnesting and renesting either side of it.
select_cleaners –

Select cleaners from one or more (cleaner_dict, keys) pairs.

Attributes:

STOPWORDS –
ABBREVIATIONS –

STOPWORDS `module-attribute` ¶

STOPWORDS = [
    "limited",
    "uk",
    "company",
    "international",
    "group",
    "of",
    "the",
    "inc",
    "and",
    "plc",
    "corporation",
    "llp",
    "pvt",
    "gmbh",
    "u k",
    "pte",
    "usa",
    "bank",
    "b v",
    "bv",
]

ABBREVIATIONS `module-attribute` ¶

ABBREVIATIONS = {'co': 'company', 'ltd': 'limited'}

cleaning_function ¶

cleaning_function(*functions: Callable) -> Callable

Takes a list of basic cleaning functions and composes them into a callable.

Functions must be appropriate for a select statement.

Only for use with cleaning methods that take a single column as their argument. Consider using functools.partial to coerce functions that need arguments into this shape.

Parameters:

functions ¶
(Callable, default: () ) –

a list of functions appropriate for a select statement. See clean_basic for some examples

alias ¶

alias(function: Callable, alias: str) -> Callable

Takes a cleaning function and aliases the output to a new column.

Parameters:

function ¶
(Callable) –

an outut from a cleaning_function function
alias ¶
(str) –

the new column name to use

unnest_renest ¶

unnest_renest(function: Callable) -> Callable

Takes a cleaning function and adds unnesting and renesting either side of it.

Useful for applying the same function to an array where there are sub-functions that also use arrays, blocking list_transform.

Parameters:

function ¶
(Callable) –

an outut from a cleaning_function function

select_cleaners ¶

select_cleaners(
    *sources: tuple[dict, list[str]],
) -> dict[str, dict[str, Any]]

Select cleaners from one or more (cleaner_dict, keys) pairs.

Useful for storing a dictionary of cleaning funtions for a particular source, then building them dynamically based on the keys you want to use.

Examples:

from matchbox.client import clean
from matchbox.client.dags import LinkStep, StepInput
from matchbox.client.helpers.cleaner import cleaner, cleaners

foo_cleaners = {
    "company_name": cleaner(
        clean.company_name,
        {"column": foo.qualify_field("company_name")},
    ),
    "company_number": cleaner(
        clean.company_number,
        {"column": foo.qualify_field("company_number")},
    ),
}

bar_cleaners = {
    "company_name": cleaner(
        clean.company_name,
        {"column": bar.qualify_field("company_name")},
    ),
    "postcode": cleaner(
        clean.postcode,
        {"column": bar.qualify_field("postcode")},
    ),
}

my_linker = LinkStep(
    left=StepInput(
        select={
            foo: ["company_name"],
            bar: ["postcode"],
        },
        cleaners=select_cleaners(
            (foo_cleaners, ["company_name"]),
            (bar_cleaners, ["postcode"]),
        ),
        ...
    ),
    ...
)

Returns:

dict[str, dict[str, Any]] –

Dictionary matching the return type of cleaners()

Raises:

ValueError –

If a key already exists in the selected cleaners (duplicate key)
KeyError –

If a requested key doesn’t exist in the cleaner dictionary

Clean¶

matchbox.client.clean ¶

company_name ¶

df ¶

column ¶

column_secondary ¶

stopwords ¶

company_number ¶

df ¶

column ¶

drop ¶

df ¶

column ¶

extract_cdms_number_to_new ¶

df ¶

column ¶

new_column ¶

extract_company_number_to_new ¶

df ¶

column ¶

new_column ¶

extract_duns_number_to_new ¶

df ¶

column ¶

new_column ¶

postcode ¶

df ¶

column ¶

postcode_to_area ¶

df ¶

column ¶

remove_prefix ¶

df ¶

column ¶

prefix ¶

alias ¶

function ¶

alias ¶

cleaning_function ¶

functions ¶

unnest_renest ¶

function ¶

lib ¶

company_name ¶

df ¶

column ¶

column_secondary ¶

stopwords ¶

company_number ¶

df ¶

column ¶

postcode ¶

df ¶

column ¶

postcode_to_area ¶

df ¶

column ¶

extract_company_number_to_new ¶

df ¶

column ¶

new_column ¶

extract_duns_number_to_new ¶

df ¶

column ¶

new_column ¶

extract_cdms_number_to_new ¶

df ¶

column ¶

new_column ¶

drop ¶

df ¶

column ¶

remove_prefix ¶

df ¶

column ¶

prefix ¶

steps ¶

array_except ¶

column ¶

terms_to_remove ¶

`df` ¶

`column` ¶

`column_secondary` ¶

`stopwords` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`new_column` ¶

`df` ¶

`column` ¶

`new_column` ¶

`df` ¶

`column` ¶

`new_column` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`prefix` ¶

`function` ¶

`alias` ¶

`functions` ¶

`function` ¶

`df` ¶

`column` ¶

`column_secondary` ¶

`stopwords` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`new_column` ¶

`df` ¶

`column` ¶

`new_column` ¶

`df` ¶

`column` ¶

`new_column` ¶

`df` ¶

`column` ¶

`df` ¶

`column` ¶

`prefix` ¶

`column` ¶

`terms_to_remove` ¶

`column` ¶

`terms_to_keep` ¶

`column` ¶

`column` ¶

`column` ¶

`column` ¶

`replacements` ¶

`column` ¶

`column` ¶

`column` ¶

`column` ¶

`column` ¶

`column` ¶

`column` ¶

`seperator` ¶

`column` ¶

`column` ¶

`column` ¶

`list_of_strings` ¶

`column` ¶

`list_of_strings` ¶

`column` ¶

`column` ¶

`stopwords` ¶

`column` ¶

`column` ¶