Skip to content

Clean

matchbox.client.clean

Library of default cleaning functions.

Modules:

  • lib

    Implementation of default cleaning functions.

  • steps

    Low-level components of default cleaning functions.

  • utils

    Generic utilities for default cleaning functions.

Functions:

  • company_name

    Standard cleaning function for company names.

  • company_number

    Remove non-numbers, and then leading zeroes.

  • drop

    Drops the column from the dataframe.

  • extract_cdms_number_to_new

    Detects the CDMS nuber in a column and moves it to a new column.

  • extract_company_number_to_new

    Detects the Companies House CRN in a column and moves it to a new column.

  • extract_duns_number_to_new

    Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.

  • postcode

    Removes all punctuation, converts to upper, removes all spaces.

  • postcode_to_area

    Extracts postcode area from a postcode.

  • alias

    Takes a cleaning function and aliases the output to a new column.

  • cleaning_function

    Takes a list of basic cleaning functions and composes them into a callable.

  • unnest_renest

    Takes a cleaning function and adds unnesting and renesting either side of it.

company_name

company_name(
    df: DataFrame,
    column: str,
    column_secondary: str = None,
    stopwords: str = STOPWORDS,
) -> DataFrame

Standard cleaning function for company names.

  • Lower case, remove punctuation & tokenise the company name into an array
  • Extract tokens into: ‘unusual’ and ‘stopwords’. Dedupe. Sort alphabetically
  • Untokenise the unusual words back to a string

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing the company’s main name

  • column_secondary

    (str, default: None ) –

    a column containing an array of the company’s secondary names

  • stopwords

    (str, default: STOPWORDS ) –

    a list of stopwords to use for this clean

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

company_number

company_number(df: DataFrame, column: str) -> DataFrame

Remove non-numbers, and then leading zeroes.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing a company number

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

drop

drop(df: DataFrame, column: str) -> DataFrame

Drops the column from the dataframe.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column

Returns:

  • dataframe ( DataFrame ) –

    the same as went in without the column

extract_cdms_number_to_new

extract_cdms_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the CDMS nuber in a column and moves it to a new column.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing some CDMS numbers

  • new_column

    (str) –

    the name of the column to add

Returns:

  • dataframe ( DataFrame ) –

    the same as went in with a new column for CDMS numbers

extract_company_number_to_new

extract_company_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Companies House CRN in a column and moves it to a new column.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing some company numbers

  • new_column

    (str) –

    the name of the column to add

Returns:

  • dataframe ( DataFrame ) –

    the same as went in with a new column for CRNs

extract_duns_number_to_new

extract_duns_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing some DUNS numbers

  • new_column

    (str) –

    the name of the column to add

Returns:

  • dataframe ( DataFrame ) –

    the same as went in with a new column for DUNs numbers

postcode

postcode(df: DataFrame, column: str) -> DataFrame

Removes all punctuation, converts to upper, removes all spaces.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing a postcode

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

postcode_to_area

postcode_to_area(df: DataFrame, column: str) -> DataFrame

Extracts postcode area from a postcode.

Parameters:

  • df

    (DataFrame) –

    a dataframe

  • column

    (str) –

    a column containing a postcode

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

alias

Takes a cleaning function and aliases the output to a new column.

Parameters:

  • function

    (Callable) –

    an outut from a cleaning_function function

  • alias

    (str) –

    the new column name to use

cleaning_function

cleaning_function(*functions: Callable) -> Callable

Takes a list of basic cleaning functions and composes them into a callable.

Functions must be appropriate for a select statement.

Only for use with cleaning methods that take a single column as their argument. Consider using functools.partial to coerce functions that need arguments into this shape.

Parameters:

  • functions

    (Callable, default: () ) –

    a list of functions appropriate for a select statement. See clean_basic for some examples

unnest_renest

unnest_renest(function: Callable) -> Callable

Takes a cleaning function and adds unnesting and renesting either side of it.

Useful for applying the same function to an array where there are sub-functions that also use arrays, blocking list_transform.

Parameters:

  • function

    (Callable) –

    an outut from a cleaning_function function

lib

Implementation of default cleaning functions.

Functions:

company_name

company_name(
    df: DataFrame,
    column: str,
    column_secondary: str = None,
    stopwords: str = STOPWORDS,
) -> DataFrame

Standard cleaning function for company names.

  • Lower case, remove punctuation & tokenise the company name into an array
  • Extract tokens into: ‘unusual’ and ‘stopwords’. Dedupe. Sort alphabetically
  • Untokenise the unusual words back to a string

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing the company’s main name

  • column_secondary
    (str, default: None ) –

    a column containing an array of the company’s secondary names

  • stopwords
    (str, default: STOPWORDS ) –

    a list of stopwords to use for this clean

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

company_number

company_number(df: DataFrame, column: str) -> DataFrame

Remove non-numbers, and then leading zeroes.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing a company number

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

postcode

postcode(df: DataFrame, column: str) -> DataFrame

Removes all punctuation, converts to upper, removes all spaces.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing a postcode

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

postcode_to_area

postcode_to_area(df: DataFrame, column: str) -> DataFrame

Extracts postcode area from a postcode.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing a postcode

Returns:

  • dataframe ( DataFrame ) –

    the same as went in, but cleaned

extract_company_number_to_new

extract_company_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Companies House CRN in a column and moves it to a new column.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing some company numbers

  • new_column
    (str) –

    the name of the column to add

Returns:

  • dataframe ( DataFrame ) –

    the same as went in with a new column for CRNs

extract_duns_number_to_new

extract_duns_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing some DUNS numbers

  • new_column
    (str) –

    the name of the column to add

Returns:

  • dataframe ( DataFrame ) –

    the same as went in with a new column for DUNs numbers

extract_cdms_number_to_new

extract_cdms_number_to_new(
    df: DataFrame, column: str, new_column: str
) -> DataFrame

Detects the CDMS nuber in a column and moves it to a new column.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column containing some CDMS numbers

  • new_column
    (str) –

    the name of the column to add

Returns:

  • dataframe ( DataFrame ) –

    the same as went in with a new column for CDMS numbers

drop

drop(df: DataFrame, column: str) -> DataFrame

Drops the column from the dataframe.

Parameters:

  • df
    (DataFrame) –

    a dataframe

  • column
    (str) –

    a column

Returns:

  • dataframe ( DataFrame ) –

    the same as went in without the column

steps

Low-level components of default cleaning functions.

Modules:

  • clean_basic

    Low-level primitives supporting default cleaning functions.

  • clean_basic_original

    Legacy cleaning rules inherited by the Company Matching Service.

Functions:

array_except

array_except(
    column: str, terms_to_remove: list[str]
) -> str

Remove terms from an array.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • terms_to_remove
    (list[str]) –

    A list of terms to remove

Returns:

  • str

    String to insert into SQL query

array_intersect

array_intersect(
    column: str, terms_to_keep: list[str]
) -> str

Filter an array to only keep terms in a list.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • terms_to_keep
    (list[str]) –

    A list of terms to keep

Returns:

  • str

    String to insert into SQL query

clean_punctuation

clean_punctuation(column: str) -> str

Removes all punctuation and spaces, trim, lowercase.

  • Set to lower case
  • Remove punctuation
  • Multiple to single space replace.
  • Trim leading and trailing spaces.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

clean_punctuation_except_hyphens

clean_punctuation_except_hyphens(column: str) -> str

Revove all punctuation and spaces except hyphens, trim.

Useful for cleaning reference numbers.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

dedupe_and_sort

dedupe_and_sort(column: str) -> str

De-duplicate an array of tokens and sort alphabetically.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

expand_abbreviations

expand_abbreviations(
    column: str,
    replacements: dict[str, str] = ABBREVIATIONS,
) -> str

Expand abbreviations found in the column.

Takes a dictionary where the keys are matches and the values are what to replace them with.

Matches only when term is surrounded by regex word boundaries.

Parameters:

  • column
    (str) –

    the name of the column to clean

  • replacements
    (dict[str, str], default: ABBREVIATIONS ) –

    a dictionary where keys are matches and values are what the replace them with

Returns:

  • str

    String to insert into SQL query

filter_cdms_number

filter_cdms_number(column: str) -> str

Filter out non-CDMS numbers.

Returns a CASE WHEN filter on the specified column that will match only CDMS numbers. Must be either:

  • 6 or 12 digits long
  • Start with ‘000’
  • Start with ‘ORG-‘

Will return false positives on some CRN numbers when they are 8 digits long and begin with ‘000’.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

filter_company_number

filter_company_number(column: str) -> str

Filter out non-Companies House numbers.

Returns a CASE WHEN filter on the specified column that will match only Companies House numbers, CRNs.

Uses regex derived from: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

filter_duns_number

filter_duns_number(column: str) -> str

Filter out non-DUNS numbers.

Returns a CASE WHEN filter on the specified column that will match only a Dun & Bradstreet DUNS number. Must be both:

  • 9 characters
  • Numeric

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

get_digits_only

get_digits_only(column: str) -> str

Extract digits only, including nonconsecutive.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

get_low_freq_char_sig

get_low_freq_char_sig(column: str) -> str

Removes letters with a frequency of 5% or higher, and spaces.

See https://en.wikipedia.org/wiki/Letter_frequency for details.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

get_postcode_area

get_postcode_area(column: str) -> str

Extract the postcode area from a column.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

list_join_to_string

list_join_to_string(
    column: str, seperator: str = " "
) -> str

Join a list of strings into a single string.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • seperator
    (str, default: ' ' ) –

    The string to use to join the list

Returns:

  • str

    String to insert into SQL query

periods_to_nothing

periods_to_nothing(column: str) -> str

Removes periods and replaces with nothing (U.K. -> UK).

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

punctuation_to_spaces

punctuation_to_spaces(column: str) -> str

Removes all punctuation and replaces with spaces.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

regex_extract_list_of_strings

regex_extract_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Extract a list of strings from a column using regex.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • list_of_strings
    (list[str]) –

    A list of strings to extract

Returns:

  • str

    String to insert into SQL query

regex_remove_list_of_strings

regex_remove_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Remove a list of strings from a column using regex.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • list_of_strings
    (list[str]) –

    A list of strings to remove

Returns:

  • str

    String to insert into SQL query

remove_notnumbers_leadingzeroes

remove_notnumbers_leadingzeroes(column: str) -> str

Remove any char that is not a number, then remove all leading zeroes.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

remove_stopwords

remove_stopwords(
    column: str, stopwords: list[str] = STOPWORDS
) -> str

A thin optinionated wrapper for array_except to clean the global stopwords list.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • stopwords
    (list[str], default: STOPWORDS ) –

    A list of terms to remove

Returns:

  • str

    String to insert into SQL query

remove_whitespace

remove_whitespace(column: str) -> str

Removes all whitespaces.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

to_lower

to_lower(column: str) -> str

All characters to lowercase.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

to_upper

to_upper(column: str) -> str

All characters to uppercase.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

tokenise

tokenise(column: str) -> str

Split the text in column into an array.

Uses any char that is not alphanumeric as delimeter.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

trim

trim(column: str) -> str

Remove leading and trailing whitespace.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

cms_original_clean_cdms_id

cms_original_clean_cdms_id(column)

Replicates the original Company Matching Service CDMS ID cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_ch_id

cms_original_clean_ch_id(column)

Replicates the original Company Matching Service Companies House ID cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_company_name_ch

cms_original_clean_company_name_ch(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _ch_name_simplification version from app/algorithm/sql_statements.py#L14.

Use with Companies House only.

cms_original_clean_company_name_general

cms_original_clean_company_name_general(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _general_name_simplification version from app/algorithm/sql_statements.py#L24.

Use with any dataset except Companies House.

cms_original_clean_email

cms_original_clean_email(column)

Replicates the original Company Matching Service email cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_postcode

cms_original_clean_postcode(column)

Replicates the original Company Matching Service postcode cleaning.

Intended to help replicate the methodology for comparison.

clean_basic

Low-level primitives supporting default cleaning functions.

Functions:

remove_whitespace
remove_whitespace(column: str) -> str

Removes all whitespaces.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

punctuation_to_spaces
punctuation_to_spaces(column: str) -> str

Removes all punctuation and replaces with spaces.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

periods_to_nothing
periods_to_nothing(column: str) -> str

Removes periods and replaces with nothing (U.K. -> UK).

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

clean_punctuation
clean_punctuation(column: str) -> str

Removes all punctuation and spaces, trim, lowercase.

  • Set to lower case
  • Remove punctuation
  • Multiple to single space replace.
  • Trim leading and trailing spaces.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

clean_punctuation_except_hyphens
clean_punctuation_except_hyphens(column: str) -> str

Revove all punctuation and spaces except hyphens, trim.

Useful for cleaning reference numbers.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

expand_abbreviations
expand_abbreviations(
    column: str,
    replacements: dict[str, str] = ABBREVIATIONS,
) -> str

Expand abbreviations found in the column.

Takes a dictionary where the keys are matches and the values are what to replace them with.

Matches only when term is surrounded by regex word boundaries.

Parameters:

  • column
    (str) –

    the name of the column to clean

  • replacements
    (dict[str, str], default: ABBREVIATIONS ) –

    a dictionary where keys are matches and values are what the replace them with

Returns:

  • str

    String to insert into SQL query

tokenise
tokenise(column: str) -> str

Split the text in column into an array.

Uses any char that is not alphanumeric as delimeter.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

dedupe_and_sort
dedupe_and_sort(column: str) -> str

De-duplicate an array of tokens and sort alphabetically.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

remove_notnumbers_leadingzeroes
remove_notnumbers_leadingzeroes(column: str) -> str

Remove any char that is not a number, then remove all leading zeroes.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

array_except
array_except(
    column: str, terms_to_remove: list[str]
) -> str

Remove terms from an array.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • terms_to_remove
    (list[str]) –

    A list of terms to remove

Returns:

  • str

    String to insert into SQL query

array_intersect
array_intersect(
    column: str, terms_to_keep: list[str]
) -> str

Filter an array to only keep terms in a list.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • terms_to_keep
    (list[str]) –

    A list of terms to keep

Returns:

  • str

    String to insert into SQL query

remove_stopwords
remove_stopwords(
    column: str, stopwords: list[str] = STOPWORDS
) -> str

A thin optinionated wrapper for array_except to clean the global stopwords list.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • stopwords
    (list[str], default: STOPWORDS ) –

    A list of terms to remove

Returns:

  • str

    String to insert into SQL query

regex_remove_list_of_strings
regex_remove_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Remove a list of strings from a column using regex.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • list_of_strings
    (list[str]) –

    A list of strings to remove

Returns:

  • str

    String to insert into SQL query

regex_extract_list_of_strings
regex_extract_list_of_strings(
    column: str, list_of_strings: list[str]
) -> str

Extract a list of strings from a column using regex.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • list_of_strings
    (list[str]) –

    A list of strings to extract

Returns:

  • str

    String to insert into SQL query

list_join_to_string
list_join_to_string(
    column: str, seperator: str = " "
) -> str

Join a list of strings into a single string.

Parameters:

  • column
    (str) –

    The name of the column to treat

  • seperator
    (str, default: ' ' ) –

    The string to use to join the list

Returns:

  • str

    String to insert into SQL query

get_postcode_area
get_postcode_area(column: str) -> str

Extract the postcode area from a column.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

get_low_freq_char_sig
get_low_freq_char_sig(column: str) -> str

Removes letters with a frequency of 5% or higher, and spaces.

See https://en.wikipedia.org/wiki/Letter_frequency for details.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

filter_cdms_number
filter_cdms_number(column: str) -> str

Filter out non-CDMS numbers.

Returns a CASE WHEN filter on the specified column that will match only CDMS numbers. Must be either:

  • 6 or 12 digits long
  • Start with ‘000’
  • Start with ‘ORG-‘

Will return false positives on some CRN numbers when they are 8 digits long and begin with ‘000’.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

filter_company_number
filter_company_number(column: str) -> str

Filter out non-Companies House numbers.

Returns a CASE WHEN filter on the specified column that will match only Companies House numbers, CRNs.

Uses regex derived from: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

filter_duns_number
filter_duns_number(column: str) -> str

Filter out non-DUNS numbers.

Returns a CASE WHEN filter on the specified column that will match only a Dun & Bradstreet DUNS number. Must be both:

  • 9 characters
  • Numeric

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

to_upper
to_upper(column: str) -> str

All characters to uppercase.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

to_lower
to_lower(column: str) -> str

All characters to lowercase.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

trim
trim(column: str) -> str

Remove leading and trailing whitespace.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

get_digits_only
get_digits_only(column: str) -> str

Extract digits only, including nonconsecutive.

Parameters:

  • column
    (str) –

    The name of the column to treat

Returns:

  • str

    String to insert into SQL query

clean_basic_original

Legacy cleaning rules inherited by the Company Matching Service.

Functions:

cms_original_clean_company_name_general
cms_original_clean_company_name_general(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _general_name_simplification version from app/algorithm/sql_statements.py#L24.

Use with any dataset except Companies House.

cms_original_clean_company_name_ch
cms_original_clean_company_name_ch(column)

Replicates the original Company Matching Service company name cleaning.

Intended to help replicate the methodology for comparison.

The _ch_name_simplification version from app/algorithm/sql_statements.py#L14.

Use with Companies House only.

cms_original_clean_postcode
cms_original_clean_postcode(column)

Replicates the original Company Matching Service postcode cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_email
cms_original_clean_email(column)

Replicates the original Company Matching Service email cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_ch_id
cms_original_clean_ch_id(column)

Replicates the original Company Matching Service Companies House ID cleaning.

Intended to help replicate the methodology for comparison.

cms_original_clean_cdms_id
cms_original_clean_cdms_id(column)

Replicates the original Company Matching Service CDMS ID cleaning.

Intended to help replicate the methodology for comparison.

utils

Generic utilities for default cleaning functions.

Functions:

  • cleaning_function

    Takes a list of basic cleaning functions and composes them into a callable.

  • alias

    Takes a cleaning function and aliases the output to a new column.

  • unnest_renest

    Takes a cleaning function and adds unnesting and renesting either side of it.

Attributes:

STOPWORDS module-attribute

STOPWORDS = [
    "limited",
    "uk",
    "company",
    "international",
    "group",
    "of",
    "the",
    "inc",
    "and",
    "plc",
    "corporation",
    "llp",
    "pvt",
    "gmbh",
    "u k",
    "pte",
    "usa",
    "bank",
    "b v",
    "bv",
]

ABBREVIATIONS module-attribute

ABBREVIATIONS = {'co': 'company', 'ltd': 'limited'}

cleaning_function

cleaning_function(*functions: Callable) -> Callable

Takes a list of basic cleaning functions and composes them into a callable.

Functions must be appropriate for a select statement.

Only for use with cleaning methods that take a single column as their argument. Consider using functools.partial to coerce functions that need arguments into this shape.

Parameters:

  • functions
    (Callable, default: () ) –

    a list of functions appropriate for a select statement. See clean_basic for some examples

alias

Takes a cleaning function and aliases the output to a new column.

Parameters:

  • function
    (Callable) –

    an outut from a cleaning_function function

  • alias
    (str) –

    the new column name to use

unnest_renest

unnest_renest(function: Callable) -> Callable

Takes a cleaning function and adds unnesting and renesting either side of it.

Useful for applying the same function to an array where there are sub-functions that also use arrays, blocking list_transform.

Parameters:

  • function
    (Callable) –

    an outut from a cleaning_function function