Clean¶
matchbox.client.clean
¶
Library of default cleaning functions.
Modules:
-
lib
–Implementation of default cleaning functions.
-
steps
–Low-level components of default cleaning functions.
-
utils
–Generic utilities for default cleaning functions.
Functions:
-
company_name
–Standard cleaning function for company names.
-
company_number
–Remove non-numbers, and then leading zeroes.
-
drop
–Drops the column from the dataframe.
-
extract_cdms_number_to_new
–Detects the CDMS nuber in a column and moves it to a new column.
-
extract_company_number_to_new
–Detects the Companies House CRN in a column and moves it to a new column.
-
extract_duns_number_to_new
–Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.
-
postcode
–Removes all punctuation, converts to upper, removes all spaces.
-
postcode_to_area
–Extracts postcode area from a postcode.
-
alias
–Takes a cleaning function and aliases the output to a new column.
-
cleaning_function
–Takes a list of basic cleaning functions and composes them into a callable.
-
unnest_renest
–Takes a cleaning function and adds unnesting and renesting either side of it.
company_name
¶
company_name(
df: DataFrame,
column: str,
column_secondary: str = None,
stopwords: str = STOPWORDS,
) -> DataFrame
Standard cleaning function for company names.
- Lower case, remove punctuation & tokenise the company name into an array
- Extract tokens into: ‘unusual’ and ‘stopwords’. Dedupe. Sort alphabetically
- Untokenise the unusual words back to a string
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing the company’s main name
-
column_secondary
¶str
, default:None
) –a column containing an array of the company’s secondary names
-
stopwords
¶str
, default:STOPWORDS
) –a list of stopwords to use for this clean
Returns:
-
dataframe
(DataFrame
) –the same as went in, but cleaned
company_number
¶
drop
¶
extract_cdms_number_to_new
¶
extract_cdms_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame
Detects the CDMS nuber in a column and moves it to a new column.
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing some CDMS numbers
-
new_column
¶str
) –the name of the column to add
Returns:
-
dataframe
(DataFrame
) –the same as went in with a new column for CDMS numbers
extract_company_number_to_new
¶
extract_company_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame
Detects the Companies House CRN in a column and moves it to a new column.
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing some company numbers
-
new_column
¶str
) –the name of the column to add
Returns:
-
dataframe
(DataFrame
) –the same as went in with a new column for CRNs
extract_duns_number_to_new
¶
extract_duns_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame
Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing some DUNS numbers
-
new_column
¶str
) –the name of the column to add
Returns:
-
dataframe
(DataFrame
) –the same as went in with a new column for DUNs numbers
postcode
¶
postcode_to_area
¶
alias
¶
cleaning_function
¶
Takes a list of basic cleaning functions and composes them into a callable.
Functions must be appropriate for a select statement.
Only for use with cleaning methods that take a single column as their argument. Consider using functools.partial to coerce functions that need arguments into this shape.
Parameters:
unnest_renest
¶
lib
¶
Implementation of default cleaning functions.
Functions:
-
company_name
–Standard cleaning function for company names.
-
company_number
–Remove non-numbers, and then leading zeroes.
-
postcode
–Removes all punctuation, converts to upper, removes all spaces.
-
postcode_to_area
–Extracts postcode area from a postcode.
-
extract_company_number_to_new
–Detects the Companies House CRN in a column and moves it to a new column.
-
extract_duns_number_to_new
–Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.
-
extract_cdms_number_to_new
–Detects the CDMS nuber in a column and moves it to a new column.
-
drop
–Drops the column from the dataframe.
company_name
¶
company_name(
df: DataFrame,
column: str,
column_secondary: str = None,
stopwords: str = STOPWORDS,
) -> DataFrame
Standard cleaning function for company names.
- Lower case, remove punctuation & tokenise the company name into an array
- Extract tokens into: ‘unusual’ and ‘stopwords’. Dedupe. Sort alphabetically
- Untokenise the unusual words back to a string
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing the company’s main name
-
column_secondary
¶str
, default:None
) –a column containing an array of the company’s secondary names
-
stopwords
¶str
, default:STOPWORDS
) –a list of stopwords to use for this clean
Returns:
-
dataframe
(DataFrame
) –the same as went in, but cleaned
company_number
¶
postcode
¶
postcode_to_area
¶
extract_company_number_to_new
¶
extract_company_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame
Detects the Companies House CRN in a column and moves it to a new column.
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing some company numbers
-
new_column
¶str
) –the name of the column to add
Returns:
-
dataframe
(DataFrame
) –the same as went in with a new column for CRNs
extract_duns_number_to_new
¶
extract_duns_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame
Detects the Dun & Bradstreet DUNS nuber in a column and moves it to a new column.
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing some DUNS numbers
-
new_column
¶str
) –the name of the column to add
Returns:
-
dataframe
(DataFrame
) –the same as went in with a new column for DUNs numbers
extract_cdms_number_to_new
¶
extract_cdms_number_to_new(
df: DataFrame, column: str, new_column: str
) -> DataFrame
Detects the CDMS nuber in a column and moves it to a new column.
Parameters:
-
df
¶DataFrame
) –a dataframe
-
column
¶str
) –a column containing some CDMS numbers
-
new_column
¶str
) –the name of the column to add
Returns:
-
dataframe
(DataFrame
) –the same as went in with a new column for CDMS numbers
steps
¶
Low-level components of default cleaning functions.
Modules:
-
clean_basic
–Low-level primitives supporting default cleaning functions.
-
clean_basic_original
–Legacy cleaning rules inherited by the Company Matching Service.
Functions:
-
array_except
–Remove terms from an array.
-
array_intersect
–Filter an array to only keep terms in a list.
-
clean_punctuation
–Removes all punctuation and spaces, trim, lowercase.
-
clean_punctuation_except_hyphens
–Revove all punctuation and spaces except hyphens, trim.
-
dedupe_and_sort
–De-duplicate an array of tokens and sort alphabetically.
-
expand_abbreviations
–Expand abbreviations found in the column.
-
filter_cdms_number
–Filter out non-CDMS numbers.
-
filter_company_number
–Filter out non-Companies House numbers.
-
filter_duns_number
–Filter out non-DUNS numbers.
-
get_digits_only
–Extract digits only, including nonconsecutive.
-
get_low_freq_char_sig
–Removes letters with a frequency of 5% or higher, and spaces.
-
get_postcode_area
–Extract the postcode area from a column.
-
list_join_to_string
–Join a list of strings into a single string.
-
periods_to_nothing
–Removes periods and replaces with nothing (U.K. -> UK).
-
punctuation_to_spaces
–Removes all punctuation and replaces with spaces.
-
regex_extract_list_of_strings
–Extract a list of strings from a column using regex.
-
regex_remove_list_of_strings
–Remove a list of strings from a column using regex.
-
remove_notnumbers_leadingzeroes
–Remove any char that is not a number, then remove all leading zeroes.
-
remove_stopwords
–A thin optinionated wrapper for array_except to clean the global stopwords list.
-
remove_whitespace
–Removes all whitespaces.
-
to_lower
–All characters to lowercase.
-
to_upper
–All characters to uppercase.
-
tokenise
–Split the text in column into an array.
-
trim
–Remove leading and trailing whitespace.
-
cms_original_clean_cdms_id
–Replicates the original Company Matching Service CDMS ID cleaning.
-
cms_original_clean_ch_id
–Replicates the original Company Matching Service Companies House ID cleaning.
-
cms_original_clean_company_name_ch
–Replicates the original Company Matching Service company name cleaning.
-
cms_original_clean_company_name_general
–Replicates the original Company Matching Service company name cleaning.
-
cms_original_clean_email
–Replicates the original Company Matching Service email cleaning.
-
cms_original_clean_postcode
–Replicates the original Company Matching Service postcode cleaning.
array_except
¶
array_intersect
¶
clean_punctuation
¶
clean_punctuation_except_hyphens
¶
dedupe_and_sort
¶
expand_abbreviations
¶
expand_abbreviations(
column: str,
replacements: dict[str, str] = ABBREVIATIONS,
) -> str
Expand abbreviations found in the column.
Takes a dictionary where the keys are matches and the values are what to replace them with.
Matches only when term is surrounded by regex word boundaries.
Parameters:
-
column
¶str
) –the name of the column to clean
-
replacements
¶dict[str, str]
, default:ABBREVIATIONS
) –a dictionary where keys are matches and values are what the replace them with
Returns:
-
str
–String to insert into SQL query
filter_cdms_number
¶
Filter out non-CDMS numbers.
Returns a CASE WHEN filter on the specified column that will match only CDMS numbers. Must be either:
- 6 or 12 digits long
- Start with ‘000’
- Start with ‘ORG-‘
Will return false positives on some CRN numbers when they are 8 digits long and begin with ‘000’.
Parameters:
Returns:
-
str
–String to insert into SQL query
filter_company_number
¶
Filter out non-Companies House numbers.
Returns a CASE WHEN filter on the specified column that will match only Companies House numbers, CRNs.
Uses regex derived from: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1
Parameters:
Returns:
-
str
–String to insert into SQL query
filter_duns_number
¶
get_digits_only
¶
get_low_freq_char_sig
¶
get_postcode_area
¶
list_join_to_string
¶
periods_to_nothing
¶
punctuation_to_spaces
¶
regex_extract_list_of_strings
¶
regex_remove_list_of_strings
¶
remove_notnumbers_leadingzeroes
¶
remove_stopwords
¶
remove_whitespace
¶
to_lower
¶
to_upper
¶
tokenise
¶
trim
¶
cms_original_clean_cdms_id
¶
Replicates the original Company Matching Service CDMS ID cleaning.
Intended to help replicate the methodology for comparison.
cms_original_clean_ch_id
¶
Replicates the original Company Matching Service Companies House ID cleaning.
Intended to help replicate the methodology for comparison.
cms_original_clean_company_name_ch
¶
Replicates the original Company Matching Service company name cleaning.
Intended to help replicate the methodology for comparison.
The _ch_name_simplification version from app/algorithm/sql_statements.py#L14.
Use with Companies House only.
cms_original_clean_company_name_general
¶
Replicates the original Company Matching Service company name cleaning.
Intended to help replicate the methodology for comparison.
The _general_name_simplification version from app/algorithm/sql_statements.py#L24.
Use with any dataset except Companies House.
cms_original_clean_email
¶
Replicates the original Company Matching Service email cleaning.
Intended to help replicate the methodology for comparison.
cms_original_clean_postcode
¶
Replicates the original Company Matching Service postcode cleaning.
Intended to help replicate the methodology for comparison.
clean_basic
¶
Low-level primitives supporting default cleaning functions.
Functions:
-
remove_whitespace
–Removes all whitespaces.
-
punctuation_to_spaces
–Removes all punctuation and replaces with spaces.
-
periods_to_nothing
–Removes periods and replaces with nothing (U.K. -> UK).
-
clean_punctuation
–Removes all punctuation and spaces, trim, lowercase.
-
clean_punctuation_except_hyphens
–Revove all punctuation and spaces except hyphens, trim.
-
expand_abbreviations
–Expand abbreviations found in the column.
-
tokenise
–Split the text in column into an array.
-
dedupe_and_sort
–De-duplicate an array of tokens and sort alphabetically.
-
remove_notnumbers_leadingzeroes
–Remove any char that is not a number, then remove all leading zeroes.
-
array_except
–Remove terms from an array.
-
array_intersect
–Filter an array to only keep terms in a list.
-
remove_stopwords
–A thin optinionated wrapper for array_except to clean the global stopwords list.
-
regex_remove_list_of_strings
–Remove a list of strings from a column using regex.
-
regex_extract_list_of_strings
–Extract a list of strings from a column using regex.
-
list_join_to_string
–Join a list of strings into a single string.
-
get_postcode_area
–Extract the postcode area from a column.
-
get_low_freq_char_sig
–Removes letters with a frequency of 5% or higher, and spaces.
-
filter_cdms_number
–Filter out non-CDMS numbers.
-
filter_company_number
–Filter out non-Companies House numbers.
-
filter_duns_number
–Filter out non-DUNS numbers.
-
to_upper
–All characters to uppercase.
-
to_lower
–All characters to lowercase.
-
trim
–Remove leading and trailing whitespace.
-
get_digits_only
–Extract digits only, including nonconsecutive.
remove_whitespace
¶
punctuation_to_spaces
¶
periods_to_nothing
¶
clean_punctuation
¶
clean_punctuation_except_hyphens
¶
expand_abbreviations
¶
expand_abbreviations(
column: str,
replacements: dict[str, str] = ABBREVIATIONS,
) -> str
Expand abbreviations found in the column.
Takes a dictionary where the keys are matches and the values are what to replace them with.
Matches only when term is surrounded by regex word boundaries.
Parameters:
-
column
¶str
) –the name of the column to clean
-
replacements
¶dict[str, str]
, default:ABBREVIATIONS
) –a dictionary where keys are matches and values are what the replace them with
Returns:
-
str
–String to insert into SQL query
tokenise
¶
dedupe_and_sort
¶
remove_notnumbers_leadingzeroes
¶
array_except
¶
array_intersect
¶
remove_stopwords
¶
regex_remove_list_of_strings
¶
regex_extract_list_of_strings
¶
list_join_to_string
¶
get_postcode_area
¶
get_low_freq_char_sig
¶
filter_cdms_number
¶
Filter out non-CDMS numbers.
Returns a CASE WHEN filter on the specified column that will match only CDMS numbers. Must be either:
- 6 or 12 digits long
- Start with ‘000’
- Start with ‘ORG-‘
Will return false positives on some CRN numbers when they are 8 digits long and begin with ‘000’.
Parameters:
Returns:
-
str
–String to insert into SQL query
filter_company_number
¶
Filter out non-Companies House numbers.
Returns a CASE WHEN filter on the specified column that will match only Companies House numbers, CRNs.
Uses regex derived from: https://gist.github.com/rob-murray/01d43581114a6b319034732bcbda29e1
Parameters:
Returns:
-
str
–String to insert into SQL query
filter_duns_number
¶
to_upper
¶
to_lower
¶
trim
¶
clean_basic_original
¶
Legacy cleaning rules inherited by the Company Matching Service.
Functions:
-
cms_original_clean_company_name_general
–Replicates the original Company Matching Service company name cleaning.
-
cms_original_clean_company_name_ch
–Replicates the original Company Matching Service company name cleaning.
-
cms_original_clean_postcode
–Replicates the original Company Matching Service postcode cleaning.
-
cms_original_clean_email
–Replicates the original Company Matching Service email cleaning.
-
cms_original_clean_ch_id
–Replicates the original Company Matching Service Companies House ID cleaning.
-
cms_original_clean_cdms_id
–Replicates the original Company Matching Service CDMS ID cleaning.
cms_original_clean_company_name_general
¶
Replicates the original Company Matching Service company name cleaning.
Intended to help replicate the methodology for comparison.
The _general_name_simplification version from app/algorithm/sql_statements.py#L24.
Use with any dataset except Companies House.
cms_original_clean_company_name_ch
¶
Replicates the original Company Matching Service company name cleaning.
Intended to help replicate the methodology for comparison.
The _ch_name_simplification version from app/algorithm/sql_statements.py#L14.
Use with Companies House only.
cms_original_clean_postcode
¶
Replicates the original Company Matching Service postcode cleaning.
Intended to help replicate the methodology for comparison.
cms_original_clean_email
¶
Replicates the original Company Matching Service email cleaning.
Intended to help replicate the methodology for comparison.
cms_original_clean_ch_id
¶
Replicates the original Company Matching Service Companies House ID cleaning.
Intended to help replicate the methodology for comparison.
cms_original_clean_cdms_id
¶
Replicates the original Company Matching Service CDMS ID cleaning.
Intended to help replicate the methodology for comparison.
utils
¶
Generic utilities for default cleaning functions.
Functions:
-
cleaning_function
–Takes a list of basic cleaning functions and composes them into a callable.
-
alias
–Takes a cleaning function and aliases the output to a new column.
-
unnest_renest
–Takes a cleaning function and adds unnesting and renesting either side of it.
Attributes:
STOPWORDS
module-attribute
¶
STOPWORDS = [
"limited",
"uk",
"company",
"international",
"group",
"of",
"the",
"inc",
"and",
"plc",
"corporation",
"llp",
"pvt",
"gmbh",
"u k",
"pte",
"usa",
"bank",
"b v",
"bv",
]
cleaning_function
¶
Takes a list of basic cleaning functions and composes them into a callable.
Functions must be appropriate for a select statement.
Only for use with cleaning methods that take a single column as their argument. Consider using functools.partial to coerce functions that need arguments into this shape.
Parameters: