In this post we're going to draw on dynamic programming and the Needleman-Wunsch algorithm, traditionally used in bioinformatics to align sequences of proteins, to clean digitised company names from a machine learning pipeline as part of my work for Professor Dell. Whilst sequences of proteins and company names in Kanji, at first glance, seem dissimilar the Needleman-Wunsch algorithm generalises extremely well to the problem at hand and saves hours of manual RA work at a comparable if not greater accuracy rate.
Our project's aim is to automate the digitisation of firm-level Japanese archival data from 1940-60. Currently, we've achieved an accuracy of 99.3% at classifying variable names/values in our validation set for the 1954 version of "Personnel Records" (PR1954 henceforth). The data looks something like this, with the book displayed on the left and corresponding column expanded on the right (courtesy of Jie Zhou):
The problem we're currently facing is twofold:
Unfortunately, whilst row classification is approaching near perfection for variable names and values, classification of company names is somewhat poorer. It's often difficult to distinguish company names from company addresses and optical character recognition (OCR) of company names can be pretty messy. The latter is a big issue when we want to match firms across publications and traditional fuzzy matches perform poorly compared to Western languages (if the average Japanese firm name is 4 characters this doesn't give us as much to go off compared to a firm name using the Roman alphabet which typically contains many characters.)
To overcome this we've also digitised the book index which lists, in order, each company in the book and its corresponding page number. This gives us another source of information concerning company names - we just have to match the company names in the book against the names listed in the index. This would be pretty straightforward for 10 firms, say, but we need to perform this for around 9000 firms per publication, across multiple publications, each of which could contain a litany of machine learning boo-boos e.g. rows completely skipped by our OCR pipeline; characters mis-OCR'ed; characters skipped; multiple rows mis-segmented into one; rows misclassified as company names - you get the picture.
Therefore, our task is to align two noisey sequences of company names where each sequence can also contain random additions or deletions due to various sources of ML classification noise.
Suppose we wish to match five firms: Amazon, Bayer, Comcast, Deutsche Bank and Ebay. Due to noise in the data cleaning pipeline there are various OCR errors; furthermore, Bayer is completely skipped in the book and Deutsche is skipped in the index:
Ground Truth | Index | Book |
---|---|---|
Amazon | Amazann | mzon |
Bayer | Boyer | - |
Comcast | Camcoost | Coast |
Deutsche Bank | - | Deutsche k |
Ebay | Ebat | dbay |
Therefore we observe two sequences: Amazann, Boyer, Camcoost, Ebat and mzon, Coast, Deutsche k, dbay. We can think of the sequence matching process as optimising a Bellman equation: $$V(book_t, index_t, skip_t^d) = \max_{skip_{t+1}^d} u(book_t, index_t, skip_t^d) + \beta V(\cdot; skip_{t+1}^d)$$
where $skip_{t+1}^d$, $d \in \{book, index\}$ is our choice variable determining if we believe a company has been skipped and we set $\beta = 1$. That is, we can break down the problem of matching all 9000 firms into much smaller subproblems - should we accept the current match or propose that an observation has been skipped in the book/index and gain flow utility $u$ with continuation value $V(\cdot; skip_{t+1}^d)$?
Matching sequences is a natural fit for this recursive formulation since the current value of the problem at time $t$ can be completely described by our state variables, $book_t, index_t$, and the decision we make now, $skip_{t+1}$, will influence the value of the problem in the next stage.
In the traditional Needleman-Wunsch algorithm $u(book_t, index_t, skip_t^d)$ is often simply assigned a cost/value using the following rule:
We need to generalise this notion of similarity to account for the fact that $book_t$ and $index_t$ aren't base-pairs drawn from $\{A, T, C, G\}$ but noisily OCR'ed strings. Fortunately, the concept of distance between strings is a well-explored topic and we can simply use the Levenshtein distance normalised by the length of the largest string (so that the relative weight of the gap penalty doesn't vary with the length of the string). This gives us a utility:
$$ u(book_t, index_t, skip_t^d) = \frac{lev(|book_t|, |index_t|)}{max(length_b, length_i)}; or -x\ \text{if}\ skipt_t^d = True. $$We've described the problem using a formulation common to macroeconomics - now how do we solve it?
Since the choice space is relatively simple (skip index/skip book/no skips) we can brute force a solution by evaluating the Bellman equation using a scoring grid and finding the cheapest path traversing the grid from the bottom right to top left (image from wikipedia, exposition is simplified somewhat here but I recommend checking out the link for more details):
Here a diagonal arrow corresponds to accepting a match, a horizontal arrow indicates skipping a match in "left" sequence and a vertical arrow in the "top" sequence.
Now time to bring everything together and apply the above to our string sequences:
import pandas as pd
import numpy as np
from NW_Company_Names.NW_company_names import create_nw_df # code I can't really share at the moment
In this example we've created the algorithm needs to recognise that Bayer and Deutsche have both been skipped and therefore Camcoost should match Coast (representing Comcast).
english_df = pd.DataFrame()
english_df["index"] = ["Amazann", "Boyer", "Camcoost", "Ebat"]
english_df["book"] = ["mzon", "Coast", "Deutsche k", "dbay"]
english_df
Calling the create_nw_df()
function automatically prints the scoring matrix - unlike the wikipedia example I don't display the actual strings in the first column/row since the Japanese company names get fairly large. Notice how the cost is now a float instead of an integer since we're combining continuous output from the Levenshtein metric:
english_nw_df = create_nw_df( # python code I can't share
english_df[0:4], # this argument is selecting the rows we want for the index column
english_df[0:4], # for the book column (this makes more sense in the actual implementation but little here)
"index", # col 1 name with our strings
"book", # col 2 name with strings
gap_penalty = -0.5
)
The corresponding optimised sequence alignment from the algorithm:
english_nw_df[["sequence_1", "sequence_2", "sequence_text_x", "sequence_text_y"]]
It works - the algorithm has inserted a skip at rows 1 and 3 for the book and index respectively! The value/cost of the sequence alignment is given by the score in the bottom right cell of the scoring matrix, $-2.571$ in this case.
N = 10
pd.options.display.max_rows = int(round(N * 1.1))
index_df = pd.read_csv("summarised_PR1954_index.csv")
wide_df = pd.read_csv("PR1954_clean_wide_numeric_pipeline.csv")
Some data munging from the cleaning pipeline (we want to remove some sections of the book that don't deal with firms):
firm_wide_df = wide_df[wide_df["companyid"].str.contains("firm")]
firm_index_df = index_df[index_df["data_source_type"] == "firm"]
firm_index_df = firm_index_df.reset_index()
firm_wide_df = firm_wide_df.reset_index()
firm_wide_df = firm_wide_df.sort_values(by = ["companyid"])
Comparing the first 10 companies - the eagle-eyed (or those familiar with Japanese) will notice that row 7 appears in the index but not in the book:
side_by_side_comp = pd.DataFrame()
side_by_side_comp["book"] = firm_wide_df.iloc[0:N, 7].reset_index(drop = True)
side_by_side_comp["index"] = firm_index_df.iloc[0:N, 17]
side_by_side_comp
There's a lot of noise we could clean up here before running the algorithm (removing ㈱ which is found in the book column for instance) but we'll keep it for this example as an additional sensitivity check/challenge. Let's run the algorithm on the first 10 companies:
japanese_nw_df = create_nw_df(
firm_wide_df[0:N],
firm_index_df[0:N],
"company name",
"text"
)
Checking the output again:
japanese_nw_df[["sequence_1", "sequence_2", "sequence_text_x", "sequence_text_y"]]
Row 7 has been correctly skipped in the book column!
We've found a pretty cool way to match two noisey sequences of strings. Furthermore the coding required, whilst unfortunately I couldn't show it here, was minimal - we only had to rewrite the scoring metric and borrow an existing implementation of the Needleman-Wunsch algorithm in Python from the Wilke Lab. The analogue of skipped companies in our setting is genetic mutations causing "indels" - insertions or deletions in the DNA (I think?) of a protein sequence.
Macro isn't my strong point so if anyone has any recommendations I'd love to hear them. One of the drawbacks of my implementation is that the NW algorithm is $O(nm)$ where $n$ and $m$ are the length of the two sequences being matched - this effectively means compute time and memory is bounded by $O(9000^2)$ which definitely adds up.