In this post we're going to draw on dynamic programming and the Needleman-Wunsch algorithm, traditionally used in bioinformatics to align sequences of proteins, to clean digitised company names from a machine learning pipeline as part of my work for Professor Dell. Whilst sequences of proteins and company names in Kanji, at first glance, seem dissimilar the Needleman-Wunsch algorithm generalises extremely well to the problem at hand and saves hours of manual RA work at a comparable if not greater accuracy rate.

The Problem

Our project's aim is to automate the digitisation of firm-level Japanese archival data from 1940-60. Currently, we've achieved an accuracy of 99.3% at classifying variable names/values in our validation set for the 1954 version of "Personnel Records" (PR1954 henceforth). The data looks something like this, with the book displayed on the left and corresponding column expanded on the right (courtesy of Jie Zhou):

The problem we're currently facing is twofold:

  • We want to ensure we're not missing any companies in the book.
  • We want to improve the detection/cleaning of company names so that matching firms across publications is easier.

Unfortunately, whilst row classification is approaching near perfection for variable names and values, classification of company names is somewhat poorer. It's often difficult to distinguish company names from company addresses and optical character recognition (OCR) of company names can be pretty messy. The latter is a big issue when we want to match firms across publications and traditional fuzzy matches perform poorly compared to Western languages (if the average Japanese firm name is 4 characters this doesn't give us as much to go off compared to a firm name using the Roman alphabet which typically contains many characters.)

To overcome this we've also digitised the book index which lists, in order, each company in the book and its corresponding page number. This gives us another source of information concerning company names - we just have to match the company names in the book against the names listed in the index. This would be pretty straightforward for 10 firms, say, but we need to perform this for around 9000 firms per publication, across multiple publications, each of which could contain a litany of machine learning boo-boos e.g. rows completely skipped by our OCR pipeline; characters mis-OCR'ed; characters skipped; multiple rows mis-segmented into one; rows misclassified as company names - you get the picture.

Therefore, our task is to align two noisey sequences of company names where each sequence can also contain random additions or deletions due to various sources of ML classification noise.

The Solution

Suppose we wish to match five firms: Amazon, Bayer, Comcast, Deutsche Bank and Ebay. Due to noise in the data cleaning pipeline there are various OCR errors; furthermore, Bayer is completely skipped in the book and Deutsche is skipped in the index:

Ground TruthIndexBook
AmazonAmazannmzon
BayerBoyer-
ComcastCamcoostCoast
Deutsche Bank-Deutsche k
EbayEbatdbay

Therefore we observe two sequences: Amazann, Boyer, Camcoost, Ebat and mzon, Coast, Deutsche k, dbay. We can think of the sequence matching process as optimising a Bellman equation: $$V(book_t, index_t, skip_t^d) = \max_{skip_{t+1}^d} u(book_t, index_t, skip_t^d) + \beta V(\cdot; skip_{t+1}^d)$$

where $skip_{t+1}^d$, $d \in \{book, index\}$ is our choice variable determining if we believe a company has been skipped and we set $\beta = 1$. That is, we can break down the problem of matching all 9000 firms into much smaller subproblems - should we accept the current match or propose that an observation has been skipped in the book/index and gain flow utility $u$ with continuation value $V(\cdot; skip_{t+1}^d)$?

Matching sequences is a natural fit for this recursive formulation since the current value of the problem at time $t$ can be completely described by our state variables, $book_t, index_t$, and the decision we make now, $skip_{t+1}$, will influence the value of the problem in the next stage.

In the traditional Needleman-Wunsch algorithm $u(book_t, index_t, skip_t^d)$ is often simply assigned a cost/value using the following rule:

  • $+1$ if the base pairs in a protein sequence match.
  • $-1$ if the base pairs do no match.
  • $-1$ for introducing a skip.

We need to generalise this notion of similarity to account for the fact that $book_t$ and $index_t$ aren't base-pairs drawn from $\{A, T, C, G\}$ but noisily OCR'ed strings. Fortunately, the concept of distance between strings is a well-explored topic and we can simply use the Levenshtein distance normalised by the length of the largest string (so that the relative weight of the gap penalty doesn't vary with the length of the string). This gives us a utility:

$$ u(book_t, index_t, skip_t^d) = \frac{lev(|book_t|, |index_t|)}{max(length_b, length_i)}; or -x\ \text{if}\ skipt_t^d = True. $$

We've described the problem using a formulation common to macroeconomics - now how do we solve it?

Solving the Optimisation Problem

Since the choice space is relatively simple (skip index/skip book/no skips) we can brute force a solution by evaluating the Bellman equation using a scoring grid and finding the cheapest path traversing the grid from the bottom right to top left (image from wikipedia, exposition is simplified somewhat here but I recommend checking out the link for more details):

Here a diagonal arrow corresponds to accepting a match, a horizontal arrow indicates skipping a match in "left" sequence and a vertical arrow in the "top" sequence.

Example - English

Now time to bring everything together and apply the above to our string sequences:

In [1]:
import pandas as pd
import numpy as np
from NW_Company_Names.NW_company_names import create_nw_df # code I can't really share at the moment

In this example we've created the algorithm needs to recognise that Bayer and Deutsche have both been skipped and therefore Camcoost should match Coast (representing Comcast).

In [2]:
english_df = pd.DataFrame()
english_df["index"] = ["Amazann", "Boyer", "Camcoost", "Ebat"]
english_df["book"] = ["mzon", "Coast", "Deutsche k", "dbay"]
english_df
Out[2]:
indexbook
0Amazannmzon
1BoyerCoast
2CamcoostDeutsche k
3Ebatdbay

Calling the create_nw_df() function automatically prints the scoring matrix - unlike the wikipedia example I don't display the actual strings in the first column/row since the Japanese company names get fairly large. Notice how the cost is now a float instead of an integer since we're combining continuous output from the Levenshtein metric:

In [10]:
english_nw_df = create_nw_df( # python code I can't share
    english_df[0:4], # this argument is selecting the rows we want for the index column
    english_df[0:4], # for the book column (this makes more sense in the actual implementation but little here)
    "index", # col 1 name with our strings
    "book", # col 2 name with strings
    gap_penalty = -0.5 
)
Scoring Matrix:
     0         1         2         3         4
0 -0.0 -0.500000 -1.000000 -1.500000 -2.000000
1 -0.5 -0.571429 -1.071429 -1.571429 -2.071429
2 -1.0 -1.071429 -1.371429 -1.571429 -2.071429
3 -1.5 -1.571429 -1.871429 -2.071429 -2.471429
4 -2.0 -2.071429 -2.371429 -2.571429 -2.571429

The corresponding optimised sequence alignment from the algorithm:

In [4]:
english_nw_df[["sequence_1", "sequence_2", "sequence_text_x", "sequence_text_y"]]
Out[4]:
sequence_1sequence_2sequence_text_xsequence_text_y
000Amazannmzon
11------BoyerNaN
221CamcoostCoast
3------2NaNDeutsche k
433Ebatdbay

It works - the algorithm has inserted a skip at rows 1 and 3 for the book and index respectively! The value/cost of the sequence alignment is given by the score in the bottom right cell of the scoring matrix, $-2.571$ in this case.

Example - Japanese

In [11]:
N = 10
pd.options.display.max_rows = int(round(N * 1.1))
index_df = pd.read_csv("summarised_PR1954_index.csv")
wide_df = pd.read_csv("PR1954_clean_wide_numeric_pipeline.csv")

Some data munging from the cleaning pipeline (we want to remove some sections of the book that don't deal with firms):

In [6]:
firm_wide_df = wide_df[wide_df["companyid"].str.contains("firm")]
firm_index_df = index_df[index_df["data_source_type"] == "firm"]
firm_index_df = firm_index_df.reset_index()
firm_wide_df = firm_wide_df.reset_index()
firm_wide_df = firm_wide_df.sort_values(by = ["companyid"])

Comparing the first 10 companies - the eagle-eyed (or those familiar with Japanese) will notice that row 7 appears in the index but not in the book:

In [7]:
side_by_side_comp = pd.DataFrame()
side_by_side_comp["book"] = firm_wide_df.iloc[0:N, 7].reset_index(drop = True)
side_by_side_comp["index"] = firm_index_df.iloc[0:N, 17]
side_by_side_comp
Out[7]:
bookindex
0(株)アース商會[匿稱アース光學㈱」アース商會
1㈱アイエ書店アイエ書店
2㈱アイゼンベルグ商會アイゼンベルグ商會1
3アイデアル石鹸アイデアル石鹸、
4株アイデンアイデン
5アコマ醫科工業アコマ醫科工業
6|アサノコンクリート)アサノコンクリート1
7㈱アサヒ藝能新聞社アサヒイブニングニュース祉
8㈱アサヒ商店アサヒ藝能新聞社
9了廿七纖維工業アサヒ商店

There's a lot of noise we could clean up here before running the algorithm (removing ㈱ which is found in the book column for instance) but we'll keep it for this example as an additional sensitivity check/challenge. Let's run the algorithm on the first 10 companies:

In [8]:
japanese_nw_df = create_nw_df(
    firm_wide_df[0:N],
    firm_index_df[0:N],
    "company name",
    "text"
)
Scoring Matrix:
       0         1         2         3         4         5         6  \
0    0.0 -1.000000 -2.000000 -3.000000 -4.000000 -5.000000 -6.000000   
1   -1.0 -0.722222 -1.722222 -2.700000 -3.700000 -4.700000 -5.700000   
2   -2.0 -1.722222 -0.888889 -1.888889 -2.888889 -3.888889 -4.888889   
3   -3.0 -2.722222 -1.888889 -1.088889 -2.088889 -3.088889 -4.088889   
4   -4.0 -3.722222 -2.888889 -2.088889 -1.213889 -2.213889 -3.213889   
5   -5.0 -4.722222 -3.888889 -3.088889 -2.213889 -1.413889 -2.413889   
6   -6.0 -5.722222 -4.888889 -4.088889 -3.213889 -2.413889 -1.413889   
7   -7.0 -6.722222 -5.888889 -5.088889 -4.213889 -3.413889 -2.413889   
8   -8.0 -7.722222 -6.888889 -6.088889 -5.213889 -4.413889 -3.413889   
9   -9.0 -8.722222 -7.888889 -7.088889 -6.213889 -5.413889 -4.413889   
10 -10.0 -9.722222 -8.888889 -8.088889 -7.213889 -6.413889 -5.413889   

           7         8         9         10  
0  -7.000000 -8.000000 -9.000000 -10.000000  
1  -6.700000 -7.700000 -8.666667  -9.666667  
2  -5.888889 -6.888889 -7.888889  -8.888889  
3  -5.088889 -6.088889 -7.088889  -8.088889  
4  -4.213889 -5.213889 -6.213889  -7.213889  
5  -3.413889 -4.413889 -5.413889  -6.413889  
6  -2.413889 -3.413889 -4.413889  -5.413889  
7  -1.595707 -2.595707 -3.595707  -4.595707  
8  -2.595707 -2.441861 -3.441861  -4.441861  
9  -3.595707 -2.706818 -3.191861  -4.191861  
10 -4.595707 -3.706818 -2.873485  -3.873485  

Checking the output again:

In [9]:
japanese_nw_df[["sequence_1", "sequence_2", "sequence_text_x", "sequence_text_y"]]
Out[9]:
sequence_1sequence_2sequence_text_xsequence_text_y
000(株)アース商會[匿稱アース光學㈱」アース商會
111㈱アイエ書店アイエ書店
222㈱アイゼンベルグ商會アイゼンベルグ商會1
333アイデアル石鹸アイデアル石鹸、
444株アイデンアイデン
555アコマ醫科工業アコマ醫科工業
666|アサノコンクリート)アサノコンクリート1
7------7NaNアサヒイブニングニュース祉
878㈱アサヒ藝能新聞社アサヒ藝能新聞社
989㈱アサヒ商店アサヒ商店
109------了廿七纖維工業NaN

Row 7 has been correctly skipped in the book column!

Conclusion

We've found a pretty cool way to match two noisey sequences of strings. Furthermore the coding required, whilst unfortunately I couldn't show it here, was minimal - we only had to rewrite the scoring metric and borrow an existing implementation of the Needleman-Wunsch algorithm in Python from the Wilke Lab. The analogue of skipped companies in our setting is genetic mutations causing "indels" - insertions or deletions in the DNA (I think?) of a protein sequence.

Macro isn't my strong point so if anyone has any recommendations I'd love to hear them. One of the drawbacks of my implementation is that the NW algorithm is $O(nm)$ where $n$ and $m$ are the length of the two sequences being matched - this effectively means compute time and memory is bounded by $O(9000^2)$ which definitely adds up.