In this post we're going to draw on dynamic programming and the Needleman-Wunsch algorithm, traditionally used in bioinformatics to align sequences of proteins, to clean digitised company names from a machine learning pipeline as part of my work for Professor Dell. Whilst sequences of proteins and company names in Kanji, at first glance, seem dissimilar the Needleman-Wunsch algorithm generalises extremely well to the problem at hand and saves hours of manual RA work at a comparable if not greater accuracy rate.
Our project's aim is to automate the digitisation of firm-level Japanese archival data from 1940-60. Currently, we've achieved an accuracy of 99.3% at classifying variable names/values in our validation set for the 1954 version of "Personnel Records" (PR1954 henceforth). The data looks something like this, with the book displayed on the left and corresponding column expanded on the right (courtesy of Jie Zhou):