In this post we’re going to look at a cool data cleaning solution I created to solve a problem we encountered whilst digitising Japanese archival data for Professor Dell. We have text data that has been output from a machine learning pipeline by some colleagues - we need to convert this text data into structured information about a company’s name, address and profit values etc. whilst being robust to OCR errors and a variety of oopsies from the ML output.

The problem is occasionally we fail to identify the start of a company in the text data which means variables are assigned to the wrong company - we convert the issue into a known problem and use some time series/signal analysis tricks to improve on the ML output.

the cleaning pipeline

We have a data pipeline consisting of several steps:

Unfortunately, errors can occur in any one of these stages - our post today is tackling an issue arising in the row classification.

Each page we’re digitising looks like so: