In this post we’re going to look at a cool data cleaning solution I created to solve a problem we encountered whilst digitising Japanese archival data for Professor Dell. We have text data that has been output from a machine learning pipeline by some colleagues - we need to convert this text data into structured information about a company’s name, address and profit values etc. whilst being robust to OCR errors and a variety of oopsies from the ML output.
The problem is occasionally we fail to identify the start of a company in the text data which means variables are assigned to the wrong company - we convert the issue into a known problem and use some time series/signal analysis tricks to improve on the ML output.
We have a data pipeline consisting of several steps:
Unfortunately, errors can occur in any one of these stages - our post today is tackling an issue arising in the row classification.
Each page we’re digitising looks like so:
I don’t speak a word of Japanese which makes everything a little more exciting.
Each page can be split into columns and corresponding rows - Jie Zhou created this breakdown:
The output from the ML row classification looks like so:
The classic case of a missing company is when the top row shown in the above image isn’t recognised as a 2 (company) classification.
We classify rows into 6 categories:
We know that the basic structure of book must follow the pattern above: CAVP.
There’s a huge amount of information embedded in this structure that the ML model exploits too - Kaixuan Zhang has written a graphical model using this structure.
We’re going to do the same thing but with a more traditional econometrics based approach.
This is an example of a missing company:
We were fixing this by using some simple heuristics that exploit the structure of the book. If Address comes after a Variable block it’s reasonable to conclude that this is a misclassified Personnel block. Unfortunately these rules are quite strict/naive and can create more problems than solutions.
The new method also exploits the repetitive structure of the problem but in a much more natural way. We need a way to spot a recurrent, noisey signal that’s robust to noise.
This is a known problem!
Currently, the classification output is a factor variable that takes values such as Address or Personnel.
We can transform this into 6 dummy/indicator variables - also called one-hot encoding in ML.
The dummy variable take a value equal to 1 if the current row corresponds to the dummy variable and 0 otherwise.
From this…
row | cls_name |
---|---|
3 | company |
4 | variable |
6 | address |
7 | address |
8 | address |
9 | variable |
…into this:
total_row_id | address | company | page | personnel | variable |
---|---|---|---|---|---|
1 | 0 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 |
5 | 1 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 1 |
How does this help us?
Well, now we effectively have a timeseries of 5 variables. Where each observation (think time period) is a row in the book.
We can plot this:
Next we apply a rolling mean to dampen down noise:
We have these pretty curves, but where do we go from there?
Figuring out where a company stops and another starts is something we can sort of eyeball.
If we plot the curves on one axis:
But we can’t eyeball 10,000 plots…
Figuring out where each company’s signal ends is actually the same as figuring out the best way to ‘stack’ the curves.
We want to know by how much we should shift each curve to maximise their summed vertical distance - remember each curve is a different variable (C, A, V or P).
When we estimate/analyse time series we typically use spectral density analysis - this sounds complicated but it’s essentially just a transformation of a curve’s autocorrelation function.
The autocorrelation function is just the correlation between \(y_t\) and \(y_{t-s}\) - the correlation between \(y\) today and \(y\) \(s\) periods ago.
-We want the same thing but the cross correlation between variables (the ACF might work but the CCF is probably more useful to us in this case.)
Therefore, for each variable we estimate the cross-correlation function. This is just the ACF but instead of comparing \(y_t\) and \(y_{t-s}\) we want to compare \(y_t\) and \(x_{t-s}\). We use the Company name as our “anchor variable” - we estimate the ACF for each variable against the Company name variable.
Then we lag each curve by whichever lag maximises the absolute value of the CCF.
We could jointly estimate everything and take the max of some norm, or estimate the spectral density and use some more complex analysis but we don’t really need to.
Shifting the curves by their lag looks like this:
and stacking them creates a much more obvious plot we can check:
if we then overlay original classification:
we can see our time series method corresponds well with the original ML classification.
Here we’re missing the two companies in the middle:
We can run this for many firms/pages in the book and use multiple firms to estimate the appropriate ccf lags:
We automated the visual analysis using a simple heuristic and ran the model across a range of hyper parameters.
This gives us the following plot of likely missing companies:
which is a pretty cool way to solve the problem - or at least I think it’s cool.
We’ve developed a method that’s doubly robust to noise. We use information from multiple variables and multiple firms per window to estimate the appropriate lag and transform the signal and finally, we’re using a statistical model that can distinguish between trends and noise unlike the hueristic algorithm.