Smells like clean data

The challenge with “big data” is that it can be messy.

A survey by The Economist identified 45% of business leaders feel that they have “too much data and too few resources to manage them…and by far the most difficult process right now is reconciling disparate data sources.”

The challenge with “big data” is that it can be messy. We have had several clients who want to integrate multiple databases so that they can better understand their customers or their operations. Great idea, right? The only problem is that often there are incomplete records, duplicate records and records that should be linked to the same person…but aren’t. And for some clients, there is also deliberately misleading information buried within their databases that we have to overcome. So for problems like these, we must first clean their data.

At 2River, we break this problem down into several key steps. We use advanced analytics technology where it helps and use human judgment where it is needed. We have developed our LIFT software to automatically do this kind of “fuzzy pattern matching” and if it is too close to call, we will ask you for help – but only for a couple of cases…then our models will do the rest.

How do you clean messy data? Five steps:

  1. First we standardize your data. We check for various uses of upper and lower case, extra spaces between words, different use of abbreviations, nicknames, punctuation, missing values, and more…
  2. Then we ask you to identify characteristics of the data that you want to use to compare records (perhaps name, gender, date of birth, address, account number, shoe size – you get the point).
  3. Our software develops a comparison pattern for each of these elements. We know that typos and misspellings can occur, so when we compare words our software looks for phonetic matches and overlooks minor differences in spelling.
  4. When we are comparing very large data sets, we have developed a number of strategies that speed up the search time with very few “true matches” that you miss. For one recent client our approach was 99.8% accurate (we are still working on that 0.2%).
  5. Finally our software determines the most likely matches by clustering the data into groups of matches. We like to get your feedback for validation and to improvement the classification. We will generate a small set of comparisons that cover the typical comparison patterns and ask you to validate this. This minimizes demands on your time, but establishes high classification performance.

This approach to advanced analytics allows businesses and organizations to correlate disparate information sources with a high degree of confidence and accuracy. It allows you to use data as a strategic asset without becoming overwhelmed by the volume being collected.

Recent Posts

The Archive