Data quality can be boring. Yes, I said it. And this comes from someone who has worked in the data quality space for more than 20 years. When I’m at a social gathering, I dread the inevitable “so what do you do?” question. My short answer is usually “boring computer stuff.” Heaven forbid they try to dig deeper! Then it becomes this awkward explanation about reducing the amount of junk mail they get or some pseudo-relatable data activity, and then watching their eyes begin to glaze over. At the end of the day, I use my tried-and-tested conversation changer, “enough about me, what about you?” I double majored in Computer Science and Psychology, and while Computer Science ultimately provided the foundation of my career, I’ve found that the psychology background is often more valuable for dealing with people and sticky situations.
So, if data quality is so boring, why spend 20 years of your life focused on it? Good question. I find working in data quality is like being a detective. You are constantly looking for clues to help you solve mysteries. “Our critical real-time e-commerce site crashed last night! What happened?!” Well, upon further analysis, it seems that the website was expecting alphabetic characters in the comment field which unfortunately had an unreadable “TAB” character in it that caused a cascading system failure. This stuff really happens! And even though it’s boring for the average person to think about “TAB” characters being significant enough to crash a website, these are the types of issues that data quality people get a kick out of, and that business leaders and senior managers should be terrified are going to impact their businesses!
At this point, we’ve determined that a minor data quality issue has the capacity to take down a business, at least for a period of time. Now the question is, “how do these data quality issues occur in the first place?” Let’s take a look at some typical data quality problems:
1. Dates – How could anyone screw up a date? Well, they can and they quite often do. If the date is entered manually (like a request for date of birth), it can be input in any number of formats: two-digit months and days, one-digit months and days, two-digit years, four-digit years, and a mixture of one-two-and-four digits, sometimes separated by spaces, or hyphens, or slashes. And what about when someone uses an “O” instead of a zero, or an “I” instead of a one? People may even spell out the date in total, like “January 1st, 2017”, which is ripe for misspellings and non-conformity. Microsoft has 25+ date formats in Excel, and any one of those could be used and abused in a manual scenario. Automation will solve all of these problems, right? Well, maybe it can help, but it doesn’t solve everything. As any world traveler knows, the format of US dates is markedly different than that of EU dates, in that US dates are in the form Month-Day-Year, and EU dates are in the form Day-Month-Year. So even if you’ve got a perfectly formatted date, like 07/06/2017, with a multi-national data set, you’ve got to take the country of origin into consideration to know if the date represents July 6, 2017 or June 7, 2017. Dates are complicated!
2. Numbers – Numbers are not quite as complicated as dates, but still fall into some of the same traps. The most common is letters representing numbers (the aforementioned “I” for “1” and “O” for “0”, and the occasional heavy-metal data entry person using an “E” for a “3”). But you also have spaces being used in numeric fields, people entering “seven” instead of the digit “7.” Now, not to show my age, but you don’t even want to know about packed and unpacked, signed and unsigned data held on a mainframe! Numbers are not immune to data quality issues.
3. International character sets and encodings – We live in a global economy, and so does our data. I’ve got challenges in even writing this article to bring up the issues around accent marks and umlauts (for example, “À” and “Ü” – hooray for Microsoft Special Characters!). Getting machines to recognize the myriad of character sets around the world is a challenge at best. Between single-byte, double-byte and multi-byte characters (think Japanese), there are literally tens of thousands of potential characters that can be captured. Some of these characters could crash your system if you are not ready for them.
Unicode (universal coding) standards have helped, but even Unicode standards have holes. Take Turkey, for example. They have four characters representing an “I”: a lowercase “i” with a dot on top, a lowercase “i” without the a capital “I” with a dot on top, and a capital “I” without the dot on top. Even in a Unicode environment, I’ve seen weeks of work in trying to get capitalization to work correctly for these Turkish characters. The default functionality is to take both lowercase “i”s and turn them into uppercase “I”s without the dot. Turning the lowercase “I” with a dot to an uppercase “I” with a dot was a multi-week project! Welcome to the world of international character sets.
The characters don’t even have to be international to wreak havoc, as in the “TAB” example above, or when other unexpected characters show up like carriage returns, line feeds, and the dreaded “null.” And a quick word on encodings (different than character sets). There are multiple encodings, changing the way data is stored, depending on your platform. EBCDIC on mainframes, ASCII on Windows, Linux and UNIX and a variety of Unicode encodings (UTF-8, UTF-16, etc.). Don’t think for a second that disparate data from disparate systems in disparate encodings isn’t going to turn your data quality governance team’s hair gray!
4. Multiple languages and units of measure – By the way, in talking about international character sets, don’t confuse this with multiple languages. We could have one, perfectly standardized character set, but that doesn’t help us with the fact that “cheese” and “fromage” mean the same thing in 2 different languages. I’ve seen major hotel chains losing thousands of dollars in inventory because they double-ordered “cheese” and “fromage” leaving half of their order to spoil. Sometimes, it doesn’t even take multiple languages – I’ve seen the same inventory mistake with “ketchup” and “catsup.” Units of measure also come into play when dealing with materials. Are they all measured in pounds, or are some in kilos (or for my British friends, stones)? Weights, lengths, distances and, most impactfully, currencies also need to be taken into consideration when looking at creating data quality standards within your company.
5. Human error – People. If it wasn’t for people, all our data quality issues would go away, right? Well, I’m not sure if that is 100% true, but I’m sure it would help! As long as there are people entering data into screens, and phones, and tablets, the data quality business will be thriving for decades to come! The thought was that companies could leverage technology to make the data entry screens better, easier and more data compliant, and then train their staff to be stellar “data entrists” (I know that’s not a word, but it should be!). That plan almost worked, though human error is still the main cause of inaccurate data according to Experian Data Quality’s 2017 global data benchmark report.
Even well-trained data entry personnel still making mistakes. Sometimes people will misfield data, putting postal code information into a city field, or placing a name on a street line. That happens all the time. A little more egregious is the “renegade” data entry that happens when information doesn’t quite have a place – sticking nicknames in quotes as part of the name field (John “Big Dog” Smith), putting directions in address fields (“On the corner of Fifth and Main”), or placing an important note in a comment field because there is no fields or flag to capture it (“This person is dead”). The downright deceptive problems are the data entry folks that are trying to buck the system; the system being the required fields necessary for them to complete their forms.
One great example was a large bank that had a system in place that required a social security number be entered before the user could continue. It was a smart system – it put up a blocker if the number was not entered, and even if a “fake” social security number was entered (like “9999999999” or “123456789”). But what the system didn’t anticipate was one creative teller that got so annoyed with asking people for their social security number, and dealing with the constant questioning of “why do you need that?”, that they starting putting in their own social security number just to get through the system! This is a classic case of, “Locks are for honest people”, and another example of why data quality is here to stay.
Sometimes organizations are completely unaware how many issues are lurking in their data. Without a way of profiling data and gaining a bird's eye view into the information in their systems, organizations are often caught off guard when a data quality issue arises and creates big problems for the business. Between all the dates, other numbers, and mix of languages in your databases, along with the effects of human error, you must proactively address potential data quality issues before they negatively impact your business. That's where we can help.
Experian Data Quality partners with our customers to help solve data quality issues and develop a comprehensive, proactive data strategy.