Skip to main content

How to fix your bad data—a lesson from Flushing Bank

We’ve all heard it before, but for Flushing Bank, the adage of poor data in resulting in poor data out had started to become a concerning reality. Their CIO recognized that the ability to find reliable and consistent answers to the business’s questions was in jeopardy due to issues with the bank’s data and processes. The big question was, what to do about it?

It wasn’t simply that there were errors in the data, although as the bank looks toward greater automation in the use of data, that was one concern. While humans can easily recognize that “$US,” “USdollars,” and “USD” are all the same thing, computers take a more prescribed approach and will signal an error if the data element isn’t recognized. And what about misplaced decimal points? There’s a lot of difference between a 4.6% interest rate and a 46.0% rate.

But beyond input errors, what if different parts of the business use different definitions for commonly used terms? Or source data from different systems to get their answers? In their presentation on data governance given at the 2019 MISER Users Group conference, Joan Roche, Flushing’s senior vice president and director of data governance and applications, and Elizabeth LaBarbera, assistant vice president and data warehouse analyst, outlined challenges with commonly used definitions such as “average customer balance” and “customer lifetime value,” which were defined and calculated differently by different business groups within the bank.

In the case of the average customer balance, two different groups came up with radically different answers to present to senior management. Finance used information from the general ledger while a BI analyst used data from the data warehouse. In fact, both were right! But the general ledger average was based on the calendar year while the analyst calculated a rolling average over a longer period. The problem was a lack of definition and process standardization.

Then there’s the problem of time. Predictive analytics and modeling require accurate historic data to produce good results. For example, when predicting which customers will be a good candidate for a mortgage loan and when, based on age, family size, income etc., incorrectly entered birth dates at the time of account opening 10 years ago will likely skew the results. This can result in targeting potential mortgage customers too late, or conversely, promoting rates too early and appearing irrelevant. Making key decisions on poor quality data and without well-understood processes is not a recipe for success.

Another key hurdle to overcome was the question: who owns the data? The business side of the bank expected to “own” the data but not to have to fix it. They expected the data warehouse managers to fix data issues, but those managers have little knowledge of the data itself. They manage the records, files, folders, cubes, etc.—meaning where and how they are stored and who has access. They also facilitate report creation, but they aren’t data stewards.

Flushing’s solution was to create a data governance program to clearly define terms and build a data dictionary, determine ownership of each area of data, create and enforce guidelines on how data should be collected and entered, and for compliance reasons, to track data lineage so that they could identify the origin of each data element used in a calculation or report. Data governance programs touch almost every part of the business and require that the business not only buy into the processes and rules but also actively participate in their definition. Without that commitment, it’s too easy for the busy folks on the frontline, working directly with customers, to take short cuts on data entry or to position data inappropriately in the wrong fields.

Flushing struggled somewhat in getting those commitments until CECL came along. Now there was a compliance issue that had to be addressed. The Flushing CFO, who was responsible for CECL submissions, leveraged the new financial standard with its requirement for accurate historic data, to convince the business of the need for strong data governance. Then after two years of vendor evaluations, Flushing chose Collibra as their data governance platform and Experian’s Pandora as their data quality solution.

Why do they need both? Because, despite agreed data definitions, improved processes, and strong business data ownership, data is still entered manually, and therefore, it is prone to error, new data sources are often being added, and new analyses put stress on existing data sources. Each time changes are made, Flushing wants to understand the implications and issues and wants to maintain the highest possible level of data quality.

Experian’s Pandora will be used to regularly profile the data, identify errors and issues, create a data quality score (for tracking and improvement) and then leverage its rule-set and other capabilities to automate the fixing of as many issues as possible. New rules, identified by the data governance processes, can be easily created (SQL not required) to correct systematic issues whenever they occur. Pandora will provide notification of issues it finds and report the updated data quality scores to Collibra to provide the necessary record for lineage and compliance purposes.

Data quality is a key component to any data governance program to ensure ongoing accuracy and reliability.

Learn more about data quality vs. data governance