Data Quality Accuracy Dimension: The Data Quality Problem

1 Data Is a Precious Resource

Data is the fuel we use to make decisions. It records the history of enterprise activities. It is used to drive processes of all sorts. It is used to make important decisions. We maintain and use data as individuals. We maintain and use data as corporations, governmental organizations, educational institutions, and virtually any other organization.

Many large organizations are nothing but data processing engines. Insurance companies, banks, financial services companies, and the IRS are all organizations that live in a sea of data. Most of what they do is process data.

Think about companies that process credit card transactions. What products do they produce and sell? Just information products. They process tons of data every day. Take their information systems away and there is nothing left.

Other organizations may appear to be less involved with information systems because their products or activities are not information specific. However, looking under the covers you see that most of their activities and decisions are driven or guided by information systems.

Manufacturing organizations produce and ship products. However, data drives the processes of material acquisition, manufacturing work flow, shipping, and billing. Most of these companies would come to a resounding halt if their information systems stopped working. To be a profitable manufacturing company today you need highly tuned information systems for just-in-time parts delivery, effective purchasing systems for adjusting what you produce to ever-changing demand, highly accurate cost accounting systems, applications for the care and feeding of customers, and much more. Those with poor information systems fall behind competitively, and many fall out of business.

The most successful companies are not always those with the best products. Companies must recognize that they must sell what is profitable and drop products that are not. Profitability requires knowledge of the supply chain; knowledge of past, present, and future buying patterns; marketing costs; and sales costs. Consolidation of data from many different systems is required to make the right profit decisions.

Retail operations depend completely on information systems to keep them profitable. They must have the latest technology for highly efficient supply chain management. If not, their competitors will lower prices and force the inefficient information processors out of business. Many are now moving to customer relationship management systems in order to gain an even better competitive position.

Data is becoming more precious all the time. Enterprises are using data more and more to help them make important decisions. These can be daily, routine decisions or long-term strategic decisions. New trends in data warehousing, data mining, decision support, and customer relationship management systems all highlight the ever-expanding role data plays in our organizations.

Note

Data gets more valuable all the time, as additional ways are found to employ it to make our organizations more successful.

Impact of Continuous Evolution of Information Systems

From about 1950 through today, there has been a clear evolution in the use of computer-generated data from simple historical record keeping to ever more active roles. This trend does not show signs of slowing down. Data is generated by more people, is used in the execution of more tasks by more people, and is used in corporate decision making more than ever before.

Note

The more technology we develop, the more users demand from it.

When we look at the past 50 years, the degree to which information systems have played a role has been in a state of constant change. Every IT department has had a significant amount of their resources devoted to the task of implementing new or replacement systems. As each new system is deployed, it is often already obsolete, as another replacement technology shows up while they were building it. This drives them to again replace the just-finished system. This process of "continuous evolution" has never stopped, and probably will not for a number of years into the future.

The constant need to remodel systems as fast as they are developed has been driven by enormously fast technology innovation in hardware, communications, and software. No organization has been able to keep up with the rapid pace of technological change. All organizations have been chasing the technology curve in the hope of eventually reaching a stable point, where new systems can survive for awhile. They will not reach stability for a long time in the future, as much more technology is being born as this is written.

The need to change is also fueled by the rapid change in the nature of the companies themselves. Mergers and acquisitions drive very rapid and important changes as companies try to merge information systems. Changes in product lines or changes in markets served drive many hastily implemented changes into information systems. For example, the decision to "go global" can wreak havoc on currency, date, address, and other data elements already in place. Business change impacts are the ones that generally are done the quickest, with the least amount of planning, and that usually derive the worst results.

External changes also cause hastily implemented patches to existing systems: tax law changes, accounting changes such as those experienced in recent years, the Y2K problem, the EURO conversion, and on and on. This rapid evolution has meant that systems have been developed hastily and changed aggressively. This is done with few useful standards for development and control. The software industry has never developed effective standards similar to those the hardware and construction industries enjoy (through blueprints), nor does it have the luxury of time to think through everything it does before committing to systems. The result is that many, if not all, of our systems are very rough edged. These rough edges particularly show through in the quality of the data and the information derived from the data.

A lot of this rapid change happened in order to push information systems into more of the tasks of the enterprise and to involve more people in the process. The Internet promises to involve all people and all tasks in the scope of information systems. At some time in the future, all companies will have an information system backbone through which almost all activity will be affected. As a result, information systems become bigger, more complex, and, hopefully, more important every time a new technology is thrown in. The data becomes more and more important.

Just about everything in organizations has been "databased." There are personnel databases, production databases, billing and collection databases, sales management databases, customer databases, marketing databases, supply chain databases, accounting databases, financial management databases, and on and on. Whenever anyone wants to know something, they instinctively run to a PC to query a database. It is difficult to imagine that less than 25 years ago there were no PCs and data was not collected on many of the corporate objects and activities of today.

I participated in an audit for a large energy company a few years ago that inventoried over 5,000 databases and tens of thousands of distinct data elements in their corporate information systems. Most corporations do not know how much data they are actually handling on a daily basis.

Not only has most corporation information been put into databases, but it has been replicated into data warehouses, data marts, operational data stores, and business objects. As new ways are discovered to use data, there is a tendency to create duplication of the primary data in order to satisfy the new need. The most dramatic example today is the wave of customer relationship management (hereafter, CRM) projects proliferating throughout the IT world.

Replication often includes aggregating data, combining data from multiple sources, putting data into data structures that are different from the original structure, and adding time period information. Often the original data cannot be recognized or found in the aggregations. As a result, errors detected in the aggregations often cannot be traced back to primary instances of data containing the errors.

In addition to replicating, there are attempts to integrate the data of multiple databases inside interactive processes. Some of this integration includes reaching across company boundaries into databases of suppliers, customers, and others.

Adding the demands of replication and integration on top of operational systems adds greatly to the complexity of information systems and places huge burdens on the content of the primary operational systems. Data quality problems get magnified through all of these channels. indicates aspects of integration, operation, and replication.

Figure 1.2: Demands on operational databases.

The claim that systems are in a state of continuous evolution seems to be belied by the resilience of legacy systems built years ago that seem to resist all attempts to replace them. In reality, these systems are the worst offenders of evolution because they change all the time, at a high cost, and usually extend themselves through replication and integration. In fact, many new requirements can only be satisfied through replication or integration extensions.

Because of the inherent inflexibility of such systems, these extensions are much more complex and often turn out to be badly implemented. This is a classic case of pushing the problems downhill and not addressing root problems.

If corporations want quality systems, they will need to eventually replace the old technologies with new ones. Retrofitting older systems to new requirements and new standards of quality is almost impossible to achieve with quality results.

Along with the increasing complexity of systems comes an increase in the impact of inaccurate data. In the primary systems, a wrong value may have little or no impact. It may cause a glitch in processing of an order, resulting in some small annoyance to fix. However, as this wrong value is propagated to higher-level decision support systems, it may trigger an incorrect reordering of a product or give a decision maker the wrong information to base expanding a manufacturing line on. The latter consequences can be much larger than the original.

Although a single wrong value is not likely to cause such drastic results, the cumulative effect of multiple wrong values in that same attribute can collectively deliver very wrong results. Processes that generate wrong values rarely generate only one inaccurate instance.

Acceptance of Inaccurate Data

Databases have risen to the level of being one of the most, if not the most, important corporate asset, and yet corporations tolerate enormous inaccuracies in their databases. Their data quality is not managed as rigorously as are most other assets and activities. Few companies have a data quality assurance program, and many that do have such a program provide too little support to make it effective.

The fact of modern business is that the databases that drive them are of poor to miserable quality, and little is being done about it. Corporations are losing significant amounts of money and missing important opportunities all the time because they operate on information derived from inaccurate data. The cost of poor-quality data is estimated by some data quality experts as being from 15 to 25% of operating profit. In a recent survey of 599 companies conducted by PricewaterhouseCoopers, an estimate of poor data management is costing global businesses more than $1.4 billion per year in billing, accounting, and inventory snafus alone. Much of that cost is attributable to the accuracy component of data quality.

This situation is not restricted to businesses. Similar costs can be found in governmental or educational organizations as well. Poor data quality is sapping all organizations of money and opportunities. A fair characterization of the state of data quality awareness and responsiveness for the typical large organization is as follows:

They are aware of problems with data.
They consistently underestimate, by a large amount, the extent of the problem.
They have no idea of the cost to the corporation of the problem.
They have no idea of the potential value in fixing the problem.

If you can get commitment to a data quality assessment exercise, it almost always raises awareness levels very high. A typical response is "I had no idea the problem was that large." Assessment is the key to awareness, not reading books like this. Most people will believe that the other guy has a larger problem than they do and assume that this book is written for that other guy, not them. Everyone believes that the data quality problem they have is small and much less interesting to address than other initiatives. They are usually very wrong in their thinking. It takes data to change their minds.

The Blame for Poor-Quality Data

Everyone starts out blaming IT. However, data is created by people outside IT, and is used by people outside IT. IT is responsible for the quality of the systems that move the data and store it. However, they cannot be held completely responsible for the content. Much of the problem lies outside IT, through poorly articulated requirements, poor acceptance testing of systems, poor data creation processes, and much more.

Data quality problems are universal in nature. In just about any large organization the state of information and data quality is at the same low levels.

The fact that data quality is universally poor indicates that it is not the fault of individually poorly managed organizations but rather that it is the natural result of the evolution of information system technology. There are two major contributing factors. The first is the rapid system implementations and change that have made it very difficult to control quality. The second is that the methods, standards, techniques, and tools for controlling quality have evolved at a much slower pace than the systems they serve.

Virtually all organizations admit that data quality issues plague their progress. They are all aware of the situation at some level within the enterprise. Quality problems are not restricted to older systems either. Nor are they restricted to particular types of systems. For example, practitioners intuitively assume that systems built on a relational database foundation are of higher data quality than older systems built on less sophisticated data management technology. Under examination, this generally turns out not to be true.

Information technology evolution is at a point where the next most important technology that needs to evolve is methods for controlling the quality of data and the information derived from it. The systems we are building are too important not to address this important topic any later than now.

Although data quality problems are universal, this should not excuse egregious examples of poor quality or cases in which awareness was high but no actions taken. In the absence of these, CEOs should not dwell on fault but instead spend their energies on improvement.

Ten years from now, poor data quality will be a reason to find fault.

With the growing availability of knowledge, experts, books, methodologies, software tools, and corporate resolve, high-quality database systems will become the norm, and there will be no excuse for not having them. What we consider excusable today will be inexcusable then.

Awareness Levels

Almost everyone is aware that data from time to time causes a visible problem. However, visibility to the magnitude of the problems and to the impact on the corporation is generally low. There are several reasons for this.

Correction activities, rework, order reprocessing, handling returns, and dealing with customer complaints are all considered a normal part of corporate life. Many of the problems are not associated with information quality, even when that is the problem. The activities tend to grow in size with little fanfare or visibility. Since the people who carry out these activities are generally not isolated within a function, the cost and scope of such problems are generally not appreciated.

When decision makers reject IT data because "they just know it can't be right," they generally do not rush into the CEO's office and demand that the data coming from IT be improved. They usually just depend on their previous methods for making decisions and do not use the information from the databases. Many times data warehouse and decision support systems get built and then become not used for this reason. To make matters worse, decision makers sometimes generate alternative data collection and storage minisystems to use instead of the mainline databases. These often tend to be as bad or worse in quality than the systems they reject.

IT management often does not want to raise a red flag regarding quality, since they know that they will get blamed for it. Their systems are collecting, storing, and disseminating information efficiently, and they are content with not surfacing the fact that the quality of the data flowing through these systems is bad.

Corporate management wants to believe that their IT departments are top notch and that their systems are first rate. They do not want to expose to their board or to the outside world the facts of inefficiencies or lost opportunities caused by inaccurate data.

If a company included in its annual report a statement that their information quality caused a loss equal to 20% of their operating profit, their stock price would plunge overnight. They do not want this information published, they do not want investors to know, and they do not want their competitors to know. The obvious psychology drives them to not want to know (or believe) it themselves.

Companies tend to hide news about information quality problems. You will never see a company voluntarily agree to a magazine article on how they discovered huge data quality problems and invested millions of dollars to fix them. Even though this is a great story for the corporation, and the results may save them many times the money they spent, the story makes them look like they lost control and were just getting back to where they should have been. It smacks of saying that they have been bad executives and managers and had to spend money to correct their inefficient ways.

I had a conversation with a government agency official in which they indicated that disclosure of data accuracy problems in a particular database would generate a political scandal of considerable proportions, even though the root cause of the quality problems had nothing to do with any of the elected officials. Needless to say, they went about fixing the problem as best they could, with no publicity at all about the project or their findings.

Data quality (and more specifically, data accuracy) problems can have liability consequences. As we move more into the Internet age, in which your company's data is used by other corporations to make decisions about purchasing and selling, costs associated with bad data will eventually be the target of litigation. Corporations surely do not want to trumpet any knowledge they have of quality problems in their databases and give ammunition to the legal staff of others.

The time to brag about spending large budgets to get and maintain highly accurate data and highly accurate information products has not yet arrived. However, the tide is turning on awareness. If you go into almost any IT organization, the data management specialists will all tell you that there are considerable problems with the accuracy of data. The business analysts will tell you that they have problems with data and information quality. As you move up the management chain, the willingness to assert the problems diminishes, usually ending with the executive level denying any quality problems at all. summarizes reasons for lack of initiative in regard to problems with information quality.

Impact of Poor-Quality Data

We usually cannot scope the extent of data quality problems without an assessment project. This is needed to really nail the impact on the organization and identify the areas of potential return. The numbers showing the potential savings are not lying around in a convenient account. They have to be dug out through a concerted effort involving several organizational entities. Some areas in which costs are created and opportunities lost through poor data quality are

transaction rework costs
costs incurred in implementing new systems
delays in delivering data to decision makers
lost customers through poor service
lost production through supply chain problems

Examples of some of these, discussed in the sections that follow, will demonstrate the power of data quality problems to eat away at the financial health of an organization.

Transaction Rework Costs

Many organizations have entire departments that handle customer complaints on mishandled orders and shipments. When the wrong items are shipped and then returned, a specific, measurable cost occurs. There are many data errors that can occur in this area: wrong part numbers, wrong amounts, and incorrect shipping addresses, to name a few. Poorly designed order entry procedures and screens are generally the cause of this problem.

Costs Incurred in Implementing New Systems

One of the major problems in implementing data warehouses, consolidating databases, migrating to new systems, and integrating multiple systems is the presence of data errors and issues that block successful implementation. Issues with the quality of data can, and more than half the time do, increase the time and cost to implement data reuse projects by staggering amounts.

A recent report published by the Standish Group shows that 37% of such projects get cancelled, with another 50% completed but with at least a 20% cost and time overrun and often with incomplete or unsatisfactory results. This means that only 13% of projects are completed within a reasonable time and cost of their plans with acceptable outcomes. This is a terrible track record for implementing major projects. Failures are not isolated to a small group of companies or to specific industries. This poor record is found in almost all companies.

Delays in Delivering Data to Decision Makers

Many times you see organizations running reports at the end of time periods and then reworking the results based on their knowledge of wrong or suspicious values. When the data sources are plagued by quality problems, it generally requires manual massaging of information before it can be released for decision-making consumption. The wasted time of people doing this rework can be measured. The poor quality of decisions made cannot be measured. If it takes effort to clean up data before use, you can never be sure if the data is entirely correct after cleanup.

Lost Customers Through Poor Service

This is another category that can easily be spotted. Customers that are being lost because they consistently get orders shipped incorrectly, get their invoices wrong, get their payments entered incorrectly, or other aspects of poor service represent a large cost to the corporation.

Lost Production Through Supply Chain Problems

Whenever the supply chain system delivers the wrong parts or the wrong quantity of parts to the production line, there is either a stoppage of work or an oversupply that needs to be stored somewhere. In either case, money is lost to the company.

The general nature of all of these examples is that data quality issues have caused people to spend time and energy dealing with the problems associated with them. The cost in people and time can be considerable. However, over time corrective processes have become routine, and everyone has come to accept this as a normal cost of business. It is generally not visible to higher levels of management and not called out on accounting reports. As a result, an assessment team should be able to identify a great deal of cost in a short period of time.

Requirements for Making Improvements

Too often executives look at quality problems as isolated instances instead of symptoms. This is a natural reaction, considering that they do not want to believe they have problems in the first place. They tend to be reactive instead of proactive. Making large improvements in the accuracy of data and the quality of information from the data can only be accomplished through proactive activities.

Considering the broad scope of quality problems, this is not an area for quick fixes. The attitude that should be adopted is that of installing a new layer of technology over their information systems that will elevate their efficiency and value. It is the same as adding a CRM system to allow marketing to move to a new level of customer care, resulting in higher profits.

The scope of quality problems and the potential for financial gain dictate that a formal program be initiated to address this area. Such a program needs to have a large component dedicated to the topic of data accuracy. Without highly accurate data, information quality cannot be achieved.

To get value from the program, it must be viewed as a long-term and continuous activity. It is like adding security to your buildings. Once you achieve it, you do not stop pursuing it. In spite of the fact that data quality improvement programs are long term, it is important to repeat that significant returns are generally achievable in the short term.

Some of the problems will take a long time to fix. The primary place to fix problems is in the systems that initially gather the data. Rebuilding them to produce more accurate data may take years to accomplish. While long-term improvements are being made, short-term improvements can be made through filtering of input data, cleansing of data in databases, and in creating an awareness of the quality that consumers of the data can expect will significantly improve the use of the data.

A major theme of this book is that you need to train all of your data management team in the concepts of accurate data and to make accurate data a requirement of all projects they work on. This is in addition to having a core group of data quality experts who pursue their own agenda.

There will still be times when overhauling a system solely for the purpose of improving data accuracy is justified. However, most of the time the best way to improve the overall data accuracy of your information systems is to make it a primary requirement of all new projects. That way, you are getting double value for your development dollars.

Expected Value Returned for Quality Program

Experts have estimated the cost of poor information quality at from 15 to 25% of operating profits. This assumes that no concerted effort has already been made to improve quality. The actual achievable number is less. However, even if you could get only 60% of that back you would add 9 to 15% to the bottom line. This is a considerable amount. If you are a corporation, this is a lot of profit. If you are an education institution, this is a lot of money added for improving the campus or faculty. If you are a charitable organization, this is a lot more money going to recipients. If you are a governmental organization, this is more value for the tax dollar.

Although these numbers are considerable, they represent the value of concentrating on improving information quality for the organization as it currently exists. However, I suggest that better-quality information systems will reduce the cost of, and accelerate the completion of, steps in evolving the organization to newer business models. There has never been a time in my lifetime when companies were not in the process of implementing newer business or manufacturing systems that promised huge returns when completed. Many of these changes were considered essential for survival. The latest example of this is the move to being Internet based.

Changing a corporation's business and operating systems to a base of high-quality data makes changes occur faster, at lower cost, and with better-quality outcomes. CRM projects are a good example. Moving to a customer-centric model, whereby information about customers drives sales and marketing activities, promises huge returns to corporations. However, we hear that over 60% of CRM implementations either are outright failures or experience long delays. Many of these problems are caused by inaccurate data, making it difficult, if not impossible, to complete the projects.

Data Quality Assurance Technology

Although information quality has remained at low levels or even degraded over the years, there has been progress in the technology for improving it. Although information technology is not yet considered a formal technology, its parts are coming together and will be recognized as such in the near future. The essential elements of the technology are

availability of experts and consultants
educational materials
methodologies
software tools

These factors combined allow a corporation to establish a data quality assurance program and realize substantial gain. It is important that these factors become established as standard methods that incorporate the best practices. This will allow the entire IT industry to use the emerging technology effectively and for rapid transfer of knowledge between individuals and organizations.

This does not mean that the technology will not evolve, as everything is not yet known about this area. It means that changes to the set of tools should be judged as to whether they advance the technology before they are adopted.

Every manufacturing operation has a quality control department. Every accounting department has auditors. There are inspectors for construction sites at every stage of building. There are requirements for formal specification of construction and manufacturing before anything is built. Any serious software development organization has trained quality assurance professionals.

Information systems need the same formality in a group of people and processes to ensure higher levels of quality. Every serious organization with a large IT operation needs a data quality assurance program. They need to require formal documentation of all information assets and sufficient information about them to satisfy all development and user requirements. They need inspectors, auditors, and development consultants. They need an established methodology to continuously monitor and improve the accuracy of data flowing through their information systems.

Availability of Experts and Consultants

Before any technology can take off, it needs the attention of a lot of smart people. When relational technology got its rocket start in the late 1970s and early 1980s, there was research going on in several corporate research organizations (most notably IBM) and in many universities (most notably the University of California at Berkeley). The vast majority of Ph.D. these in computer science in that era had something to do with relational database technology. An enormous number of start-up companies appeared to exploit the new technology. I did a survey in 1982 and found over 200 companies that had or were building a relational database engine. Today, less than five of them have survived. However, those that did survive have been enormously successful.

Data quality has the attention of a few smart people, not the large group that is desirable for a new technology to emerge. However, the number is increasing every year. Many university research efforts are now addressing this topic. The most notable is the M.I.T. TDQM (total data quality management) research program. There are many more university research efforts being aimed at this field every year. In addition, technical conferences devoted to data and information quality are experiencing significant growth in attendance every year.

A number of consultant experts have emerged who are dedicating their careers to the data quality topic. The number increases every year. The quality of these consultants is superb. Corporations should not hesitate to take advantage of their knowledge and experience.

Educational Materials

There is a clear shortage of educational materials in the field of data and information quality. Materials need to be developed and included in standard college courses on computer science. Corporations need to provide education not only to those responsible for data quality assurance but to everyone who is involved in defining, building, executing, or monitoring information systems. There should also be education for consumers of information so that they can more effectively determine how to use information at their disposal and to provide effective requirements and feedback to system developers.

Books and articles are useful tools for education, and plenty of them are available. However, more specific training modules need to be developed and deployed for quality to become an important component of information systems.

Methodologies

There have emerged a number of methodologies for creating and organizing data quality assurance programs, for performing data quality assessments, and for ongoing data stewardship. These can be found in the various books available on data or information quality. This book provides its own methodology, based on data profiling technology, for consideration. More detailed methodologies need to be employed for profiling existing data stores and monitoring data quality in operational settings.

If data quality assurance programs are going to be successful, they must rally around standard methods for doing things that have been proven to work. They then need to employ them professionally over and over again.

Software Tools

There has been a paucity of software tools available to professionals to incorporate into data quality assurance programs. It is ironic that on the topic of data quality the software industry has been the least helpful. Part of the reason for this is that corporations have not been motivated to identify and solve quality problems and thus have not generated sufficient demand to foster the growth in successful software companies focusing on data quality.

More tools are emerging as the industry is waking up to the need for improving quality. You cannot effectively carry out a good program without detailed analysis and monitoring of data. The area of data accuracy specifically requires software to deal with the tons of data that should be looked at.

Metadata Repositories

The primary software tool for managing data quality is the metadata repository. Repositories have been around for a long time but have been poorly employed. Most IT departments have one or more repositories in place and use them with very little effectiveness. Most people would agree that the movement to establish metadata repositories as a standard practice has been a resounding failure. This is unfortunate, as the metadata repository is the one tool that is essential for gaining control over your data.

The failure of repository technology can be traced to a number of factors. The first is that implementations have been poorly defined, with only a vague concept of what they are there for. Often, the real information that people need from them is not included. They tend to dwell on schema definitions and not the more interesting information that people need to do their jobs. There has been a large mismatch between requirements and products.

A second failure is that no one took them seriously. There was never a serious commitment to them. Information system professionals did not use them in their daily jobs. It was not part of their standard tool set. It appeared to be an unnecessary step that stood in the way of getting tasks done.

A third failure is that they were never kept current. They were passive repositories that had no method for verifying that their content actually matched the information systems they were supposed to represent. It is ironic that repositories generally have the most inaccurate data within the information systems organization.

A fourth failure is that the standard repositories were engineered for data architects and not the wider audience of people who can benefit from valuable information in an accurate metadata repository. The terminology is too technical, the information maintained is not what they all need, and the accessibility is restricted too much.

Since corporations have never accepted the concept of an industry standard repository, most software products on the market deliver a proprietary repository that incorporates only that information needed to install and operate their product. The result is that there are dozens of isolated repositories sitting around that all contain different information, record information in unique ways, and have little, if any, ability to move information to other repositories. Even when this capability is provided, it is rarely used. Repository technology needs to be reenergized based on the requirements for establishing and carrying out an effective data quality assurance program.

Data Profiling

The second important need is analytical tools for data profiling. Data profiling has emerged as a major new technology. It employs analytical methods for looking at data for the purpose of developing a thorough understanding of the content, structure, and quality of the data. A good data profiling product can process very large amounts of data and, with the skills of the analyst, uncover all sorts of issues in the data that need to be addressed.

Data profiling is an indispensable tool for assessing data quality. It is also very useful at periodic checking of data to determine if corrective measures are being effective or to monitor the health of the data over time.

Data profiling uses two different approaches to examining data. One is discovery, whereby processes examine the data and discover characteristics from the data without the prompting of the analyst. In this regard it is performing data mining for metadata. This is extremely important to do because the data will take on a persona of itself and the analyst may be completely unaware of some of the characteristics. It is also helpful in addressing the problem that the metadata that normally exists for data is usually incorrect, incomplete, or both.

The second approach to data profiling is assertive testing. The analyst poses conditions he believes to be true about the data and then executes data rules against the data that check for these conditions to see if it conforms or not. This is also a useful technique for determining how much the data differs from the expected. Assertive testing is normally done after discovery.

The output of data profiling will be accurate metadata plus information about data quality problems. One goal of data profiling is to establish the true metadata description of the data. In effect, it can correct the sins of the past.

Data profiling tools exist in the market and are getting better every year. They did not exist five years ago. Data profiling functions are being implemented as part of some older products, and some new products are also emerging that focus on this area. More companies are employing them every year and are consistently amazed at what they can learn from them.

Data Monitoring

A third tool includes effective methods for monitoring data quality. A data monitoring tool can be either transaction oriented or database oriented. If transaction oriented, the tool looks at individual transactions before they cause database changes. A database orientation looks at an entire database periodically to find issues.

The goal of a transaction monitor is to screen for potential inaccuracies in the data in the transactions. The monitor must be built into the transaction system. XML transaction systems make this a much more plausible approach. For example, if IBM's MQ is the transaction system being employed, building an MQ node for screening data is very easy to do.

A potential problem with transaction monitors is that they have the potential to slow down processing if too much checking is done. If this is the result, they will tend not to be used very much. Another problem is that they are not effective in generating alerts where something is wrong but not sufficiently wrong to block the transaction from occurring. Transaction monitors need to be carefully designed and judiciously used so as to not impair the effectiveness of the transaction system.

Database monitors are useful for finding a broad range of problems and in performing overall quality assessment. Many issues are not visible in individual transactions but surface when looking at counts, distributions, and aggregations. In addition, many data rules that are not possible to use on individual transactions because of processing time become possible when processing is offline.

Database monitors are also useful in examining collections of data being received at a processing point. For example, data feeds being purchased from an outside group can be fed through a database monitor to assess the quality of the submission.

The most effective data monitoring program uses a combination of transaction and database monitoring. It takes an experienced designer to understand when and where to apply specific rules. The technology of data quality monitors is not very advanced at this point. However, this is an area that will hopefully improve significantly over the next few years.

Data Cleansing Tools

Data cleansing tools are designed to examine data that exists to find data errors and to fix them. To find an error, you need rules. Once an error is found, either it can cause rejection of the data (usually the entire data object) or it can be fixed. To fix an error, there are only two possibilities: substitution of a synonym or correlation through lookup tables.

Substitution correction involves having a list of value pairs that associate a correct value for each known wrong value. These are useful for fixing misspellings or inconsistent representations. The known misspellings are listed with correct spellings. The multiple ways of representing a value are listed with the single preferred representation. These lists can grow over time as new misspellings or new ways of representing a value are discovered in practice.

Correlation requires a group of fields that must be consistent across values. A set of rules or lookup tables establish the value sets that are acceptable. If a set of values from a database record is not in the set, the program looks for a set that matches most of the elements and then fixes the missing or incorrect part. The most common example of this is name and address fields. The correlation set is the government database of values that can go together (e.g., city, state, Zip code, and so on). In fact, there is little applicability of this type of scrubbing for anything other than name and address examination.

Database Management Systems

Database management systems (DBMSs) have always touted their abilities to promote correct data. Relational systems have implemented physical data typing, referential constraints, triggers, and procedures to help database designers put transaction screening, database screening, and cleansing into the database structure. The argument is that the DBMS is the right place to look for errors and fix data because it is the single point of entry of data to the database.

Database designers have found this argument useful for some things and not useful for others. The good designers are using the referential constraints. A good database design will employ primary key definitions, data type definitions, null rules, unique rules, and primary/foreign key pair designations to the fullest extent to make sure that data conforms to the expected structure.

The problem with putting quality screens into the DBMS through procedures and triggers are many. First of all, the rules are buried in obscure code instead of being in a business rule repository. This makes them difficult to review and manage. A second problem is that all processing becomes part of the transaction path, thus slowing down response times. A third problem is that the point of database entry is often "too late" to clean up data, especially in Internet-based transaction systems. The proper way to treat data quality issues is to use a combination of DBMS structural support, transaction monitors, database monitors, and external data cleansing.

Closing Remarks

As information systems become more of the fabric of organizations, they also get more and more complex. The quality of data within them has not improved over the years as has other technologies. The result is that most information systems produce data that is of such poor quality that organizations incur significant losses in operations and decision making. It also severely slows down and sometimes cripples attempts to introduce new business models into the organization.

There are many reasons data quality is low and getting lower. This will not change until corporations adopt stringent data quality assurance initiatives. With proper attention, great returns can be realized through improvements in the quality of data.

The primary value to the corporation for getting their information systems into a state of high data quality and maintaining them there is that it gives them the ability to quickly and efficiently respond to new business model changes. This alone will justify data quality assurance initiatives many times over.

Data quality assurance initiatives are becoming more popular as organizations are realizing the impact that improving quality can have on the bottom line. The body of qualified experts, educational information, methodologies, and software tools supporting these initiatives is increasing daily. Corporations are searching for the right mix of tools, organization, and methodologies that will give them the best advantage in such programs.

Data accuracy is the foundation of data quality. You must get the values right first. The remainder of this book focuses on data accuracy: what it means, what is possible, methods for improving the accuracy of data, and the return you can expect for instituting data accuracy assurance programs.

Data Quality Accuracy Dimension

Thursday, December 13, 2007

The Data Quality Problem