<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-1576936343892993251</id><updated>2011-08-04T09:53:09.927-07:00</updated><title type='text'>Data Quality Accuracy Dimension</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dataqualityaccuracy.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dataqualityaccuracy.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>vijikumar</name><uri>http://www.blogger.com/profile/18263997193124066136</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>4</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-1576936343892993251.post-7327193771411794801</id><published>2007-12-13T22:53:00.000-08:00</published><updated>2007-12-13T22:58:57.573-08:00</updated><title type='text'>Implementing a Data Quality Assurance Program</title><content type='html'>&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;Data Quality Assurance&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;h2 class="first-section-title"&gt;Goals of a Data Quality Assurance Program&lt;/h2&gt; &lt;p class="first-para"&gt;A data quality assurance program is an explicit combination  of organization, methodologies, and activities that exist for the purpose of  reaching and maintaining high levels of data quality. The term &lt;i class="emphasis"&gt;assurance&lt;/i&gt; puts it in the same category as other functions  corporations are used to funding and maintaining. Quality assurance, quality  control, inspection, and audit are terms applied to other activities that exist  for the purpose of maintaining some aspect of the corporation's activities or  products at a high level of excellence. Data quality assurance should take place  alongside these others, with the same expectations.&lt;/p&gt; &lt;p class="para"&gt;Just as we demand high quality in our manufactured products, in  our financial reports, in our information systems infrastructure, and in other  aspects of our business, we should demand it from our data.&lt;/p&gt; &lt;p class="para"&gt;The goal of a data quality assurance program is to reach high  levels of data accuracy within the critical data stores of the corporation and  then keep them there. It must encompass all existing, important databases and,  more importantly, be a part of every project that creates new data stores or  that migrates, replicates, or integrates existing data stores. It must address  not only the accuracy of data when initially collected but accuracy decay,  accurate access and transformation of that data, and accurate interpretation of  the data for users. Its mission is threefold: improve, prevent, monitor.&lt;/p&gt; &lt;p class="para"&gt;Improvement assumes that the current state of data quality is not  where you want it to be. Much of the work is to investigate current databases  and information processes to find and fix existing problems. This effort alone  can take several years for a corporation that has not been investing in data  quality assurance.&lt;/p&gt; &lt;p class="para"&gt;Prevention means that the group should help development and user  departments in building data checkers, better data capture processes, better  screen designs, and better policies to prevent data quality problems from being  introduced into information systems. The data quality assurance team should  engage with projects that build new systems, merge systems, extract data from  new applications, and build integration transaction systems over older systems  to ensure that good data is not turned into bad data and that the best practices  available are used in designing human interfaces.&lt;/p&gt; &lt;p class="last-para"&gt;Monitoring means that changes brought about through data  quality assurance activities need to be monitored to determine if they are  effective. Monitoring also includes periodic auditing of databases to ensure  that new problems are not appearing.&lt;/p&gt; &lt;h2 class="first-section-title"&gt;Structure of a Data Quality Assurance Program&lt;/h2&gt; &lt;p class="first-para"&gt;&lt;a name="160"&gt;&lt;/a&gt;&lt;a name="page69"&gt;&lt;/a&gt;Creating a data quality  assurance program and determining how resources are to be applied needs to be  done with careful thought. The first decision is how to organize the group. The  activities of the group need to be spelled out. Properly skilled staff members  must be assigned. They then need to be equipped with adequate tools and  training.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="161"&gt;&lt;/a&gt;&lt;a name="ch04lev2sec1"&gt;&lt;/a&gt;Data Quality  Assurance Department&lt;/h3&gt; &lt;p class="first-para"&gt;There should be a data quality assurance department. This  should be organized so that the members are fully dedicated to the task of  improving and maintaining higher levels of data quality. It should not have  members who are part-time. Staff members assigned to this function need to  become experts in the concepts and tools used to identify and correct quality  problems. This will make them a unique discipline within the corporation. &lt;/p&gt;&lt;div class="figure"&gt;&lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;The group needs to have members who are expert data analysts.  Analyzing data is an important function of the group. Schooling in database  architecture and analytical techniques is a must to get the maximum value from  these activities. It should also have staff members who are experienced business  analysts. So much of what we call quality deals with user requirements and  business interpretation of data that this side of the data cannot be  ignored.&lt;/p&gt; &lt;p class="para"&gt;The data quality assurance group needs to work with many other  people in the corporation. It needs to interact with all of the data management  professionals, such as database administrators, data architects, repository  owners, application developers, and system designers. They also need to spend a  great deal of time with key members of the user community, such as business &lt;a name="163"&gt;&lt;/a&gt;&lt;a name="page70"&gt;&lt;/a&gt;analysts, managers of departments, and web  designers. This means that they need to have excellent working relationships  with their customers.&lt;/p&gt; &lt;div class="sidebar"&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="Start Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="first-para"&gt;There is a strong parallel between the emergence of data  quality assurance to the improvements made in software development in the 1970s  and 1980s. Software development teams back then consisted mostly of programmers.  They wrote the code, tested the product, and wrote the user manuals. This was  the common practice found in the best of software development groups.&lt;/p&gt; &lt;p class="para"&gt;In my first job at IBM I designed, developed the code, tested,  wrote user documents, and provided customer support of a software product  (Apparel Business Control System). It was a one-person project. Although the  product had high quality and good customer acceptance, I believe it would have  gone better and been a better product if I had access to professional writers  and software quality assurance people.&lt;/p&gt; &lt;p class="para"&gt;In response to the continual problems of poorly tested products  and very poor user manuals, companies started dedicating some of the programmers  to ensuring the quality of code (testing) and began to hire professional  technical writers. There was an immediate improvement in both the code and user  manuals. As time went on, these two areas became established disciplines.  Software development companies specialized in building tools for these  disciplines; colleges offered classes and tracks for these disciplines.&lt;/p&gt; &lt;p class="para"&gt;The programmers that tested were no different from those that  wrote the code in the beginning. They made huge improvements only because they  were dedicated to testing, worked with the programmers throughout the entire  project, and brought another view to the use of the code. In time, they became  even better as they developed very effective methodologies and tools for  testing. Testing became a unique technology in its own right.&lt;/p&gt; &lt;p class="para"&gt;The cost of these programs is clearly zero. Every serious  development group today separates code quality assurance from code development.  Projects finish earlier, with higher-quality results. The projects spend less  money (much less money) and use up less time (much less time) than they would if  programmers were still doing the testing.&lt;/p&gt; &lt;p class="last-para"&gt;Data quality is emerging as a major topic 20 years later. The  same evolution is happening. Making data quality the responsibility of the data  management staff who design, build, and maintain our systems means that they do  not become experts in the methodologies and tools available, do not have the  independence to prioritize their work, and do not focus on the single task of  ensuring high-quality data. Data quality assurance must be the full-time task of  dedicated professionals to be effective.&lt;/p&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="End Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt; &lt;table class="BlankSpace" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td height="16"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="last-para"&gt;&lt;a name="164"&gt;&lt;/a&gt;&lt;a name="page71"&gt;&lt;/a&gt;One way to achieve a high  level of cooperation is to have an advisory group that meets periodically to  help establish priorities, schedules, and interactions with the various groups.  This group should have membership from all of the relevant organizations. It  should build and maintain an inventory of quality assurance projects that are  worth doing, keep this list prioritized, and assign work from it. The advisory  group can be very helpful in assessing the impact of quality problems as well as  the impact of corrective measures that are subsequently implemented.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="165"&gt;&lt;/a&gt;&lt;a name="ch04lev2sec2"&gt;&lt;/a&gt;Data Quality  Assurance Methods&lt;/h3&gt; &lt;p class="first-para"&gt; The first component is the quality  dimensions that need to be addressed. The second is the methodology for  executing activities, and the last is the three ways the group can get involved  in activities.&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="166"&gt;&lt;/a&gt;&lt;a name="ch04fig02"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig4%2D2%5F0%2Ejpg" name="IMG_13" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;The figure highlights the top line of each component to show where  a concentration on data accuracy lies. Data accuracy is clearly the most  important dimension of quality. The best way to address accuracy is through an  insideout methodology, discussed later in the book. This methodology depends  heavily on analysis of data through a process called data profiling. The last  part of this book is devoted to explaining data profiling. Improving accuracy  can be done through any of the activities shown. However, the one that will  return the most benefit is generally the one shown: project services.&lt;/p&gt; &lt;p class="para"&gt;Any data quality assurance function needs to address all of the  dimensions of quality. The first two, data accuracy and completeness, focus on  data stored in corporate databases. The other dimensions focus on the user  community and how they interpret and use data.&lt;/p&gt; &lt;p class="para"&gt; Both of  these methodologies have a goal of identifying data quality issues. An &lt;a name="167"&gt;&lt;/a&gt;&lt;a name="page72"&gt;&lt;/a&gt;issue is a problem that has surfaced, that is  clearly defined, and that either is costing the corporation something valuable  (such as money, time, or customers) or has the potential of costing the  corporation something valuable. Issues are actionable items: they result in  activities that change the data quality of one or more databases. Once  identified, issues are managed through an issues management process to determine  value, remedies, resolution, and monitoring of results. The process of issue  management is discussed more fully in the next chapter.&lt;/p&gt;  &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Inside-Out Method&lt;/h4&gt; &lt;p class="first-para"&gt;The inside-out method starts with analyzing the data. A  rigorous examination using data profiling technology is performed over an  existing database. Data inaccuracies are produced from the process that are then  analyzed together to generate a set of data issues for subsequent  resolution.&lt;/p&gt; &lt;p class="para"&gt;The analysis should be done by a highly qualified data analyst who  understands the structure of the data. The methodology starts with a complete  and correct set of rules that define data accuracy for the data. This is  metadata. It consists of descriptions of the data elements, values permitted in  them, how they relate to one another in data structures, and specific data rules  that describe value correlation conditions that should always be true within the  data. All of these categories are discussed at length in later chapters.&lt;/p&gt; &lt;p class="para"&gt;Of course, such a rigorous rule set for any operational database  does not exist. The metadata that is available is generally incomplete and most  likely inaccurate. The data profiling process described in later chapters is a  process that completes and corrects the metadata, along with using it to find  evidence of inaccurate data. This intertwined process has a very valuable  by-product: accurate and complete metadata.&lt;/p&gt; &lt;p class="para"&gt;The process of determining the correct metadata inevitably  involves conferring with business analysts and end users. The data analyst will  detect a behavior in the data and require consultation to determine why it is  so. This often leads to modifications to the metadata. These consultations are  &lt;a name="169"&gt;&lt;/a&gt;&lt;a name="page73"&gt;&lt;/a&gt;always productive because the question is  always backed up by information from the data.&lt;/p&gt; &lt;p class="para"&gt;The data analyst should identify who in the user community will be  the most valuable in consulting on issue identification and form a small,  dynamic working group with them. In the end, they should always agree on what  the final metadata is, and agree on the inaccurate data facts derived from the  comparison with the actual data.&lt;/p&gt; &lt;p class="para"&gt;The inaccurate data evidence produced is a collection of facts. It  may be explicit cases of wrong or missing values, or it may identify rules that  fail without being able to say what values are wrong. For example, one fact may  be that 30% of purchase order records do not have a supplier ID. Another may be  that the employee birth date field has values that are invalid: too long ago or  too recent. Another might be that the percent of the color BLUE in a database is  too large. In this case, the analyst does not know which instances are correct  and which are wrong; only that some of them must be wrong.&lt;/p&gt; &lt;p class="last-para"&gt;The facts are aggregated into issues. Some facts are issues  by themselves. For example, the supplier ID problem may be the basis for a  single issue. Others are aggregated into a larger issue. An example is that  customer demographic fields in a marketing database contain numerous errors in  all fields, possibly indicating a general problem with form design.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Outside-In Method&lt;/h4&gt; &lt;p class="first-para"&gt;This method looks for issues in the business, not the data.  It identifies facts that suggest that data quality problems are having an impact  on the business. It looks for rework, returned merchandise, customer complaints,  lost customers, delays in getting information products completed, high amounts  of work required to get information products produced, and so on. Interviews are  done with users to determine their level of trust in the accuracy of data coming  from the information systems and their level of satisfaction with getting  everything they need. It may also include looking for decisions made by the  corporation that turned out to be wrong decisions.&lt;/p&gt; &lt;p class="para"&gt;These facts are then examined to determine the degree of  culpability attributable to defects in the data. The data is then examined to  determine if it has inaccuracies that contribute to problems, and to determine  the scope of the contribution. This examination is generally pointed at the  specific problem. It is generally not a thorough data profiling exercise,  although it could be expanded to that if the evidence indicates a widespread  quality problem with the data.&lt;/p&gt; &lt;p class="last-para"&gt;This approach is generally the work of the data quality  assurance team member with skills as a business analyst. It involves heavy  participation on the &lt;a name="170"&gt;&lt;/a&gt;&lt;a name="page74"&gt;&lt;/a&gt;part of outside people.  It also requires conference sessions with user community experts. The result is  a collection of data issues that are then tracked on the same path as those from  the inside-out methodology.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Comparison Of Methods&lt;/h4&gt; &lt;p class="first-para"&gt;Neither approach is superior to the other: they both bring  value to the process. However, they do not get to the same end point. Data  quality assurance groups should use both methodologies as applicable.&lt;/p&gt; &lt;p class="para"&gt;Inside-out is generally easier to accomplish and uses less people  time. A single analyst can analyze a great deal of data in a short time. The  data quality assurance group can accomplish a great deal with this approach with  the staff within their own department. The outside-in approach requires spending  a lot of time interviewing people in other departments.&lt;/p&gt; &lt;p class="para"&gt;The inside-out approach is nondisruptive. You just get a copy of  the data you want to analyze and do it offline. The outside-in approach requires  scheduling time for others, thus interrupting their regular activities.&lt;/p&gt; &lt;p class="para"&gt;The inside-out approach will catch many problems the outside-in  approach does not catch. For an outside-in approach to catch a problem, it must  manifest itself in some external behavior, and that behavior must be  recognizable as being not good.&lt;/p&gt; &lt;p class="para"&gt;An example of a hidden problem is a case in which missing supplier  ID numbers on purchase orders causes a company not to get maximum discounts they  were entitled to from suppliers. The purchase order volumes were summarized by  supplier ID and, because the field was missing on 30% of the records, the  amounts were low. The company was losing millions of dollars every year because  of this and was completely unaware that it was happening. The inside-out  approach catches this; the outside-in approach does not.&lt;/p&gt; &lt;p class="para"&gt;Another type of problem are those inaccuracies that have the  potential for a problem but for which the problem has not yet occurred. An  example of this is where an HR database failed to capture government  classification group information on employees accurately. Many minority  employees were not classified as minorities, nor were handicapped employees all  being identified as handicapped. No problem may have surfaced yet. However, the  potential for being denied contracts in the future because of these inaccuracies  is waiting to happen. Inside-out analysis will catch this; outside-in will  not.&lt;/p&gt; &lt;p class="para"&gt;The opposite is also true. The inside-out approach will not catch  problems where the data is inaccurate but valid. The data can pass all metadata  tests and still be wrong. This can happen either because the rule set is  incomplete or because the data hides underneath all of the rules. An example is  getting the part number wrong on orders. The wrong merchandise is shipped. An  analysis of the data will not reveal inaccurate data because all of the part  numbers &lt;a name="171"&gt;&lt;/a&gt;&lt;a name="page75"&gt;&lt;/a&gt;are valid numbers. The outside-in  approach catches these problems better. (The inside-out approach may catch this  if the analysis finds the percentage of orders returned to be higher than an  acceptable threshold. This is possible if a data rule or value test has been  formulated. These topics are covered in &lt;a class="chapterjump" href="LiB0077.html#494" target="_parent"&gt;Chapters 11&lt;/a&gt; and &lt;a class="chapterjump" href="LiB0083.html#520" target="_parent"&gt;12&lt;/a&gt;).&lt;/p&gt; &lt;p class="last-para"&gt;There is another class of problems not detectable by either  approach. The data is valid but wrong and also produces insufficient external  evidence to raise a flag. Although these generally are of little concern to a  corporation, they have the potential to be costly in the future if not detected.  A data quality assurance program built exclusively using only one approach is  generally going to miss some important issues.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="172"&gt;&lt;/a&gt;&lt;a name="ch04lev2sec3"&gt;&lt;/a&gt;Data Quality  Assurance Activities&lt;/h3&gt; &lt;p class="first-para"&gt;The data quality assurance team must decide how it will  engage the corporation to bring about improvements and return value for their  efforts. The group should set an explicit set of guidelines for what activities  they engage in and the criteria for deciding one over another. This is best done  with the advisory group.&lt;/p&gt; &lt;p class="para"&gt;There are three primary roles the group can adopt. This is shown  as the last column in &lt;a class="internaljump" href="#ch04fig02" role="chapterjump"&gt;Figure 4.2&lt;/a&gt;. One of them, project services, involves  working directly with other departments on projects. Another, stand-alone  assessments, involves performing assessments entirely within the data quality  assurance group. Both of these involve performing extensive analysis of data and  creating and resolving issues. The other activity, teach and preach, involves  educating and encouraging employees in other groups to perform data auditing  functions and to employ best practices in designing and implementing new  systems.&lt;/p&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Project Services&lt;/h4&gt; &lt;p class="first-para"&gt;The vast majority of projects being pursued by the IT  organization involve repurposing an existing database. It is rare these days to  see a truly new application being developed that does not draw from data that  has already been collected in an existing application. Examples of projects that  involve working with existing data stores are&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;data migration to new applications (generally packaged  applications)&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;consolidation of databases as a result of mergers and  acquisitions&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;consolidation of databases to eliminate departmental  versions of applications&lt;a name="173"&gt;&lt;/a&gt;&lt;a name="page76"&gt;&lt;/a&gt; &lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;replication of data into data warehouses, data marts, or  operational data stores&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;building a CRM system&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;application integration that connects two or more  applications&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;application integration that connects an older database to  the Internet&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;There is a real danger in all of these applications of introducing  errors through mistakes made due to a misunderstanding of the data. There is  also a real danger in the data from the original systems not being of sufficient  quality to meet the demands of the new use of the data. Both of these are  classical concerns that if not addressed will certainly cause great difficulty  in completing the projects, as well as unhappiness with the outcome.&lt;/p&gt; &lt;p class="para"&gt;The data quality assurance team can provide an invaluable service  to these projects by profiling the data. By doing this they provide two valuable  outputs: an accurate and complete metadata description of the data and an  inventory of data quality problems uncovered in the process.&lt;/p&gt; &lt;p class="para"&gt;The metadata repository produced should be used to match target  system requirements against the content and structure of the source systems. It  is also the perfect input to developing processes for extraction,  transformation, cleansing, and loading processes.&lt;/p&gt; &lt;p class="para"&gt;The data quality assurance team can use the inaccuracy facts to  determine either whether the data is strong enough to satisfy the intended use  or whether there is a need to establish new projects from the issues to drive  improvements in the source systems. Of course, this applies to cases in which  the source databases continue to live past the project, as is the case for  replication and integration projects.&lt;/p&gt; &lt;p class="para"&gt;The data quality assurance team can also provide advice and  oversight in the design of target database structures, as well as processes for  collecting or updating data. They also have a good opportunity to get data  checking and monitoring functions embedded in the new systems to help prevent  future quality problems.&lt;/p&gt; &lt;p class="para"&gt;Why should the data quality assurance team perform these tasks, as  opposed to the project teams? The answer is that the data quality assurance team  are experts in data quality technologies.They are experienced in data profiling,  investigation of issues, and fabrication of data quality problem remedies.&lt;/p&gt; &lt;p class="para"&gt;One of the most valuable outputs of data profiling at the  beginning of a project is to learn that the project cannot achieve its goals  because of the condition of the source data. When this happens, the project team  can then make decisions about changing target design, changing target  expectations, making improvements to data sources, or scrapping the project  outright. This is the &lt;a name="174"&gt;&lt;/a&gt;&lt;a name="page77"&gt;&lt;/a&gt;perfect place to make  these decisions: before most of the project money has been spent and before most  of the development work has been done.&lt;/p&gt; &lt;p class="last-para"&gt;Projects that do not perform a thorough review of the source  data generally do not discover the match between the data and the project  requirements until after much time and money has been spent. It is generally  very expensive to repair the damage that has already been done and impossible to  recoup the money spent and the valuable time lost.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Stand-Alone Assessments&lt;/h4&gt; &lt;p class="first-para"&gt;A stand-alone assessment is a project organized for the  purpose of determining the health of an existing database. The database is  chosen because of suspicions or evidence about problems coming from the use of  data, or simply because it is an important data source for the corporation.&lt;/p&gt; &lt;p class="para"&gt;The data quality assurance team will generally execute the entire  project. Using the inside-out method, they will profile the data, collect  quality facts, produce issues, and then follow the issues through to  remedies.&lt;/p&gt; &lt;p class="para"&gt;The advantage of assessment projects is that they do not require  as much interaction with other project teams and can be scheduled without  concern for other plans in IT. Of course, it makes no sense to schedule an  assessment of a database that is about to get a facelift as a result of another  project.&lt;/p&gt; &lt;p class="para"&gt;An assessment can be quite disruptive to other departments, even  if no change activity is under way for the data source. Time from them will be  needed to develop perfect understanding of the metadata and to interpret facts  that come out of profiling. If remedies are needed, negotiations with IT and  users will be needed to get them designed and implemented. It may also be quite  disturbing to people to find out that they have been using flawed data for a  long time without knowing it. The data quality assurance team needs to involve  the other departments in the planning phase and keep them involved throughout  the process.&lt;/p&gt; &lt;p class="last-para"&gt;It is important not to appear as an outside hit team trying  to do damage to the reputation of the operational organizations. Involving them  makes them part of the solution.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Teach And Preach&lt;/h4&gt; &lt;p class="first-para"&gt;This function involves training information system staff  members on the technology available for data quality assessment, the techniques  and best practices available for building and maintaining systems, and how to  develop quality requirements and use them to qualify data.&lt;/p&gt; &lt;p class="para"&gt;Few information systems professionals come out of college with  training explicitly targeted to data quality. The principles are not difficult  to understand, &lt;a name="175"&gt;&lt;/a&gt;&lt;a name="page78"&gt;&lt;/a&gt;nor are the disciplines  difficult to use in daily practice. Educating them will improve all of the work  they do.&lt;/p&gt; &lt;p class="para"&gt;The data quality assurance group should function as the experts in  data quality. They should not keep this knowledge exclusively to themselves. The  more they educate others in the corporation, the more likely the information  systems will reach and stay at a high level of quality.&lt;/p&gt; &lt;p class="para"&gt;Preaching means that the data quality assurance department should  encourage and insist that quality checkpoints be put into all projects. They  should encourage upper management to be cognizant of the need for data quality  activities. They should collect and advertise the value to the corporation  realized from these activities.&lt;/p&gt; &lt;p class="last-para"&gt;The data quality assurance group should not depend  exclusively on teaching and preaching. If that is all they do, the company will  never develop the focused expertise needed to analyze the mountains of data and  drive improvements.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Closing Remarks&lt;/h2&gt; &lt;p class="first-para"&gt;If you want high data quality you must have highly accurate  data. To get that you need to be proactive. You need a dedicated, focused  group.&lt;/p&gt; &lt;p class="para"&gt;You need to focus on data accuracy. This means you need an  organization that is dedicated to improving data accuracy. You also need trained  staff members who consider the skills required to achieve and maintain data  accuracy as career-building skills.&lt;/p&gt; &lt;p class="para"&gt;You need to use technology heavily. Achieving high levels of data  accuracy requires looking at data and acting on what you see. You need to do a  lot of data profiling. You need to have experienced staff members who can sniff  out data issues.&lt;/p&gt; &lt;p class="para"&gt;You need to treat information about your data as of equal or  greater importance than the data itself. You must install and maintain a  legitimate metadata repository and use it effectively.&lt;/p&gt; &lt;p class="para"&gt;You need to educate other corporate employees in the importance of  data and in what they can do to improve the accuracy. This includes the  following elements.&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;Business users of data need to be sensitized to quality  issues.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;Business analysts must become experts on data quality  concepts and play an active role in data quality projects.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;Developers need to be taught best practices for database and  application design to ensure improved data accuracy.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;a name="177"&gt;&lt;/a&gt;&lt;a name="page79"&gt;&lt;/a&gt;Data administrators need  to be taught the importance of accuracy and how they can help improve it.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;All employees who generate data need to be educated on the  importance of data accuracy and be given regular feedback on the quality of data  they generate.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;The executive team needs to understand the value of improved  data accuracy and the impact it has on improved information  quality.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="last-para"&gt;You need to make quality assurance a part of all data  projects. Data quality assurance activities need to be planned along with all of  the other activities of the information systems department. Assisting a new  project in achieving its data quality goals is of equal or higher value than  conducting assessment projects in isolation. The more integrated data quality  assurance is with the entire information system function, the more value is  realized. And finally, everyone needs to work well together to accomplish the  quality goals of the corporation.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h1 class="chapter-title"&gt;Data Quality Issues Management&lt;/h1&gt; &lt;div class="section"&gt; &lt;h2 class="sect2-title"&gt;&lt;a name="179"&gt;&lt;/a&gt;Overview&lt;/h2&gt; &lt;p class="first-para"&gt;&lt;a name="180"&gt;&lt;/a&gt;&lt;a name="page80"&gt;&lt;/a&gt;Data quality  investigations are all designed to surface problems with the data. This is true  whether the problems come from stand-alone assessments or through data profiling  services to projects. It also does not matter whether assessments reveal  problems from an inside-out or an outside-in method. The output of all these  efforts is a collection of facts that get consolidated into issues. An issue is  a problem with the database that calls for action. In the context of data  quality assurance, it is derived from a collection of information that defines a  problem that has a single root cause or can be grouped to describe a single  course of action.&lt;/p&gt; &lt;p class="para"&gt;That is clearly not the end of the data quality effort. Just  identifying issues does nothing to improve things. The issues need to drive  changes that will improve the quality of the data for the eventual users.&lt;/p&gt; &lt;p class="para"&gt;It is important to have a formal process for moving issues from  information to action. It is also important to track the progress of issues as  they go through this process. The disposition of issues and the results obtained  from implementing changes as a result of those issues are the true documentation  of the work done and value of the data quality assurance department.&lt;/p&gt; &lt;p class="para"&gt;It does not  matter who performs these phases. The data quality assurance department may own  the entire process. However, much of the work lies outside this department. It  may be a good idea to form a committee to meet regularly and discuss progress of  issue activity. The leader of the committee should probably be from the data  quality assurance department. At any rate, the department has a vested interest  in getting issues turned into actions and in results being measured. They should  not be passive in pursuing issue resolution. This is the fruit of their  work.&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="181"&gt;&lt;/a&gt;&lt;a name="ch05fig01"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig5%2D1%5F0%2Ejpg" name="IMG_15" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;&lt;a name="182"&gt;&lt;/a&gt;&lt;a name="page81"&gt;&lt;/a&gt;An issue management system  should be used to formally document and track issue activity. There are a number  of good project management systems available for tracking problems through a  work flow process.&lt;/p&gt; &lt;p class="last-para"&gt;The collection of issues and the management process can  differ if the issues surface from a "services to project" activity. The project  may have an issues management system in place to handle all issues related to  the project. They certainly should. In this case, the data quality issues may be  mixed with other issues, such as extraction, transformation, target database  design, and packaged application modification issues. It is helpful if data  quality issues are kept in a separate tracking database or are separately  identified within a central project management system, so that they can be  tracked as such. If "project services" data profiling surfaces the need to  upgrade the source applications to generate less bad data, this should be broken  out into a separate project or subproject and managed independently.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Turning Facts into Issues&lt;/h2&gt; &lt;p class="first-para"&gt;Data quality investigations turn up facts. The primary job  of the investigations is to identify inaccurate data. The data profiling process  will produce inaccuracy facts that in some cases identify specific instances of  wrong values. Other cases identify where wrong values exist but identification  of which value is wrong is not known, and in yet other cases identify facts that  raise suspicions about the presence of wrong values.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="184"&gt;&lt;/a&gt;&lt;a name="page82"&gt;&lt;/a&gt;Facts are individually  granular. This means that each rule has a list of violations. You can build a  report that lists rules, the number of violations, and the percentage of tests  performed (rows, objects, groups tested) that violated the rule. The violations  can be itemized and aggregated.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="185"&gt;&lt;/a&gt;&lt;a name="ch05lev2sec1"&gt;&lt;/a&gt;Metrics&lt;/h3&gt; &lt;p class="first-para"&gt;There is a strong temptation for quality groups to generate  metrics about the facts and to "grade" a data source accordingly. Sometimes this  is useful; sometimes not. Examples of metrics that can be gathered are&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;number of rows containing at least one wrong value&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;graph of errors found by data element&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;number of key violations (nonredundant primary keys,  primary/foreign key orphans)&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;graph of data rules executed and number of violations  returned&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;breakdown of errors based on data entry locations&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;breakdown of errors based on data creation  date&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;The data profiling process can yield an interesting database of  errors derived from a large variety of rules. A creative analyst can turn this  into volumes of graphs and reports. You can invent an aggregation value that  grades the entire data source. This can be a computed value that weights each  rule based on its importance and the number of violations. You could say, for  example, that this database has a quality rating of 7 on a scale of 10.&lt;/p&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;The Good&lt;/h4&gt; &lt;p class="first-para"&gt;Metrics can be useful. One use is to demonstrate to  management that the process is finding facts. The facts have little to no  significance by themselves but can be circumstantial evidence that something is  wrong with the data. When a data quality assurance department is trying to gain  traction in a corporation, metrics can be a useful way to show progress.&lt;/p&gt; &lt;p class="para"&gt;Metrics can also be useful to show improvements. If data is  profiled before and after corrective actions, the metrics can show whether the  quality has improved or not.&lt;/p&gt; &lt;p class="para"&gt;Another use of metrics is to qualify data. Data purchased from  outside the corporation, such as demographic data, can be subjected to a quick  data profiling process when received. Metrics can then be applied to generate a  qualifying grade for the data source. It can help determine if you want to use  &lt;a name="186"&gt;&lt;/a&gt;&lt;a name="page83"&gt;&lt;/a&gt;the data at all. This can be used to  negotiate with the vendor providing the data. It can be the basis for penalties  or rewards.&lt;/p&gt; &lt;p class="last-para"&gt;Qualification can also be done for internal data sources. For  example, a data warehousing group can qualify data extracts from operational  groups before they are applied to the central data warehouse.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;The Bad&lt;/h4&gt; &lt;p class="first-para"&gt;The downside of metrics is that they are not exact and they  do not solve problems. In fact, they do not identify what the problems are; they  only provide an indicator that problems exist.&lt;/p&gt; &lt;p class="para"&gt;Earlier chapters demonstrated that it is not possible to identify  all inaccurate data even if you are armed with every possible rule the data  should conform to. Consequently you cannot accurately estimate the percentage of  inaccuracies that exist. The only thing you know for sure is that you found a  specific number of inaccuracies. The bad news is that there are probably more;  the good news is that you found these. If the number you find is significant,  you know you have a problem.&lt;/p&gt; &lt;p class="para"&gt;Corrective actions have these potential consequences: they can  prevent recurrence of some errors that you can detect, they can prevent  recurrence of errors you cannot detect, and they can continue to pass errors  through. It is also theoretically possible that you would introduce new errors  that may or may not be detectable.&lt;/p&gt; &lt;p class="para"&gt;The conclusion is that data profiling techniques can show the  presence of errors but cannot show the absence of errors nor the number of  errors. Therefore, any metrics derived from the output of profiling are inexact.  This does not make them useless. On the contrary, the errors found are true  errors, and if there are enough of them you have uncovered true problems.&lt;/p&gt; &lt;p class="para"&gt;You might conclude from the previous discussion that the number of  errors reported is understated. This would be great if it were true. However,  poorly defined metrics can actually overstate the error condition. This occurs  when a single inaccurate value triggers multiple rule violations. This is  difficult to detect and impossible to quantify. When you consider that the  majority of rules will find the presence of inaccurate data but will not  pinpoint the offending values, you can see why it is difficult, if not  impossible, to find the true number of inaccurate values.&lt;/p&gt; &lt;p class="para"&gt;Comparing metrics can also be misleading if the yardstick changes  between profiling exercises. As analysts gain more knowledge about a data  source, they will add to the rule set used to dig out inaccuracies. Comparing  two result sets that are derived from different rule sets results in an  apples-to-oranges comparison. All presentations of quality metrics need to  provide disclaimers so that the readers can understand these dynamics.&lt;/p&gt; &lt;div class="sidebar"&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="Start Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="first-para"&gt;&lt;a name="187"&gt;&lt;/a&gt;&lt;a name="page84"&gt;&lt;/a&gt;The following is an  example of preventing recurrence of errors you never detected. A medical  clinic's internal system records a code for the medical procedure performed, as  well as the gender of the patient. It is discovered in data profiling that  procedures are being recorded that are not possible for the gender code  recorded. These are inaccuracy facts.&lt;/p&gt; &lt;p class="para"&gt;However, the root cause is that the procedure codes are  handwritten on paper forms and then sent to the data entry office. Many of them  are illegible or missing. The data entry staff has no way of verifying the  correct procedure and are motivated to get the data into the system rather than  fix it. In addition to the procedure codes being invalid in the case of gender  conflicts, there are probably many other procedure codes that are wrong.  However, because they are valid procedure codes, they are not detected.&lt;/p&gt; &lt;p class="para"&gt;The remedy called for having the data entered directly online by  the administrators of the doctors instead of transferring paper documents to a  central data entry function. Because so many errors were noted, the new form  displays a text description of the procedure when it is entered with a  confirmation button. This helps the administrators confirm that they have  entered the correct code.&lt;/p&gt; &lt;p class="last-para"&gt;Checks were put in for gender/procedure code conflicts, as  well as other conflicts, such as invalid patient age/procedure code  combinations. In addition, administrators were educated on the importance of  correct procedure codes. Because of the better data entry procedures, the number  of errors prevented not only included those that were detectable but many others  that were not detectable through analysis.&lt;/p&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="End Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt; &lt;table class="BlankSpace" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td height="16"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="para"&gt;An additional problem with metrics is that data quality assurance  departments often believe that this is the end of their mission. They define  their work product as the metrics. However, metrics do not define the source of  problems nor the solutions. To improve data quality you need to follow through  on getting improvements made. To hand the responsibility for this to other  departments is a guarantee that the work items will sit low on priority lists of  things to do and will not get done expeditiously. The data quality assurance  department needs to track and drive the issues through to solution.&lt;/p&gt; &lt;p class="para"&gt;Metrics are not all bad. They are often a good shock factor for  driving actions. When you give management a presentation that says the HR  database records revealed 700 inaccurate values, this can raise eyebrows and  produce a call for action. Knowing that you have 700 and that the real number is  higher can be motivation enough.&lt;/p&gt; &lt;p class="last-para"&gt;Often a single fact is more shocking than statistical  metrics. For example, telling management that a profiling exercise of the birth  date of employees &lt;a name="188"&gt;&lt;/a&gt;&lt;a name="page85"&gt;&lt;/a&gt;revealed that the youngest  employee in the company has not been born yet and that the oldest was born  before the Civil War is far more effective than a metric at getting across the  point that improvements are needed &lt;i class="emphasis"&gt;now.&lt;/i&gt; (I did not make  this up; it was an actual output of a data profiling exercise.)&lt;/p&gt;&lt;/div&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="189"&gt;&lt;/a&gt;&lt;a name="ch05lev2sec2"&gt;&lt;/a&gt;Issues&lt;/h3&gt; &lt;p class="first-para"&gt;The real output of the fact collection phase is a set of  issues that define problems that need to be solved. A single statistic can  result in an issue. For example, 30% of the purchase order fields have no  supplier ID number. Alternatively, several facts can be grouped into one issue.  For example, the customer name and address data is severely flawed: 5% of name  fields have invalid names, 15% of address fields are inaccurate or blank, 12% of  city fields are blank, 5% of city fields are misspelled, and 12% of Zip codes  are invalid or blank. This single issue rolls up several inaccuracy facts into a  single issue that needs to be addressed. Addressing each inaccuracy fact is an  inefficient use of time.&lt;/p&gt; &lt;p class="last-para"&gt;Issues need to be recorded in a database within an issues  tracking system. Each issue needs a narrative description of the findings and  facts that are the basis for the issue. It is important to identify the facts  and the data source so that comparisons can be correctly made during the  monitoring phase. The information needed for the data source is the  identification of the database used, whether samples or the entire database were  used, the date of the extraction, and any other information that will help  others understand what you extracted the facts from. In tracking the issues, all  meetings, presentations, and decisions need to be recorded along with dates and  persons present.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;div class="chapter"&gt; &lt;div class="section"&gt; &lt;h2 class="first-section-title"&gt;Assessing Impact&lt;/h2&gt; &lt;p class="first-para"&gt;Each issue that has been created needs to be studied to  determine the impact it has already had or potentially may have on the  corporation. Somewhere along the line someone will ask the "so what" question  about an issue. It is important to justify development and disruptive efforts to  deploy corrective actions. It is important to document the value returned to the  corporation for the time and cost spent pursuing issues.&lt;/p&gt; &lt;p class="para"&gt;This needs to be updated from time to time. It is usually  impossible to compute the costs and benefits up front. One approach is to look  at the facts and theorize on possible impacts. A brainstorming session with data  analysts, business analysts, and others may be helpful. This will lead to  activities to prove that the impacts have already occurred. Because impacts have  not occurred does not mean they will not in the future. As the issues are worked  &lt;a name="191"&gt;&lt;/a&gt;&lt;a name="page86"&gt;&lt;/a&gt;through the entire process, additional  information about impacts may become apparent. These need to be added to the  impact section.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="192"&gt;&lt;/a&gt;&lt;a name="ch05lev2sec3"&gt;&lt;/a&gt;Impacts Already  Happening&lt;/h3&gt; &lt;p class="first-para"&gt;The impacts may not be obvious to anyone but may be very  real. For example, an issue that states that suppliers exist in the supplier's  database multiple times may lead to speculation that you are not getting large  enough discounts for volumes purchased over a year. Investigation may uncover  that this is true (one department orders under one supplier ID and another  department uses a second supplier ID for the same supplier). You can easily  compute the discount difference, the volume of purchases made, and the value  lost to the corporation. The cost of this type of inaccuracy is totally hidden  until the issue is identified and pursued.&lt;/p&gt; &lt;p class="para"&gt;Sometimes an issue is created from an outside-in investigation and  the cost is already known. Tying the external cost to facts is part of issue  definition. For example, the external manifestation might be that the accounts  receivable department spends &lt;i class="emphasis"&gt;x&lt;/i&gt; amount of people time per  month correcting wrong information on invoices. The facts are the number of  blank or inaccurate values found during data profiling. The facts back up the  assertion that invoices are not being prepared properly.&lt;/p&gt; &lt;p class="para"&gt;Further investigation may reveal that not only is time being  wasted but that payments are being delayed by a certain amount for two reasons:  one is the lag in time in getting invoices out, and the other is that invoices  sent out without corrections get rejected by the purchasing company, causing yet  further delays. In fact, there may be a group of invoices that are never  collected due to data errors on the invoices. This is an example of a single  visible cost leading to facts about inaccuracies, which lead to the discovery of  more hidden costs.&lt;/p&gt; &lt;p class="para"&gt;One point to consider is that a significant accuracy problem on a  data element may indicate a bigger quality problem. In the case of the missing  supplier ID, it is clear that if 30% of the values are missing, there is a real  possibility that the process is flawed and that the supplier ID is not available  at the time the data is entered. It is unlikely that data entry staff are that  bad at their jobs. It is also clear that this field is not involved in making  the purchase or subsequent payments (it appears to cause no harm). The harm is  all done in the secondary uses of the data. It is easy to speculate that if the  data is not available at entry, data entry staff may also be entering wrong but  valid values. The problem may be much larger than it first appears.&lt;/p&gt; &lt;p class="last-para"&gt;This is why you need to match inaccuracy facts to known  manifestations. By seeing the actual data values in error and the data elements  containing errors, you can often speculate about hidden costs that may be  occurring.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="193"&gt;&lt;/a&gt;&lt;a name="ch05lev2sec4"&gt;&lt;/a&gt;Impacts Not Yet  Happening&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="194"&gt;&lt;/a&gt;&lt;a name="page87"&gt;&lt;/a&gt;The most dangerous  impacts are those that have not yet occurred. Seeing the presence of inaccurate  data can sometimes lead to speculation about problems that could occur. These  can have greater impact than those that occur on a regular basis but cost little  to correct.&lt;/p&gt; &lt;p class="para"&gt;A simple example is the inaccurate birth dates of employees. There  may have been no costs that have occurred yet for a new company that hires  mostly young people. However, as this population ages, all sorts of government  regulations about reporting, pension programs, and changing medical benefits  when an employee reaches age 65 are at risk of occurring. These errors can also  make decisions about hiring practices inaccurate and lead to wasteful efforts to  adjust the company's mix of ages.&lt;/p&gt; &lt;p class="para"&gt;A business rule may require that a fast mode of shipment be used  to ship certain materials that have the potential to spoil or decay. They may  require refrigeration or avoidance of temperatures above a certain number. It  may be that errors in the orders have caused a number of shipments to be made  that violate the rule and no dire consequences have occurred. All values are  valid individually, but the shipment mode rule for the product type is violated.  By speculating on the potential for costs, the issues team may speculate about  returned orders, merchandise that cannot be resold, and lost customers. However,  that speculation may lead to the potential for real lawsuits, as the corporation  may be liable for damage done to the purchaser trying to use spoiled  merchandise.&lt;/p&gt; &lt;p class="para"&gt;This example may have been saving the company money (lower  shipping costs) but creating a potential liability (lawsuits) that could  severely damage or even destroy the company. This is why speculation on  potential impacts is so important.&lt;/p&gt; &lt;p class="para"&gt;The process of assessing impacts will crystallize issues. It may  result in issues being broken apart or issues being combined. As participants  gain more experience, they will be better at sniffing out impacts both real and  potential. As new participants join the process, they can benefit from the  documentation of previous issues as a training device.&lt;/p&gt; &lt;p class="last-para"&gt;It should also be apparent that the documentation of the  impacts of issues is highly sensitive information. The issues management process  should provide for a high degree of privacy and safety of the  information.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;h2 class="first-section-title"&gt;Investigating Causes&lt;/h2&gt; &lt;p class="first-para"&gt;The next logical step in the process is to discover the  causes of the inaccuracy facts. Remedies cannot be fabricated until more  information is uncovered. &lt;a name="196"&gt;&lt;/a&gt;&lt;a name="page88"&gt;&lt;/a&gt;You need to perform  a thorough study, in that the causes may not be what you think they are.&lt;/p&gt; &lt;p class="para"&gt;This chapter is not going to cover this topic comprehensively.  This is a very large topic and beyond the scope of this book. However, a  snapshot of some of the approaches is given to show the types of activities  required.&lt;/p&gt; &lt;p class="para"&gt;Investigating causes requires talking to a lot of people in a lot  of organizations. Assignments to investigators must to be done based on the  substance of the issues. Participants from many organizations may be needed. The  data quality assurance department should not try to undergo this step entirely  with their own staff. Neither should they relegate this entirely to others. It  is yet another place where the need for a larger team exists that gets guidance  and leadership from the data quality assurance staff.&lt;/p&gt; &lt;p class="para"&gt;Investigation of the cause is not always possible. For example,  databases purchased from vendors may be found to be defective. It is your  responsibility to notify them of the problem and give them facts. It is their  job to investigate the causes and correct them.&lt;/p&gt; &lt;p class="para"&gt;There are two basic approaches to investigating errors: error  cluster analysis and data events analysis. The first is used to narrow down the  sources of errors. The second is used to study the events that cause data to be  created and maintained in order to help identify the root causes of problems.  They can often be used together to efficiently complete the task.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="197"&gt;&lt;/a&gt;&lt;a name="ch05lev2sec5"&gt;&lt;/a&gt;Error Clustering  Analysis&lt;/h3&gt; &lt;p class="first-para"&gt;This type of analysis attempts to use information in the  database to provide clues as to where the inaccuracies may be coming from. It  starts with information about the specific database objects containing  inaccuracies. For example, in an order database, it would start by identifying  those orders that contain inaccurate data or that are suspected of having  inaccurate data. Although many rules about data cannot identify specific data  elements that are wrong, they can identify entire orders that contain the wrong  data. The collection of all orders that have wrong values or rule violations  constitutes the analysis set.&lt;/p&gt; &lt;p class="para"&gt;The analysis set may be defined narrowly (all orders violating a  single rule) or broadly (all orders violating any rule). It depends on the  amount of data in the analysis set and the importance of the individual rule.  There is also the concept of rules having affinity. That is, for example, all  rules that deal with the initial capture of the order information (a process  clustering) or all orders dealing with customer name and address information  (data semantic clustering).&lt;/p&gt; &lt;p class="para"&gt;Once the set of data is isolated that contains offending data, all  of the data elements of the isolated set are used to determine if they vary in  significant ways with the general population of data.&lt;/p&gt; &lt;p class="last-para"&gt;&lt;a name="198"&gt;&lt;/a&gt;&lt;a name="page89"&gt;&lt;/a&gt;Common data elements that  may reveal significant variances are data source location (branch office,  geographic region, specific sales reps), customer information (first-time  customers, Internet customers), dates (specific dates, days of week, range of  dates), product type or characteristics (engine pats, volatile, expensive), or  process steps completed (initial entry, order shipped, invoice created). You are  looking for any factor that may indicate a starting point in examining the  causes of the errors. Performing error clustering analysis can shorten the  search for causes significantly through performing a relatively quick and simple  test of data.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="199"&gt;&lt;/a&gt;&lt;a name="ch05lev2sec6"&gt;&lt;/a&gt;Data Events  Analysis&lt;/h3&gt; &lt;p class="first-para"&gt;This involves a review of all processes that capture data or  change data. Data takes a journey from inception to one or more databases. It  may have a single process event (data entry) or a number of events. The points  of examination can be any or all of the following:&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;data capture processes&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;durations in which data decay can occur&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;points at which data is extracted and added to a different  data store&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;points at which data is converted to business  information&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Data Capture Processes&lt;/h4&gt; &lt;p class="first-para"&gt;The process point at which data is captured represents the  single most important place data can be made accurate or inaccurate. All data  capture points need to be identified and examined. Some data is only captured  once. Some is captured and then updated on an exception basis. Some data is  captured and the business object updated or enhanced through a work flow process  that may occur over a long period of time. Some of these points may take on  multiple forms. For example, an order may be entered by the actual customer over  the Internet, entered by a recording clerk from a form received in the mail, or  entered by a company sales representative through a company client server  application. This example shows three very different and distinct ways of  entering the same business object.&lt;/p&gt; &lt;p class="para"&gt;Building a diagram of the data paths of a business object,  identifying the distinct points of data capture, and specifying the  characteristics of each is a time-consuming but extremely important task.&lt;/p&gt;  &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;&lt;a name="201"&gt;&lt;/a&gt;&lt;a name="page90"&gt;&lt;/a&gt;&lt;i class="emphasis"&gt;Time  between event and recording:&lt;/i&gt; In general, the longer the time differences,  the greater the chance for errors. If the time lag is long enough, it also lends  itself to missing or late information. Examples of long durations are cases in  which forms are completed and mailed to a data entry location. The accuracy and  timeliness would be enhanced if the time difference were eliminated through a  more direct entry, such as through the Internet.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Distance between event and recording:&lt;/i&gt;  Physical distance can also be a factor. This reduces the opportunity for the  person who is entering the data to verify or challenge information. For example,  if the originator of data is in Chicago but the information is transmitted via  telephone or paper to Kansas City for entry, you have a distance between the  person who knows the right information and the one entering it. If there is  confusion, the entry person has to either enter nulls or enter a best guess.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Number of handoffs of information before  recording:&lt;/i&gt; The first person to experience the event is most likely to be the  one with the most accurate description of the facts. Each handoff to another  person introduces the possibility of misreading written information,  misinterpreting some else's comments, or not knowing information that was not  passed on.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Availability of all facts at  recording:&lt;/i&gt; If the person entering the information has no access to the  event, to the person who created or observed the event, or to databases  containing important auxiliary information, they cannot fill in missing  information or challenge information they see. For example, it is better for HR  data to be entered with the employee sitting next to the entry person, as  opposed to copying information from a form. Another example is to have a search  function for customer identifiers available for order entry personnel.&lt;a name="202"&gt;&lt;/a&gt;&lt;a name="page91"&gt;&lt;/a&gt; &lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Ability to verify information at  recording:&lt;/i&gt; This is similar to the previous issue, but slightly different.  Can the data entry person get to correct information if they think the  information provided is wrong? An HR data entry person could call or e-mail the  employee if there is confusion. Sometimes the process makes it impossible to  make this connection. Sometimes the process penalizes the data entry person for  taking the time to verify questionable information. All entry points should  allow for information to be either verified immediately or posted to a deferred  process queue for later verification and correction if needed.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Motivation of person doing recording:&lt;/i&gt;  This is a complex topic with many sides. Are they motivated to enter correct  information? Are they motivated and empowered to challenge questionable  information? Are they motivated to enter the information at all? Someone  entering their own order is motivated to do it and get it right. Someone  entering piles of form information they do not understand could not care less if  the information is entered correctly or completely. Is feedback provided? Is  their performance measured relative to completeness and accuracy?&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Skill, training, and experience of person  doing recording:&lt;/i&gt; People who enter the same information for a living get to  learn the application, the typical content, and the data entry processes. They  can be trained to do it right and to look for red flags. People who enter data  on a form only one time in their life are much more likely to get it wrong.  Sometimes there exists a data entry position that has not been trained in the  application. This is an invitation for mistakes. Note that entry people who are  making mistakes tend to make them repetitively, thus increasing the database  inaccuracy level and thereby increasing the likelihood that it will be exposed  through data profiling analysis.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Feedback provided to recorder:&lt;/i&gt;  Feedback is always a good thing. And yet, our information systems rarely provide  feedback to the most important people in the data path: those entering the data.  Relevant information, such as errors found in computer checks, should be  collected and provided to help them improve the accuracy of data they enter.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Auto-assist in recording process:&lt;/i&gt; Do  the data entry programs and screens help in getting it right? A complex process  can include pull-downs, file checking, suggestions on names, addresses,  questioning of unusual options or entry information, and so on. Remembering  information from the last transaction for that source can be very helpful in  getting information right. Letting each data entry station set its own pull-down  defaults can reduce errors. Providing the current date instead of asking that it  be &lt;a name="203"&gt;&lt;/a&gt;&lt;a name="page92"&gt;&lt;/a&gt;entered can improve accuracy. There are a  lot of technology best practices that can improve the accuracy of  information.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;i class="emphasis"&gt;Error checking in recording process:&lt;/i&gt;  Evaluate the checking provided by the entry screen programs, the transaction  path, and the database acceptance routines. Data checkers, filters, and database  structural enforcement options can all be used to catch mistakes at the entry  point. These are not always easy to identify because they require someone to dig  around in code and database definitions. Many times these are not documented.  Many times they are thought to be true but have been turned off by a database  administrator to improve performance. Many times they exist but are not applied  to all points of entry.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="last-para"&gt;It is important to study all factors at each entry point,  even though the investigation started by focusing on a single set of inaccuracy  facts. This process may reveal other inaccuracies that were hidden from the  profiling process or uncover the potential for problems that have not yet  occurred. It may also uncover some locally devised practices that are good ideas  and may warrant propagation as a formal methodology throughout the data entry  community.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Data Decay&lt;/h4&gt; &lt;p class="first-para"&gt;The analyst needs to identify data elements that are subject  to decay and check for process steps that exist that will mitigate decay.  Identifying data decay candidates is a business analyst topic best handled as  work sessions with participants from multiple departments.&lt;/p&gt; &lt;p class="para"&gt;If the investigation reveals that no procedures are present to  prevent decay, the analyst needs to determine the extent to which decay has  contributed to currently visible problems or whether it presents the potential  for future problems.&lt;/p&gt; &lt;p class="last-para"&gt;Decay problems are often not observable though data profiling  because the values in the database are valid even though wrong. However, process  analysis may suggest that the data is susceptible to decay problems. Sampling  the data and testing it through object reverification may reveal hidden  problems. These can become the subject of new issues split off from those that  got you there.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Data Movement And Restructuring Processes&lt;/h4&gt; &lt;p class="first-para"&gt;Many errors can be introduced when data is extracted,  reformatted, aggregated, and combined with other data. If the data source that  was used for identifying the inaccurate data is not a primary data source, it  requires examination of the processes that build that database from the primary  sources.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="204"&gt;&lt;/a&gt;&lt;a name="page93"&gt;&lt;/a&gt;The first question to ask is  whether the problems also exist in the original data source, are part of the  data movement processes, or are the result of an incompatibility with the target  database structure or definition. Errors at this level often cause primary data  sources to be blamed for problems not of their making.&lt;/p&gt; &lt;p class="para"&gt;One of the problems with this type of analysis is that the  extraction, transformation, cleansing, and loading processes are often not well  documented or are documented only in the proprietary repositories of individual  products used for the separate steps. This requires expertise on each of these  repositories and on the functions of the individual products used. This can  lengthen the time required to perform the analysis.&lt;/p&gt; &lt;p class="para"&gt;Often data movement processes are locally developed without the  aid of packaged tool software. The project team merely writes code for each  step. In these cases, finding out what the team does may be difficult because  much of it is probably not documented at all. This stresses the importance of  being disciplined enough to create and maintain metadata repositories on all  data structures: primary, intermediate, and summary. Information should also be  kept on all processes that move data between them.&lt;/p&gt; &lt;p class="last-para"&gt;Review of upstream processes may be indicated by discovering  information about quality problems in primary databases. This means that a  situation discovered in a primary database that produces inaccurate data may  lead to the discovery that upstream uses of this data are also flawed. You are  basically asking the question "What is the data warehouse doing with this wrong  stuff?" This process of examining known data flaws through to their final use  can raise issues that were otherwise hidden.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Conversion To Information Products&lt;/h4&gt; &lt;p class="first-para"&gt;Other places to look are the conversion of data from  databases to reports, movement to OLAP cubes, staging data in corporate portals,  and other business information products.&lt;/p&gt; &lt;p class="para"&gt;This type of review would normally only be done if the issue were  created from concerns raised about these objects. Looking at wrong output does  not always indicate that the data is wrong. The routines to extract the data and  to compute from it, and the timeliness of this activity, can lead to inaccurate  business information products from perfectly accurate data. Problems in the  information products should be traced back through the system because they can  often uncover previously hidden problems with other uses of the same data.&lt;/p&gt; &lt;p class="para"&gt;It should be clear that the process of identifying where errors  creep into databases has many beneficial side effects. It can surface bad  practices that are creating errors that were not detected in the initial  analysis. It can detect bad &lt;a name="205"&gt;&lt;/a&gt;&lt;a name="page94"&gt;&lt;/a&gt;practices that  are not generating errors but have the potential for doing so. It can identify  hidden problems in upstream copies of the data or uses of the data that were not  known. This may lead to expanding the impacts section to include impacts already  occurring and those that have not yet occurred. This process may lead to the  consolidation of issues (discovery that the data entry process caused many of  the issues) or creating new issues (the corporate portal is displaying flawed  renditions of the data).&lt;/p&gt; &lt;p class="last-para"&gt;It may be helpful to document the bad practices independently  for the benefit of future projects. Bad practices used in one application  frequently find their way into other applications. The same team that  implemented them in one case may have implemented them in other applications  they also worked on. Having a list of bad practices can serve as a checklist of  things to look for in subsequent investigations.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1576936343892993251-7327193771411794801?l=dataqualityaccuracy.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataqualityaccuracy.blogspot.com/feeds/7327193771411794801/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=1576936343892993251&amp;postID=7327193771411794801' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/7327193771411794801'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/7327193771411794801'/><link rel='alternate' type='text/html' href='http://dataqualityaccuracy.blogspot.com/2007/12/implementing-data-quality-assurance.html' title='Implementing a Data Quality Assurance Program'/><author><name>vijikumar</name><uri>http://www.blogger.com/profile/18263997193124066136</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1576936343892993251.post-3495880443643942380</id><published>2007-12-13T22:48:00.000-08:00</published><updated>2007-12-13T22:53:11.168-08:00</updated><title type='text'>Sources of Inaccurate Data</title><content type='html'>&lt;p class="first-para"&gt;&lt;a name="108"&gt;&lt;/a&gt;&lt;a name="page43"&gt;&lt;/a&gt;Before we can assess data  correctness we need to understand the various ways inaccurate values get into  databases. There are many sources of data inaccuracies, and each contributes its  own part to the total data quality problem. Understanding these sources will  demonstrate the need for a comprehensive program of assessment, monitoring, and  improvement. Having highly accurate data requires attention to all sources of  inaccuracies and appropriate responses and tools for each.&lt;/p&gt; &lt;p class="para"&gt; The first three  cause inaccuracies in data within the databases, whereas the fourth area causes  inaccuracies in the information products produced from the data. If you roll up  all potential sources of errors, the interesting conclusion is that the most  important use of the data (corporate decision making) is made on the rendition  of data that has the most inaccuracies.&lt;/p&gt;  &lt;div class="section"&gt; &lt;h2 class="sect2-title"&gt;&lt;a name="110"&gt;&lt;/a&gt;&lt;a name="ch03lev1sec1"&gt;&lt;/a&gt;&lt;span class="section-titlelabel"&gt;3.1 &lt;/span&gt;Initial Data Entry&lt;/h2&gt; &lt;p class="first-para"&gt;&lt;a name="111"&gt;&lt;/a&gt;&lt;a name="page44"&gt;&lt;/a&gt;Most people assume that  data inaccuracies are always the result of entering the wrong data at the  beginning. This is certainly a major source of data inaccuracies but not the  only source. Inaccurate data creation can be the result of mistakes, can result  from flawed data entry processes, can be deliberate, or can be the result of  system errors. By looking at our systems through these topics, you can gain  insight into whether systems are designed to invite inaccurate data or are  designed to promote accurate data.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="112"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec1"&gt;&lt;/a&gt;Data Entry  Mistakes&lt;/h3&gt; &lt;p class="first-para"&gt;The most common source of a data inaccuracy is that the  person entering the data just plain makes a mistake. You intend to enter &lt;i class="emphasis"&gt;blue&lt;/i&gt; but enter &lt;i class="emphasis"&gt;bleu&lt;/i&gt; instead; you hit  the wrong entry on a select list; you put a correct value in the wrong field.  Much of operational data originates from a person. People make mistakes; we make  them all the time. It is doubtful that anyone could fill out a hundred-field  form without making at least one mistake.&lt;/p&gt; &lt;p class="last-para"&gt;A real-world example involves an automobile damage claims  database in which the COLOR field was entered as text. Examination of the  content of this field yielded 13 different spellings for the word &lt;i class="emphasis"&gt;beige.&lt;/i&gt; Some of these mistakes were the result of typos.  Others were just that the entry person did not know how to spell the word. In  some of the latter cases, they thought they knew how to spell the word, whereas  in others they were just not able or willing to look it up.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="113"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec2"&gt;&lt;/a&gt;Flawed Data Entry  Processes&lt;/h3&gt; &lt;p class="first-para"&gt;A lot of data entry begins with a form. A person completes a  form either on a piece of paper or on a computer screen. Form design has a lot  to do with the amount of inaccurate data that ends up in the database. Form  design should begin with a basic understanding of quality issues in order to  avoid many of the mistakes commonly seen. For example, having someone select  from a list of valid values instead of typing in a value can eliminate the  misspellings previously cited.&lt;/p&gt; &lt;p class="para"&gt;Another common problem is having fields on the form that are  confusing to the user. This often leads them to enter wrong information. The  field itself may be confusing to the user. If it is a field that is not commonly  understood, or if the database definition is unconventional, the form needs to  provide assistance in guiding the user through entry of values into the field.  Sometimes the confusion is in the way the field is described in its identifying  text or &lt;a name="114"&gt;&lt;/a&gt;&lt;a name="page45"&gt;&lt;/a&gt;in its positioning on the form. Form  design should always be subjected to rigorous quality testing to find the fields  a normal user would have difficulty in knowing what to enter.&lt;/p&gt; &lt;p class="para"&gt;Data entry windows should have instructions available as HELP  functions and should be user friendly in handling errors. Frustration in using a  form can lead to deliberate mistakes that corrupt the database.&lt;/p&gt; &lt;p class="para"&gt;Forms are better completed by a trained entry person than by a  one-time user. This is because the entry person can be taught how things should  be entered, can become proficient in using the form mechanisms, and can be given  feedback to improve the efficiency and accuracy of the data. A one-time user is  always uncertain about what they are supposed to do on the form. Unfortunately,  our society is moving by way of the Internet toward eliminating the middle  person in the process and having end users complete forms directly. This places  a much higher demand on quality form design.&lt;/p&gt; &lt;p class="para"&gt;The data entry process includes more than the forms that are  filled out. It also includes the process that surrounds it. Forms are completed  at a specific point or points in a process. Sometimes we have forms that are  required to be completed when not all information is known or easily obtained at  that point in the process. This will inevitably lead to quality problems.&lt;/p&gt; &lt;p class="para"&gt;An example of a data entry process I helped design a number of  years ago for military repair personnel is very instructive of the types of  problems that can occur in data collection. The U.S. Navy has a database that  collects detailed information on the repair and routine maintenance performed on  all aircraft and on all major components of every ship. This database is  intended to be used for a variety of reasons, from negotiating contracts with  suppliers, to validating warranties, to designing new aircraft and ships.&lt;/p&gt; &lt;p class="para"&gt;When an aircraft carrier is in a combat situation, such as in  Kuwait and Afghanistan, repairs are being made frequently. The repair crews are  working around the clock and under a great deal of pressure to deal with a lot  of situations that come up unexpectedly. Completing forms is the least of their  concerns. They have a tendency to fix things and do the paperwork later. The  amount of undocumented work piles up during the day, to be completed when a  spare moment is available. By then the repair person has forgotten some of the  work done or the details of some of the work and certainly is in a hurry to get  it done and out of the way.&lt;/p&gt; &lt;p class="para"&gt;Another part of this problem comes in when the data is actually  entered from the forms. The forms are coming out of a hectic, very messy  environment. Some of the forms are torn; some have oil or other substances on  them. The writing is often difficult to decipher. The person who created it is  probably not available and probably would not remember much about it if  available.&lt;/p&gt; &lt;p class="last-para"&gt;&lt;a name="115"&gt;&lt;/a&gt;&lt;a name="page46"&gt;&lt;/a&gt;A database built from this  system will have many inaccuracies in it. Many of the inaccuracies will be  missing information or valid but wrong information. An innovative solution that  involves wireless, handheld devices and employs voice recognition technology  would vastly improve the completeness and accuracy of this database. I hope the  U.S. Navy has made considerable improvements in the data collection processes  for this application since I left. I trust they have.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="116"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec3"&gt;&lt;/a&gt;The Null  Problem&lt;/h3&gt; &lt;p class="first-para"&gt;A special problem occurs in data entry when the information  called for is not available. A data element has a value, an indicator that the  value is not known, or an indicator that no value exists (or is applicable) for  this element in this record. Have you ever seen an entry screen that had room  for a value and two indicator boxes you could use for the case where there is no  value? I haven't. Most form designs either mandate that a value be provided or  allow it to be left blank. If left blank, you do not know the difference between  value-not-known and no-value-applies.&lt;/p&gt; &lt;p class="para"&gt;When the form requires that an entry be available and the entry  person does not have the value, there is a strong tendency to "fake it" by  putting a wrong, but acceptable, value into the field. This is even  unintentionally encouraged for selection lists that have a default value in the  field to start with.&lt;/p&gt; &lt;p class="para"&gt;It would be better form design to introduce the notion of NOT  KNOWN or NOT APPLICABLE for data elements that are not crucial to the  transaction being processed. This would at least allow the entry people to enter  accurately what they know and the users of the data to understand what is going  on in the data.&lt;/p&gt; &lt;p class="para"&gt;It would make sense in some cases to allow the initial entry of  data to record NOT KNOWN values and have the system trigger subsequent  activities that would collect and update these fields after the fact. This is  far better than having people enter false information or leaving something blank  and not knowing if a value exists for the field or not.&lt;/p&gt; &lt;p class="last-para"&gt;An example of a data element that may be NOT KNOWN or NOT  APPLICABLE is a driver's license number. If the field is left blank, you cannot  tell if it was not known at the point of entry or whether the person it applies  to does not have a driver's license. Failure to handle the possibility of  information not being available at the time of entry and failure to allow for  options to express what you do know about a value leads to many inaccuracies in  data.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="117"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec4"&gt;&lt;/a&gt;Deliberate  Errors&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="118"&gt;&lt;/a&gt;&lt;a name="page47"&gt;&lt;/a&gt;Deliberate errors are  those that occur when the person enters a wrong value on purpose. There are  three reasons they do this:&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;They do not know the correct information.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;They do not want you to know the correct information.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;They get a benefit from entering the wrong  information.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Do Not Know Correct Information&lt;/h4&gt; &lt;p class="first-para"&gt;Not knowing the correct information occurs when the form  requires a value for a field and the person wants or needs to complete the form  but does not know the value to use. The form will not be complete without a  value. The person does not believe the value is important to the transaction, at  least not relative to what they are trying to do. The result is that they make  up a value, enter the information, and go on.&lt;/p&gt; &lt;p class="last-para"&gt;Usually the information is not important to completing the  transaction but may be important to other database users later on. For example,  asking and requiring a value for the license plate number of your car when  registering for a hotel has no effect on getting registered. However, it may be  important when you leave your lights on and they need to find out whose car it  is.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Do Not Wish To Give The Correct Information&lt;/h4&gt; &lt;p class="first-para"&gt;The second source of deliberate errors is caused by the  person providing the data not wanting to give the correct information. This is  becoming a more and more common occurrence with data coming off the Internet and  the emergence of CRM applications. Every company wants a database on all of  their customers in order to tailor marketing programs. However, they end up with  a lot of incorrect data in their databases because the information they ask  people for is more than people are willing to provide or is perceived to be an  invasion of privacy.&lt;/p&gt; &lt;p class="para"&gt;Examples of fields that people will lie about are age, height,  weight, driver's license number, home phone number, marital status, annual  income, and education level. People even lie about their name if it can get the  result they want from the form without putting in their correct name. A common  name appearing in many marketing databases is Mickey Mouse.&lt;/p&gt; &lt;p class="last-para"&gt;The problem with collecting data that is not directly  required to complete the transaction is that the quality of these data elements  tends to be low but is &lt;a name="119"&gt;&lt;/a&gt;&lt;a name="page48"&gt;&lt;/a&gt;not immediately  detected. It is only later, when you try to employ this data, that the  inaccuracies show up and create problems.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Falsifying To Obtain A Benefit&lt;/h4&gt; &lt;p class="first-para"&gt;The third case in which deliberate mistakes are made is  where the entry person obtains an advantage in entering wrong data. Some  examples from the real world illustrate this.&lt;/p&gt; &lt;p class="para"&gt;An automobile manufacturer receives claim forms for warranty  repairs performed by dealers. Claims for some procedures are paid immediately,  whereas claims for other procedures are paid in 60 days. The dealers figure out  this scheme and deliberately lie about the procedures performed in order to get  their money faster. The database incorrectly identifies the repairs made. Any  attempt to use this database to determine failure rates would be a total  failure. In fact, it was in attempts to use this data for this purpose that led  to the discovery of the practice. It had been going on for years.&lt;/p&gt; &lt;p class="para"&gt;A bank gives branch bank employees a bonus for all new corporate  accounts. A new division of a larger company opens an account with a local  branch. If the bank employee determines that this is a sub-account of a larger,  existing customer (the correct procedure), no bonus is paid upon opening the  account. If, however, the account is opened as a new corporate customer (the  wrong procedure), a bonus is paid.&lt;/p&gt; &lt;p class="para"&gt;An insurance company sells automobile insurance policies through  independent insurance writers. In a metropolitan area, the insurance rate is  determined by the Zip code of the applicant. The agents figure out that if they  falsify the ZIP CODE field on the initial application for high-cost Zip codes,  they can get the client on board at a lower rate. The transaction completes, the  agent gets his commission, and the customer corrects the error when the renewal  forms arrive a year later. The customer's rates subsequently go up as a  result.&lt;/p&gt; &lt;p class="para"&gt;Data entry people are rated based on the number of documents  entered per hour. They are not penalized for entering wrong information. This  leads to a practice of entering data too fast, not attempting to resolve issues  with input documents, and making up missing information. The operators who enter  the poorest-quality data get the highest performance ratings.&lt;/p&gt; &lt;p class="last-para"&gt;All of these examples demonstrate that company policy can  encourage people to deliberately falsify information in order to obtain a  personal benefit.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="120"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec5"&gt;&lt;/a&gt;System  Problems&lt;/h3&gt; &lt;p class="first-para"&gt;Systems are too often blamed for mistakes when, after  investigation, the mistakes turn out to be the result of a human error. Our  computing systems have &lt;a name="121"&gt;&lt;/a&gt;&lt;a name="page49"&gt;&lt;/a&gt;become enormously  reliable over the years. However, database errors do occur because of system  problems when the transaction systems are not properly designed.&lt;/p&gt; &lt;p class="para"&gt;Database systems have the notion of COMMIT. This means that  changes to a database system resulting from an external transaction either get  completely committed or completely rejected. Specific programming logic ensures  that a partial transaction never occurs. In application designs, the user is  generally made aware that a transaction has committed to the database.&lt;/p&gt; &lt;p class="para"&gt;In older systems, the transaction path from the person entering  data to the database was very short. It usually consisted of a terminal passing  information through a communications controller to a mainframe, where an  application program made the database calls, performed a COMMIT, and sent a  response back to the terminal. Terminals were either locally attached or  accessed through an internal network.&lt;/p&gt; &lt;p class="para"&gt;Today, the transaction path can be very long and very complex. It  is not unusual for an application to occur outside your corporation on a PC,  over the Internet. The transaction flows through ISPs to an application server  in your company. This server then passes messages to a database server, where  the database calls are made. It is not unusual for multiple application servers  to be in the path of the transaction. It is also not unusual for multiple  companies to house application servers in the path. For example, Amazon passes  transactions to other companies for "used book" orders.&lt;/p&gt; &lt;p class="para"&gt;The person entering the data is a nonprofessional, totally  unfamiliar with the system paths. The paths themselves involve many parts,  across many communication paths. If something goes wrong, such as a server going  down, the person entering the information may not have any idea of whether the  transaction occurred or not. If there is no procedure for them to find out, they  often reenter the transaction, thinking it is not there, when in fact it is; or  they do not reenter the transaction, thinking it happened, when in fact it did  not. In one case, you have duplicate data; in the other, you have missing  data.&lt;/p&gt; &lt;p class="para"&gt;More attention must be paid to transaction system design in this  new, complex world we have created. We came pretty close to cleaning up  transaction failures in older "short path" systems but are now returning to this  problem with the newer "long path" systems.&lt;/p&gt; &lt;p class="last-para"&gt;In summary, there are plenty of ways data inaccuracies can  occur when data is initially created. Errors that occur innocently tend to be  random and are difficult to correct. Errors that are deliberate or are the  result of poorly constructed processes tend to leave clues around that can be  detected by analytical techniques.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Data Accuracy Decay&lt;/h2&gt; &lt;p class="first-para"&gt;&lt;a name="123"&gt;&lt;/a&gt;&lt;a name="page50"&gt;&lt;/a&gt;Data that is accurate  when initially created can become inaccurate within a database through time. The  data value does not change; its accuracy does. Some data elements are subject to  value accuracy decay and some are not. Some examples will illustrate the concept  of decay.&lt;/p&gt; &lt;p class="para"&gt;Personal information in an employee database easily becomes wrong.  People move, they change their marital status, they complete new education  programs, they change telephone numbers. Most employees do not run into HR and  fill out a form every time something changes in their life. The information in  HR reflects the accuracy at the time they initially joined the company or the  last time an update was done. Inventory-on-hand information can become wrong due  to spoilage, parts taken and used, and new transactions not processed.&lt;/p&gt; &lt;p class="para"&gt;The value carried on the books for capital assets can change due  to market demand changes, changes in the environment in which the asset is used,  unusual wear and tear, or damage done and not reported. A state driver's license  database indicates that a person has acceptable eyesight to drive without  glasses. However, during the time since the license was issued, the person's  eyesight may deteriorate to the point where she cannot safely drive without  glasses. The inaccuracy will not be corrected until a renewal requires a new eye  test.&lt;/p&gt; &lt;p class="para"&gt;All of these examples show that a change occurred in the object  being represented in the database and the database was not updated to reflect  it. Another way of saying this is that a transaction was needed to be processed  and was not. However, these "missing transactions" are commonplace in the real  world.&lt;/p&gt; &lt;p class="para"&gt;Not all data elements are subject to decay. Information defining  an object generally does not decay, whereas information providing other  information about the object generally can be subject to decay. Good database  designers will note the decay characteristic of a data element as part of the  metadata and will design processes to verify or update the information as needed  by the consumers of the data. For example, in an inventory application, the data  elements for PART_NUMBER, PART_DESCRIPTION, and UNIT_OF_MEASURE would not be  considered subject to decay, whereas QUANTITY_ON_HAND, SUPPLIER_ID, and  STORAGE_BIN_NUMBER would be subject to decay.&lt;/p&gt; &lt;p class="para"&gt;Another characteristic of decay-prone data elements is that the  overall accuracy of the element tends to follow a sawtooth graph over time. Data  is entered. Over time, the older data gets progressively more inaccurate (it  decays). The accuracy of the element is determined by the mix of older data  versus newer data, which tends to be more accurate. A corrective action occurs  that pulls the data back into a higher state of accuracy.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="124"&gt;&lt;/a&gt;&lt;a name="page51"&gt;&lt;/a&gt;In the previously cited  examples, the corrective event for HR information may be an annual data review  conducted with all employees; for inventory, an annual inventory exercise; for  capital equipment, an annual reappraisal. Note that for the driver's license  example there is no event that brings all or a set of records into accuracy at a  single point in time. It is a continuous database with new records and renewals  occurring on an individual timeline. This does not follow a sawtooth pattern  because the decaying and correcting occurs continuously over the database.  However, it does have an inaccuracy component due to decay that remains fairly  constant over time.&lt;/p&gt;&lt;div class="figure"&gt;&lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;An example of a good HR system is to have new employees verify all  HR information immediately after initial entry into the database, request that  employees review all personal information each time they make any change, and  then request a specific review by the employee any time no review has been done  in a year (or less if you can get away with it). Another mechanism is to make  the review a part of the annual employee evaluation process. It would not hurt  to monitor marriage and birth announcements as well.&lt;/p&gt; &lt;p class="para"&gt;Unfortunately, most companies are not willing to be that concerned  about data decay accuracy issues. It is not because they have considered the  cost of a continuous updating process to be higher than the cost of not being  more accurate. It is more a case of not thinking about the problem at all. The  cost of a corrective program is not very high once it has been put in place. It  will generally return more value than it costs.&lt;/p&gt; &lt;p class="last-para"&gt;A proper way of handling decay-prone data elements is to  identify them in the database design and indicate this in the metadata  repository. Periodic events should be planned to update information to get the  database to recover &lt;a name="126"&gt;&lt;/a&gt;&lt;a name="page52"&gt;&lt;/a&gt;from decay. The times  that these events occur should be included in the database and/or the metadata  repository so that database business users can know what to expect. The metadata  repository should also include information on the rate of change that occurred  in the periodic auditing event. A good data analyst knows the rate of decay and  the probable accuracy index for any element at any given point in time.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Moving and Restructuring Data&lt;/h2&gt; &lt;p class="first-para"&gt;Inaccurate data is often created from perfectly good data  through the processes used to move and restructure data. These processes are  commonly used to extract data from operational databases and put it into data  warehouses, data marts, or operational data stores. It may also restage the data  for access through corporate portals. This is an area often overlooked as a  source of data inaccuracies and yet it contributes significantly to the  inaccuracy problem. When you hear complaints that the data in a data warehouse  or data mart is wrong and unusable, a major contributing factor might be flawed  data movement processes as opposed to the data being inaccurate in the source  databases.&lt;/p&gt; &lt;p class="para"&gt;Moving and restructuring data is commonly done through ETL  (extract, transform, and load) processes. There may also be data scrubbing  processes involved. Sometimes this is performed through packaged tools and  sometimes through in-house-developed scripts or programs. When packaged tools  are used, the errors are not coming from them but rather from faulty  specifications used for them.&lt;/p&gt; &lt;p class="para"&gt;Entire classes of tools and software companies emerged to support  this activity. The two primary tool categories are ETL and data cleansing. What  was missing was tools to help understand the data in enough detail to  effectively use the tools and to design decision support systems correctly. The  recent emergence of data profiling tools is intended to fill this gap and is  showing dramatically positive effects when used.&lt;/p&gt; &lt;p class="para"&gt;The irony of it is that most projects claim to be extracting the  data and cleaning it up before it is moved into the data warehouses, when in  fact they are making it dirtier, not cleaner. Data is generated and stored in a  database based on the initiating application. This is generally a transaction  system that collects the data and performs updates. The database is generally  designed to meet the requirements of the initiating application and nothing  more. The design of the data and the database is generally poorly documented and  almost always not kept up to date.&lt;/p&gt; &lt;p class="para"&gt;There are also a host of other reasons for moving data to a  database of a different structure. These include migration of applications to  newer technology &lt;a name="128"&gt;&lt;/a&gt;&lt;a name="page53"&gt;&lt;/a&gt;or to packaged applications.  For many applications, this may be the first time they have been subject to a  major reengineering effort. Another reason is the need to consolidate databases.  Consolidations occur for two reasons: corporation mergers and acquisitions, and  the need to centralize databases from local or divisional databases.&lt;/p&gt; &lt;p class="para"&gt;An important fact of migrations and consolidations is that you  cannot leave the old data behind. Most databases require a continuity of  information over long periods of time. You cannot move to a new personnel system  and not move the existing personnel information behind. You cannot move to a new  supply chain system and not move the inventory or the inventory history data.  lists  projects for which database restructure and/or data movement should be a serious  consideration.&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="129"&gt;&lt;/a&gt;&lt;a name="ch03fig03"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig3%2D3%5F0%2Ejpg" name="IMG_10" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;There are two major problems frequently found in moving and  restructuring data. The first is that to effectively accomplish the movement of  the data to another database you have to fully understand everything about the  source database (and the target, if it is not a new database). The reality is  nothing close to this. The other problem is that the source systems were never  designed with the thought of having to give up their data to another database.  &lt;a name="130"&gt;&lt;/a&gt;&lt;a name="page54"&gt;&lt;/a&gt;This is particularly the case for a database  of a different type, designed very differently and combining it with data from  other sources.&lt;/p&gt; &lt;div class="sidebar"&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="Start Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="first-para"&gt;The software industry supported moving and restructuring  data by creating two disjointed types of products: ETL products and data  cleansing products.&lt;/p&gt; &lt;p class="para"&gt;ETL products provide support for extracting data from source  systems, transforming values in the data, aggregating data, and producing  suitable load files for target systems.&lt;/p&gt; &lt;p class="para"&gt;Data cleansing companies provide support for processing selective  data fields to standardize values, find errors, and make corrections through  external correlation. Their target has been primarily name and address field  data, which easily lends itself to this process. It has also been found to be  usable on some other types of data.&lt;/p&gt; &lt;p class="para"&gt;In spite of very good software from these companies, a lot of data  inaccuracies filter through the process and a lot of new inaccuracies are  generated. None of this is the fault of the products from these companies. They  all execute as advertised. The problem is that they will not work correctly  unless accurate specifications of the source data and the work to be done are  entered. The problem is that the tools cannot help if they are fed inaccurate or  incomplete specifications.&lt;/p&gt; &lt;p class="para"&gt;This has led to the emergence of a third product type: data  profiling products. Data profiling is intended to complete or correct the  metadata about source systems. It is also used to map systems together  correctly. The information developed in profiling becomes the specification  information that is needed by ETL and data cleansing products.&lt;br /&gt;&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="131"&gt;&lt;/a&gt;&lt;a name="ch03fig04"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig3%2D4%5F0%2Ejpg" name="IMG_11" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt; &lt;/div&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="End Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt; &lt;table class="BlankSpace" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td height="16"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="132"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec6"&gt;&lt;/a&gt;Understanding  Source Databases&lt;/h3&gt; &lt;p class="first-para"&gt;Rarely do you have current, complete information available  on transaction systems. Metadata dictionaries and repositories generally have  very low accuracy. Even COBOL copybooks sport a high degree of mismatch between  what they say and how the data is actually stored.&lt;/p&gt; &lt;p class="para"&gt;The poor attention paid to creating and maintaining accurate data  in data dictionaries and metadata repositories is now hurting corporations to  the tune of millions of dollars in unnecessarily complex data movement projects  and/or in having to accept low-quality data (data that is even of lower quality  than the source databases it is extracted from) for use in decision making.&lt;/p&gt; &lt;p class="para"&gt;The reasons for poor data quality in metadata repositories are  many. Repository technology has lagged behind database design evolution.  Repository solutions are generally insufficient in content to collect all the  needed information. They are passive, which means that they can get out of step  with the data without noticeable impacts on normal operations. There has been  little motivation on the part of data administration staff to keep them  current.&lt;/p&gt; &lt;p class="para"&gt;COBOL copybooks also become inaccurate over time. Changes to  fields are often done without updating the copybook. A character field may be  reassigned to be used for an entirely different purpose. If the physical  definition of the field satisfies the new use, the copybook change is not needed  to make the program work. This results in the field name and comment text  referring to the old meaning and not the new meaning. This can, and will,  mislead anyone using it to make changes or to extract data.&lt;/p&gt; &lt;p class="last-para"&gt;In many instances, valuable application information was found  stored in the space defined to COBOL as FILLER. This is a common and quick-fix  way of implementing a change and of completely bypassing an update of the  copybook or any other documentation that may exist.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="133"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec7"&gt;&lt;/a&gt;Overloaded  Fields&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="134"&gt;&lt;/a&gt;&lt;a name="page55"&gt;&lt;/a&gt;Older, legacy  applications are filled with "overloaded" fields. An overloaded field is one  that has information about multiple, different facts stored in the same  field.&lt;/p&gt; &lt;p class="para"&gt;An example of an overloaded field is where a mainframe application  developer needs a new field that has a binary representation such as YES or NO.  They do not want to waste the space for a new field (a common concern for  developers in the 1960 to 1980 time frame). Instead of creating a new field,  they decide to use an unused bit of another, existing field. This could be one  of the spare bits within packed decimal fields for IBM mainframes or the sign  bit of a numeric field that can never be negative. The result is an overloaded  field. Other examples of overloaded fields are to use each bit of a single byte  to represent the value for a different binary-valued field or to use free-form  text fields to encode keywords that mean different things.&lt;/p&gt; &lt;p class="para"&gt;An extreme case discovered in one system was that a person's NAME  field had the characters *DECEASED appended to it within the same NAME field if  the person was dead. So, for example, the value &lt;i class="emphasis"&gt;Jack Olson&lt;/i&gt;  meant who I am and that I am alive, whereas the value &lt;i class="emphasis"&gt;Jack  Olson&lt;/i&gt; *&lt;i class="emphasis"&gt;DECEASED&lt;/i&gt; meant who I am and that I am not  alive.&lt;/p&gt; &lt;p class="para"&gt;The problem with overloaded fields is that they are generally  never documented. The copybook definitions and dictionary definitions usually  document only the underlying field or, in the case of multivalued keyword  fields, document it only as a text field.&lt;/p&gt; &lt;p class="para"&gt;Analysts designing data movement and restructuring processes often  do not become aware of the overloading and either incorrectly transform and move  the data or reject some or all of the values in the field as being inaccurate.  In the process of rejection, they are losing correct information on each of the  separate facts recorded in the field.&lt;/p&gt; &lt;p class="para"&gt;Another real example of overloading fields is where a legacy  system developer used the sign bit on the PAY_RATE field to represent a  different fact because this field should never be negative. Neither this  convention nor the meaning of the bit was ever documented. When they upgraded to  a new packaged application for payroll, they moved the field as it existed,  innocently thinking that they did not need to investigate such an obvious field.  When they ran their first payroll, 30% of the checks had negative amounts. This  is an example of a situation in which you absolutely must verify the content of  every field against your expectations when you lift it out of one system and  place it into another.&lt;/p&gt; &lt;p class="last-para"&gt;Sometimes the information needed about existing databases is  not recorded anywhere. You often need detailed information on missing or &lt;a name="135"&gt;&lt;/a&gt;&lt;a name="page56"&gt;&lt;/a&gt;exception conditions. This is often not  available. You also need information on accuracy. This is never recorded  anywhere.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="136"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec8"&gt;&lt;/a&gt;Matching Source  Databases to Target Databases&lt;/h3&gt; &lt;p class="first-para"&gt;This is tricky business. You first need to have a total  understanding of both systems. You then need to match data elements to determine  whether the result makes sense.&lt;/p&gt; &lt;p class="para"&gt;In matching data elements you are saying that a particular field  in the source system contains the content you need to put into a particular  field in the target system. Many times this is an obvious match, such as a  PAYTYPE field in one system and a PAY_TYPE field in the other system. However,  when you get below the surface, differences can emerge that are highly  significant and can create enormous problems if not addressed. The types of  problems that can occur are&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;The actual representation of the values within the field may  differ. One system, for example, may encode PAY_TYPE as H and S for hourly and  salaried. The other may encode the same information as 1 and 2. You must look at  the values to determine what is there.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;The scope of values may differ. One system may have two  types of PAY_TYPE, and the other may have three types. Are the differences  reconcilable?&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;One system may record information to a lower level of  granularity than the other system. For example, one may encode color as RED,  BLUE, and GREEN, whereas the other system may have five values for shades of  RED, four values for shades of BLUE, and so on. The issue then becomes whether  the target system can deal with the differences.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;The systems may handle NULL or NOT KNOWN differently. One  system may not allow them, whereas the other does. One system may record a NULL  as a blank and the other system as the? character.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;Special meaning may be attached to specific values in one  system. For example, a policy number of 99999999 may mean that the policy number  has not been assigned yet.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;It is not sufficient that the target system have a field of the  correct data type and length as the source system field. It is necessary that  every value in the source system can be translated to an acceptable value in the  target system that has exactly the same semantic meaning. Failure to do this  will turn acceptable (accurate) values in the source system into unacceptable &lt;a name="137"&gt;&lt;/a&gt;&lt;a name="page57"&gt;&lt;/a&gt;(inaccurate) values in the target system.&lt;/p&gt; &lt;p class="para"&gt;Another problem comes into play in matching data elements when  there are a different number of data elements between the source and target  systems. When the target system has a data element that does not exist in the  source system, a determination needs to be made on what to do about it. The  target data element may be eliminated from the design, NULLS inserted, or values  manufactured for the field. In any case, the result is less than optimal. This  problem is compounded when merging data from multiple sources wherein some  sources have values for this field and other sources do not. The result is a  target data element populated with highly questionable data. Any attempt to use  the field to generate aggregations for groups will yield inaccurate results.&lt;/p&gt; &lt;p class="para"&gt;When the source system contains excess data elements, a decision  needs to be made on whether the information is, in fact, needed in the target  system. Generally it is okay to leave it out, provided it does not have  structural importance to other data elements.&lt;/p&gt; &lt;p class="para"&gt;Source and target systems need to be matched at the structural  level as well as at the data element level. This is a step often either not done  or done very poorly. The result can be failures in attempting to move the data  or, worse yet, inaccurate results in the target systems.&lt;/p&gt; &lt;p class="para"&gt;Structural matching involves checking to see if data element  functional dependencies and referential constraints in the source systems match  those in the target systems. This is discussed in detail in &lt;a class="chapterjump" href="LiB0062.html#374" target="_parent"&gt;Chapter 9&lt;/a&gt;.&lt;/p&gt; &lt;p class="para"&gt;For example, if the data elements of one system are not part of a  primary key and they correspond to data elements in the other system that are  part of a primary key, an error can easily occur in trying to move the data.  Another example is where in one system the DISCOUNT is applied to the order, and  in the other system it is applied to individual line items within the order. If  the target system has the discount on the order, a problem exists in trying to  move the data.&lt;/p&gt; &lt;p class="para"&gt;A large number of errors are routinely made when trying to force  data into a target structure in circumstances where the two systems are  structurally different and not reconcilable. The reason so much force fitting is  done is that the information about these differences usually does not surface  until the actual implementation phase of projects. By then, the project team is  trying to make things work at any cost.&lt;/p&gt; &lt;p class="last-para"&gt;Once data is understood and matched, the project must  construct processes to extract the data from the source databases, run them  through any data cleansing needed, apply transforms, and load the data into the  target systems. Shortcomings in analyzing the source data and in matching  elements and structures often lead to serious problems in executing these  procedures or result in corrupt data in the target systems.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="138"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec9"&gt;&lt;/a&gt;Extraction&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="139"&gt;&lt;/a&gt;&lt;a name="page58"&gt;&lt;/a&gt;Extracting data normally  involves reading the source databases and building output files suitable for use  in the subsequent processes. These output files are almost always expected to be  in a normalized form. This means that they are tables of values, hopefully in  "third-normal" form. (Normal forms of data are explained in &lt;a class="chapterjump" href="LiB0063.html#377" target="_parent"&gt;Section 9.1&lt;/a&gt;.)&lt;/p&gt; &lt;p class="para"&gt;Source systems are not always easy to unwind into a set of  normalized flat files. Depending on the complexity of how data is stored in the  source systems, this can be a monumental task that is susceptible to all sorts  of errors.&lt;/p&gt; &lt;p class="para"&gt;For example, a mainframe VSAM application covered by COBOL  copy-books may store information in the same data file for many different  records and have variations embedded within. Liberal use of the REDEFINE and  OCCURS clauses can provide a challenging structural situation that must be  studied thoroughly before developing the extraction routines.&lt;/p&gt; &lt;p class="para"&gt;A program cannot automatically figure this out for you. Many  extraction programs have been constructed that automatically read the copybook  and develop extraction logic from it. There are a number of problems with this  approach. A REDEFINE may be used to recast the same facts or to indicate a  different set of facts depending on some other value in the record. Only a human  can determine which is meant. In one case you need to select the representation  to include in the output; in the other you need to normalize the facts within  the REDEFINE into separate data elements. A similar situation exists for OCCURS  clauses. Do they represent &lt;i class="emphasis"&gt;n&lt;/i&gt; occurrences of the same fact  or &lt;i class="emphasis"&gt;n&lt;/i&gt; different facts? One case calls for normalization  into a separate file, and the other does not.&lt;/p&gt; &lt;p class="para"&gt;A company was trying to move a VSAM application for which over 200  different record types were encoded within the same VSAM file using a record  type field to distinguish them and an equal number of record-level REDEFINEs.  There were also liberal uses of REDEFINE and OCCURS within many of these record  types. In essence, the user had cleverly defined an entire relational database  structure within the VSAM file structure for a single file. Unfortunately, his  cleverness caused great pains for subsequent analysts who needed to extract data  from this file. As you might expect, he was no longer with the company when this  was needed. Unraveling it into the proper relational definition was not  easy.&lt;/p&gt; &lt;p class="para"&gt;Converting nonrelational structures to normalized files may also  require key generation in cases for which the connection between records is done  through positioning or internal pointers. Failure to recognize this and provide  the keys will result in files that lack the ability to JOIN objects later or to  enforce referential integrity constraints.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="140"&gt;&lt;/a&gt;&lt;a name="page59"&gt;&lt;/a&gt;Extraction routines also need  to handle overloaded fields. Often, they merely look at the source data as being  in error and just reject the values or substitute a NULL or blank/zero value in  its place. This is generating wrong information from correct information.&lt;/p&gt; &lt;p class="para"&gt;Source systems are often denormalized. Denormalization was a  common trick used in earlier systems (especially early relational systems) to  obtain adequate performance. Failure to recognize that the source data is  denormalized results in denormalized flat files, which can lead to denormalized  targets, which can lead to statistical errors when using the data. This is  another area rarely documented in source systems.&lt;/p&gt; &lt;p class="last-para"&gt;The inverse is also true. Extraction routines that "flatten"  data from hierarchical structures often generate denormalized flat files from  structures that were correctly normalized. Most general-purpose extraction  programs have the potential for doing this. For example, in an IMS database  containing multiple segment types, if only one output file is generated for the  entire database, it is denormalized if multiple occurrences of child segments  are allowed for each parent instance. This output file will generate errors in  the target system if the denormalization is not corrected somewhere else in the  process.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="141"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec10"&gt;&lt;/a&gt;Data  Cleansing&lt;/h3&gt; &lt;p class="first-para"&gt;Data cleansing involves identifying incorrect data values  and then either correcting them or rejecting them. They deal with INVALID values  in single data elements or correlation across multiple data elements. Many  products are available to help you construct data cleansing routines. They can  be helpful in improving the accuracy of data or they can result in less accurate  data, depending on how carefully they are used.&lt;/p&gt; &lt;p class="para"&gt;Data cleansing becomes a problem when you lack an understanding of  the source data and reject values that have special meaning. For example, if the  source system had a convention of putting a value of * in a STATE field to  indicate local state, the scrubbing routine may infer that * is invalid and  either reject the row or put a null in its place. Even though the convention is  unusual, it was correct in the source system, whose applications understood the  convention.&lt;/p&gt; &lt;p class="para"&gt;Another problem is when data cleansing routines reject a value  that is clearly correctable. For example, multiple representations of a name in  a text field may include misspellings or alternative representations that are  clearly recognizable for what they should be in the source systems. For example,  the entry &lt;i class="emphasis"&gt;blakc&lt;/i&gt; is clearly a misspelling for &lt;i class="emphasis"&gt;black&lt;/i&gt;, and &lt;i class="emphasis"&gt;Tex&lt;/i&gt; is clearly an  alternative representation of the text &lt;i class="emphasis"&gt;Texas&lt;/i&gt; (or &lt;i class="emphasis"&gt;TX&lt;/i&gt; if two characters are the convention).&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="142"&gt;&lt;/a&gt;&lt;a name="page60"&gt;&lt;/a&gt;A poorly designed data  cleansing routine will reject these values and substitute a NULL or blank. The  result is that there is no way of correcting it in the target system because the  original wrong value is lost.&lt;/p&gt; &lt;p class="para"&gt;Data cleansing routines can identify wrong values but generally  cannot correct them. They can correct values only through synonym lists or  correlation against tables showing valid combinations of values. Most of the  time they identify a wrong value but cannot correct it. The only alternative is  to change the value to unknown (NULL in most systems) or to reject the row in  which the bad value is contained. If it is rejected, they can either leave it  out of the target (creating a bigger problem) or manually investigate the value  and reenter it into the target after correction.&lt;/p&gt; &lt;p class="last-para"&gt;Leaving rows out has multiple problems. First, you are losing  some data. The correct data in other data elements of these rows may be more  important to the target than the result you get by leaving the entire row out.  Another problem is that you may create a structural problem relative to other  rows in the same or other tables. For example, if you reject an order header,  the order detail records in a different file will become orphans. Rejecting one  row may have the effect of causing many other rows to be rejected later, when  referential constraints are enforced upon load.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="143"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec11"&gt;&lt;/a&gt;Transforms&lt;/h3&gt; &lt;p class="first-para"&gt;Transforms are routines that change the representation of a  value while not changing the content. For example, in one system, GENDER may be  represented by 1 and 2, whereas in the target it is represented by M and F. The  transformation routine simply translates 1s to M and 2s to F. Transformation  routines can be simple or complex. Failure to properly specify transformations  can result in correct data in the source becoming wrong data in the target.&lt;/p&gt; &lt;p class="para"&gt;Two areas that commonly introduce problems are code page  crossovers and currency translation. For example, if you are building a central  data warehouse from data coming from divisions around the world, you need to pay  special attention to the effects either of these may have on the outcome. Code  page conversions can lose special country codes that translate the special  characters of another country into the USA code page. This may generate a name  value that looks invalid to the target system.&lt;/p&gt; &lt;p class="last-para"&gt;Another problem sometimes overlooked is where a source system  stores a dollar value as an integer. The application programs "know" that they  are supposed to insert the decimal point after position 2 (effectively dividing  the stored value by 100). A transform needs to occur when this data is moved. If  the analyst fails to recognize the convention, the values are moved over as if  they were full dollar amounts.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="144"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec12"&gt;&lt;/a&gt;Loading&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="145"&gt;&lt;/a&gt;&lt;a name="page61"&gt;&lt;/a&gt;The last step of the data  movement process is loading data. You would think that this step would be a safe  bet and that all of the damage that can be done has already been done to the  data. Assuming you have properly defined the target database, you only need to  deal with a few issues.&lt;/p&gt; &lt;p class="para"&gt;You need to ensure that all of the data gets loaded in order for  the target to be in a consistent state. This sounds simple. However, if data is  coming from dozens of sources, you need to ensure that all of the data from all  sources gets there and gets loaded together to create a complete target system.  One of the important issues of data warehouse and data mart design is ensuring  that this is done and that the data is not released to the user community until  a complete load is accomplished.&lt;/p&gt; &lt;p class="para"&gt;Another issue is what you do with the data that is rejected upon  load due to data type, key definition, referential constraint, or procedure  violations. As in the case of data cleansing, you must decide if you substitute  a value to make the structure work, leave rows out of the target, or investigate  the cause and enter the correct values.&lt;/p&gt; &lt;p class="para"&gt;The quantity of rejections will determine the best way to handle  this issue. A good data movement process will have found all of these values  before they get to the load process.&lt;/p&gt; &lt;p class="para"&gt;This section has shown that there are many opportunities to make  mistakes in the process of extracting data from one system and moving it to  another. The majority of these mistakes are mistakes of omission. The developer  of the processes does not have a correct or complete set of information about  the source systems in order to know everything that needs to be done. The scary  part of this is that we continually plow through projects with very low-quality  metadata for the source systems and try to fix or reconcile the problems after  the fact.&lt;/p&gt; &lt;p class="para"&gt;The other problem is that when we look at the data in the target  system and determine that it is of low quality we instinctively blame the source  systems instead of considering the movement processes as the culprit. Since  source systems are difficult to change, little is generally done about the  situation. Often the target system is just not used because of a lack of  confidence in the data. Much of the time, the target system quality can be  improved immensely through fixing the data movement processes.&lt;/p&gt; &lt;p class="last-para"&gt;&lt;a class="internaljump" href="#ch03fig04" role="chapterjump"&gt;&lt;/a&gt; Clearly, the more effort you make in completing the first two  steps properly the fewer mistakes you make in the other steps. Mistakes result  in inaccurate data in the target systems. Eliminating the mistakes makes the  target systems more accurate and more usable.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="146"&gt;&lt;/a&gt;&lt;a name="ch03lev2sec13"&gt;&lt;/a&gt;Data  Integration&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="147"&gt;&lt;/a&gt;&lt;a name="page62"&gt;&lt;/a&gt;The previous section  covered what can go wrong when data is taken from one system to another. Data  integration projects that fetch data from source systems and use them in new  applications have all the same problems. They may be only fetching tiny amounts  of data for each use. However, the data in the source systems, no matter how  small, must be properly extracted, cleansed, transformed, and put into a form  (usually a message) the target application understands. Virtually all of the  topics covered for batch, bulk data movement apply to integration.&lt;/p&gt; &lt;p class="last-para"&gt;Failure to recognize this and do everything that needs to be  done leads to inaccurate data being presented to new applications from correct  databases. This can lead to incorrect processing of transactions because of  wrong data values being presented to the integration transactions. This has a  large potential for negative impacts on the business conducted through  them.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Using Data&lt;/h2&gt; &lt;p class="first-para"&gt;The final area where data goes wrong is the process of  putting it into business objects such as reports, query result screens,  portal-accessible windows, and the subsequent use of this by business  professionals.&lt;/p&gt; &lt;p class="para"&gt;Data may be accurate, but if users do not understand the meaning  of the data or the context within which it is presented, their interpretation  and use of the data may be inaccurate. This again points to the problem of not  having a good metadata repository that is maintained 100% accurately all of the  time. This repository should contain information on what each data element  represents, how the data within it is encoded, how to interpret special values,  the &lt;a name="149"&gt;&lt;/a&gt;&lt;a name="page63"&gt;&lt;/a&gt;source of data, the time periods last  updated, and the quality levels last known for this data.&lt;/p&gt; &lt;p class="para"&gt;Quality levels are important because they can let users judge  whether the data satisfies their needs or not. They should be able to determine  from the repository the probable inaccurate rate within the data. This can be  derived from periodic assessment, data monitoring tools, or both.&lt;/p&gt; &lt;p class="last-para"&gt;The user should have easy access to this information. What is  needed is either a comprehensive help system on the business objects or an  easy-to-use corporate portal on corporate data stores. If users have to call  someone or jump through hoops to find out information about their data, they  will just not do it. They will make their own assumptions about what the data  represents and use it accordingly. Unfortunately, most companies do not have a  good metadata repository and certainly do not have a good mechanism for non-IT  professionals to access the information they do have.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Scope of Problems&lt;/h2&gt; &lt;p class="first-para"&gt;By the time a report has been generated from a data  warehouse, there have been a lot of opportunities for the data to go wrong. The  errors just accumulate from initial data creation through decay, data movement,  and use. No wonder so many data decision support systems are judged  failures.&lt;/p&gt; &lt;p class="para"&gt;Some practitioners take comfort in believing that data accuracy  problems are greater in legacy systems than they are in systems built more  recently, systems on a relational base, or systems implemented through a  packaged application. The truth is that there are many opportunities to create  problems in all of these.&lt;/p&gt; &lt;p class="para"&gt;Relational systems are not immune to errors. This is particularly  true of relational systems built in the 1980s. Much of the current protection  capability of relational systems was introduced gradually throughout the 1980s.  This included date/time data types, triggers, procedures, and referential  constraints. Many of the older systems do not use these capabilities and thus  leave themselves open to inaccurate data.&lt;/p&gt; &lt;p class="last-para"&gt;Packaged applications do not protect against bad data. They  provide a framework for collecting and storing data, but only the using company  generates and is responsible for the data. The packaged application cannot  ensure that all data is collected, that all values are accurate, or that the  fields are being used as intended. Using companies often make local  customization decisions in order to force the packaged application to fit their  way of doing business. This can lead to problems in extracting, moving, and  interpreting data.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Closing Remarks&lt;/h2&gt; &lt;p class="first-para"&gt;&lt;a name="152"&gt;&lt;/a&gt;&lt;a name="page64"&gt;&lt;/a&gt;Inaccurate data gets into  databases at a number of points and for a variety of reasons. Any program to  improve data accuracy must address the issues across the entire spectrum of  opportunities for error.&lt;/p&gt; &lt;p class="para"&gt;Data can be entered mistakenly, can be deliberately entered  inaccurately, can be the result of system errors, can decay in accuracy, can be  turned into inaccurate data through moving and restructuring, and can be turned  into wrong information when inappropriately reported or used. Understanding all  of these areas will make data quality assurance professionals more expert in  analyzing data and processes for inaccuracies.&lt;/p&gt; &lt;p class="para"&gt;The common theme throughout this chapter is that knowledge about  your data is the key to successful assessment, movement of data, and use of  data. There is no substitute for a sound knowledge base of information about the  data. Most metadata repositories fall short of the need. If a process similar to  that shown in &lt;a class="chapterjump" href="LiB0026.html#131" target="_parent"&gt;&lt;/a&gt; is used vigorously, updating the metadata repository at all stages,  higher data accuracy will result for those making decisions from the data.&lt;/p&gt; &lt;p class="last-para"&gt;The area of metadata repositories is ripe for new  development, meaningful standardization, and widespread deployment. It should  emerge as the next most important technology corporations demand after they have  made significant progress on data quality. The business case for repository  projects will emerge as the cost of not having them (via inhibiting data quality  improvement efforts) becomes more clear. The business value of data quality  initiatives will generally depend on the ability of the corporation to maintain  an accurate metadata repository. To the extent that it does not, it will not  reap the maximum value from data quality efforts.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1576936343892993251-3495880443643942380?l=dataqualityaccuracy.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataqualityaccuracy.blogspot.com/feeds/3495880443643942380/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=1576936343892993251&amp;postID=3495880443643942380' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/3495880443643942380'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/3495880443643942380'/><link rel='alternate' type='text/html' href='http://dataqualityaccuracy.blogspot.com/2007/12/sources-of-inaccurate-data.html' title='Sources of Inaccurate Data'/><author><name>vijikumar</name><uri>http://www.blogger.com/profile/18263997193124066136</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1576936343892993251.post-7864681230016369688</id><published>2007-12-13T22:43:00.000-08:00</published><updated>2007-12-13T22:48:09.477-08:00</updated><title type='text'>Definition of Accurate Data</title><content type='html'>&lt;h2 class="sect2-title"&gt;Data Quality Definitions&lt;/h2&gt; &lt;p class="first-para"&gt;Data quality is defined as follows: data has quality if it  satisfies the requirements of its intended use. It lacks quality to the extent  that it does not satisfy the requirement. In other words, data quality depends  as much on the intended use as it does on the data itself. To satisfy the  intended use, the data must be accurate, timely, relevant, complete, understood,  and trusted.&lt;/p&gt; &lt;p class="para"&gt;Some examples will help in understanding the notion of data  quality in the context of intended use. The sections that follow explore  examples of the previously mentioned aspects of data integrity.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="62"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec1"&gt;&lt;/a&gt;Case 1:  Accuracy&lt;/h3&gt; &lt;p class="first-para"&gt;Consider a database that contains names, addresses, phone  numbers, and e-mail addresses of physicians in the state of Texas. This database  is known to have a number of errors: some records are wrong, some are missing,  and some are obsolete. If you compare the database to the true population of  physicians, it is expected to be 85% accurate.&lt;/p&gt; &lt;p class="para"&gt;If this database is to be used for the state of Texas to notify  physicians of a new law regarding assisted suicide, it would certainly be  considered poor quality. In fact, it would be dangerous to use it for that  intended purpose.&lt;/p&gt; &lt;p class="last-para"&gt;&lt;a name="63"&gt;&lt;/a&gt;&lt;a name="page25"&gt;&lt;/a&gt;If this database were to be  used by a new surgical device manufacturer to find potential customers, it would  be considered high quality. Any such firm would be delighted to have a potential  customer database that is 85% accurate. From it, they could conduct a  telemarketing campaign to identify real sales leads with a completely acceptable  success rate. The same database: for one use it has poor data quality, and for  another it has high data quality.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="64"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec2"&gt;&lt;/a&gt;Case 2:  Timeliness&lt;/h3&gt; &lt;p class="first-para"&gt;Consider a database containing sales information for a  division of a company. This database contains three years' worth of data.  However, the database is slow to become complete at the end of each month. Some  units submit their information immediately, whereas others take several days to  send in information. There are also a number of corrections and adjustments that  flow in. Thus, for a period of time at the end of the accounting period, the  content is incomplete. However, all of the data is correct when complete.&lt;/p&gt; &lt;p class="para"&gt;If this database is to be used to compute sales bonuses that are  due on the 15th of the following month, it is of poor data quality even though  the data in it is always eventually accurate. The data is not timely enough for  the intended use.&lt;/p&gt; &lt;p class="last-para"&gt;However, if this database is to be used for historical trend  analysis and to make decisions on altering territories, it is of excellent data  quality as long as the user knows when all additions and changes are  incorporated. Waiting for all of the data to get in is not a problem because its  intended use is to make long-term decisions.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="65"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec3"&gt;&lt;/a&gt;Case 3:  Relevance&lt;/h3&gt; &lt;p class="first-para"&gt;Consider an inventory database that contains part numbers,  warehouse locations, quantity on hand, and other information. However, it does  not contain source information (where the parts came from). If a part is  supplied by multiple suppliers, once the parts are received and put on the shelf  there is no indication of which supplier the parts came from. The information in  the database is always accurate and current. For normal inventory transactions  and decision making, the database is certainly of high quality.&lt;/p&gt; &lt;p class="last-para"&gt;If a supplier reports that one of their shipments contained  defective parts, this database is of no help in identifying whether they have  any of those parts or not. The database is of poor quality because it does not  contain a relevant element of information. Without that information, the  database is poor data quality for the intended use.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="66"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec4"&gt;&lt;/a&gt;Case 4:  Completeness&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="67"&gt;&lt;/a&gt;&lt;a name="page26"&gt;&lt;/a&gt;A database contains  information on repairs done to capital equipment. However, it is a known fact  that sometimes the repairs are done and the information about the repair is just  not entered into the database. This is the result of lack of concern on the part  of the repair people and a lack of enforcement on the part of their supervisors.  It is estimated that the amount of missing information is about 5%.&lt;/p&gt; &lt;p class="para"&gt;This database is probably a good-quality database for assessing  the general health of capital equipment. Equipment that required a great deal of  expense to maintain can be identified from the data. Unless the missing data is  disproportionately skewed, the records are usable for all ordinary  decisions.&lt;/p&gt; &lt;p class="last-para"&gt;However, trying to use it as a base for evaluating  information makes it a low-quality database. The missing transactions could  easily tag an important piece of equipment as satisfying a warranty when in fact  it does not.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="68"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec5"&gt;&lt;/a&gt;Case 5:  Understood&lt;/h3&gt; &lt;p class="first-para"&gt;Consider a database containing orders from customers. A  practice for handling complaints and returns is to create an "adjustment" order  for backing out the original order and then writing a new order for the  corrected information if applicable. This procedure assigns new order numbers to  the adjustment and replacement orders.&lt;/p&gt; &lt;p class="last-para"&gt;For the accounting department, this is a high-quality  database. All of the numbers come out in the wash. For a business analyst trying  to determine trends in growth of orders by region, this is a poor-quality  database. If the business analyst assumes that each order number represents a  distinct order, his analysis will be all wrong. Someone needs to explain the  practice and the methods necessary to unravel the data to get to the real  numbers (if that is even possible after the fact).&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="69"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec6"&gt;&lt;/a&gt;Case 6:  Trusted&lt;/h3&gt; &lt;p class="first-para"&gt;A new application is deployed that is used to determine the  amount and timing of ordering parts for machinery based on past history and the  time in service since last replacement for the machines they are used in. The  original application had a programming error that incorrectly ordered 10 times  the amount actually required. The error went undisclosed until a large order was  sent. A great deal of publicity ensued over the incident. The programming error  was fixed and the problem does not repeat.&lt;/p&gt; &lt;p class="para"&gt;The database was never wrong; the application was. The large order  was actually placed and the database reflected the order as such.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="70"&gt;&lt;/a&gt;&lt;a name="page27"&gt;&lt;/a&gt;Because of a fear of a repeat of  the incident, the maintenance chief has chosen not to use the application nor  the information within the database. He orders parts based on a small  spreadsheet application he built to keep much of the same information, even  though he often misses transactions and does not always know when new parts  arrive in inventory.&lt;/p&gt; &lt;p class="last-para"&gt;Unless his confidence in the original application is  restored, the database is of poor quality, even though it is entirely accurate.  It is not serving its intended use due to a lack of believability.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Principle of Unintended Uses&lt;/h2&gt; &lt;p class="first-para"&gt;The previously cited examples demonstrate that you cannot  separate data from uses. To assess the quality of data, you must first collect a  thorough specification of the intended uses and then judge the data as to its  suitability for those uses. In a perfect world, database builders would gather  all requirements and then craft a database design and applications to match  them.&lt;/p&gt; &lt;p class="para"&gt;In the real world there is a serious problem when dealing with  "unintended uses." These are uses that were not known or defined at the time the  databases were designed and implemented.&lt;/p&gt; &lt;p class="para"&gt;Unintended uses arise for a large variety of reasons. Some  examples follow:&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;The company expands to new markets.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;The company purchases another company and consolidates  applications.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;External requirements are received, such as a new tax  law.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;The company grows its usage, particularly in decision  making.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;The company migrates to a new, packaged application that has  different needs.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;This represents the single biggest problem with databases.  Unintended uses proliferate at ever-expanding rates. You cannot anticipate all  uses for a database when initially building it unless it is a database with  unimportant content. In the real world you can expect (and in fact depend on) a  large number of unintended uses appearing with surprising regularity. Each of  these can cause a good database to become a bad database. Two things are needed:  anticipation in database design and flexibility in implementations.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="72"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec7"&gt;&lt;/a&gt;Need for  Anticipation in Database Design&lt;/h3&gt; &lt;p class="first-para"&gt;Database designers need to be schooled in the principles of  data quality. By doing so, they will be able to avoid some data quality problems  from occurring &lt;a name="73"&gt;&lt;/a&gt;&lt;a name="page28"&gt;&lt;/a&gt;when unintended uses appear. In  the least, they should be schooled to be diligent in the careful and thorough  documentation of the content. This means that metadata repositories should be  more prevalent, more used, and more valued by information systems groups.&lt;/p&gt; &lt;p class="para"&gt;Anticipation also includes questioning each database design  decision in light of what might appear in the future. For example, name and  address fields should anticipate the company's growth into markets in other  countries where the structure and form of elements may vary. Another example is  anticipating sales amounts in multiple national currencies.&lt;/p&gt; &lt;p class="para"&gt;Database and application designers should be discouraged from  using confusing and complicated data encoding methods. Many bad techniques  proliferated in the past that were the result of the limited capacity and slow  speed of storage devices. These are no longer excuses for making data structures  and encoding schemes overly complicated.&lt;/p&gt; &lt;p class="last-para"&gt;A good database design is one that is resilient in the face  of unintended uses. This principle and the techniques to achieve it must be  taught to the newer generations of information system designers.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="74"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec8"&gt;&lt;/a&gt;Need for  Flexibility in Implementations&lt;/h3&gt; &lt;p class="first-para"&gt;We know that changes will be made to our systems.  Corporations always change. They change so much that keeping up with the changes  is a major headache for any CIO. Unintended uses is one of the reasons for  change. When a new use appears, its requirements need to be collected and  analyzed against the data they intend to use. This needs to be done up front and  thoroughly. If the database is not up to the task of the new uses, either the  new use needs to be changed or discarded, or the database and its  data-generating applications must be upgraded to satisfy the requirements. This  analysis concept needs to be incorporated into all new uses of data.&lt;/p&gt; &lt;p class="para"&gt;It is amazing how many companies do not think this way. Too often  the data is force fit into the new use with poor results. How many times have  you heard about a data warehouse project that completed but yielded nonsense  results from queries? This generally is the result of force fitting data to a  design it just does not match.&lt;/p&gt; &lt;p class="para"&gt;Database systems are better able to accept changes if they are  designed with flexibility in the first place. Relational-based systems tend to  be more flexible than older database technologies. Systems with thorough,  complete, and current metadata will be much easier to change than those lacking  metadata.&lt;/p&gt; &lt;p class="para"&gt;Most information system environment do a very poor job of creating  and maintaining metadata repositories. Part of the blame goes to the repository  vendors who have built insufficient systems. Part goes to practitioners who &lt;a name="75"&gt;&lt;/a&gt;&lt;a name="page29"&gt;&lt;/a&gt;fail to take the time to use the metadata systems  that are there. Part goes to lack of education and awareness of how important  these things really are.&lt;/p&gt; &lt;table class="note" border="0" cellpadding="0" cellspacing="0"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="admon-check" valign="top"&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;h2 class="first-section-title"&gt;Data Accuracy Defined&lt;/h2&gt; &lt;p class="first-para"&gt;Data accuracy is one of the components of data quality. It  refers to whether the data values stored for an object are the correct values.  To be correct, a data values must be the right value and must be represented in  a consistent and unambiguous form.&lt;/p&gt; &lt;p class="para"&gt;For example, my birth date is December 13, 1941. If a personnel  database has a BIRTH_DATE data element that expects dates in USA format, a date  of 12/13/1941 would be correct. A date of 12/14/1941 would be inaccurate because  it is the wrong value. A date of 13/12/1941 would be wrong because it is a  European representation instead of a USA representation.&lt;/p&gt; &lt;p class="para"&gt;There are two characteristics of accuracy: form and content. Form  is important because it eliminates ambiguities about the content. The birth date  example is ambiguous because the reviewer would not know whether the date was  invalid or just erroneously represented. In the case of a date such as 5  February, 1944, the USA representation is 02/05/1944, whereas the European  representation is 05/02/1944. You cannot tell the representation from the value  and thus need discipline in creating the date values in order to be accurate. A  value is not accurate if the user of the value cannot tell what it is.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="77"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec9"&gt;&lt;/a&gt;Value  Representation Consistency&lt;/h3&gt; &lt;p class="first-para"&gt;Two values can be both correct and unambiguous and still  cause problems. For example, the data values &lt;i class="emphasis"&gt;ST Louis&lt;/i&gt; and  &lt;i class="emphasis"&gt;Saint Louis&lt;/i&gt; may both refer to the same city. However, the  recordings are inconsistent, and thus at least one of them is inaccurate.&lt;/p&gt; &lt;p class="last-para"&gt;Why is consistency a part of accuracy? Although the values  are different, anyone looking at them would have no trouble interpreting what  the values mean. The answer is that inconsistent values cannot be accurately  aggregated and compared. Since much of database usage involves comparisons and  aggregations, inconsistencies create an opportunity for inaccurate usage of the  data. Because all databases have a base level of unintended uses through &lt;a name="78"&gt;&lt;/a&gt;&lt;a name="page30"&gt;&lt;/a&gt;the opportunity for ad hoc queries, data is  inaccurate if it is inconsistent in representation.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="79"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec10"&gt;&lt;/a&gt;Change-Induced  Inconsistencies&lt;/h3&gt; &lt;p class="first-para"&gt;Inconsistencies can also be caused by system changes that  change the way information is recorded or that change the granularity of  recording. Following is an example of how change in granularity might introduce  inconsistencies: A company has a color field that only records red, blue, and  yellow. A new requirement makes them decide to break each of these colors down  to multiple shadings and thus institute a scheme of recording up to 30 different  colors, all of which are variations of red, blue, and yellow. None of the old  records are updated to the new scheme, as only new records use it. This database  will have inconsistency of representation of color that crosses a point in  time.&lt;/p&gt; &lt;p class="last-para"&gt;These types of changes are often not documented. The point in  time of the changeover may also be fuzzy, wherein not everyone switches to the  new scheme at the same time. If data mining and business intelligence software  is used on this database for periods of time that cross the change, results will  be inaccurate.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="80"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec11"&gt;&lt;/a&gt;Valid Values&lt;/h3&gt; &lt;p class="first-para"&gt;The definition of a value being valid means simply that the  value is in the collection of possible accurate values and is represented in an  unambiguous and consistent way. It means that the value has the potential to be  accurate. It does not mean that it is accurate. To be accurate, it must also be  the correct value. Defining all values that are valid for a data element is  useful because it allows invalid values to be easily spotted and rejected from  the database. However, we often mistakenly think values are accurate because  they are valid.&lt;/p&gt; &lt;p class="last-para"&gt;For example, if a data element is used to store the color of  a person's eyes, a value of TRUCK is invalid. A value of BROWN for my eye color  would be valid but inaccurate, in that my real eye color is blue.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="81"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec12"&gt;&lt;/a&gt;Missing  Values&lt;/h3&gt; &lt;p class="first-para"&gt;A data element that has no value in it may be either  accurate or inaccurate. For example, a missing value in the data element  COLLEGE_LAST_ATTENDED would be blank if the person it applied to had never  attended college. On the other hand, a BIRTH_DATE value left blank would not be  accurate because all of us have birth dates.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="82"&gt;&lt;/a&gt;&lt;a name="page31"&gt;&lt;/a&gt;A blank for  COLLEGE_LAST_ATTENDED may be accurate or inaccurate. If the person it applied to  had attended college, it would be inaccurate. This is another case of valid but  not accurate.&lt;/p&gt; &lt;p class="para"&gt;Missing values are very problematic in a database because of this  characteristic. To be accurate in recording information, an optional data  element should allow encoding a value for NULL. This is properly done by  creating a separate field that indicates whether the value in the first field is  missing or not. In this instance, the value in the primary field would be set to  blank if no college had been attended. If the recorder just did not know the  answer to the question, the NULL field would be set to indicate YES, meaning  that the blank in the primary field did not apply.&lt;/p&gt; &lt;p class="para"&gt;Unfortunately, few systems are built to allow for this  distinction. Even if they were, most data entry people would not get it right  all of the time anyway.&lt;/p&gt; &lt;p class="para"&gt;Sometimes a system is created that supports a keyword that means  empty. In the previous example, it might be "never attended college." This is a  bad practice because it can make queries that do COUNT, GROUPBY, and other  commands difficult or impossible to formulate correctly. Thus, it leads to  inaccurate query results.&lt;/p&gt; &lt;p class="last-para"&gt;Poor design within a system for not making the distinction  between "No value is OK" and "I don't know the correct value" is a common source  of poor data quality.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="83"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec13"&gt;&lt;/a&gt;Object-Level  Accuracy&lt;/h3&gt; &lt;p class="first-para"&gt;The concept of accuracy also applies above the data element  level. Data elements are never recorded in isolation. They are value attributes  of business objects such as personnel records, orders, invoices, payments, and  inventory records. The business objects represent real-world objects or events,  and each consists of one or more rows of one or more tables connected through  keys. Object-level inaccuracies consist of objects that are missing, have  missing parts, or that exist but should not.&lt;/p&gt; &lt;p class="para"&gt;For example, if a repair is done to a piece of capital equipment  and the repair person just failed to complete and submit a record of the action,  there would be a missing row or rows in the database. It is just not there. Any  decisions made from the database can be flawed if the missing information would  affect the outcome. Missing objects are very difficult to detect. This is an  important component of accuracy because a database may have nothing but accurate  data but be an inaccurate database because of missing information.&lt;/p&gt; &lt;p class="para"&gt;An example of missing elements might be found for an entity  consisting of a master record and one or more subrecords. For example, a master  record may consist of a record describing a rental object, and the subrecords  are &lt;a name="84"&gt;&lt;/a&gt;&lt;a name="page32"&gt;&lt;/a&gt;instances of renting. If one of the  rentals is done without the record being added to the database, the database  still looks accurate, but in fact is not.&lt;/p&gt; &lt;p class="para"&gt;Missing subrecords are very difficult to detect, whereas missing  master records are generally easier to detect. For example, if the rental auto  is sold and the master record deleted but not the subrecords, detecting the  problem is easy.&lt;/p&gt; &lt;p class="last-para"&gt;An example of objects being present that should not be is a  personnel database containing a record for an employee who left the company. The  record was just not deleted when it should have been. These are also generally  difficult to detect.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="85"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec14"&gt;&lt;/a&gt;Object-Level  Inconsistencies&lt;/h3&gt; &lt;p class="first-para"&gt;The concept of consistency is also found at the object  level. Database are not static. Changes are being made all of the time. Large  databases generally have data flowing into them from many different sources. If  you have groups creating data with a different criterion for when to add an  object (insert) or to remove an object (delete), you can end up with object  inconsistency. This means that if you want to use the database to get  information about object counts or data aggregations, you have the potential to  get inaccurate results.&lt;/p&gt; &lt;p class="para"&gt;To guard against this, companies need to define birth and death  rules for data objects and try their best to enforce them. They also need to  define a method of determining when all of the data is consistent over some time  period so that the data can be used intelligently.  charts the scenarios  discussed in this section in terms of what constitutes accurate versus  inaccurate data.&lt;/p&gt;&lt;p class="para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Distribution of Inaccurate Data&lt;/h2&gt; &lt;p class="first-para"&gt;The presence of wrong values will generally not be  distributed evenly throughout the database. The reasons for this are as  follows:&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;Some data is more important than other data.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;Some inaccurate data tends to get recognized and fixed when  used.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;How an element of data is used will affect the chances of  inaccuracies being recognized.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;Flaws in data acquisition processes are not equal for all  elements.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;In every database there are data elements that are more important  for an application than others. For example, in an orders database, the order  number &lt;a name="88"&gt;&lt;/a&gt;&lt;a name="page33"&gt;&lt;/a&gt;and customer number are more important  than the order date. If the customer number is wrong, it will get recognized  very early and get fixed. If the order date is wrong, it may never get  recognized and fixed. In an HR (human resources) database, the employee's Social  Security number is more important than the last education level achieved. A  Social Security number error will get recognized and fixed very early, whereas  the educational level achieved will probably never get recognized nor fixed. If  a large number of errors occur on a frequent basis in important fields, a major  issue erupts and the source of the errors is found and fixed.&lt;/p&gt; &lt;p class="para"&gt;Another factor is how a data element is used. A field that is used  in computing an amount or for updating inventory levels is more important than  one that is merely descriptive and is only printed on reports. Computations and  aggregation fields will generally precipitate visual clues to errors, whereas  fields not used for these purposes will generally not be recognized for their  errors.&lt;/p&gt; &lt;p class="para"&gt;The tendency for data elements that are more important to be more  accurate is why quality problems rarely surface through the initiating  transaction applications. The major problems with fields important to the users  of that application have already been recognized, and corrective action has been  taken to ensure that they are of sufficient accuracy to satisfy their  requirements.&lt;/p&gt; &lt;p class="para"&gt;The data inaccuracy problem surfaces when this data is moved and  used for decision making. Many of the data elements used only to record  secondary information about the transaction now become much more important. For  example, trying to correlate promotions to educational levels requires that the  "education level achieved" field be very accurate. This new use has a higher  demand on this data element than the demands made from the HR application.&lt;/p&gt; &lt;p class="para"&gt;This is a major reason data suddenly appears to be awful even  though the transaction applications have been running for years with no  complaints. The new uses of the data place higher requirements for accuracy on  some of the data elements than the transaction applications did.&lt;/p&gt; &lt;p class="last-para"&gt;&lt;a name="89"&gt;&lt;/a&gt;&lt;a name="page34"&gt;&lt;/a&gt;Unfortunately, another  dynamic comes into play regarding the chances of getting improvements made. The  only way the data will come up to the level needed by the newer uses is for  fundamental changes to occur all the way back to the transaction level. And yet,  the farther away you get from the initiating application, the more difficult it  is to get changes made. The people who own the data are satisfied with the  quality and place low priority on complaints from decision support analysts.  This situation screams out for data stewards and data quality controls.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Can Total Accuracy Be Achieved?&lt;/h2&gt; &lt;p class="first-para"&gt;The short answer is no. There will always be some amount of  data in any database that is inaccurate. There may be no data that is invalid.  However, as we have seen, being valid is not the same thing as being  accurate.&lt;/p&gt; &lt;p class="para"&gt;Data accuracy is much like air quality. You can never hope to get  to 100% pure air quality within an area where people live and work. It is just  not possible. However, most people can distinguish between good air quality and  poor air quality, even though both have some level of imperfections. People  value higher-quality air over lower-quality air; and they know the  difference.&lt;/p&gt; &lt;p class="para"&gt;Data accuracy is the same thing. Improvements in the accuracy of  data can change the perception of poor data quality to good data quality in most  databases even though inaccuracies persist. It is a rare application that  demands 100% accurate data to satisfy its requirements.&lt;/p&gt; &lt;p class="para"&gt;A database that has a 5% inaccurate data element rate will  probably be very troublesome to most users. The same database at a 0.5%  inaccurate rate would probably be very useful and considered high quality.&lt;/p&gt; &lt;p class="para"&gt;Another important concept to understand is that data inaccuracies  arise for a variety of reasons. Some of these are&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;wrong values entered&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;data entry people who do not care to do the job right&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;confusing and contradictory data entry screens or forms&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;procedures that allow for data to not be entered or not be  entered on time&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;procedures or policies that promote entering wrong  values&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;poorly defined database systems&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;If you can identify and correct all of the sources except the  first one, you can get very high levels of data accuracy. You are left with only  the case where I meant "blue" but entered "black". Data entry technology and  best practices exist that can minimize the amount of these types of errors as  well.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="91"&gt;&lt;/a&gt;&lt;a name="page35"&gt;&lt;/a&gt;In almost all cases where poor  data quality is reported, no effort has been made to identify root causes of  wrong values. Without finding root causes, improvements in the quality are not  going to occur. Whenever effort is spent to identify root causes and correct  them, improvements follow. The improvements are almost always noticeable and  impressive.&lt;/p&gt; &lt;p class="para"&gt;All other reasons tend to cause a clustering of data inaccuracies  around the faulty process. These are easier to find and correct than the random  errors that occur just because people enter data mistakenly. If all we had left  were the random errors of people, the errors would be more evenly distributed  throughout the database, would be small in number, and would have minimal impact  on the uses of data.&lt;/p&gt; &lt;p class="last-para"&gt;So, the long answer is yes. You can get accurate data to a  degree that makes it highly useful for all intended requirements.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Finding Inaccurate Values&lt;/h2&gt; &lt;p class="first-para"&gt;Some of the inaccurate values can be found, but probably not  all of them. In most cases a high percentage of inaccurate values can be found  if enough effort is spent searching for them.&lt;/p&gt; &lt;p class="para"&gt;&lt;a class="internaljump" href="#ch02fig02" role="chapterjump"&gt;&lt;/a&gt; the general potential of any collection of data. In any collection  there will be some data that is accurate and some that is not. This is shown by  the first column. Of the data that is not accurate, some of it is valid and some  of it is invalid.&lt;/p&gt; &lt;div class="figure"&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;  &lt;/div&gt; &lt;p class="para"&gt;There are two ways you can go about finding inaccurate data:  reverification and analysis. If you want to be 100% sure that all values are  accurate, you must use reverification. Even then, you cannot achieve this for  many types of &lt;a name="94"&gt;&lt;/a&gt;&lt;a name="page36"&gt;&lt;/a&gt;data. The reverification process  may not be possible for some data. Reverification is done by people, and they  may make mistakes in the reverification process.&lt;/p&gt; &lt;p class="para"&gt;The maximum potential of analytical techniques also falls short of  perfect even if you employ all analytical techniques available.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="95"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec15"&gt;&lt;/a&gt;Reverification&lt;/h3&gt; &lt;p class="first-para"&gt;This means that you manually go back to the original source  of the information and check every value. It is the only sure way to determine  what values are right and what values are wrong. The reason for this is that  analytical techniques cannot tell if a valid value is accurate unless there is a  secondary source of information to correlate the value against.&lt;/p&gt; &lt;p class="para"&gt;Reverification is not always possible. For example, if the data  records temperatures as of a point in time, it is impossible to go back in time  and measure them again.&lt;/p&gt; &lt;p class="para"&gt;Reverification is also susceptible to errors. You would generally  think that verifiers would be more careful than original entry people. However,  this may not be true. In addition, the source of information that provided wrong  information initially may continue to provide wrong information in the  reverification. If someone lies about their age initially, they will probably  lie about it in the reverification.&lt;/p&gt; &lt;p class="para"&gt;You would expect that a reverification would improve the quality  of the data by some amount. It would catch more errors than it would create.  However, you can never be sure that it catches them all.&lt;/p&gt; &lt;p class="para"&gt;Reverification takes a long time. If it were implemented as an  operational techniques to ensure the quality of the data before use, it would  certainly violate the timeliness requirement for almost all uses.&lt;/p&gt; &lt;p class="para"&gt;The primary problem with reverification is that it is impractical  for most databases. It is enormously time consuming and expensive.  Reverification on databases that are continuously changing is extremely  difficult if not downright impossible. For example, a web site that takes 10,000  orders per day is not likely to seek reverification of the information on every  order. Companies are not willing or able to use this technique widely.&lt;/p&gt; &lt;p class="para"&gt;There are isolated cases for which reverification is practical. On  small but important databases, it is not only possible but sometimes  appropriate. For example, a database on capital equipment expenditures may use  this technique. Another case is to use reverification on only part of the  objects in a database. For example, an industrial supplier may verify order  information on all orders over a certain amount of value or a certain amount of  product.&lt;/p&gt; &lt;p class="last-para"&gt;Selective reverification may also be a technique used in  monitoring a database. Random selection of transactions or records for  reverification can be a &lt;a name="96"&gt;&lt;/a&gt;&lt;a name="page37"&gt;&lt;/a&gt;good quality  assessment tool either to spot developing problems or to instill confidence  among users of the data. This improves the trust dimension of the quality of the  data.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="97"&gt;&lt;/a&gt;&lt;a name="ch02lev2sec16"&gt;&lt;/a&gt;Analytical  Techniques&lt;/h3&gt; &lt;p class="first-para"&gt;Analytical techniques involve the use of software in  conjunction with the skills of a data or business analyst to search through  databases to find the presence of inaccurate data. Analytical techniques can be  used against transactions as they are occurring, against databases as they are  changing, or against databases on a periodic basis.&lt;/p&gt; &lt;p class="para"&gt;There are five categories of analysis that can be applied to data:  data element analysis, structural analysis, value correlation, aggregation  correlation, and value inspection. These are individually explored in Chapters 8  through 12. Each of these contributes to finding the presence of inaccurate  data. However, not all of them can pinpoint the actual wrong values. The  sections that follow briefly describe these types of analysis.&lt;/p&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Element Analysis&lt;/h4&gt; &lt;p class="first-para"&gt;Element analysis involves looking at individual values in  isolation to determine if they are valid. To do this you need a definition of  what is valid and what is not. For example, a Social Security number must  consist of nine numeric digits. If this is your only definition, you will find  that all values that are blank, contain characters other than numeric or contain  less than or more than nine digits. However, you can go further in your  definition. The government employs a scheme of assigning numbers that allows you  to examine the value in more detail to determine if it is valid or not. Using  the larger rule has the potential for finding more inaccurate values.&lt;/p&gt; &lt;p class="para"&gt;Element analysis can include examination based on data type,  length, range of values, list of discrete values allowed, patterns allowed, and  any other information that is helpful. The more definition you provide, the more  potential you have for finding invalid values.&lt;/p&gt; &lt;p class="last-para"&gt;A simplified but often overlooked technique for element  analysis is visual inspection. This method involves looking at values to see if  they belong to the data element or not. For example, you could not write a  programmatic rule to find invalid names for names that contain valid characters.  However, a human could look at the list and find wrong or questionable entries  because of the superior human capability to deal with semantics and context. For  example, names such as &lt;i class="emphasis"&gt;Donald Duck, xwdcgex, Don't know&lt;/i&gt;,  and &lt;i class="emphasis"&gt;Wouldn't tell me&lt;/i&gt; would be spotted by the analyst and  ruled to be invalid. You could not write a comprehensive enough rule to find  these through a program.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Structural Analysis&lt;/h4&gt; &lt;p class="first-para"&gt;&lt;a name="98"&gt;&lt;/a&gt;&lt;a name="page38"&gt;&lt;/a&gt;Structural analysis  techniques involve checking columns of values for uniqueness or consecutiveness,  checking for orphans on collections of records with connecting parts, and  checking for circular relationships. For example, no two employees can have the  same employee number. Each employee record must have a corresponding emergency  contact record, and each emergency contact record must belong to a valid  employee record. An example of circular relationships would be that a part  cannot be used in the construction of itself.&lt;/p&gt; &lt;p class="last-para"&gt;Structural analysis techniques are used on collections of  records. They can identify clear inaccuracies in the data. They generally  isolate the error to a small number of records. They do not identify the  offending values. Reverification generally is required to pinpoint the actual  offender.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Value Correlation&lt;/h4&gt; &lt;p class="first-para"&gt;Value correlation analysis involves checking collections of  values against a rule that must hold true over the data. For example, if an  EMPLOYEE_TYPE field indicates a part-time employee, the PAY_TYPE field must be  hourly. This is a rule that applies to a subset of the records in the database.  If it is violate, an inaccuracy exists in the data. However, you cannot tell  which value is wrong, EMPLOYEE_TYPE or PAY_TYPE. You just know that one of them  is wrong.&lt;/p&gt; &lt;p class="para"&gt;Value correlation can be very simple or very complex. It may  involve only values in a single row, values in multiple rows of the same  relation, or values that cross over rows of multiple tables.&lt;/p&gt; &lt;p class="last-para"&gt;There is generally a large number of rules that can be  written for cross-checking values to ensure that the set of values represents a  valid combination. When a rule is violated, you cannot tell which value is the  offender unless the rule involves only one data element. In addition, rules can  be satisfied with completely inaccurate data. In the previous example, if the  employee record showed EMPLOYEE_TYPE as part-time and PAY_TYPE as hourly, it  would be satisfied even though it is possible the EMPLOYEE_TYPE field is  inaccurate (i.e., it is really a full-time employee). In fact, both  EMPLOYEE_TYPE and PAY_TYPE can satisfy the rule, even though both are  inaccurate.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Aggregation Correlation&lt;/h4&gt; &lt;p class="first-para"&gt;Aggregation correlation analysis is used to identify the  presence of inaccuracies through examining aggregated values over large amounts  of data. For example, a rule may say that the count of orders by region for a  week should be greater than 100 and less than 1,000. Violations would indicate  that either data is missing or orders have the wrong date on them.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Value Inspection&lt;/h4&gt; &lt;p class="first-para"&gt;&lt;a name="99"&gt;&lt;/a&gt;&lt;a name="page39"&gt;&lt;/a&gt;Values can often be  computed that allow the analyst to inspect them and determine if they are  reasonable or not. Visual inspection differs from aggregation correlation in  that it applies to cases where it is not possible to create a clear rule that  defines the boundary between right and wrong. Small amounts of inaccurate data  may not provide a clear indicator. However, clearly unreasonable data will jump  out at the analyst.&lt;/p&gt; &lt;p class="para"&gt;For example, the distribution of values within a column may  indicate that a problem exists. If 60% of the records indicate the same color of  product even though you know that the colors should be approximately evenly  divided over a set of 10 colors, the data is clearly inaccurate.&lt;/p&gt; &lt;p class="para"&gt;Aggregation tests can be applied to distribution of values,  counts, sums, averages, and medians. They can also be used to compare  aggregations between groups such as between sales regions.&lt;/p&gt; &lt;p class="last-para"&gt;In summary, element analysis is used to find only invalid  values. Structural analysis, value correlation, aggregation correlation, and  value inspection are used to find the presence of inaccuracies among valid  values. They cannot identify the offending values but can determine with  certainty that some of the values are wrong.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Looks Like a Duck, Walks Like a Duck&lt;/h4&gt; &lt;p class="first-para"&gt;Although it is clear that you cannot find all inaccurate  data values in a database, there is clearly value in finding out what you can  detect. If a database has inaccurate data within it, it is very likely that  analytical techniques will surface enough of them to get a good picture of the  health of the data. It is highly unlikely that all or most of the inaccurate  values can hide out as valid values that successfully correlate through all  rules. You will find enough inaccurate data, provided you do a thorough enough  job of analysis.&lt;/p&gt; &lt;p class="para"&gt;You also do not need to worry about seeing only the tip of the  iceberg. In most cases, the inaccuracies disclosed through analysis represent a  significant portion of the bad values.&lt;/p&gt; &lt;p class="para"&gt;A quality improvement program will probably improve the accuracy  of data that was not discoverable through analysis along with the inaccuracies  that were discovered. This is a secondary effect. For example, if your analysis  indicates that 8% of the records in a specific database have at least one wrong  value, you implement changes targeting the areas you uncovered and then reassess  to find that the new number is 1%. The actual improvement is probably greater  than the 7% indicated because of the effect of improvements on values not  discoverable through analysis. The result is that users may feel that &lt;a name="100"&gt;&lt;/a&gt;&lt;a name="page40"&gt;&lt;/a&gt;a much bigger improvement has actually taken  place.   indicates cross-correlations in terms of the effects of improvements in the  accuracy of data values.&lt;/p&gt;&lt;p class="para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;How Important Is It to Get Close?&lt;/h2&gt; &lt;p class="first-para"&gt;You can achieve very high levels of accuracy of data within  a database if enough energy and resources are applied. Although data accuracy is  only one component of data quality, it is clearly the single most important  component. If the data values are just plain wrong, missing, or inconsistent,  any attempt to use the data will be tainted. Every quality improvement program  must begin by dealing with data accuracy.&lt;/p&gt; &lt;p class="para"&gt;Most decision support applications have a tolerance level for  inaccurate data. Inaccuracies up to the tolerance level allow the application to  provide high-quality decisions. The inaccuracies do not change the outcome from  what it would be if the data were 100% accurate, provided the data inaccuracies  are not unduly biased.&lt;/p&gt; &lt;p class="para"&gt;Above the tolerance level, the data will generate wrong decisions  but will not be noticed because the decisions are not so bad. This is a  dangerous situation because the company is acting in the wrong way to data that  they believe to be good. It leads to inefficiencies that are not noticed. At  some higher level of inaccuracies, the data becomes not believed and has no  effect on decisions because it is not used.  depicts the relationship of  usefulness and accuracy as a step function influence on tolerance levels.&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="103"&gt;&lt;/a&gt;&lt;a name="ch02fig04"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig2%2D4%5F0%2Ejpg" name="IMG_7" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;Figure 2.4: &lt;/span&gt;Step  function influence on tolerance levels.&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;Most business analysts have no idea how to compute the tolerance  levels for decisions they make. Because they have no idea how bad the data is,  they must believe what they see.&lt;/p&gt; &lt;p class="para"&gt;This highlights two important aspects of working to improve the  accuracy of data. First, you have a good chance of bringing the data accuracy  back to the first area (the good place to be). You do not have to get to zero  errors, you just need to get into the good zone. Second, you provide valuable  information &lt;a name="104"&gt;&lt;/a&gt;&lt;a name="page41"&gt;&lt;/a&gt;to the decision makers on the  relative accuracy of the data. If you do not have an assessment program, you  either blindly believe the data or mistrust it enough to either use it with  caution or never use it.&lt;/p&gt; &lt;p class="para"&gt;Because decision-making efficiency is a step function of data  accuracy, it follows that small improvements in the accuracy can lead to very  large payoffs in value. If the quantity of inaccuracies is putting you in the  wrong decision zone, and improvements move you into the zone of right decisions,  the difference in value to the corporation can be enormous.&lt;/p&gt; &lt;p class="last-para"&gt;If you have no data quality program, there is probably a huge  potential value in instituting one. You have no idea how much value is there  because you are blindly using the data you have and have no idea how bad it is.  The cost of lowering the percentage of inaccurate data will almost always pay  off big for early improvements. As you get closer to zero errors, the cost will  prove to be excessive in comparison to gain. However, you can get very close  before the crossover occurs.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Closing Remarks&lt;/h2&gt; &lt;p class="first-para"&gt;&lt;a name="106"&gt;&lt;/a&gt;&lt;a name="page42"&gt;&lt;/a&gt;Data accuracy is the most  visible and dramatic dimension of data quality. It is the easiest to expose, the  easiest to make improvements in, often does not require system reengineering to  achieve improvements, and often does not require reorganization of your  corporation to accommodate it. Although you cannot get to perfect accuracy in  your data, you can improve the accuracy to the point where it consistently  provides information that drives correct decisions.&lt;/p&gt; &lt;p class="para"&gt;Data accuracy is a complex subject that needs to be fully  understood. The concepts of valid versus invalid, inconsistencies in  representation, object-level inconsistency, representation of values not known,  and missing information are all part of defining accuracy.&lt;/p&gt; &lt;p class="para"&gt;There are two methods of determining the accuracy of data:  reverification and data analysis. Neither one can guarantee finding all  inaccurate values.&lt;/p&gt; &lt;p class="last-para"&gt;Reverification is generally too expensive and slow to be  effective. Analytical techniques are easier to use. Analytical techniques  require that you understand what the definition of "correct" is for each data  element. Just as accurate data is the foundation component of data quality,  analytical techniques are the foundation component of any effective data quality  assurance program.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1576936343892993251-7864681230016369688?l=dataqualityaccuracy.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataqualityaccuracy.blogspot.com/feeds/7864681230016369688/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=1576936343892993251&amp;postID=7864681230016369688' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/7864681230016369688'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/7864681230016369688'/><link rel='alternate' type='text/html' href='http://dataqualityaccuracy.blogspot.com/2007/12/definition-of-accurate-data.html' title='Definition of Accurate Data'/><author><name>vijikumar</name><uri>http://www.blogger.com/profile/18263997193124066136</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-1576936343892993251.post-3954743660672572367</id><published>2007-12-13T22:36:00.000-08:00</published><updated>2007-12-13T22:43:12.249-08:00</updated><title type='text'>The Data Quality Problem</title><content type='html'>&lt;h2 class="first-section-title"&gt;&lt;span class="section-titlelabel"&gt;1 &lt;/span&gt;Data Is a Precious Resource&lt;/h2&gt; &lt;p class="first-para"&gt;Data is the fuel we use to make decisions. It records the  history of enterprise activities. It is used to drive processes of all sorts. It  is used to make important &lt;a name="18"&gt;&lt;/a&gt;&lt;a name="page4"&gt;&lt;/a&gt;decisions. We  maintain and use data as individuals. We maintain and use data as corporations,  governmental organizations, educational institutions, and virtually any other  organization.&lt;/p&gt; &lt;p class="para"&gt;Many large organizations are nothing but data processing engines.  Insurance companies, banks, financial services companies, and the IRS are all  organizations that live in a sea of data. Most of what they do is process  data.&lt;/p&gt; &lt;p class="para"&gt;Think about companies that process credit card transactions. What  products do they produce and sell? Just information products. They process tons  of data every day. Take their information systems away and there is nothing  left.&lt;/p&gt; &lt;p class="para"&gt;Other organizations may appear to be less involved with  information systems because their products or activities are not information  specific. However, looking under the covers you see that most of their  activities and decisions are driven or guided by information systems.&lt;/p&gt; &lt;p class="para"&gt;Manufacturing organizations produce and ship products. However,  data drives the processes of material acquisition, manufacturing work flow,  shipping, and billing. Most of these companies would come to a resounding halt  if their information systems stopped working. To be a profitable manufacturing  company today you need highly tuned information systems for just-in-time parts  delivery, effective purchasing systems for adjusting what you produce to  ever-changing demand, highly accurate cost accounting systems, applications for  the care and feeding of customers, and much more. Those with poor information  systems fall behind competitively, and many fall out of business.&lt;/p&gt; &lt;p class="para"&gt;The most successful companies are not always those with the best  products. Companies must recognize that they must sell what is profitable and  drop products that are not. Profitability requires knowledge of the supply  chain; knowledge of past, present, and future buying patterns; marketing costs;  and sales costs. Consolidation of data from many different systems is required  to make the right profit decisions.&lt;/p&gt; &lt;p class="para"&gt;Retail operations depend completely on information systems to keep  them profitable. They must have the latest technology for highly efficient  supply chain management. If not, their competitors will lower prices and force  the inefficient information processors out of business. Many are now moving to  customer relationship management systems in order to gain an even better  competitive position.&lt;/p&gt; &lt;p class="para"&gt;Data is becoming more precious all the time. Enterprises are using  data more and more to help them make important decisions. These can be daily,  routine decisions or long-term strategic decisions. New trends in data  warehousing, data mining, decision support, and customer relationship management  systems all highlight the ever-expanding role data plays in our  organizations.&lt;/p&gt; &lt;table class="note" border="0" cellpadding="0" cellspacing="0"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="admon-check" valign="top"&gt;&lt;br /&gt;&lt;/td&gt; &lt;td class="admon-title" valign="top"&gt;Note &lt;/td&gt; &lt;td class="admon-body" valign="top"&gt; &lt;p class="first-para"&gt;&lt;a name="19"&gt;&lt;/a&gt;&lt;a name="page5"&gt;&lt;/a&gt;Data gets more valuable all  the time, as additional ways are found to employ it to make our organizations  more successful.&lt;/p&gt;&lt;p class="first-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;&lt;span class="section-titlelabel"&gt; &lt;/span&gt;Impact of Continuous Evolution of Information Systems&lt;/h2&gt; &lt;p class="first-para"&gt;From about 1950 through today, there has been a clear  evolution in the use of computer-generated data from simple historical record  keeping to ever more active roles. This trend does not show signs of slowing  down. Data is generated by more people, is used in the execution of more tasks  by more people, and is used in corporate decision making more than ever  before.&lt;/p&gt; &lt;table class="note" border="0" cellpadding="0" cellspacing="0"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="admon-check" valign="top"&gt;&lt;br /&gt;&lt;/td&gt; &lt;td class="admon-title" valign="top"&gt;Note &lt;/td&gt; &lt;td class="admon-body" valign="top"&gt; &lt;p class="first-para"&gt;The more technology we develop, the more users demand from  it.&lt;/p&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="para"&gt;When we look at the past 50 years, the degree to which information  systems have played a role has been in a state of constant change. Every IT  department has had a significant amount of their resources devoted to the task  of implementing new or replacement systems. As each new system is deployed, it  is often already obsolete, as another replacement technology shows up while they  were building it. This drives them to again replace the just-finished system.  This process of "continuous evolution" has never stopped, and probably will not  for a number of years into the future.&lt;/p&gt; &lt;p class="para"&gt;The constant need to remodel systems as fast as they are developed  has been driven by enormously fast technology innovation in hardware,  communications, and software. No organization has been able to keep up with the  rapid pace of technological change. All organizations have been chasing the  technology curve in the hope of eventually reaching a stable point, where new  systems can survive for awhile. They will not reach stability for a long time in  the future, as much more technology is being born as this is written.&lt;/p&gt; &lt;p class="para"&gt;The need to change is also fueled by the rapid change in the  nature of the companies themselves. Mergers and acquisitions drive very rapid  and important changes as companies try to merge information systems. Changes in  product lines or changes in markets served drive many hastily implemented  changes into information systems. For example, the decision to "go global" can  wreak havoc on currency, date, address, and other data elements already in  place. Business change impacts are the ones that generally are done the  quickest, with the least amount of planning, and that usually derive the worst  results.&lt;/p&gt; &lt;p class="para"&gt;External changes also cause hastily implemented patches to  existing systems: tax law changes, accounting changes such as those experienced  in recent &lt;a name="21"&gt;&lt;/a&gt;&lt;a name="page6"&gt;&lt;/a&gt;years, the Y2K problem, the EURO  conversion, and on and on. This rapid evolution has meant that systems have been  developed hastily and changed aggressively. This is done with few useful  standards for development and control. The software industry has never developed  effective standards similar to those the hardware and construction industries  enjoy (through blueprints), nor does it have the luxury of time to think through  everything it does before committing to systems. The result is that many, if not  all, of our systems are very rough edged. These rough edges particularly show  through in the quality of the data and the information derived from the  data.&lt;/p&gt; &lt;p class="para"&gt;A lot of this rapid change happened in order to push information  systems into more of the tasks of the enterprise and to involve more people in  the process. The Internet promises to involve all people and all tasks in the  scope of information systems. At some time in the future, all companies will  have an information system backbone through which almost all activity will be  affected. As a result, information systems become bigger, more complex, and,  hopefully, more important every time a new technology is thrown in. The data  becomes more and more important.&lt;/p&gt; &lt;p class="para"&gt;Just about everything in organizations has been "databased." There  are personnel databases, production databases, billing and collection databases,  sales management databases, customer databases, marketing databases, supply  chain databases, accounting databases, financial management databases, and on  and on. Whenever anyone wants to know something, they instinctively run to a PC  to query a database. It is difficult to imagine that less than 25 years ago  there were no PCs and data was not collected on many of the corporate objects  and activities of today.&lt;/p&gt; &lt;p class="para"&gt;I participated in an audit for a large energy company a few years  ago that inventoried over 5,000 databases and tens of thousands of distinct data  elements in their corporate information systems. Most corporations do not know  how much data they are actually handling on a daily basis.&lt;/p&gt; &lt;p class="para"&gt;Not only has most corporation information been put into databases,  but it has been replicated into data warehouses, data marts, operational data  stores, and business objects. As new ways are discovered to use data, there is a  tendency to create duplication of the primary data in order to satisfy the new  need. The most dramatic example today is the wave of customer relationship  management (hereafter, CRM) projects proliferating throughout the IT world.&lt;/p&gt; &lt;p class="para"&gt;Replication often includes aggregating data, combining data from  multiple sources, putting data into data structures that are different from the  original structure, and adding time period information. Often the original data  cannot be recognized or found in the aggregations. As a result, errors detected  &lt;a name="22"&gt;&lt;/a&gt;&lt;a name="page7"&gt;&lt;/a&gt;in the aggregations often cannot be traced back  to primary instances of data containing the errors.&lt;/p&gt; &lt;p class="para"&gt;In addition to replicating, there are attempts to integrate the  data of multiple databases inside interactive processes. Some of this  integration includes reaching across company boundaries into databases of  suppliers, customers, and others.&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="23"&gt;&lt;/a&gt;&lt;a name="ch01fig01"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig1%2D1%5F0%2Ejpg" name="IMG_1" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt; &lt;/div&gt; &lt;p class="para"&gt;Adding the demands of replication and integration on top of  operational systems adds greatly to the complexity of information systems and  places huge burdens on the content of the primary operational systems. Data  quality problems get magnified through all of these channels.  indicates  aspects of integration, operation, and replication.&lt;/p&gt; &lt;div class="figure"&gt;&lt;a name="24"&gt;&lt;/a&gt;&lt;a name="ch01fig02"&gt;&lt;/a&gt;&lt;span class="figuremediaobject"&gt;&lt;a href="images/fig1%2D2%5F0%2Ejpg" name="IMG_2" target="_parent"&gt;&lt;/a&gt;&lt;/span&gt; &lt;br /&gt;&lt;span class="figure-title"&gt;&lt;span class="figure-titlelabel"&gt;Figure 1.2: &lt;/span&gt;Demands on  operational databases.&lt;/span&gt; &lt;/div&gt; &lt;div class="sidebar"&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="Start Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="first-para"&gt;The claim that systems are in a state of continuous  evolution seems to be belied by the resilience of legacy systems built years ago  that seem to resist all attempts to replace them. In reality, these systems are  the worst offenders of evolution because they change all the time, at a high  cost, and usually extend themselves through replication and integration. In  fact, many new requirements can only be satisfied through replication or  integration extensions.&lt;/p&gt; &lt;p class="para"&gt;Because of the inherent inflexibility of such systems, these  extensions are much more complex and often turn out to be badly implemented.  This is a classic case of pushing the problems downhill and not addressing root  problems.&lt;/p&gt; &lt;p class="last-para"&gt;If corporations want quality systems, they will need to  eventually replace the old technologies with new ones. Retrofitting older  systems to new requirements and new standards of quality is almost impossible to  achieve with quality results.&lt;/p&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="End Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt; &lt;table class="BlankSpace" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td height="16"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="para"&gt;&lt;a name="25"&gt;&lt;/a&gt;&lt;a name="page8"&gt;&lt;/a&gt;Along with the increasing  complexity of systems comes an increase in the impact of inaccurate data. In the  primary systems, a wrong value may have little or no impact. It may cause a  glitch in processing of an order, resulting in some small annoyance to fix.  However, as this wrong value is propagated to higher-level decision support  systems, it may trigger an incorrect reordering of a product or give a decision  maker the wrong information to base expanding a manufacturing line on. The  latter consequences can be much larger than the original.&lt;/p&gt; &lt;p class="last-para"&gt;Although a single wrong value is not likely to cause such  drastic results, the cumulative effect of multiple wrong values in that same  attribute can collectively deliver very wrong results. Processes that generate  wrong values rarely generate only one inaccurate instance.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Acceptance of Inaccurate Data&lt;/h2&gt; &lt;p class="first-para"&gt;Databases have risen to the level of being one of the most,  if not the most, important corporate asset, and yet corporations tolerate  enormous inaccuracies in their databases. Their data quality is not managed as  rigorously as are most other assets and activities. Few companies have a data  quality assurance &lt;a name="27"&gt;&lt;/a&gt;&lt;a name="page9"&gt;&lt;/a&gt;program, and many that do  have such a program provide too little support to make it effective.&lt;/p&gt; &lt;p class="para"&gt;The fact of modern business is that the databases that drive them  are of poor to miserable quality, and little is being done about it.  Corporations are losing significant amounts of money and missing important  opportunities all the time because they operate on information derived from  inaccurate data. The cost of poor-quality data is estimated by some data quality  experts as being from 15 to 25% of operating profit. In a recent survey of 599  companies conducted by PricewaterhouseCoopers, an estimate of poor data  management is costing global businesses more than $1.4 billion per year in  billing, accounting, and inventory snafus alone. Much of that cost is  attributable to the accuracy component of data quality.&lt;/p&gt; &lt;p class="para"&gt;This situation is not restricted to businesses. Similar costs can  be found in governmental or educational organizations as well. Poor data quality  is sapping all organizations of money and opportunities. A fair characterization  of the state of data quality awareness and responsiveness for the typical large  organization is as follows:&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;They are aware of problems with data.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;They consistently underestimate, by a large amount, the  extent of the problem.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;They have no idea of the cost to the corporation of the  problem.&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;They have no idea of the potential value in fixing the  problem.&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="last-para"&gt;If you can get commitment to a data quality assessment  exercise, it almost always raises awareness levels very high. A typical response  is "I had no idea the problem was that large." Assessment is the key to  awareness, not reading books like this. Most people will believe that the other  guy has a larger problem than they do and assume that this book is written for  that other guy, not them. Everyone believes that the data quality problem they  have is small and much less interesting to address than other initiatives. They  are usually very wrong in their thinking. It takes data to change their  minds.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;The Blame for Poor-Quality Data&lt;/h2&gt; &lt;p class="first-para"&gt;Everyone starts out blaming IT. However, data is created by  people outside IT, and is used by people outside IT. IT is responsible for the  quality of the systems that move the data and store it. However, they cannot be  held completely responsible for the content. Much of the problem lies outside  IT, through poorly articulated requirements, poor acceptance testing of systems,  poor data creation processes, and much more.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="29"&gt;&lt;/a&gt;&lt;a name="page10"&gt;&lt;/a&gt;Data quality problems are  universal in nature. In just about any large organization the state of  information and data quality is at the same low levels.&lt;/p&gt; &lt;p class="para"&gt;The fact that data quality is universally poor indicates that it  is not the fault of individually poorly managed organizations but rather that &lt;i class="emphasis"&gt;it is the natural result of the evolution of information system  technology.&lt;/i&gt; There are two major contributing factors. The first is the rapid  system implementations and change that have made it very difficult to control  quality. The second is that the methods, standards, techniques, and tools for  controlling quality have evolved at a much slower pace than the systems they  serve.&lt;/p&gt; &lt;p class="para"&gt;Virtually all organizations admit that data quality issues plague  their progress. They are all aware of the situation at some level within the  enterprise. Quality problems are not restricted to older systems either. Nor are  they restricted to particular types of systems. For example, practitioners  intuitively assume that systems built on a relational database foundation are of  higher data quality than older systems built on less sophisticated data  management technology. Under examination, this generally turns out not to be  true.&lt;/p&gt; &lt;p class="para"&gt;Information technology evolution is at a point where the next most  important technology that needs to evolve is methods for controlling the quality  of data and the information derived from it. The systems we are building are too  important not to address this important topic any later than now.&lt;/p&gt; &lt;div class="sidebar"&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="Start Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;p class="first-para"&gt;Although data quality problems are universal, this should  not excuse egregious examples of poor quality or cases in which awareness was  high but no actions taken. In the absence of these, CEOs should not dwell on  fault but instead spend their energies on improvement.&lt;/p&gt; &lt;p class="para"&gt;Ten years from now, poor data quality will be a reason to find  fault.&lt;/p&gt; &lt;p class="last-para"&gt;With the growing availability of knowledge, experts, books,  methodologies, software tools, and corporate resolve, high-quality database  systems will become the norm, and there will be no excuse for not having them.  What we consider excusable today will be inexcusable then.&lt;/p&gt; &lt;table class="BlueLine" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td class="bluecell" bg style="color:#000080;"&gt;&lt;span style="font-family:Arial;font-size:85%;color:#010100;"&gt;&lt;b&gt;&lt;img alt="End Sidebar" src="_.gif" border="0" height="2" width="1" /&gt;&lt;/b&gt;&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt; &lt;table class="BlankSpace" border="0" cellpadding="0" cellspacing="0" width="100%"&gt; &lt;tbody&gt; &lt;tr&gt; &lt;td height="16"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt; &lt;h2 class="first-section-title"&gt;Awareness Levels&lt;/h2&gt; &lt;p class="first-para"&gt;Almost everyone is aware that data from time to time causes  a visible problem. However, visibility to the magnitude of the problems and to  the impact on the corporation is generally low. There are several reasons for  this.&lt;/p&gt; &lt;p class="para"&gt;Correction activities, rework, order reprocessing, handling  returns, and dealing with customer complaints are all considered a normal part  of corporate life. Many of the problems are not associated with information  quality, &lt;a name="31"&gt;&lt;/a&gt;&lt;a name="page11"&gt;&lt;/a&gt;even when that is the problem. The  activities tend to grow in size with little fanfare or visibility. Since the  people who carry out these activities are generally not isolated within a  function, the cost and scope of such problems are generally not appreciated.&lt;/p&gt; &lt;p class="para"&gt;When decision makers reject IT data because "they just know it  can't be right," they generally do not rush into the CEO's office and demand  that the data coming from IT be improved. They usually just depend on their  previous methods for making decisions and do not use the information from the  databases. Many times data warehouse and decision support systems get built and  then become not used for this reason. To make matters worse, decision makers  sometimes generate alternative data collection and storage minisystems to use  instead of the mainline databases. These often tend to be as bad or worse in  quality than the systems they reject.&lt;/p&gt; &lt;p class="para"&gt;IT management often does not want to raise a red flag regarding  quality, since they know that they will get blamed for it. Their systems are  collecting, storing, and disseminating information efficiently, and they are  content with not surfacing the fact that the quality of the data flowing through  these systems is bad.&lt;/p&gt; &lt;p class="para"&gt;Corporate management wants to believe that their IT departments  are top notch and that their systems are first rate. They do not want to expose  to their board or to the outside world the facts of inefficiencies or lost  opportunities caused by inaccurate data.&lt;/p&gt; &lt;p class="para"&gt;If a company included in its annual report a statement that their  information quality caused a loss equal to 20% of their operating profit, their  stock price would plunge overnight. They do not want this information published,  they do not want investors to know, and they do not want their competitors to  know. The obvious psychology drives them to not want to know (or believe) it  themselves.&lt;/p&gt; &lt;p class="para"&gt;Companies tend to hide news about information quality problems.  You will never see a company voluntarily agree to a magazine article on how they  discovered huge data quality problems and invested millions of dollars to fix  them. Even though this is a great story for the corporation, and the results may  save them many times the money they spent, the story makes them look like they  lost control and were just getting back to where they should have been. It  smacks of saying that they have been bad executives and managers and had to  spend money to correct their inefficient ways.&lt;/p&gt; &lt;p class="para"&gt;I had a conversation with a government agency official in which  they indicated that disclosure of data accuracy problems in a particular  database would generate a political scandal of considerable proportions, even  though the root cause of the quality problems had nothing to do with any of the  elected officials. Needless to say, they went about fixing the problem as best  they could, with no publicity at all about the project or their findings.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="32"&gt;&lt;/a&gt;&lt;a name="page12"&gt;&lt;/a&gt;Data quality (and more  specifically, data accuracy) problems can have liability consequences. As we  move more into the Internet age, in which your company's data is used by other  corporations to make decisions about purchasing and selling, costs associated  with bad data will eventually be the target of litigation. Corporations surely  do not want to trumpet any knowledge they have of quality problems in their  databases and give ammunition to the legal staff of others.&lt;/p&gt; &lt;p class="para"&gt;The time to brag about spending large budgets to get and maintain  highly accurate data and highly accurate information products has not yet  arrived. However, the tide is turning on awareness. If you go into almost any IT  organization, the data management specialists will all tell you that there are  considerable problems with the accuracy of data. The business analysts will tell  you that they have problems with data and information quality. As you move up  the management chain, the willingness to assert the problems diminishes, usually  ending with the executive level denying any quality problems at all.   summarizes reasons for lack of initiative in regard to problems with information  quality.&lt;/p&gt;&lt;p class="para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Impact of Poor-Quality Data&lt;/h2&gt; &lt;p class="first-para"&gt;We usually cannot scope the extent of data quality problems  without an assessment project. This is needed to really nail the impact on the  organization and identify the areas of potential return. The numbers showing the  potential savings are not lying around in a convenient account. They have to be  dug out through a concerted effort involving several organizational entities.  Some areas in which costs are created and opportunities lost through poor data  quality are&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;transaction rework costs&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;costs incurred in implementing new systems&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;delays in delivering data to decision makers&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;&lt;a name="35"&gt;&lt;/a&gt;&lt;a name="page13"&gt;&lt;/a&gt;lost customers through  poor service&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;lost production through supply chain problems&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;Examples of some of these, discussed in the sections that follow,  will demonstrate the power of data quality problems to eat away at the financial  health of an organization.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="36"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec1"&gt;&lt;/a&gt;Transaction Rework  Costs&lt;/h3&gt; &lt;p class="first-para"&gt;Many organizations have entire departments that handle  customer complaints on mishandled orders and shipments. When the wrong items are  shipped and then returned, a specific, measurable cost occurs. There are many  data errors that can occur in this area: wrong part numbers, wrong amounts, and  incorrect shipping addresses, to name a few. Poorly designed order entry  procedures and screens are generally the cause of this problem.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="37"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec2"&gt;&lt;/a&gt;Costs Incurred in  Implementing New Systems&lt;/h3&gt; &lt;p class="first-para"&gt;One of the major problems in implementing data warehouses,  consolidating databases, migrating to new systems, and integrating multiple  systems is the presence of data errors and issues that block successful  implementation. Issues with the quality of data can, and more than half the time  do, increase the time and cost to implement data reuse projects by staggering  amounts.&lt;/p&gt; &lt;p class="last-para"&gt;A recent report published by the Standish Group shows that  37% of such projects get cancelled, with another 50% completed but with at least  a 20% cost and time overrun and often with incomplete or unsatisfactory results.  This means that only 13% of projects are completed within a reasonable time and  cost of their plans with acceptable outcomes. This is a terrible track record  for implementing major projects. Failures are not isolated to a small group of  companies or to specific industries. This poor record is found in almost all  companies.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="38"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec3"&gt;&lt;/a&gt;Delays in  Delivering Data to Decision Makers&lt;/h3&gt; &lt;p class="first-para"&gt;Many times you see organizations running reports at the end  of time periods and then reworking the results based on their knowledge of wrong  or suspicious values. When the data sources are plagued by quality problems, it  generally requires manual massaging of information before it can be released for  decision-making consumption. The wasted time of people doing this rework can be  measured. The poor quality of decisions made cannot be measured. If it takes  effort to clean up data before use, you can never be sure if the data is  entirely correct after cleanup.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="39"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec4"&gt;&lt;/a&gt;Lost Customers  Through Poor Service&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="40"&gt;&lt;/a&gt;&lt;a name="page14"&gt;&lt;/a&gt;This is another category  that can easily be spotted. Customers that are being lost because they  consistently get orders shipped incorrectly, get their invoices wrong, get their  payments entered incorrectly, or other aspects of poor service represent a large  cost to the corporation.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="41"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec5"&gt;&lt;/a&gt;Lost Production  Through Supply Chain Problems&lt;/h3&gt; &lt;p class="first-para"&gt;Whenever the supply chain system delivers the wrong parts or  the wrong quantity of parts to the production line, there is either a stoppage  of work or an oversupply that needs to be stored somewhere. In either case,  money is lost to the company.&lt;/p&gt; &lt;p class="last-para"&gt;The general nature of all of these examples is that data  quality issues have caused people to spend time and energy dealing with the  problems associated with them. The cost in people and time can be considerable.  However, over time corrective processes have become routine, and everyone has  come to accept this as a normal cost of business. It is generally not visible to  higher levels of management and not called out on accounting reports. As a  result, an assessment team should be able to identify a great deal of cost in a  short period of time.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Requirements for Making Improvements&lt;/h2&gt; &lt;p class="first-para"&gt;Too often executives look at quality problems as isolated  instances instead of symptoms. This is a natural reaction, considering that they  do not want to believe they have problems in the first place. They tend to be  reactive instead of proactive. Making large improvements in the accuracy of data  and the quality of information from the data can only be accomplished through  proactive activities.&lt;/p&gt; &lt;p class="para"&gt;Considering the broad scope of quality problems, this is not an  area for quick fixes. The attitude that should be adopted is that of installing  a new layer of technology over their information systems that will elevate their  efficiency and value. It is the same as adding a CRM system to allow marketing  to move to a new level of customer care, resulting in higher profits.&lt;/p&gt; &lt;p class="para"&gt;The scope of quality problems and the potential for financial gain  dictate that a formal program be initiated to address this area. Such a program  needs to have a large component dedicated to the topic of data accuracy. Without  highly accurate data, information quality cannot be achieved.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="43"&gt;&lt;/a&gt;&lt;a name="page15"&gt;&lt;/a&gt;To get value from the program,  it must be viewed as a long-term and continuous activity. It is like adding  security to your buildings. Once you achieve it, you do not stop pursuing it. In  spite of the fact that data quality improvement programs are long term, it is  important to repeat that significant returns are generally achievable in the  short term.&lt;/p&gt; &lt;p class="para"&gt;Some of the problems will take a long time to fix. The primary  place to fix problems is in the systems that initially gather the data.  Rebuilding them to produce more accurate data may take years to accomplish.  While long-term improvements are being made, short-term improvements can be made  through filtering of input data, cleansing of data in databases, and in creating  an awareness of the quality that consumers of the data can expect will  significantly improve the use of the data.&lt;/p&gt; &lt;p class="para"&gt;A major theme of this book is that you need to train all of your  data management team in the concepts of accurate data and to make accurate data  a requirement of all projects they work on. This is in addition to having a core  group of data quality experts who pursue their own agenda.&lt;/p&gt; &lt;p class="last-para"&gt;There will still be times when overhauling a system solely  for the purpose of improving data accuracy is justified. However, most of the  time the best way to improve the overall data accuracy of your information  systems is to make it a primary requirement of all new projects. That way, you  are getting double value for your development dollars.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Expected Value Returned for Quality Program&lt;/h2&gt; &lt;p class="first-para"&gt;Experts have estimated the cost of poor information quality  at from 15 to 25% of operating profits. This assumes that no concerted effort  has already been made to improve quality. The actual achievable number is less.  However, even if you could get only 60% of that back you would add 9 to 15% to  the bottom line. This is a considerable amount. If you are a corporation, this  is a lot of profit. If you are an education institution, this is a lot of money  added for improving the campus or faculty. If you are a charitable organization,  this is a lot more money going to recipients. If you are a governmental  organization, this is more value for the tax dollar.&lt;/p&gt; &lt;p class="para"&gt;Although these numbers are considerable, they represent the value  of concentrating on improving information quality for the organization as it  currently exists. However, I suggest that better-quality information systems  will reduce the cost of, and accelerate the completion of, steps in evolving the  organization to newer business models. There has never been a time in my  lifetime when companies were not in the process of implementing newer business  or manufacturing systems that promised huge returns when completed. &lt;a name="45"&gt;&lt;/a&gt;&lt;a name="page16"&gt;&lt;/a&gt;Many of these changes were considered essential  for survival. The latest example of this is the move to being Internet  based.&lt;/p&gt; &lt;p class="para"&gt;Changing a corporation's business and operating systems to a base  of high-quality data makes changes occur faster, at lower cost, and with  better-quality outcomes. CRM projects are a good example. Moving to a  customer-centric model, whereby information about customers drives sales and  marketing activities, promises huge returns to corporations. However, we hear  that over 60% of CRM implementations either are outright failures or experience  long delays. Many of these problems are caused by inaccurate data, making it  difficult, if not impossible, to complete the projects.&lt;/p&gt;&lt;p class="para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Data Quality Assurance Technology&lt;/h2&gt; &lt;p class="first-para"&gt;Although information quality has remained at low levels or  even degraded over the years, there has been progress in the technology for  improving it. Although information technology is not yet considered a formal  technology, its parts are coming together and will be recognized as such in the  near future. The essential elements of the technology are&lt;/p&gt; &lt;ul class="itemizedlist"&gt;&lt;li class="first-listitem"&gt; &lt;p class="first-para"&gt;availability of experts and consultants&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;educational materials&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;methodologies&lt;/p&gt; &lt;/li&gt;&lt;li class="listitem"&gt; &lt;p class="first-para"&gt;software tools&lt;/p&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p class="para"&gt;These factors combined allow a corporation to establish a data  quality assurance program and realize substantial gain. It is important that  these factors become established as standard methods that incorporate the best  practices. This will allow the entire IT industry to use the emerging technology  effectively and for rapid transfer of knowledge between individuals and  organizations.&lt;/p&gt; &lt;p class="para"&gt;This does not mean that the technology will not evolve, as  everything is not yet known about this area. It means that changes to the set of  tools should be judged as to whether they advance the technology before they are  adopted.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="47"&gt;&lt;/a&gt;&lt;a name="page17"&gt;&lt;/a&gt;Every manufacturing operation  has a quality control department. Every accounting department has auditors.  There are inspectors for construction sites at every stage of building. There  are requirements for formal specification of construction and manufacturing  before anything is built. Any serious software development organization has  trained quality assurance professionals.&lt;/p&gt; &lt;p class="para"&gt;Information systems need the same formality in a group of people  and processes to ensure higher levels of quality. Every serious organization  with a large IT operation needs a data quality assurance program. They need to  require formal documentation of all information assets and sufficient  information about them to satisfy all development and user requirements. They  need inspectors, auditors, and development consultants. They need an established  methodology to continuously monitor and improve the accuracy of data flowing  through their information systems.&lt;/p&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="48"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec6"&gt;&lt;/a&gt;Availability of  Experts and Consultants&lt;/h3&gt; &lt;p class="first-para"&gt;Before any technology can take off, it needs the attention  of a lot of smart people. When relational technology got its rocket start in the  late 1970s and early 1980s, there was research going on in several corporate  research organizations (most notably IBM) and in many universities (most notably  the University of California at Berkeley). The vast majority of Ph.D. these in  computer science in that era had something to do with relational database  technology. An enormous number of start-up companies appeared to exploit the new  technology. I did a survey in 1982 and found over 200 companies that had or were  building a relational database engine. Today, less than five of them have  survived. However, those that did survive have been enormously successful.&lt;/p&gt; &lt;p class="para"&gt;Data quality has the attention of a few smart people, not the  large group that is desirable for a new technology to emerge. However, the  number is increasing every year. Many university research efforts are now  addressing this topic. The most notable is the M.I.T. TDQM (total data quality  management) research program. There are many more university research efforts  being aimed at this field every year. In addition, technical conferences devoted  to data and information quality are experiencing significant growth in  attendance every year.&lt;/p&gt; &lt;p class="last-para"&gt;A number of consultant experts have emerged who are  dedicating their careers to the data quality topic. The number increases every  year. The quality of these consultants is superb. Corporations should not  hesitate to take advantage of their knowledge and experience.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="49"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec7"&gt;&lt;/a&gt;Educational  Materials&lt;/h3&gt; &lt;p class="first-para"&gt;&lt;a name="50"&gt;&lt;/a&gt;&lt;a name="page18"&gt;&lt;/a&gt;There is a clear shortage  of educational materials in the field of data and information quality. Materials  need to be developed and included in standard college courses on computer  science. Corporations need to provide education not only to those responsible  for data quality assurance but to everyone who is involved in defining,  building, executing, or monitoring information systems. There should also be  education for consumers of information so that they can more effectively  determine how to use information at their disposal and to provide effective  requirements and feedback to system developers.&lt;/p&gt; &lt;p class="last-para"&gt;Books and articles are useful tools for education, and plenty  of them are available. However, more specific training modules need to be  developed and deployed for quality to become an important component of  information systems.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="51"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec8"&gt;&lt;/a&gt;Methodologies&lt;/h3&gt; &lt;p class="first-para"&gt;There have emerged a number of methodologies for creating  and organizing data quality assurance programs, for performing data quality  assessments, and for ongoing data stewardship. These can be found in the various  books available on data or information quality. This book provides its own  methodology, based on data profiling technology, for consideration. More  detailed methodologies need to be employed for profiling existing data stores  and monitoring data quality in operational settings.&lt;/p&gt; &lt;p class="last-para"&gt;If data quality assurance programs are going to be  successful, they must rally around standard methods for doing things that have  been proven to work. They then need to employ them professionally over and over  again.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h3 class="sect3-title"&gt;&lt;a name="52"&gt;&lt;/a&gt;&lt;a name="ch01lev2sec9"&gt;&lt;/a&gt;Software  Tools&lt;/h3&gt; &lt;p class="first-para"&gt;There has been a paucity of software tools available to  professionals to incorporate into data quality assurance programs. It is ironic  that on the topic of data quality the software industry has been the least  helpful. Part of the reason for this is that corporations have not been  motivated to identify and solve quality problems and thus have not generated  sufficient demand to foster the growth in successful software companies focusing  on data quality.&lt;/p&gt; &lt;p class="para"&gt;More tools are emerging as the industry is waking up to the need  for improving quality. You cannot effectively carry out a good program without  detailed analysis and monitoring of data. The area of data accuracy specifically  requires software to deal with the tons of data that should be looked at.&lt;/p&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Metadata Repositories&lt;/h4&gt; &lt;p class="first-para"&gt;&lt;a name="53"&gt;&lt;/a&gt;&lt;a name="page19"&gt;&lt;/a&gt;The primary software tool  for managing data quality is the metadata repository. Repositories have been  around for a long time but have been poorly employed. Most IT departments have  one or more repositories in place and use them with very little effectiveness.  Most people would agree that the movement to establish metadata repositories as  a standard practice has been a resounding failure. This is unfortunate, as the  metadata repository is the one tool that is essential for gaining control over  your data.&lt;/p&gt; &lt;p class="para"&gt;The failure of repository technology can be traced to a number of  factors. The first is that implementations have been poorly defined, with only a  vague concept of what they are there for. Often, the real information that  people need from them is not included. They tend to dwell on schema definitions  and not the more interesting information that people need to do their jobs.  There has been a large mismatch between requirements and products.&lt;/p&gt; &lt;p class="para"&gt;A second failure is that no one took them seriously. There was  never a serious commitment to them. Information system professionals did not use  them in their daily jobs. It was not part of their standard tool set. It  appeared to be an unnecessary step that stood in the way of getting tasks  done.&lt;/p&gt; &lt;p class="para"&gt;A third failure is that they were never kept current. They were  passive repositories that had no method for verifying that their content  actually matched the information systems they were supposed to represent. It is  ironic that repositories generally have the most inaccurate data within the  information systems organization.&lt;/p&gt; &lt;p class="para"&gt;A fourth failure is that the standard repositories were engineered  for data architects and not the wider audience of people who can benefit from  valuable information in an accurate metadata repository. The terminology is too  technical, the information maintained is not what they all need, and the  accessibility is restricted too much.&lt;/p&gt; &lt;p class="last-para"&gt;Since corporations have never accepted the concept of an  industry standard repository, most software products on the market deliver a  proprietary repository that incorporates only that information needed to install  and operate their product. The result is that there are dozens of isolated  repositories sitting around that all contain different information, record  information in unique ways, and have little, if any, ability to move information  to other repositories. Even when this capability is provided, it is rarely used.  Repository technology needs to be reenergized based on the requirements for  establishing and carrying out an effective data quality assurance  program.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Data Profiling&lt;/h4&gt; &lt;p class="first-para"&gt;&lt;a name="54"&gt;&lt;/a&gt;&lt;a name="page20"&gt;&lt;/a&gt;The second important need  is analytical tools for data profiling. Data profiling has emerged as a major  new technology. It employs analytical methods for looking at data for the  purpose of developing a thorough understanding of the content, structure, and  quality of the data. A good data profiling product can process very large  amounts of data and, with the skills of the analyst, uncover all sorts of issues  in the data that need to be addressed.&lt;/p&gt; &lt;p class="para"&gt;Data profiling is an indispensable tool for assessing data  quality. It is also very useful at periodic checking of data to determine if  corrective measures are being effective or to monitor the health of the data  over time.&lt;/p&gt; &lt;p class="para"&gt;Data profiling uses two different approaches to examining data.  One is discovery, whereby processes examine the data and discover  characteristics from the data without the prompting of the analyst. In this  regard it is performing data mining for metadata. This is extremely important to  do because the data will take on a persona of itself and the analyst may be  completely unaware of some of the characteristics. It is also helpful in  addressing the problem that the metadata that normally exists for data is  usually incorrect, incomplete, or both.&lt;/p&gt; &lt;p class="para"&gt;The second approach to data profiling is assertive testing. The  analyst poses conditions he believes to be true about the data and then executes  data rules against the data that check for these conditions to see if it  conforms or not. This is also a useful technique for determining how much the  data differs from the expected. Assertive testing is normally done after  discovery.&lt;/p&gt; &lt;p class="para"&gt;The output of data profiling will be accurate metadata plus  information about data quality problems. One goal of data profiling is to  establish the true metadata description of the data. In effect, it can correct  the sins of the past.&lt;/p&gt; &lt;p class="last-para"&gt;Data profiling tools exist in the market and are getting  better every year. They did not exist five years ago. Data profiling functions  are being implemented as part of some older products, and some new products are  also emerging that focus on this area. More companies are employing them every  year and are consistently amazed at what they can learn from them.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Data Monitoring&lt;/h4&gt; &lt;p class="first-para"&gt;A third tool includes effective methods for monitoring data  quality. A data monitoring tool can be either transaction oriented or database  oriented. If transaction oriented, the tool looks at individual transactions  before they cause database changes. A database orientation looks at an entire  database periodically to find issues.&lt;/p&gt; &lt;p class="para"&gt;The goal of a transaction monitor is to screen for potential  inaccuracies in the data in the transactions. The monitor must be built into the  transaction &lt;a name="55"&gt;&lt;/a&gt;&lt;a name="page21"&gt;&lt;/a&gt;system. XML transaction systems  make this a much more plausible approach. For example, if IBM's MQ is the  transaction system being employed, building an MQ node for screening data is  very easy to do.&lt;/p&gt; &lt;p class="para"&gt;A potential problem with transaction monitors is that they have  the potential to slow down processing if too much checking is done. If this is  the result, they will tend not to be used very much. Another problem is that  they are not effective in generating alerts where something is wrong but not  sufficiently wrong to block the transaction from occurring. Transaction monitors  need to be carefully designed and judiciously used so as to not impair the  effectiveness of the transaction system.&lt;/p&gt; &lt;p class="para"&gt;Database monitors are useful for finding a broad range of problems  and in performing overall quality assessment. Many issues are not visible in  individual transactions but surface when looking at counts, distributions, and  aggregations. In addition, many data rules that are not possible to use on  individual transactions because of processing time become possible when  processing is offline.&lt;/p&gt; &lt;p class="para"&gt;Database monitors are also useful in examining collections of data  being received at a processing point. For example, data feeds being purchased  from an outside group can be fed through a database monitor to assess the  quality of the submission.&lt;/p&gt; &lt;p class="last-para"&gt;The most effective data monitoring program uses a combination  of transaction and database monitoring. It takes an experienced designer to  understand when and where to apply specific rules. The technology of data  quality monitors is not very advanced at this point. However, this is an area  that will hopefully improve significantly over the next few years.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Data Cleansing Tools&lt;/h4&gt; &lt;p class="first-para"&gt;Data cleansing tools are designed to examine data that  exists to find data errors and to fix them. To find an error, you need rules.  Once an error is found, either it can cause rejection of the data (usually the  entire data object) or it can be fixed. To fix an error, there are only two  possibilities: substitution of a synonym or correlation through lookup  tables.&lt;/p&gt; &lt;p class="para"&gt;Substitution correction involves having a list of value pairs that  associate a correct value for each known wrong value. These are useful for  fixing misspellings or inconsistent representations. The known misspellings are  listed with correct spellings. The multiple ways of representing a value are  listed with the single preferred representation. These lists can grow over time  as new misspellings or new ways of representing a value are discovered in  practice.&lt;/p&gt; &lt;p class="last-para"&gt;Correlation requires a group of fields that must be  consistent across values. A set of rules or lookup tables establish the value  sets that are acceptable. &lt;a name="56"&gt;&lt;/a&gt;&lt;a name="page22"&gt;&lt;/a&gt;If a set of values  from a database record is not in the set, the program looks for a set that  matches most of the elements and then fixes the missing or incorrect part. The  most common example of this is name and address fields. The correlation set is  the government database of values that can go together (e.g., city, state, Zip  code, and so on). In fact, there is little applicability of this type of  scrubbing for anything other than name and address examination.&lt;/p&gt;&lt;/div&gt; &lt;div class="section"&gt; &lt;h4 class="sect4-title"&gt;Database Management Systems&lt;/h4&gt; &lt;p class="first-para"&gt;Database management systems (DBMSs) have always touted their  abilities to promote correct data. Relational systems have implemented physical  data typing, referential constraints, triggers, and procedures to help database  designers put transaction screening, database screening, and cleansing into the  database structure. The argument is that the DBMS is the right place to look for  errors and fix data because it is the single point of entry of data to the  database.&lt;/p&gt; &lt;p class="para"&gt;Database designers have found this argument useful for some things  and not useful for others. The good designers are using the referential  constraints. A good database design will employ primary key definitions, data  type definitions, null rules, unique rules, and primary/foreign key pair  designations to the fullest extent to make sure that data conforms to the  expected structure.&lt;/p&gt; &lt;p class="last-para"&gt;The problem with putting quality screens into the DBMS  through procedures and triggers are many. First of all, the rules are buried in  obscure code instead of being in a business rule repository. This makes them  difficult to review and manage. A second problem is that all processing becomes  part of the transaction path, thus slowing down response times. A third problem  is that the point of database entry is often "too late" to clean up data,  especially in Internet-based transaction systems. The proper way to treat data  quality issues is to use a combination of DBMS structural support, transaction  monitors, database monitors, and external data cleansing.&lt;/p&gt;&lt;p class="last-para"&gt; &lt;/p&gt;&lt;h2 class="first-section-title"&gt;Closing Remarks&lt;/h2&gt; &lt;p class="first-para"&gt;As information systems become more of the fabric of  organizations, they also get more and more complex. The quality of data within  them has not improved over the years as has other technologies. The result is  that most information systems produce data that is of such poor quality that  organizations incur significant losses in operations and decision making. It  also severely slows down and sometimes cripples attempts to introduce new  business models into the organization.&lt;/p&gt; &lt;p class="para"&gt;&lt;a name="58"&gt;&lt;/a&gt;&lt;a name="page23"&gt;&lt;/a&gt;There are many reasons data  quality is low and getting lower. This will not change until corporations adopt  stringent data quality assurance initiatives. With proper attention, great  returns can be realized through improvements in the quality of data.&lt;/p&gt; &lt;p class="para"&gt;The primary value to the corporation for getting their information  systems into a state of high data quality and maintaining them there is that it  gives them the ability to quickly and efficiently respond to new business model  changes. This alone will justify data quality assurance initiatives many times  over.&lt;/p&gt; &lt;p class="para"&gt;Data quality assurance initiatives are becoming more popular as  organizations are realizing the impact that improving quality can have on the  bottom line. The body of qualified experts, educational information,  methodologies, and software tools supporting these initiatives is increasing  daily. Corporations are searching for the right mix of tools, organization, and  methodologies that will give them the best advantage in such programs.&lt;/p&gt; &lt;p class="last-para"&gt;Data accuracy is the foundation of data quality. You must get  the values right first. The remainder of this book focuses on data accuracy:  what it means, what is possible, methods for improving the accuracy of data, and  the return you can expect for instituting data accuracy assurance programs.&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/1576936343892993251-3954743660672572367?l=dataqualityaccuracy.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataqualityaccuracy.blogspot.com/feeds/3954743660672572367/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=1576936343892993251&amp;postID=3954743660672572367' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/3954743660672572367'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/1576936343892993251/posts/default/3954743660672572367'/><link rel='alternate' type='text/html' href='http://dataqualityaccuracy.blogspot.com/2007/12/data-quality-problem.html' title='The Data Quality Problem'/><author><name>vijikumar</name><uri>http://www.blogger.com/profile/18263997193124066136</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry></feed>
