Thursday, December 13, 2007

Sources of Inaccurate Data

Before we can assess data correctness we need to understand the various ways inaccurate values get into databases. There are many sources of data inaccuracies, and each contributes its own part to the total data quality problem. Understanding these sources will demonstrate the need for a comprehensive program of assessment, monitoring, and improvement. Having highly accurate data requires attention to all sources of inaccuracies and appropriate responses and tools for each.

The first three cause inaccuracies in data within the databases, whereas the fourth area causes inaccuracies in the information products produced from the data. If you roll up all potential sources of errors, the interesting conclusion is that the most important use of the data (corporate decision making) is made on the rendition of data that has the most inaccuracies.

3.1 Initial Data Entry

Most people assume that data inaccuracies are always the result of entering the wrong data at the beginning. This is certainly a major source of data inaccuracies but not the only source. Inaccurate data creation can be the result of mistakes, can result from flawed data entry processes, can be deliberate, or can be the result of system errors. By looking at our systems through these topics, you can gain insight into whether systems are designed to invite inaccurate data or are designed to promote accurate data.

Data Entry Mistakes

The most common source of a data inaccuracy is that the person entering the data just plain makes a mistake. You intend to enter blue but enter bleu instead; you hit the wrong entry on a select list; you put a correct value in the wrong field. Much of operational data originates from a person. People make mistakes; we make them all the time. It is doubtful that anyone could fill out a hundred-field form without making at least one mistake.

A real-world example involves an automobile damage claims database in which the COLOR field was entered as text. Examination of the content of this field yielded 13 different spellings for the word beige. Some of these mistakes were the result of typos. Others were just that the entry person did not know how to spell the word. In some of the latter cases, they thought they knew how to spell the word, whereas in others they were just not able or willing to look it up.

Flawed Data Entry Processes

A lot of data entry begins with a form. A person completes a form either on a piece of paper or on a computer screen. Form design has a lot to do with the amount of inaccurate data that ends up in the database. Form design should begin with a basic understanding of quality issues in order to avoid many of the mistakes commonly seen. For example, having someone select from a list of valid values instead of typing in a value can eliminate the misspellings previously cited.

Another common problem is having fields on the form that are confusing to the user. This often leads them to enter wrong information. The field itself may be confusing to the user. If it is a field that is not commonly understood, or if the database definition is unconventional, the form needs to provide assistance in guiding the user through entry of values into the field. Sometimes the confusion is in the way the field is described in its identifying text or in its positioning on the form. Form design should always be subjected to rigorous quality testing to find the fields a normal user would have difficulty in knowing what to enter.

Data entry windows should have instructions available as HELP functions and should be user friendly in handling errors. Frustration in using a form can lead to deliberate mistakes that corrupt the database.

Forms are better completed by a trained entry person than by a one-time user. This is because the entry person can be taught how things should be entered, can become proficient in using the form mechanisms, and can be given feedback to improve the efficiency and accuracy of the data. A one-time user is always uncertain about what they are supposed to do on the form. Unfortunately, our society is moving by way of the Internet toward eliminating the middle person in the process and having end users complete forms directly. This places a much higher demand on quality form design.

The data entry process includes more than the forms that are filled out. It also includes the process that surrounds it. Forms are completed at a specific point or points in a process. Sometimes we have forms that are required to be completed when not all information is known or easily obtained at that point in the process. This will inevitably lead to quality problems.

An example of a data entry process I helped design a number of years ago for military repair personnel is very instructive of the types of problems that can occur in data collection. The U.S. Navy has a database that collects detailed information on the repair and routine maintenance performed on all aircraft and on all major components of every ship. This database is intended to be used for a variety of reasons, from negotiating contracts with suppliers, to validating warranties, to designing new aircraft and ships.

When an aircraft carrier is in a combat situation, such as in Kuwait and Afghanistan, repairs are being made frequently. The repair crews are working around the clock and under a great deal of pressure to deal with a lot of situations that come up unexpectedly. Completing forms is the least of their concerns. They have a tendency to fix things and do the paperwork later. The amount of undocumented work piles up during the day, to be completed when a spare moment is available. By then the repair person has forgotten some of the work done or the details of some of the work and certainly is in a hurry to get it done and out of the way.

Another part of this problem comes in when the data is actually entered from the forms. The forms are coming out of a hectic, very messy environment. Some of the forms are torn; some have oil or other substances on them. The writing is often difficult to decipher. The person who created it is probably not available and probably would not remember much about it if available.

A database built from this system will have many inaccuracies in it. Many of the inaccuracies will be missing information or valid but wrong information. An innovative solution that involves wireless, handheld devices and employs voice recognition technology would vastly improve the completeness and accuracy of this database. I hope the U.S. Navy has made considerable improvements in the data collection processes for this application since I left. I trust they have.

The Null Problem

A special problem occurs in data entry when the information called for is not available. A data element has a value, an indicator that the value is not known, or an indicator that no value exists (or is applicable) for this element in this record. Have you ever seen an entry screen that had room for a value and two indicator boxes you could use for the case where there is no value? I haven't. Most form designs either mandate that a value be provided or allow it to be left blank. If left blank, you do not know the difference between value-not-known and no-value-applies.

When the form requires that an entry be available and the entry person does not have the value, there is a strong tendency to "fake it" by putting a wrong, but acceptable, value into the field. This is even unintentionally encouraged for selection lists that have a default value in the field to start with.

It would be better form design to introduce the notion of NOT KNOWN or NOT APPLICABLE for data elements that are not crucial to the transaction being processed. This would at least allow the entry people to enter accurately what they know and the users of the data to understand what is going on in the data.

It would make sense in some cases to allow the initial entry of data to record NOT KNOWN values and have the system trigger subsequent activities that would collect and update these fields after the fact. This is far better than having people enter false information or leaving something blank and not knowing if a value exists for the field or not.

An example of a data element that may be NOT KNOWN or NOT APPLICABLE is a driver's license number. If the field is left blank, you cannot tell if it was not known at the point of entry or whether the person it applies to does not have a driver's license. Failure to handle the possibility of information not being available at the time of entry and failure to allow for options to express what you do know about a value leads to many inaccuracies in data.

Deliberate Errors

Deliberate errors are those that occur when the person enters a wrong value on purpose. There are three reasons they do this:

  • They do not know the correct information.

  • They do not want you to know the correct information.

  • They get a benefit from entering the wrong information.

Do Not Know Correct Information

Not knowing the correct information occurs when the form requires a value for a field and the person wants or needs to complete the form but does not know the value to use. The form will not be complete without a value. The person does not believe the value is important to the transaction, at least not relative to what they are trying to do. The result is that they make up a value, enter the information, and go on.

Usually the information is not important to completing the transaction but may be important to other database users later on. For example, asking and requiring a value for the license plate number of your car when registering for a hotel has no effect on getting registered. However, it may be important when you leave your lights on and they need to find out whose car it is.

Do Not Wish To Give The Correct Information

The second source of deliberate errors is caused by the person providing the data not wanting to give the correct information. This is becoming a more and more common occurrence with data coming off the Internet and the emergence of CRM applications. Every company wants a database on all of their customers in order to tailor marketing programs. However, they end up with a lot of incorrect data in their databases because the information they ask people for is more than people are willing to provide or is perceived to be an invasion of privacy.

Examples of fields that people will lie about are age, height, weight, driver's license number, home phone number, marital status, annual income, and education level. People even lie about their name if it can get the result they want from the form without putting in their correct name. A common name appearing in many marketing databases is Mickey Mouse.

The problem with collecting data that is not directly required to complete the transaction is that the quality of these data elements tends to be low but is not immediately detected. It is only later, when you try to employ this data, that the inaccuracies show up and create problems.

Falsifying To Obtain A Benefit

The third case in which deliberate mistakes are made is where the entry person obtains an advantage in entering wrong data. Some examples from the real world illustrate this.

An automobile manufacturer receives claim forms for warranty repairs performed by dealers. Claims for some procedures are paid immediately, whereas claims for other procedures are paid in 60 days. The dealers figure out this scheme and deliberately lie about the procedures performed in order to get their money faster. The database incorrectly identifies the repairs made. Any attempt to use this database to determine failure rates would be a total failure. In fact, it was in attempts to use this data for this purpose that led to the discovery of the practice. It had been going on for years.

A bank gives branch bank employees a bonus for all new corporate accounts. A new division of a larger company opens an account with a local branch. If the bank employee determines that this is a sub-account of a larger, existing customer (the correct procedure), no bonus is paid upon opening the account. If, however, the account is opened as a new corporate customer (the wrong procedure), a bonus is paid.

An insurance company sells automobile insurance policies through independent insurance writers. In a metropolitan area, the insurance rate is determined by the Zip code of the applicant. The agents figure out that if they falsify the ZIP CODE field on the initial application for high-cost Zip codes, they can get the client on board at a lower rate. The transaction completes, the agent gets his commission, and the customer corrects the error when the renewal forms arrive a year later. The customer's rates subsequently go up as a result.

Data entry people are rated based on the number of documents entered per hour. They are not penalized for entering wrong information. This leads to a practice of entering data too fast, not attempting to resolve issues with input documents, and making up missing information. The operators who enter the poorest-quality data get the highest performance ratings.

All of these examples demonstrate that company policy can encourage people to deliberately falsify information in order to obtain a personal benefit.

System Problems

Systems are too often blamed for mistakes when, after investigation, the mistakes turn out to be the result of a human error. Our computing systems have become enormously reliable over the years. However, database errors do occur because of system problems when the transaction systems are not properly designed.

Database systems have the notion of COMMIT. This means that changes to a database system resulting from an external transaction either get completely committed or completely rejected. Specific programming logic ensures that a partial transaction never occurs. In application designs, the user is generally made aware that a transaction has committed to the database.

In older systems, the transaction path from the person entering data to the database was very short. It usually consisted of a terminal passing information through a communications controller to a mainframe, where an application program made the database calls, performed a COMMIT, and sent a response back to the terminal. Terminals were either locally attached or accessed through an internal network.

Today, the transaction path can be very long and very complex. It is not unusual for an application to occur outside your corporation on a PC, over the Internet. The transaction flows through ISPs to an application server in your company. This server then passes messages to a database server, where the database calls are made. It is not unusual for multiple application servers to be in the path of the transaction. It is also not unusual for multiple companies to house application servers in the path. For example, Amazon passes transactions to other companies for "used book" orders.

The person entering the data is a nonprofessional, totally unfamiliar with the system paths. The paths themselves involve many parts, across many communication paths. If something goes wrong, such as a server going down, the person entering the information may not have any idea of whether the transaction occurred or not. If there is no procedure for them to find out, they often reenter the transaction, thinking it is not there, when in fact it is; or they do not reenter the transaction, thinking it happened, when in fact it did not. In one case, you have duplicate data; in the other, you have missing data.

More attention must be paid to transaction system design in this new, complex world we have created. We came pretty close to cleaning up transaction failures in older "short path" systems but are now returning to this problem with the newer "long path" systems.

In summary, there are plenty of ways data inaccuracies can occur when data is initially created. Errors that occur innocently tend to be random and are difficult to correct. Errors that are deliberate or are the result of poorly constructed processes tend to leave clues around that can be detected by analytical techniques.

Data Accuracy Decay

Data that is accurate when initially created can become inaccurate within a database through time. The data value does not change; its accuracy does. Some data elements are subject to value accuracy decay and some are not. Some examples will illustrate the concept of decay.

Personal information in an employee database easily becomes wrong. People move, they change their marital status, they complete new education programs, they change telephone numbers. Most employees do not run into HR and fill out a form every time something changes in their life. The information in HR reflects the accuracy at the time they initially joined the company or the last time an update was done. Inventory-on-hand information can become wrong due to spoilage, parts taken and used, and new transactions not processed.

The value carried on the books for capital assets can change due to market demand changes, changes in the environment in which the asset is used, unusual wear and tear, or damage done and not reported. A state driver's license database indicates that a person has acceptable eyesight to drive without glasses. However, during the time since the license was issued, the person's eyesight may deteriorate to the point where she cannot safely drive without glasses. The inaccuracy will not be corrected until a renewal requires a new eye test.

All of these examples show that a change occurred in the object being represented in the database and the database was not updated to reflect it. Another way of saying this is that a transaction was needed to be processed and was not. However, these "missing transactions" are commonplace in the real world.

Not all data elements are subject to decay. Information defining an object generally does not decay, whereas information providing other information about the object generally can be subject to decay. Good database designers will note the decay characteristic of a data element as part of the metadata and will design processes to verify or update the information as needed by the consumers of the data. For example, in an inventory application, the data elements for PART_NUMBER, PART_DESCRIPTION, and UNIT_OF_MEASURE would not be considered subject to decay, whereas QUANTITY_ON_HAND, SUPPLIER_ID, and STORAGE_BIN_NUMBER would be subject to decay.

Another characteristic of decay-prone data elements is that the overall accuracy of the element tends to follow a sawtooth graph over time. Data is entered. Over time, the older data gets progressively more inaccurate (it decays). The accuracy of the element is determined by the mix of older data versus newer data, which tends to be more accurate. A corrective action occurs that pulls the data back into a higher state of accuracy.

In the previously cited examples, the corrective event for HR information may be an annual data review conducted with all employees; for inventory, an annual inventory exercise; for capital equipment, an annual reappraisal. Note that for the driver's license example there is no event that brings all or a set of records into accuracy at a single point in time. It is a continuous database with new records and renewals occurring on an individual timeline. This does not follow a sawtooth pattern because the decaying and correcting occurs continuously over the database. However, it does have an inaccuracy component due to decay that remains fairly constant over time.



An example of a good HR system is to have new employees verify all HR information immediately after initial entry into the database, request that employees review all personal information each time they make any change, and then request a specific review by the employee any time no review has been done in a year (or less if you can get away with it). Another mechanism is to make the review a part of the annual employee evaluation process. It would not hurt to monitor marriage and birth announcements as well.

Unfortunately, most companies are not willing to be that concerned about data decay accuracy issues. It is not because they have considered the cost of a continuous updating process to be higher than the cost of not being more accurate. It is more a case of not thinking about the problem at all. The cost of a corrective program is not very high once it has been put in place. It will generally return more value than it costs.

A proper way of handling decay-prone data elements is to identify them in the database design and indicate this in the metadata repository. Periodic events should be planned to update information to get the database to recover from decay. The times that these events occur should be included in the database and/or the metadata repository so that database business users can know what to expect. The metadata repository should also include information on the rate of change that occurred in the periodic auditing event. A good data analyst knows the rate of decay and the probable accuracy index for any element at any given point in time.

Moving and Restructuring Data

Inaccurate data is often created from perfectly good data through the processes used to move and restructure data. These processes are commonly used to extract data from operational databases and put it into data warehouses, data marts, or operational data stores. It may also restage the data for access through corporate portals. This is an area often overlooked as a source of data inaccuracies and yet it contributes significantly to the inaccuracy problem. When you hear complaints that the data in a data warehouse or data mart is wrong and unusable, a major contributing factor might be flawed data movement processes as opposed to the data being inaccurate in the source databases.

Moving and restructuring data is commonly done through ETL (extract, transform, and load) processes. There may also be data scrubbing processes involved. Sometimes this is performed through packaged tools and sometimes through in-house-developed scripts or programs. When packaged tools are used, the errors are not coming from them but rather from faulty specifications used for them.

Entire classes of tools and software companies emerged to support this activity. The two primary tool categories are ETL and data cleansing. What was missing was tools to help understand the data in enough detail to effectively use the tools and to design decision support systems correctly. The recent emergence of data profiling tools is intended to fill this gap and is showing dramatically positive effects when used.

The irony of it is that most projects claim to be extracting the data and cleaning it up before it is moved into the data warehouses, when in fact they are making it dirtier, not cleaner. Data is generated and stored in a database based on the initiating application. This is generally a transaction system that collects the data and performs updates. The database is generally designed to meet the requirements of the initiating application and nothing more. The design of the data and the database is generally poorly documented and almost always not kept up to date.

There are also a host of other reasons for moving data to a database of a different structure. These include migration of applications to newer technology or to packaged applications. For many applications, this may be the first time they have been subject to a major reengineering effort. Another reason is the need to consolidate databases. Consolidations occur for two reasons: corporation mergers and acquisitions, and the need to centralize databases from local or divisional databases.

An important fact of migrations and consolidations is that you cannot leave the old data behind. Most databases require a continuity of information over long periods of time. You cannot move to a new personnel system and not move the existing personnel information behind. You cannot move to a new supply chain system and not move the inventory or the inventory history data. lists projects for which database restructure and/or data movement should be a serious consideration.



There are two major problems frequently found in moving and restructuring data. The first is that to effectively accomplish the movement of the data to another database you have to fully understand everything about the source database (and the target, if it is not a new database). The reality is nothing close to this. The other problem is that the source systems were never designed with the thought of having to give up their data to another database. This is particularly the case for a database of a different type, designed very differently and combining it with data from other sources.


Understanding Source Databases

Rarely do you have current, complete information available on transaction systems. Metadata dictionaries and repositories generally have very low accuracy. Even COBOL copybooks sport a high degree of mismatch between what they say and how the data is actually stored.

The poor attention paid to creating and maintaining accurate data in data dictionaries and metadata repositories is now hurting corporations to the tune of millions of dollars in unnecessarily complex data movement projects and/or in having to accept low-quality data (data that is even of lower quality than the source databases it is extracted from) for use in decision making.

The reasons for poor data quality in metadata repositories are many. Repository technology has lagged behind database design evolution. Repository solutions are generally insufficient in content to collect all the needed information. They are passive, which means that they can get out of step with the data without noticeable impacts on normal operations. There has been little motivation on the part of data administration staff to keep them current.

COBOL copybooks also become inaccurate over time. Changes to fields are often done without updating the copybook. A character field may be reassigned to be used for an entirely different purpose. If the physical definition of the field satisfies the new use, the copybook change is not needed to make the program work. This results in the field name and comment text referring to the old meaning and not the new meaning. This can, and will, mislead anyone using it to make changes or to extract data.

In many instances, valuable application information was found stored in the space defined to COBOL as FILLER. This is a common and quick-fix way of implementing a change and of completely bypassing an update of the copybook or any other documentation that may exist.

Overloaded Fields

Older, legacy applications are filled with "overloaded" fields. An overloaded field is one that has information about multiple, different facts stored in the same field.

An example of an overloaded field is where a mainframe application developer needs a new field that has a binary representation such as YES or NO. They do not want to waste the space for a new field (a common concern for developers in the 1960 to 1980 time frame). Instead of creating a new field, they decide to use an unused bit of another, existing field. This could be one of the spare bits within packed decimal fields for IBM mainframes or the sign bit of a numeric field that can never be negative. The result is an overloaded field. Other examples of overloaded fields are to use each bit of a single byte to represent the value for a different binary-valued field or to use free-form text fields to encode keywords that mean different things.

An extreme case discovered in one system was that a person's NAME field had the characters *DECEASED appended to it within the same NAME field if the person was dead. So, for example, the value Jack Olson meant who I am and that I am alive, whereas the value Jack Olson *DECEASED meant who I am and that I am not alive.

The problem with overloaded fields is that they are generally never documented. The copybook definitions and dictionary definitions usually document only the underlying field or, in the case of multivalued keyword fields, document it only as a text field.

Analysts designing data movement and restructuring processes often do not become aware of the overloading and either incorrectly transform and move the data or reject some or all of the values in the field as being inaccurate. In the process of rejection, they are losing correct information on each of the separate facts recorded in the field.

Another real example of overloading fields is where a legacy system developer used the sign bit on the PAY_RATE field to represent a different fact because this field should never be negative. Neither this convention nor the meaning of the bit was ever documented. When they upgraded to a new packaged application for payroll, they moved the field as it existed, innocently thinking that they did not need to investigate such an obvious field. When they ran their first payroll, 30% of the checks had negative amounts. This is an example of a situation in which you absolutely must verify the content of every field against your expectations when you lift it out of one system and place it into another.

Sometimes the information needed about existing databases is not recorded anywhere. You often need detailed information on missing or exception conditions. This is often not available. You also need information on accuracy. This is never recorded anywhere.

Matching Source Databases to Target Databases

This is tricky business. You first need to have a total understanding of both systems. You then need to match data elements to determine whether the result makes sense.

In matching data elements you are saying that a particular field in the source system contains the content you need to put into a particular field in the target system. Many times this is an obvious match, such as a PAYTYPE field in one system and a PAY_TYPE field in the other system. However, when you get below the surface, differences can emerge that are highly significant and can create enormous problems if not addressed. The types of problems that can occur are

  • The actual representation of the values within the field may differ. One system, for example, may encode PAY_TYPE as H and S for hourly and salaried. The other may encode the same information as 1 and 2. You must look at the values to determine what is there.

  • The scope of values may differ. One system may have two types of PAY_TYPE, and the other may have three types. Are the differences reconcilable?

  • One system may record information to a lower level of granularity than the other system. For example, one may encode color as RED, BLUE, and GREEN, whereas the other system may have five values for shades of RED, four values for shades of BLUE, and so on. The issue then becomes whether the target system can deal with the differences.

  • The systems may handle NULL or NOT KNOWN differently. One system may not allow them, whereas the other does. One system may record a NULL as a blank and the other system as the? character.

  • Special meaning may be attached to specific values in one system. For example, a policy number of 99999999 may mean that the policy number has not been assigned yet.

It is not sufficient that the target system have a field of the correct data type and length as the source system field. It is necessary that every value in the source system can be translated to an acceptable value in the target system that has exactly the same semantic meaning. Failure to do this will turn acceptable (accurate) values in the source system into unacceptable (inaccurate) values in the target system.

Another problem comes into play in matching data elements when there are a different number of data elements between the source and target systems. When the target system has a data element that does not exist in the source system, a determination needs to be made on what to do about it. The target data element may be eliminated from the design, NULLS inserted, or values manufactured for the field. In any case, the result is less than optimal. This problem is compounded when merging data from multiple sources wherein some sources have values for this field and other sources do not. The result is a target data element populated with highly questionable data. Any attempt to use the field to generate aggregations for groups will yield inaccurate results.

When the source system contains excess data elements, a decision needs to be made on whether the information is, in fact, needed in the target system. Generally it is okay to leave it out, provided it does not have structural importance to other data elements.

Source and target systems need to be matched at the structural level as well as at the data element level. This is a step often either not done or done very poorly. The result can be failures in attempting to move the data or, worse yet, inaccurate results in the target systems.

Structural matching involves checking to see if data element functional dependencies and referential constraints in the source systems match those in the target systems. This is discussed in detail in Chapter 9.

For example, if the data elements of one system are not part of a primary key and they correspond to data elements in the other system that are part of a primary key, an error can easily occur in trying to move the data. Another example is where in one system the DISCOUNT is applied to the order, and in the other system it is applied to individual line items within the order. If the target system has the discount on the order, a problem exists in trying to move the data.

A large number of errors are routinely made when trying to force data into a target structure in circumstances where the two systems are structurally different and not reconcilable. The reason so much force fitting is done is that the information about these differences usually does not surface until the actual implementation phase of projects. By then, the project team is trying to make things work at any cost.

Once data is understood and matched, the project must construct processes to extract the data from the source databases, run them through any data cleansing needed, apply transforms, and load the data into the target systems. Shortcomings in analyzing the source data and in matching elements and structures often lead to serious problems in executing these procedures or result in corrupt data in the target systems.

Extraction

Extracting data normally involves reading the source databases and building output files suitable for use in the subsequent processes. These output files are almost always expected to be in a normalized form. This means that they are tables of values, hopefully in "third-normal" form. (Normal forms of data are explained in Section 9.1.)

Source systems are not always easy to unwind into a set of normalized flat files. Depending on the complexity of how data is stored in the source systems, this can be a monumental task that is susceptible to all sorts of errors.

For example, a mainframe VSAM application covered by COBOL copy-books may store information in the same data file for many different records and have variations embedded within. Liberal use of the REDEFINE and OCCURS clauses can provide a challenging structural situation that must be studied thoroughly before developing the extraction routines.

A program cannot automatically figure this out for you. Many extraction programs have been constructed that automatically read the copybook and develop extraction logic from it. There are a number of problems with this approach. A REDEFINE may be used to recast the same facts or to indicate a different set of facts depending on some other value in the record. Only a human can determine which is meant. In one case you need to select the representation to include in the output; in the other you need to normalize the facts within the REDEFINE into separate data elements. A similar situation exists for OCCURS clauses. Do they represent n occurrences of the same fact or n different facts? One case calls for normalization into a separate file, and the other does not.

A company was trying to move a VSAM application for which over 200 different record types were encoded within the same VSAM file using a record type field to distinguish them and an equal number of record-level REDEFINEs. There were also liberal uses of REDEFINE and OCCURS within many of these record types. In essence, the user had cleverly defined an entire relational database structure within the VSAM file structure for a single file. Unfortunately, his cleverness caused great pains for subsequent analysts who needed to extract data from this file. As you might expect, he was no longer with the company when this was needed. Unraveling it into the proper relational definition was not easy.

Converting nonrelational structures to normalized files may also require key generation in cases for which the connection between records is done through positioning or internal pointers. Failure to recognize this and provide the keys will result in files that lack the ability to JOIN objects later or to enforce referential integrity constraints.

Extraction routines also need to handle overloaded fields. Often, they merely look at the source data as being in error and just reject the values or substitute a NULL or blank/zero value in its place. This is generating wrong information from correct information.

Source systems are often denormalized. Denormalization was a common trick used in earlier systems (especially early relational systems) to obtain adequate performance. Failure to recognize that the source data is denormalized results in denormalized flat files, which can lead to denormalized targets, which can lead to statistical errors when using the data. This is another area rarely documented in source systems.

The inverse is also true. Extraction routines that "flatten" data from hierarchical structures often generate denormalized flat files from structures that were correctly normalized. Most general-purpose extraction programs have the potential for doing this. For example, in an IMS database containing multiple segment types, if only one output file is generated for the entire database, it is denormalized if multiple occurrences of child segments are allowed for each parent instance. This output file will generate errors in the target system if the denormalization is not corrected somewhere else in the process.

Data Cleansing

Data cleansing involves identifying incorrect data values and then either correcting them or rejecting them. They deal with INVALID values in single data elements or correlation across multiple data elements. Many products are available to help you construct data cleansing routines. They can be helpful in improving the accuracy of data or they can result in less accurate data, depending on how carefully they are used.

Data cleansing becomes a problem when you lack an understanding of the source data and reject values that have special meaning. For example, if the source system had a convention of putting a value of * in a STATE field to indicate local state, the scrubbing routine may infer that * is invalid and either reject the row or put a null in its place. Even though the convention is unusual, it was correct in the source system, whose applications understood the convention.

Another problem is when data cleansing routines reject a value that is clearly correctable. For example, multiple representations of a name in a text field may include misspellings or alternative representations that are clearly recognizable for what they should be in the source systems. For example, the entry blakc is clearly a misspelling for black, and Tex is clearly an alternative representation of the text Texas (or TX if two characters are the convention).

A poorly designed data cleansing routine will reject these values and substitute a NULL or blank. The result is that there is no way of correcting it in the target system because the original wrong value is lost.

Data cleansing routines can identify wrong values but generally cannot correct them. They can correct values only through synonym lists or correlation against tables showing valid combinations of values. Most of the time they identify a wrong value but cannot correct it. The only alternative is to change the value to unknown (NULL in most systems) or to reject the row in which the bad value is contained. If it is rejected, they can either leave it out of the target (creating a bigger problem) or manually investigate the value and reenter it into the target after correction.

Leaving rows out has multiple problems. First, you are losing some data. The correct data in other data elements of these rows may be more important to the target than the result you get by leaving the entire row out. Another problem is that you may create a structural problem relative to other rows in the same or other tables. For example, if you reject an order header, the order detail records in a different file will become orphans. Rejecting one row may have the effect of causing many other rows to be rejected later, when referential constraints are enforced upon load.

Transforms

Transforms are routines that change the representation of a value while not changing the content. For example, in one system, GENDER may be represented by 1 and 2, whereas in the target it is represented by M and F. The transformation routine simply translates 1s to M and 2s to F. Transformation routines can be simple or complex. Failure to properly specify transformations can result in correct data in the source becoming wrong data in the target.

Two areas that commonly introduce problems are code page crossovers and currency translation. For example, if you are building a central data warehouse from data coming from divisions around the world, you need to pay special attention to the effects either of these may have on the outcome. Code page conversions can lose special country codes that translate the special characters of another country into the USA code page. This may generate a name value that looks invalid to the target system.

Another problem sometimes overlooked is where a source system stores a dollar value as an integer. The application programs "know" that they are supposed to insert the decimal point after position 2 (effectively dividing the stored value by 100). A transform needs to occur when this data is moved. If the analyst fails to recognize the convention, the values are moved over as if they were full dollar amounts.

Loading

The last step of the data movement process is loading data. You would think that this step would be a safe bet and that all of the damage that can be done has already been done to the data. Assuming you have properly defined the target database, you only need to deal with a few issues.

You need to ensure that all of the data gets loaded in order for the target to be in a consistent state. This sounds simple. However, if data is coming from dozens of sources, you need to ensure that all of the data from all sources gets there and gets loaded together to create a complete target system. One of the important issues of data warehouse and data mart design is ensuring that this is done and that the data is not released to the user community until a complete load is accomplished.

Another issue is what you do with the data that is rejected upon load due to data type, key definition, referential constraint, or procedure violations. As in the case of data cleansing, you must decide if you substitute a value to make the structure work, leave rows out of the target, or investigate the cause and enter the correct values.

The quantity of rejections will determine the best way to handle this issue. A good data movement process will have found all of these values before they get to the load process.

This section has shown that there are many opportunities to make mistakes in the process of extracting data from one system and moving it to another. The majority of these mistakes are mistakes of omission. The developer of the processes does not have a correct or complete set of information about the source systems in order to know everything that needs to be done. The scary part of this is that we continually plow through projects with very low-quality metadata for the source systems and try to fix or reconcile the problems after the fact.

The other problem is that when we look at the data in the target system and determine that it is of low quality we instinctively blame the source systems instead of considering the movement processes as the culprit. Since source systems are difficult to change, little is generally done about the situation. Often the target system is just not used because of a lack of confidence in the data. Much of the time, the target system quality can be improved immensely through fixing the data movement processes.

Clearly, the more effort you make in completing the first two steps properly the fewer mistakes you make in the other steps. Mistakes result in inaccurate data in the target systems. Eliminating the mistakes makes the target systems more accurate and more usable.

Data Integration

The previous section covered what can go wrong when data is taken from one system to another. Data integration projects that fetch data from source systems and use them in new applications have all the same problems. They may be only fetching tiny amounts of data for each use. However, the data in the source systems, no matter how small, must be properly extracted, cleansed, transformed, and put into a form (usually a message) the target application understands. Virtually all of the topics covered for batch, bulk data movement apply to integration.

Failure to recognize this and do everything that needs to be done leads to inaccurate data being presented to new applications from correct databases. This can lead to incorrect processing of transactions because of wrong data values being presented to the integration transactions. This has a large potential for negative impacts on the business conducted through them.

Using Data

The final area where data goes wrong is the process of putting it into business objects such as reports, query result screens, portal-accessible windows, and the subsequent use of this by business professionals.

Data may be accurate, but if users do not understand the meaning of the data or the context within which it is presented, their interpretation and use of the data may be inaccurate. This again points to the problem of not having a good metadata repository that is maintained 100% accurately all of the time. This repository should contain information on what each data element represents, how the data within it is encoded, how to interpret special values, the source of data, the time periods last updated, and the quality levels last known for this data.

Quality levels are important because they can let users judge whether the data satisfies their needs or not. They should be able to determine from the repository the probable inaccurate rate within the data. This can be derived from periodic assessment, data monitoring tools, or both.

The user should have easy access to this information. What is needed is either a comprehensive help system on the business objects or an easy-to-use corporate portal on corporate data stores. If users have to call someone or jump through hoops to find out information about their data, they will just not do it. They will make their own assumptions about what the data represents and use it accordingly. Unfortunately, most companies do not have a good metadata repository and certainly do not have a good mechanism for non-IT professionals to access the information they do have.

Scope of Problems

By the time a report has been generated from a data warehouse, there have been a lot of opportunities for the data to go wrong. The errors just accumulate from initial data creation through decay, data movement, and use. No wonder so many data decision support systems are judged failures.

Some practitioners take comfort in believing that data accuracy problems are greater in legacy systems than they are in systems built more recently, systems on a relational base, or systems implemented through a packaged application. The truth is that there are many opportunities to create problems in all of these.

Relational systems are not immune to errors. This is particularly true of relational systems built in the 1980s. Much of the current protection capability of relational systems was introduced gradually throughout the 1980s. This included date/time data types, triggers, procedures, and referential constraints. Many of the older systems do not use these capabilities and thus leave themselves open to inaccurate data.

Packaged applications do not protect against bad data. They provide a framework for collecting and storing data, but only the using company generates and is responsible for the data. The packaged application cannot ensure that all data is collected, that all values are accurate, or that the fields are being used as intended. Using companies often make local customization decisions in order to force the packaged application to fit their way of doing business. This can lead to problems in extracting, moving, and interpreting data.

Closing Remarks

Inaccurate data gets into databases at a number of points and for a variety of reasons. Any program to improve data accuracy must address the issues across the entire spectrum of opportunities for error.

Data can be entered mistakenly, can be deliberately entered inaccurately, can be the result of system errors, can decay in accuracy, can be turned into inaccurate data through moving and restructuring, and can be turned into wrong information when inappropriately reported or used. Understanding all of these areas will make data quality assurance professionals more expert in analyzing data and processes for inaccuracies.

The common theme throughout this chapter is that knowledge about your data is the key to successful assessment, movement of data, and use of data. There is no substitute for a sound knowledge base of information about the data. Most metadata repositories fall short of the need. If a process similar to that shown in is used vigorously, updating the metadata repository at all stages, higher data accuracy will result for those making decisions from the data.

The area of metadata repositories is ripe for new development, meaningful standardization, and widespread deployment. It should emerge as the next most important technology corporations demand after they have made significant progress on data quality. The business case for repository projects will emerge as the cost of not having them (via inhibiting data quality improvement efforts) becomes more clear. The business value of data quality initiatives will generally depend on the ability of the corporation to maintain an accurate metadata repository. To the extent that it does not, it will not reap the maximum value from data quality efforts.

13 comments:

Unknown said...


Read all the related Posts:


Basic of VBScript Language for QTP

Introduction to QTP (QuickTest Professional) Part2

Introduction to QTP (QuickTest Professional) Part3

Introduction to QTP (QuickTest Professional) Part4

Basic of VBScript Language for QTP

ISTQB

ISTQB Practice Test Part 1
ISTQB Practice Test Part 2
ISTQB Practice Test Part 3
ISTQB Practice Test Part 4
ISTQB Practice Te st Part 5
ISTQB Practice Test Part 6
ISTQB Practice Test Part 7
ISTQB Practice Test Part 8
ISTQB Practice Test Part 9

GASCONDI said...

Supposedly you can request that online databases permanently remove or correct the information you have identified as inaccurate. Surprisingly, you have to provide them with more information just to have yours removed and even after all that, your data repeatedly pops back up on these sites...

Unknown said...
This comment has been removed by the author.
Unknown said...

Data quality exists only when business users have a consistently high level of confidence in the accuracy,Adaptive products help organizations maintain high Data Quality and establish a “standard” root- cause identification process.

Unknown said...

I really need to setup myself up on Wellsphere. I haven’t even done an “about me” section yet. Perhaps tonight.
Data Quality Assurance

Unknown said...

I really need to setup myself up on Wellsphere. I haven’t even done an “about me” section yet. Perhaps tonight.
Data Quality Assurance

Unknown said...

Thank you so much for sharing such informative post.
Data Quality

Unknown said...

Data Quality Software
Great Article.

Digital Marketing India said...

Great article it is helpful and improves data entry skills Data. Thanks for sharing this informative article with us.

Outsource Data Entry Services in India

Unknown said...

Why should you choose us as your outsourced partner for Amazon Product Listing Services. Well, to begin there is no match to the quality and accuracy level of our work. Our team will come up with customized solutions as per the client’s needs and budget. Mail - info@123edata.com, Call- 1-8772841032, 9958399732

qualdo said...

Thank you for your valuable info

ml model performance monitoring


ml model monitoring


machine learning monitoring tools

ml monitoring on azure


data quality on azure


data quality on aws


data quality monitoring tools


piperr




idbssoftware said...

Awesome post!!!

payroll software

visitor management software

mobile app development

mobile app development company

android app development company

ios app development company

vehicle tracking software

fleet management software

idbssoftware said...

payroll software