Discovering Data Quality Rules
Fei Chiang University of Toronto fchiang@cs.toronto.edu
Rene ́e J. Miller University of Toronto miller@cs.toronto.edu
Poor data quality continues to be a mainstream issue for many organizations. Having erroneous, duplicate or incomplete data leads to ineffective marketing, operational inefficiencies, inferior customer relationship management, and poor business decisions. It is estimated that dirty data costs US businesses over $600 billion a year [11]. There is an increased need for effective methods to improve data quality and to restore consistency.
Dirty data often arises due to changes in use and perception of the data, and violations of integrity constraints (or lack of such constraints). Integrity constraints, meant to preserve data consistency and accuracy, are defined according to domain specific business rules. These rules define relationships among a restricted set of attribute values that are expected to be true under a given context. For example, an organization may have rules such as: (1) all new cus- tomers will receive a 15% discount on their first purchase and preferred customers receive a 25% discount on all purchases; and (2) for US customer addresses, the street, city and state functionally determines the zipcode. Deriving a complete set of integrity constraints that accurately reflects an organization’s policies and domain semantics is a primary task towards improving data quality.
To address this task, many organizations employ consultants to develop a data quality management process. This process involves looking at the current data instance and identifying existing integrity constraints, dirty records, and developing new constraints. These new constraints are normally developed in consultation with users who have specific knowledge of business policies that must be enforced. This effort can take a considerable amount of time. Furthermore, there may exist domain specific rules in the data that users are not aware of, but that can be useful towards enforcing semantic data consistency. When such rules are not explicitly enforced, the data may become inconsistent.
Identifying inconsistent values is a fundamental step in the data cleaning process. Records may contain inconsistent values that are clearly erroneous or may potentially be dirty. Values that are clearly incorrect are normally easy to identify (e.g., a ’husband’ who is a ’female’). Data values that are potentially incorrect are not as easy to disambiguate (e.g., a ’child’ whose yearly ’salary’ is ’$100K’). The unlikely co-occurrence of these values causes them to become dirty candidates. Further semantic and domain knowledge may be required to determine the correct values.
For example,Table1showsasampleofrecordsfroma 1994 US Adult Census database [4] that contains records of citizens and their work class (CLS), education level (ED), marital status (MR), occupation (OCC), family relationship (REL), gender (GEN), and whether their salary (SAL) is
Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.
In this work, we propose a new data-driven tool that can be used within an organization’s data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).
We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.