Data density assessment using classification techniques Sergio Pío Alvarez1, Adriana Marotta1, and Libertad Tansini1 1 Universidad de la República, Montevideo, Uruguay {sergiop,amarotta,libertad}@fing.edu.uy Abstract. There is general agreement among data quality researchers in that completeness is one of the most important data quality dimensions. In particular, data density can be a crucial factor in data processing and decision making tasks. Most techniques for data quality evaluation regarding density are limited to counting null values. However, density is not only about null values but also about not-null values when there should be null values, as the latter degrades the quality of data too. Besides, the existence of null values not necessarily implies a data quality problem. In this work we present a technique based on the application of data mining techniques for data quality assessment. Our proposal consists in creating a classification model from available data having null and not null values and then using that model to assess if a particular attribute of a record should or should not have a null value. This technique allows us to evaluate if a null value is an error, if it is correct, or if it is uncertain, and also we can evaluate if a not-null value is acceptable, is an error (it should be null) or is uncertain. Keywords: Data Quality, Density, Null Values, Data Mining, Classification. 1 Introduction The importance of Data Quality (DQ) in all kind of information systems is widely recognized. If data do not have the appropriate quality level the main business processes could be affected and lead to wrong decisions. DQ is a multifaceted concept, since it is defined regarding a set of dimensions [1]. There is general agreement among DQ researchers and practitioners in that completeness is one of the most important DQ dimensions [2]. Although there are different conceptions about what completeness means [3], it usually involves two factors: coverage and density. If the real word is composed by entities, each of them described by attributes, then coverage is about how many entities are represented in the dataset, while density is about how many attributes are known for each entity. Once the relevant attributes for an entity are selected, density is usually regarded as not having missing values for them. In relational databases a missing value is represented with the special value 'null'. Techniques for density assessment have been traditionally limited to counting not-null values, assuming that missing values imply data quality problems. However, it is important to understand why a value is missing for an attribute of an entity: it could be that does not apply a value for the entity or it could be really missing. We claim that, as well as a null value could be a density problem, a not-null value where should be a null one is also a density problem [4]. The purpose of Data Mining (DM) is to discover hidden knowledge within large amounts of data. DM spans many techniques, being classification, clustering and associative analysis the most common ones. Classification is a technique aimed at assigning entities into one of a set of predefined categories called classes. Classification algorithms build a model from a set of entities previously classified, and then use the model to classify new entities for which the class is not known. The goal of this work is to propose a technique for density assessment using DM classification concepts. The main contribution of this work is to put into discussion the idea that null values should not be taken always as density problems, and that not-null values could be density problems. The rest of the document is organized as follows: in Section 2 we present related work, in Section 3 we present the proposal for assessing data density, in Section 4 we summarize some experiments, and in Section 5 we show the conclusions. 2 Related Work Missing values are usually classified in three categories [5]: Missing Completely At Random (MCAR), which means that there is no pattern that explains why values are missing; Missing At Random (MAR), which means that a pattern relating missing values for an attribute to some other attributes can be found; and Missing Not At Random (MNAR), which means that missing values for an attribute are related to the attribute itself but not to other attributes. At first glance MNAR is similar to MCAR because looking only at data it can not be told which case it is; moreover, missing data are almost never MCAR [6]. There are many techniques that tackle the problem of density assessment, most of them focused on the MAR and MCAR scenarios [7], but most of them try to solve the “null-value problem”, assuming that if there is a null value then there is a problem that must be resolved, usually by means of value imputation or data deletion ([4][6][8]). Data Quality Mining (DQM) is defined as the application of DM techniques to measure and improve DQ. The underlying concept is that modelling of different behaviours within data can not only be used to understand the data but also to detect anomalies, hence pointing out possible quality problems [9,10,11,12,13,14,15]. Roughly, DQM consists of two phases: in the first phase a model capturing the characteristics of data is induced from a training dataset, and in the second phase the model is used to assess the quality of another dataset to detect deviations. Data which deviates from the model are candidates to show some kind of data quality problem. 3 Data Density Assessment through Classification Techniques As we pointed in the previous section there are many techniques for solving the problem of null values. We propose to take a step back and evaluate first if a null- value is really a density problem, as well as if a not-null value could also be a density problem. Our method takes a dataset and estimates the probability of each value to be correct (whether null or not). For this task we use a classification technique that marks each value as 'probably null', 'probably not null' or 'uncertain' based on other values in the dataset. Let D be a dataset and A an attribute (not a key). For each record r of D it can happen that the value for A is either null or not-null. The algorithm is as follows: 1. Drop all keys from the dataset. This step is important because most classification algorithms will generate single classification rules in the form 'IF key=X then A=[null|not-null]' for each value X of the key, and those rules are trivial and possibly wrong. 2. Among the remaining attributes, those that should be used to assess the attribute A must be identified; this task could be challenging as it constitutes a whole working area named feature learning [16]. 3. Discretize non discrete values following the next guidelines: replace text values with 'SOMEVALUE' as it does not matter which value the attribute has but it is only important if it is null or not, and for all other non-discrete values can be used as-is but most classification algorithms work better when all attributes are discrete. If the selected classification algorithm requires attributes to be discrete then some discretization technique should be applied [17]. 4. Replace the value of the attribute A for each record as follow: if the value of the attribute A is not-null then replace that value by the text 'NOTNULL' otherwise replace the null value by the text 'NULL'. This defines two classes: null and notnull, and each record will have assigned one of them. 5. Apply some classification algorithm to build a classification model M using the attribute A as the class. Any classification technique can be used, but decision-tree based algorithms are easier to interpret. Usually the model is built using a clean dataset and then is applied to another dataset to verify how well the model predicts the class. In our case both, the training dataset and the test dataset, are the whole input dataset because we assume that density problems are exceptions and so are lost in the wideness while the model is being built. 6. Apply the classification model built in the previous step to the dataset. For each record the classification model will output a prediction (either 'NULL' or 'NOTNULL') and a decimal value in the range [0,1] which is the confidence of the prediction. 7. Evaluate each record from the dataset again as follows:  If the classification confidence for the record is above a predefined threshold then the assigned class is accepted as correct, leading to two scenarios: ◦ If the assigned class matches the value of the attribute for the record then the record does not present a density problem. ◦ Otherwise it can be taken for sure that the record is wrong and there is a density problem.  Otherwise is an uncertain scenario because it can not be told if there should or should not be a null or not-null value for the attribute for the record, and it is a candidate for manual revision. Although the threshold is defined beforehand it can be adjusted based on the results of the evaluation of the whole dataset. We usually set the threshold to 0.66 and if there are too many records that fall above the threshold then it can be increased while if there are too few then it can be decreased. 4 Experiments We applied the proposed method to a laboratory case, with an extensive combination of attributes: attributes functionally dependant on others, attributes not functionally dependants but related to others, and attributes not related in any form to other attributes. Some attributes had null values, some of which were real density problems but others were not as the null value was the correct one (for example, non deceased people should have a null value for the date-of-decease attribute), and conversely some other attributes had not null values where it should have null (there were alive people with not null date-of-decease values), which were also data density problems. We used the Weka software [18,19], and chose the Random Tree algorithm because with the default configuration it produces good decision trees. In some cases we found that with a threshold of 0.66 then 2/3 of the records were classified with a high confidence, 1/4 of which were assigned a class different from the real value of the assessed attribute. This means that at least 1/6 of the whole dataset presented a density problem regarding the assessed attribute (there are null values where there should not be or conversely). On the other hand, 3/4 of the records classified with confidence above the threshold were assigned the class matching the assessed attribute, so is almost certain that those records did not present any kind of density problem regarding the attribute. There was 1/3 of the records for which the algorithm could not determine if the assessed attribute should have a null value or not, these record are candidates for further inspection. 5 Conclusions We believe that data density should not only be fighting null values, since the presence of not-null values where there should be null values is also a data quality problem. Moreover, the presence of a null value should not be considered a density problem when there should be a null value in that place. In this sense we propose a simple approach for data density assessment using classification techniques. It is oriented to evaluating when null values and not-null values could imply data quality problems. The presented method helps in two complementary ways to achieve a high density database: it can detect when a null-value or not-null-value may be wrong, and the model built can be used to prevent the degradation of the dataset quality by checking data before inserting it into the dataset. 6 References 1. Batini, C., Scannapieco, M.: Data and Information Quality, Dimensions, Principles and Techniques. Springer International Publishing (2016). 2. Mendes Sampaio, S. de F., Dong, C., Sampaio, P.: DQ2S, A framework for data qualityaware information management. Exp. Systems with Applications, Vol. 42, No. 21, (2015). 3. Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Computing Surveys, Vol. 41, No. 3, Article 16 (2009). 4. Horton, N.J., Kleinman, K.P.: Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models. The American Statistician, Vol. 61, No. 1 (2007). 5. Little, R., Rubin, D.: Statistical Analysis with Missing Data. John Wiley & Sons, New York (1987). 6. Newman, D.A.: Missing Data: Five Practical Guidelines. Organizational Research Methods, Vol. 17 No. 4 (2014). 7. Charini Tremblay, M., Dutta, K., VanderMeer, D.: Using Data Mining Techniques to Discover Bias Patterns in Missing Data. ACM Journal of Data Information Quality, Vol. 2, No. 1 (2010). 8. Sessions, V., Gieves, J., Perrine, S.: A Technique for Incorporating Data Missing Not at Random (MNAR) into Bayesian Networks. Int. Conf. on Information Quality (2016). 9. Hipp, J., Güntzer, U., Grimmer, U.: Data quality mining, making a virtue of necessity. Proceedings of the 6th ACM SIGMOD workshop on research issues in data mining and knowledge discovery (2001). 10. Grüning, F.: Data quality mining: employing classifiers for assuring consistent datasets. Proceedings of the 3rd International ICSC Symposium (2007). 11. Grimmer, U., Hinrichs, H.: A methodological approach to data quality management supported by data mining. Proc. of Int. Conference on Information Quality (2001). 12. Farzi, S., Baraani Dastjerdi, A.: Data quality measurement using data mining. International Journal of Computer Theory and Engineering, Vol. 2, No. 1 (2010). 13. Luebbers, D., Grimmer, U., Jarke, M.: Systematic development of data mining-based data quality tools. Proceedings of the 29th VLDB Conference (2003). 14. Vázquez Soler, S., Yankelevich, D.: Quality mining: a data mining based method for data quality evaluation. International Conference on Information Quality (2003). 15. Dasu, T., Johnson, T.: Hunting of the snark: finding data glitches using data mining methods. Proceedings of Int. Conference on Information Quality (1999). 16. Bengio, Y., Courville, A., Vincent, P.: Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives. arXiv:1206.5538v3 (2012). 17. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised Discretization of Continuos Features. Machine Learning: Proc. of the 12th. International Conference (1995). 18. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update - SIGKDD Explorations, Vol. 11, No. 1 (2009). 19. Machine Learning Group at the University of Waikato. Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/~ml/weka/ (2015) (last access: 2017/03/24)