<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data density assessment using classification techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergio Pío Alvarez</string-name>
          <email>sergiop@fing.edu.uy</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriana Marotta</string-name>
          <email>amarotta@fing.edu.uy</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Libertad Tansini</string-name>
          <email>libertad@fing.edu.uy</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de la República</institution>
          ,
          <addr-line>Montevideo</addr-line>
          ,
          <country country="UY">Uruguay</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There is general agreement among data quality researchers in that completeness is one of the most important data quality dimensions. In particular, data density can be a crucial factor in data processing and decision making tasks. Most techniques for data quality evaluation regarding density are limited to counting null values. However, density is not only about null values but also about not-null values when there should be null values, as the latter degrades the quality of data too. Besides, the existence of null values not necessarily implies a data quality problem. In this work we present a technique based on the application of data mining techniques for data quality assessment. Our proposal consists in creating a classification model from available data having null and not null values and then using that model to assess if a particular attribute of a record should or should not have a null value. This technique allows us to evaluate if a null value is an error, if it is correct, or if it is uncertain, and also we can evaluate if a not-null value is acceptable, is an error (it should be null) or is uncertain.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality</kwd>
        <kwd>Density</kwd>
        <kwd>Null Values</kwd>
        <kwd>Data Mining</kwd>
        <kwd>Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The importance of Data Quality (DQ) in all kind of information systems is widely
recognized. If data do not have the appropriate quality level the main business
processes could be affected and lead to wrong decisions. DQ is a multifaceted
concept, since it is defined regarding a set of dimensions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. There is general
agreement among DQ researchers and practitioners in that completeness is one of the
most important DQ dimensions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Although there are different conceptions about
what completeness means [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], it usually involves two factors: coverage and density. If
the real word is composed by entities, each of them described by attributes, then
coverage is about how many entities are represented in the dataset, while density is
about how many attributes are known for each entity.
      </p>
      <p>
        Once the relevant attributes for an entity are selected, density is usually regarded as
not having missing values for them. In relational databases a missing value is
represented with the special value 'null'. Techniques for density assessment have been
traditionally limited to counting not-null values, assuming that missing values imply
data quality problems. However, it is important to understand why a value is missing
for an attribute of an entity: it could be that does not apply a value for the entity or it
could be really missing. We claim that, as well as a null value could be a density
problem, a not-null value where should be a null one is also a density problem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The purpose of Data Mining (DM) is to discover hidden knowledge within large
amounts of data. DM spans many techniques, being classification, clustering and
associative analysis the most common ones. Classification is a technique aimed at
assigning entities into one of a set of predefined categories called classes.
Classification algorithms build a model from a set of entities previously classified,
and then use the model to classify new entities for which the class is not known.</p>
      <p>The goal of this work is to propose a technique for density assessment using DM
classification concepts.</p>
      <p>The main contribution of this work is to put into discussion the idea that null
values should not be taken always as density problems, and that not-null values could
be density problems.</p>
      <p>The rest of the document is organized as follows: in Section 2 we present related
work, in Section 3 we present the proposal for assessing data density, in Section 4 we
summarize some experiments, and in Section 5 we show the conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Missing values are usually classified in three categories [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: Missing Completely
At Random (MCAR), which means that there is no pattern that explains why values
are missing; Missing At Random (MAR), which means that a pattern relating missing
values for an attribute to some other attributes can be found; and Missing Not At
Random (MNAR), which means that missing values for an attribute are related to the
attribute itself but not to other attributes. At first glance MNAR is similar to MCAR
because looking only at data it can not be told which case it is; moreover, missing data
are almost never MCAR [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. There are many techniques that tackle the problem of
density assessment, most of them focused on the MAR and MCAR scenarios [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], but
most of them try to solve the “null-value problem”, assuming that if there is a null
value then there is a problem that must be resolved, usually by means of value
imputation or data deletion ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).
      </p>
      <p>
        Data Quality Mining (DQM) is defined as the application of DM techniques to
measure and improve DQ. The underlying concept is that modelling of different
behaviours within data can not only be used to understand the data but also to detect
anomalies, hence pointing out possible quality problems [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref9">9,10,11,12,13,14,15</xref>
        ].
Roughly, DQM consists of two phases: in the first phase a model capturing the
characteristics of data is induced from a training dataset, and in the second phase the
model is used to assess the quality of another dataset to detect deviations. Data which
deviates from the model are candidates to show some kind of data quality problem.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Density Assessment through Classification Techniques</title>
      <p>As we pointed in the previous section there are many techniques for solving the
problem of null values. We propose to take a step back and evaluate first if a
nullvalue is really a density problem, as well as if a not-null value could also be a density
problem. Our method takes a dataset and estimates the probability of each value to be
correct (whether null or not). For this task we use a classification technique that marks
each value as 'probably null', 'probably not null' or 'uncertain' based on other values in
the dataset.</p>
      <p>
        Let D be a dataset and A an attribute (not a key). For each record r of D it can
happen that the value for A is either null or not-null. The algorithm is as follows:
1. Drop all keys from the dataset. This step is important because most
classification algorithms will generate single classification rules in the form
'IF key=X then A=[null|not-null]' for each value X of the key, and those
rules are trivial and possibly wrong.
2. Among the remaining attributes, those that should be used to assess the
attribute A must be identified; this task could be challenging as it constitutes
a whole working area named feature learning [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
3. Discretize non discrete values following the next guidelines: replace text
values with 'SOMEVALUE' as it does not matter which value the attribute has
but it is only important if it is null or not, and for all other non-discrete
values can be used as-is but most classification algorithms work better when
all attributes are discrete. If the selected classification algorithm requires
attributes to be discrete then some discretization technique should be applied
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
4. Replace the value of the attribute A for each record as follow: if the value of
the attribute A is not-null then replace that value by the text 'NOTNULL'
otherwise replace the null value by the text 'NULL'. This defines two classes:
null and notnull, and each record will have assigned one of them.
5. Apply some classification algorithm to build a classification model M using
the attribute A as the class. Any classification technique can be used, but
decision-tree based algorithms are easier to interpret. Usually the model is
built using a clean dataset and then is applied to another dataset to verify
how well the model predicts the class. In our case both, the training dataset
and the test dataset, are the whole input dataset because we assume that
density problems are exceptions and so are lost in the wideness while the
model is being built.
6. Apply the classification model built in the previous step to the dataset. For
each record the classification model will output a prediction (either 'NULL'
or 'NOTNULL') and a decimal value in the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] which is the
confidence of the prediction.
7. Evaluate each record from the dataset again as follows:
 If the classification confidence for the record is above a predefined
threshold then the assigned class is accepted as correct, leading to two
scenarios:
◦ If the assigned class matches the value of the attribute for the record
then the record does not present a density problem.
◦ Otherwise it can be taken for sure that the record is wrong and there
is a density problem.
 Otherwise is an uncertain scenario because it can not be told if there
should or should not be a null or not-null value for the attribute for the
record, and it is a candidate for manual revision.
      </p>
      <p>Although the threshold is defined beforehand it can be adjusted based on the
results of the evaluation of the whole dataset. We usually set the threshold to
0.66 and if there are too many records that fall above the threshold then it
can be increased while if there are too few then it can be decreased.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We applied the proposed method to a laboratory case, with an extensive combination
of attributes: attributes functionally dependant on others, attributes not functionally
dependants but related to others, and attributes not related in any form to other
attributes. Some attributes had null values, some of which were real density problems
but others were not as the null value was the correct one (for example, non deceased
people should have a null value for the date-of-decease attribute), and conversely
some other attributes had not null values where it should have null (there were alive
people with not null date-of-decease values), which were also data density problems.</p>
      <p>
        We used the Weka software [
        <xref ref-type="bibr" rid="ref18 ref19">18,19</xref>
        ], and chose the Random Tree algorithm because
with the default configuration it produces good decision trees.
      </p>
      <p>In some cases we found that with a threshold of 0.66 then 2/3 of the records were
classified with a high confidence, 1/4 of which were assigned a class different from
the real value of the assessed attribute. This means that at least 1/6 of the whole
dataset presented a density problem regarding the assessed attribute (there are null
values where there should not be or conversely). On the other hand, 3/4 of the records
classified with confidence above the threshold were assigned the class matching the
assessed attribute, so is almost certain that those records did not present any kind of
density problem regarding the attribute. There was 1/3 of the records for which the
algorithm could not determine if the assessed attribute should have a null value or not,
these record are candidates for further inspection.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We believe that data density should not only be fighting null values, since the
presence of not-null values where there should be null values is also a data quality
problem. Moreover, the presence of a null value should not be considered a density
problem when there should be a null value in that place. In this sense we propose a
simple approach for data density assessment using classification techniques. It is
oriented to evaluating when null values and not-null values could imply data quality
problems. The presented method helps in two complementary ways to achieve a high
density database: it can detect when a null-value or not-null-value may be wrong, and
the model built can be used to prevent the degradation of the dataset quality by
checking data before inserting it into the dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Batini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scannapieco</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Data and
          <string-name>
            <given-names>Information</given-names>
            <surname>Quality</surname>
          </string-name>
          , Dimensions, Principles and Techniques. Springer International Publishing (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mendes</given-names>
            <surname>Sampaio</surname>
          </string-name>
          , S. de F.,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sampaio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>DQ2S, A framework for data qualityaware information management</article-title>
          .
          <source>Exp. Systems with Applications</source>
          , Vol.
          <volume>42</volume>
          , No.
          <volume>21</volume>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Batini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappiello</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Francalanci</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Methodologies for data quality assessment and improvement</article-title>
          .
          <source>ACM Computing Surveys</source>
          , Vol.
          <volume>41</volume>
          , No. 3,
          <string-name>
            <surname>Article</surname>
            <given-names>16</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Horton</surname>
            ,
            <given-names>N.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleinman</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          :
          <article-title>Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models</article-title>
          .
          <article-title>The American Statistician</article-title>
          , Vol.
          <volume>61</volume>
          , No.
          <volume>1</volume>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Little</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Statistical Analysis with Missing Data</article-title>
          . John Wiley &amp; Sons, New York (
          <year>1987</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Missing Data: Five Practical Guidelines</article-title>
          .
          <source>Organizational Research Methods</source>
          , Vol.
          <volume>17</volume>
          No.
          <issue>4</issue>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Charini</given-names>
            <surname>Tremblay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>VanderMeer</surname>
          </string-name>
          , D.:
          <article-title>Using Data Mining Techniques to Discover Bias Patterns in Missing Data</article-title>
          .
          <source>ACM Journal of Data Information Quality</source>
          , Vol.
          <volume>2</volume>
          , No.
          <volume>1</volume>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sessions</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gieves</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrine</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A Technique for Incorporating Data Missing Not at Random (MNAR) into Bayesian Networks</article-title>
          .
          <source>Int. Conf. on Information Quality</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hipp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Güntzer</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grimmer</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Data quality mining, making a virtue of necessity</article-title>
          .
          <source>Proceedings of the 6th ACM SIGMOD workshop on research issues in data mining and knowledge discovery</source>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Grüning</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Data quality mining: employing classifiers for assuring consistent datasets</article-title>
          .
          <source>Proceedings of the 3rd International ICSC Symposium</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Grimmer</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinrichs</surname>
          </string-name>
          , H.:
          <article-title>A methodological approach to data quality management supported by data mining</article-title>
          .
          <source>Proc. of Int. Conference on Information Quality</source>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Farzi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baraani</surname>
            <given-names>Dastjerdi</given-names>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Data quality measurement using data mining</article-title>
          .
          <source>International Journal of Computer Theory and Engineering</source>
          , Vol.
          <volume>2</volume>
          , No.
          <volume>1</volume>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Luebbers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grimmer</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jarke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Systematic development of data mining-based data quality tools</article-title>
          .
          <source>Proceedings of the 29th VLDB Conference</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Vázquez</given-names>
            <surname>Soler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Yankelevich</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Quality mining: a data mining based method for data quality evaluation</article-title>
          .
          <source>International Conference on Information Quality</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Dasu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , T.:
          <article-title>Hunting of the snark: finding data glitches using data mining methods</article-title>
          .
          <source>Proceedings of Int. Conference on Information Quality</source>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives</article-title>
          . arXiv:
          <volume>1206</volume>
          .
          <string-name>
            <surname>5538v3</surname>
          </string-name>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Dougherty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohavi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahami</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Supervised and unsupervised Discretization of Continuos Features</article-title>
          .
          <source>Machine Learning: Proc. of the 12th. International Conference</source>
          (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <source>The WEKA Data Mining Software: An Update - SIGKDD Explorations</source>
          , Vol.
          <volume>11</volume>
          , No.
          <volume>1</volume>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. Machine Learning Group at the University of Waikato.
          <article-title>Weka 3: Data Mining Software in Java</article-title>
          . http://www.cs.waikato.ac.nz/~ml/weka/ (
          <year>2015</year>
          )
          <article-title>(last access:</article-title>
          <year>2017</year>
          /03/24)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>