<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing the Quality of Unstructured Data: An Initial Overview</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cornelia Kiefer</string-name>
          <email>cornelia.kiefer@gsame.uni-stuttgart.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Excellence Advanced Manufacturing Engineering Nobelstr.</institution>
          <addr-line>12, 70569 Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In contrast to structured data, unstructured data such as texts, speech, videos and pictures do not come with a data model that enables a computer to use them directly. Nowadays, computers can interpret the knowledge encoded in unstructured data using methods from text analytics, image recognition and speech recognition. Therefore, unstructured data are used increasingly in decision-making processes. But although decisions are commonly based on unstructured data, data quality assessment methods for unstructured data are lacking. We consider data analysis pipelines built upon two types of data consumers, human consumers that usually come at the end of the pipeline and non-human / machine consumers (e.g., natural language processing modules such as part of speech tagger and named entity recognizer) that mainly work intermediate. We define data quality of unstructured data via (1) the similarity of the input data to the data expected by these consumers of unstructured data and via (2) the similarity of the input data to the data representing the real world. We deduce data quality dimensions from the elements in analytic pipelines for unstructured data and characterize them. Finally, we propose automatically measurable indicators for assessing the quality of unstructured text data and give hints towards an implementation.</p>
      </abstract>
      <kwd-group>
        <kwd>quality of unstructured data</kwd>
        <kwd>quality of text data</kwd>
        <kwd>data quality dimensions</kwd>
        <kwd>data quality assessment</kwd>
        <kwd>data quality metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years the methods for knowledge extraction from unstructured data
have improved and unstructured data sources such as texts, speech, videos and
pictures have gained importance. Nowadays, sentiment analysis of social media
data leads to decisions in marketing campaign design, images are classified
automatically and unstructured information can be retrieved easily using search
engines [
        <xref ref-type="bibr" rid="ref19 ref6">6, 19</xref>
        ]. But methods which determine the quality of the data are
lacking. To be able to make good decisions, the quality of the underlying data must
be determined. Similar to the concepts, frameworks and systems developed for
structured data we need means to ensure high quality of unstructured data. We
focus on data consumers of unstructured data and define them as humans or
non-humans / machines (e.g. algorithms) that are using or processing data. The
quality of the data consumed by the final consumer such as a human who needs
to derive a decision from the data, depends on the quality assessed for earlier
consumers. This is especially true for unstructured data, which is analyzed in a
pipeline.
      </p>
      <p>The remainder of this paper is organized as follows: First we motivate
research in assessing the quality of unstructured data in section 2. In section 3
we define data quality of unstructured data. Furthermore, we describe the data
quality dimensions interpretability, relevance and accuracy. Based on this, in
section 4 we present data quality indicators for unstructured text data. In section 5
we discuss related work and finally conclude the work and highlight future work
in section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>
        Low data quality is dangerous because it can lead to wrong or missing decisions,
strategies and operations. It can slow down innovation processes, and losses for
organizations caused by low data quality are estimated to lie over billions of
dollars per year [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Bad data is a huge problem: 60% of enterprises su↵ er from
data quality issues, 10-30% of data in organizational databases is inaccurate and
individual reports of incomplete, inaccurate and ambiguous organizational data
are numerous [
        <xref ref-type="bibr" rid="ref13 ref18">13, 18</xref>
        ].
      </p>
      <p>
        The most important information sources in organizations, such as the
workers, managers and customers produce unstructured data. About 90% of all data
outside of organizations and still more than 50% inside are estimated to be
unstructured [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In the era of Big Data the amount of data is increasing immensely
and filtering relevant and high quality data gets more and more important.
Organizations need to leverage the information hidden in unstructured data to stay
competitive [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Therefore, the quality of texts, pictures, videos and speech data
needs to be ensured. But while the need for data quality assessment and
improvement strategies for unstructured data was recognized (e.g. [
        <xref ref-type="bibr" rid="ref2 ref23">2, 23</xref>
        ]) no concrete
approach to assessing the quality of unstructured data was suggested yet. We
fill this gap and provide data quality dimensions and executable indicators for
unstructured data. By focusing on automatically calculable indicators of data
quality, we aim to support real time analytics of stream data (such as social
media data) with real time data quality assessment techniques, both running
concurrently.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Definition of Data Quality and of Data Quality</title>
    </sec>
    <sec id="sec-4">
      <title>Dimensions for Unstructured Data</title>
      <p>
        The definitions of data quality in [
        <xref ref-type="bibr" rid="ref24 ref30">24, 30</xref>
        ] focus on structured data which is
consumed by humans. They define data quality via the similarity of the data
D to the data set D’ which is expected by the data consumer [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and via
the fitness for use by the data consumer [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. We extend the meaning of these
existing definitions by pointing out that machine consumers and many di↵ erent
consumers in a pipeline need to be considered as well as human end consumers in
the case of unstructured data. Furthermore, data quality needs to be defined in
terms of accuracy. Accuracy describes the similarity between the input data and
the data which would be representing the real world. This definition of Accuracy
is equal to exiting ones, e.g. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The quality of data has a multi-faceted nature and many lists of data quality
dimensions and indicators for structured data exist (see 5). All of the dimensions
that were found to be relevant in the literature, such as completeness, timeliness
and accuracy are relevant to structured as well as unstructured data. From these
dimensions we selected three dimensions which are relevant to mining processes
on unstructured data.</p>
      <p>We deduce the dimensions from the elements involved in mining processes
on unstructured data: The input data, the real world, data consumers, a task
and the knowledge extracted. Based on these elements, the quality of data D
can be determined by comparing it to three classes of ideal data sets: the data as
expected by the current data consumer DC (we will call this the Interpretability
dimension), the data as it would be optimal for the task DT (Relevancy) and
the data set which is representing the real world DW (Accuracy). The deduced
dimensions are also in line with the data quality definitions stated above. In
Fig. 1, we illustrate the three data sets in the context of an ideal mining process
on unstructured data. Ideally, D would match the real world DW and would
be exactly the same as the data expected by the first data consumer. Since
unstructured data is analyzed in a pipeline, the output of the first data consumer is
input to the second and should therefore match the data expected by the second
data consumer and so on (as indicated in Fig. 1 with the analysis pipeline). An
ideal result of the mining process can be DT (which is still bound to D, DW
and DC and is usually equal to the data expected by the final consumer). By
basing the data quality dimensions on the elements involved in a mining process
on unstructured data, we focus on the quality of unstructured data which is
analyzed automatically in analytics pipelines.</p>
      <p>In the following, we describe the deduced data quality dimensions in more
detail:
Interpretability can be assessed as the degree of similarity between D and
DC . For example, consider a statistical preprocessor which is used to segment
a text into sentences. If it was trained on Chinese texts and is used to segment
English texts, D and DC are not similar and data quality is low. Since often
many di↵ erent data consumers are involved in interpreting unstructured data,
this dimension is crucial for unstructured data.</p>
      <p>Relevancy can be assessed as the similarity between D and DT . Usually
DT will be very similar to the DC of the end consumer (which we will call DCE )
who wants to use the data to accomplish the task. While di↵ erences between
DT and the data expected by the end consumer DCE indicate problems, these
are not related to data quality and we will therefore assume DT and DCE to be
equivalent. As an example for relevancy, consider a worker on the shop floor who
is searching for a solution for an urgent problem with a machine in a knowledge
base. If he only finds information on the price of the machine, the data quality
of the result is low because it does not help him with his task of solving the
problem.</p>
      <p>We assess the Interpretability and Relevancy of a data set D by its
similarity to the data set DC and DCE which is expected by the data consumers.
Expectations di↵ er from human to machine consumers. What a human data
consumer expects, depends on factors such as his knowledge, experiences and goals.
Expectations of machine consumers are very precise and depend on the
algorithm, training data, statistical models, rules and knowledge resources available.
This holds for all types of unstructured data. As illustrated in Fig. 2,
unstructured data such as textual documents may be consumed by machines or humans
and the data set DC or DCE depends on factors such as the native language
of the human and the statistical language models available to the machine. For
example, a human data consumer expects a manual for a machine to be in his
native language or in a language he knows. He also expects the manual to explain
the machine in a way he understands with his technical expertise. When a
machine consumes unstructured data, similar factors influence the interpretability
and more precisely the similarity of the input data and the data expected. The
knowledge of a machine consumer can be represented by machine-readable
domain knowledge encoded in semantic resources (such as taxonomies), by training
data, statistical models or by rules. As an example, imagine a machine consumer
that uses a simple rule-based approach to the extraction of proper names from
German text data, where all uppercased words are extracted. This machine
consumer expects a data set DC with correct upper and lowercased words. If D is
all lower-cased, DC and D are not similar and the data is not fit for use by that
data consumer.</p>
      <p>Unstructured data is usually consumed by many di↵ erent data consumers
with many di↵ erent data sets DC expected. In an analytics pipeline, the raw
data is consumed and processed by several consumers in a row and the output
of the previous consumer is the input to the next consumer and so on. Data
quality problems at intermediate consumers may be automatically propagated
to following consumers. By considering all intermediate (machine and/or human)
consumers, the exact points for data quality improvement can be determined. In
Fig. 3 we illustrate an analytics pipeline involving three machine consumers and
one human end consumer of the data. Machine consumers are in this illustration
represented by three high level machine consumers which are present in many
analytic pipelines of unstructured data: preprocessors, classifiers and visualizers.
For example, as depicted in Fig. 3, the output of the preprocessor is input to
automatic classification and the results are then visualized. The visualizations
are finally the input to a human consumer of the data, who e.g., derives decisions
from it.</p>
      <p>
        As for structured data, the Accuracy of data and information is a very
important data quality dimension. It is hard to measure, because the data set
DW , which represents the real world, is often not known and creating it involves
the work of human experts, is time-consuming, costly or even impossible. The
solution is usually to abstract away from details e.g., by using rules to check
general conformance of data points with expected patterns (e.g., e-mail addresses
containing an @ sign) or to built DW manually for a part of the data set only
(see [
        <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
        ]). DW may be represented by a so-called gold standard data set
with the accurate values annotated manually by human experts. For example,
statistical classifiers are evaluated by comparing the prediction of the statistical
classifier with those in a gold standard with manually annotated classes. Since
DW is not known for all data sets D, many statistical classifiers can not be
evaluated and the number of problems with accuracy in big data bases can only
be approximated.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Data Quality Indicators for Unstructured Text Data</title>
      <p>
        A data quality dimension can be measured by exploitation of data quality
indicators. Data quality indicators must be transferable to a number in the interval
[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] where 0 indicates low data quality and 1 indicates high data quality (this is
similar to the standard characterizations of data quality metrics, such as in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
Therefore, indicators can e.g., be represented by yes/no-questions, proportions
of data items which have a certain characteristic or by evaluation metrics. The
standard approaches to more concrete indicators for the quality of structured
data involve counting the number of missing values, wrong values or the number
of outliers. For the case of unstructured data, di↵ erent indicators are needed. We
compiled an extensive list of indicators for all three dimensions. The definition
of indicators is based on the dimensions discussed in the previous section and on
related work in natural language processing, information retrieval, automated
assessment and machine learning (see section 5.2). Here, we limit the indicators
presented to those which are (1) automatically measurable and (2) applicable
to unstructured text data. Furthermore, we selected indicators, which we
already implemented or which are straightforward to implement (since libraries
with good documentations are available), so that the indicators can be verified
in experiments in near future work. In table 1, we describe each dimension with
these more concrete indicators of data quality.
      </p>
      <p>While the concept behind the indicators confidence, precision, accuracy and
quality of gold annotations are applicable to all types of unstructured data which
are processed by statistical machine learning components, the remaining
indicators are text specific. With a di↵ erent definition of noisy data and fit of training
data, the concepts may be transferred to other data types as well, e.g. measuring
the similarity between input pictures and training data pictures or measuring
the percentage of noisy data, defined as the percentage of background noise, in
speech.</p>
      <p>In the following we describe the indicators in more detail and give hints
towards possible implementations:</p>
      <p>
        The first indicator fit of training data directly follows from the definition for
Interpretability we gave in section 3, when considering statistical classifiers as
data consumers. The quality of text data with respect to a machine consumer,
can be measured by calculating the similarity of the input text data and the
data expected by the data consumer. In the case of statistical classifiers such as a
part of speech tagger (which automatically assigns parts of speech to each token
such as a word in a text) or sentiment classifier (which automatically detects
opinions in texts and assigns e.g., the classes positive, negative and neutral to
texts), DC may be represented by the training data. For the case of unstructured
text data the similarity can be measured using text similarity measures. For
example, consider the situation where Twitter data is consumed by a statistical
classifier such as a part of speech tagger that was trained on newspaper texts.
By the definition of interpretability used in this work, data quality is lower
than for another tagger that was trained on text data from Twitter as well.
Examples for measures for this indicator are text similarity measures such as
Cosine Similarity and Greedy String Tiling which are e.g. implemented in the
DKPro Similarity package (see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Using the DKPro Similarity library in Java
two lists of tokens can be easily compared and a similarity score in the interval
[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] can be calculated, following the instructions on the web site1.
      </p>
      <p>
        The second indicator, confidence, also focuses on data quality of text data as
perceived from the point of view of a statistical classifier. A statistical classifier
estimates the probabilities for each class from a fixed list of classes, given the
data. These probabilities are also called confidence values (for more details, see
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). If the probability of a classification decision is very high, confidence of
the statistical classifier is said to be high. Confidence is expressed as a number
in the interval [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] and may be used for measuring data quality. For example,
confidence measures are available and can be retrieved for the natural language
processing tools in OpenNLP2 (such as the tokenizer and part of speech tagger),
a Java library for natural language processing which is heavily used in industry
applications because it has an Apache license. To get these confidence values,
follow the documentation of the OpenNLP library (see footnote 2, e.g., for the
      </p>
      <sec id="sec-5-1">
        <title>1 https://dkpro.github.io/dkpro-similarity/ 2 https://opennlp.apache.org/</title>
        <p>part of speech tagger, just call the probs method which will return an array of
the probabilities for all tagging decisions).</p>
        <p>
          The third indicator in the interpretability dimension is the percentage of
noisy data. This is a relevant indicator for human and machine consumers, since
reading a text is more di cult for a human if it is full of misspelled words,
nongrammatical sentences and abbreviations. Since most machine consumers of text
data expect clean text data such as newspaper texts, the degree of noisy data
also measures data quality from the viewpoint of such standard machine
consumers. The percentage of noisy data may be measured as the percentage of
sentences which cannot be parsed by an automatic syntax parser, unknown words,
punctuation, very long/short sentences, incorrect casing, special signs, urls, mail
addresses, emoticons, abbreviations, pause filling words, rare words or by the
percentage of spelling mistakes (the latter as already suggested by [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]).
Nonparsable sentences can be identified using an automatic syntax parser such as the
parser implemented in natural language processing libraries such as OpenNLP
(see footnote 2) or the Natural Language Processing Tool Kit NLTK3. The
number of punctuation and of unknown words (e.g., defined as words unknown to
a standard part of speech tagger) may be e.g., calculated using the standard
part of speech tagger implemented in NLTK (which has individual classes for
punctuation and unknown words). Very long/short sentences can be identified
using a tokenizer and a sentence segmenter from a natural language processing
library and by counting the automatically determined tokens and sentences.
Incorrect casing may be detected using supervised machine learning methods, such
as suggested in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Regular expressions can be used to automatically identify
the percentage of special signs, urls, mail adresses, emoticons, abbreviations and
pause filling words in texts. Rare words can be identified internally by counting
all words that occur less than a specified number of times in the text corpus, by
counting words that are not found in a standard dictionary or a generated
dictionary (such as a dictionary generated from a very encompassing text corpus from
the domain). The number of spelling mistakes in a text corpus may be calculated
using the Python implementation PyEnchant4 or any other spelling correction
module. Most of the measures suggested for the indicator noisy data can be
implemented using the NLTK library which comes with very good documentation
and an active community (see footnote 3).
        </p>
        <p>
          But it is not su cient if data is interpretable only. Interpretable data, which
is not relevant to the end data consumer and his goal is of low quality. Therefore,
it’s Relevancy need to be calculated. For text data this can be done following
approaches already developed for information retrieval systems. The relevance
metric used in information retrieval systems determines the relevance of search
results with respect to the information need of the searcher. The information
need is captured via keywords or documents first and can then be compared
e.g., to the frequent keywords in the input texts (see [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] for the relevance
metric in information retrieval). Again, textual similarity measures such as cosine
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>3 http://www.nltk.org/ 4 http://pythonhosted.org/pyenchant/</title>
        <p>
          similarity are used to determine the similiarity of the information need and a
text (as implemented in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and accessible via the well-documented DKPro
Similarity library, see footnote 4). Besides the frequent keywords, also specificity can
indicate the relevance of unstructured text data for the task a certain end
consumer wants to accomplish. The specificity of language in texts and speech can
be determined via the coverage of a domain-specific semantic resource which
contains all relevant technical terms. In the simplest version this would be a text
file with all domain words listed which is used to determine the percentage of
domain words in a corpus. Coverage of domain specific taxonomies may be e.g.,
calculated with a concept matcher such as the one presented in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>
          If the data is interpretable and relevant, the remaining question is whether
it reflects the real world or not, that is whether it is accurate. The Accuracy of
unstructured text data may be indicated by evaluation metrics such as precision
and accuracy. These metrics compare the automatically annotated data to parts
of the data which represent the real world, such as manually annotated gold
standard corpora. Statistical classifiers are evaluated by comparing them to gold
standards and by determining how many of the classified entities really belong to
a class (precision) and the percentage of classification decisions that were correct
(accuracy ), see [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The metrics precision and accuracy were already suggested
as indicators for text data quality by [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Furthermore, the quality
of gold annotations of training and test data is an indicator in the accuracy
dimension. These can be calculated according to [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] by measuring the
interrater agreement which measures the number of times one or more annotators
agree. Evaluation metrics and inter-rater metrics are e.g. implemented in NLTK
(see footnote 3).
        </p>
        <p>In this section we presented automatically measurable indicators for text
data which are executable. Not all indicators presented here are relevant and
applicable in all cases. Only few out of the many statistical tools give access
to the confidence metric and only with access to gold test data precision and
accuracy can be calculated.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>While research on the quality of structured data is numerous, the quality of
unstructured data has hardly been considered yet. We present related work in
the field of data quality in section 5.1 and list isolated methods useful in assessing
unstructured text data quality in section 5.2.
5.1</p>
      <p>
        Related Work in Data Quality
Many frameworks and data quality dimensions dedicated to the quality of
structured data have been suggested (e.g. [
        <xref ref-type="bibr" rid="ref24 ref30">24, 30</xref>
        ]) and also special frameworks and
dimensions for social media data and big data were developed [
        <xref ref-type="bibr" rid="ref21 ref5">5, 21</xref>
        ]. In these
works, data quality dimensions are defined from a human end consumer’s point
of view and no automatic measures for the assessment of unstructured data are
given. Several sources [
        <xref ref-type="bibr" rid="ref2 ref23 ref26">2, 23, 26</xref>
        ] address the need for data quality measures on
unstructured data but none of them gives executable dimensions and indicators.
In these works, interesting starting points for quality dimensions and indicators
are defined, such as:
– The quality of technologies used to interpret unstructured data and the
author’s expertise [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
– Accuracy, readability, consistency and accessibility [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
– Precision and spelling quality [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
No hints towards possible implementations of these dimensions and indicators
are suggested, though. As demanded in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], we also support the view that
textual data quality needs to be measured for both, human consumers and machine
consumers. We have furthermore motivated the need to measure data quality
at every stage. This is also demanded in [
        <xref ref-type="bibr" rid="ref15 ref27">15, 27</xref>
        ]. A closely related idea is also
expressed in the concept of data provenance which aims at collecting the
information on all data sources and transformation or merging steps of data (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
5.2
      </p>
      <p>
        Isolated Methods for Data Quality Assessment of Unstructured
Text Data
In the definition of the quality indicators in this article we focused on
unstructured text data. Therefore, we limit the list of isolated methods to those
relevant for the assessment of textual data. For example, quite some work in the
field of natural language processing focuses on the interaction between textual
data characteristics and the performance of Natural Language Processing (NLP)
tools. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the authors consider factors that a↵ ect the accuracy of automatic
text-based language identification (such as the size of the text fragment and the
amount of training data). Furthermore, work on correcting upper and
lowercasing of words in texts (re-casing), spelling correction, abbreviation expansion and
text simplification is related to our work (e.g., [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). In the context of search
engines, the quality of the search results and of the data basis is discussed as
well [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In automated assessment, methods to automatically assess the quality of
hand-written essays and short answers (e.g., student essays and answers to free
text questions) are developed (for a good overview, see [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]). Work on training
data selection in machine learning, which is on choosing subsets of training data
which fit best to the domain of the test set (e.g. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]) is also related to our work.
The idea expressed in these works is similar to the idea behind the indicator
fit of training data, which we added to our list of indicators for unstructured
text data quality. However, we are the first to suggest the fit of training data
as a data quality indicator. Furthermore, we do not suggest to use it for parts
of training data, as suggested in these works, but to choose from di↵ erent text
corpora.
6
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>
        We listed dimensions and indicators for determining the quality of unstructured
data based on the basic elements of mining processes on unstructured data.
The indicators proposed are executable and easily transfer into a data quality
metric in the interval [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]. In future work we will determine the most suitable
implementations for the indicators and validate them in experiments. We will
furthermore explore how indicators may be combined to measure the overall
data quality of unstructured data and how the improvement of data quality as
perceived by intermediate consumers influences data quality from a rather end
consumer viewpoint.
      </p>
      <p>Acknowledgments. The authors would like to thank the German Research
Foundation (DFG) for financial support of this project as part of the
Graduate School of Excellence advanced Manufacturing Engineering (GSaME) at the
University of Stuttgart. Moreover, we thank B. Mitschang and L. Kassner for
important feedback.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cabitza</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Grega</surname>
          </string-name>
          .
          <article-title>A data quality methodology for heterogeneous data</article-title>
          .
          <source>International Journal of Database Management Systems (IJDMS)</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>60</fpage>
          -
          <lpage>79</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          .
          <source>Data and Information Quality</source>
          . Springer International Publishing, Cham,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Botha</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Barnard</surname>
          </string-name>
          .
          <article-title>Factors that a↵ ect the accuracy of text-based language identification</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          ,
          <volume>26</volume>
          (
          <issue>5</issue>
          ):
          <fpage>307</fpage>
          -
          <lpage>320</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          .
          <article-title>Data provenance - the foundation of data quality</article-title>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>L.</given-names>
            <surname>Cai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>The challenges of data quality and data quality assessment in the big data era</article-title>
          .
          <source>Data Science Journal</source>
          ,
          <volume>14</volume>
          (
          <issue>0</issue>
          ):
          <fpage>2</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>F.</given-names>
            <surname>Camastra</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Vinciarelli</surname>
          </string-name>
          .
          <article-title>Machine learning for audio, image and video analysis: Theory and applications</article-title>
          .
          <source>Advanced Information and Knowledge Processing</source>
          . Springer, London,
          <source>second edition edition</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Daniel Ba¨r, Torsten Zesch, and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <article-title>Dkpro similarity: An open source framework for text similarity</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (System Demonstrations) (ACL</source>
          <year>2013</year>
          ), pages
          <fpage>121</fpage>
          -
          <lpage>126</lpage>
          , Stroudsburg, PA, USA,
          <year>2013</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Dey</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Reassessing data quality for information products</article-title>
          .
          <source>Management Science</source>
          ,
          <volume>56</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2316</fpage>
          -
          <lpage>2322</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>C.</given-names>
            <surname>Feilmayr</surname>
          </string-name>
          .
          <article-title>Decision guidance for optimizing web data quality - a recommendation model for completing information extraction results</article-title>
          .
          <source>24th International Workshop on Database and Expert Systems Applications</source>
          , pages
          <fpage>113</fpage>
          -
          <lpage>117</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Fleiss and Levin. The measurement of interrater agreement</article-title>
          . In J. L.
          <string-name>
            <surname>Fleiss</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Levin</surname>
          </string-name>
          , and M. C. Paik, editors,
          <source>Statistical methods for rates and proportions</source>
          , Wiley series in probability and statistics, pages
          <fpage>598</fpage>
          -
          <lpage>626</lpage>
          . J. Wiley, Hoboken,
          <string-name>
            <surname>N.J.</surname>
          </string-name>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>C. Fox</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Levitin</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Redman</surname>
          </string-name>
          .
          <article-title>The notion of data and its quality dimensions</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <fpage>9</fpage>
          -
          <lpage>19</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. S. Gandrabur, G. Foster, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Lapalme</surname>
          </string-name>
          .
          <article-title>Confidence estimation for nlp applications</article-title>
          .
          <source>ACM Transactions on Speech and Language Processing (TSLP)</source>
          ,
          <volume>3</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. J. Han,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Web article quality ranking based on web community knowledge</article-title>
          .
          <source>Computing</source>
          ,
          <volume>97</volume>
          (
          <issue>5</issue>
          ):
          <fpage>509</fpage>
          -
          <lpage>537</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>K.</given-names>
            <surname>Hartl</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Jacob</surname>
          </string-name>
          .
          <article-title>Determing the business value of business intelligence with data mining methods</article-title>
          .
          <source>The Fourth International Conference on Data Analytics</source>
          , pages
          <fpage>87</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.</given-names>
            <surname>Immonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paakkonen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Ovaska</surname>
          </string-name>
          .
          <article-title>Evaluating the quality of social media data in big data architecture</article-title>
          .
          <source>IEEE Access, (3):1</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
          </string-name>
          , and H. Schu¨tze. Introduction to information retrieval. Cambridge University Press, New York,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>C. Niu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ding</surname>
            , and
            <given-names>R. K.</given-names>
          </string-name>
          <string-name>
            <surname>Srihari</surname>
          </string-name>
          .
          <article-title>Orthographic case restoration using supervised learning without manual annotation</article-title>
          .
          <source>International Journal on Artificial Intelligence Tools</source>
          , (
          <volume>13</volume>
          ),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>J. R. Nurse</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Creese</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Goldsmith</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lamberts</surname>
          </string-name>
          .
          <article-title>Information quality and trustworthiness: A topical state-of-the-art review</article-title>
          .
          <source>International Conference on Computer Applications and Network Security (ICCANS</source>
          <year>2011</year>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>B.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaithyanathan</surname>
          </string-name>
          .
          <article-title>Thumbs up? sentiment classification using machine learning techniques</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          , pages
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>P.</given-names>
            <surname>Russom</surname>
          </string-name>
          .
          <article-title>Bi search and text analytics: New additions to the bi technology stack</article-title>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>M. Schaal</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          <string-name>
            <surname>Mueller</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>MacLean</surname>
          </string-name>
          .
          <article-title>Information quality dimensions for the social web</article-title>
          .
          <source>In Proceedings of the International Conference on Management of Emergent Digital EcoSystems</source>
          , pages
          <fpage>53</fpage>
          -
          <lpage>58</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>M.</given-names>
            <surname>Schierle</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Trabold</surname>
          </string-name>
          .
          <article-title>Multilingual knowledge-based concept recognition in textual data</article-title>
          . In A. Fink,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lausen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Seidel</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ultsch, editors,
          <source>Advances in Data Analysis, Data Handling and Business Intelligence</source>
          , Studies in Classification,
          <source>Data Analysis, and Knowledge Organization</source>
          , pages
          <fpage>327</fpage>
          -
          <lpage>336</lpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ireland</surname>
          </string-name>
          , E. Gonzales,
          <string-name>
            <given-names>M.</given-names>
            <surname>Del Pilar</surname>
          </string-name>
          <string-name>
            <surname>Angeles</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Burdescu</surname>
          </string-name>
          .
          <article-title>On the quality of non-structured data</article-title>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24. L.
          <string-name>
            <surname>Sebastian-Coleman</surname>
          </string-name>
          .
          <article-title>Measuring data quality for ongoing improvement: A data quality assessment framework</article-title>
          .
          <source>Elsevier Science</source>
          , Burlington,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Klassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Kit</surname>
          </string-name>
          .
          <article-title>Entropy-based training data selection for domain adaptation</article-title>
          .
          <source>Proceedings of COLING</source>
          <year>2012</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>D.</given-names>
            <surname>Sonntag</surname>
          </string-name>
          .
          <article-title>Assessing the quality of natural language text data</article-title>
          .
          <source>In GI Jahrestagung</source>
          , pages
          <fpage>259</fpage>
          -
          <lpage>263</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27. I.-G. Todoran,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lecornu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khenchaf</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-M. Le Caillec</surname>
          </string-name>
          .
          <article-title>A methodology to evaluate important dimensions of information quality in systems</article-title>
          .
          <source>Journal of Data and Information Quality</source>
          ,
          <volume>6</volume>
          (
          <issue>2</issue>
          -3):
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <given-names>T.</given-names>
            <surname>Vogel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Draisbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lange</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Reach for gold</article-title>
          .
          <source>Journal of Data and Information Quality</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          -2):
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. H.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          . Cleanix.
          <source>ACM SIGMOD Record</source>
          ,
          <volume>44</volume>
          (
          <issue>4</issue>
          ):
          <fpage>35</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Strong</surname>
          </string-name>
          .
          <article-title>Beyond accuracy: what data quality means to data consumers</article-title>
          .
          <source>J. Manage. Inf. Syst.</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <given-names>R.</given-names>
            <surname>Ziai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ott</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Meurers</surname>
          </string-name>
          .
          <article-title>Short answer assessment: Establishing links between research strands</article-title>
          .
          <source>In Proceedings of the 7th Workshop on Innovative Use of NLP for Building Educational Applications (BEA7)</source>
          , Montreal, Canada,
          <year>2012</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>