<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Profiling as a Process</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabian Schomm</string-name>
          <email>fabian.schomm@wi.uni-muenster.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Research Center for Information Systems (ERCIS) Leonardo-Campus 3 Münster</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>98</fpage>
      <lpage>102</lpage>
      <abstract>
        <p>Dealing with data necessitates understanding data, which can be facilitated by analyzing its metadata. This metadata is often not readily available, and thus, needs to be extracted and collected. This activity is called data profiling and is well established and researched in the literature. However, what is missing is a structured description of a holistic data profiling process, that puts together all the individual pieces and guides a user from the problem to the solution. This paper describes such a process and the results and insights that have been achieved so far.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        When working with data, one often faces situations in
which new or unknown datasets appear, which need to be
processed to achieve a specific goal. Due to the unknown
characteristics of the data and its schema, it is unclear how
the processing should be carried out, which makes it
necessary to inspect and analyze the data first. In the literature,
this inspection is usually referred to as data profiling, which
has been defined as “the set of activities and processes to
determine the metadata about a given dataset” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Examples
of metadata about data include aggregated counts,
correlations, value distributions, or functional dependencies. The
extraction of these metadata has been the focus of various
research activities in the past, and numerous efficient
algorithms for their discovery have been developed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Still, it has been our observation that a comprehensive
data profiling is performed very rarely by practitioners in
real-world scenarios. Rather, it seems to be common
practice to inspect an unknown dataset in a manual fashion, by
opening it, e.g., in a spreadsheet tool, and simply skimming
through it. This ad hoc approach, which has derogatorily
been called “data eyeballing” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or “data gazing” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], is not
only time- and cost-intensive, but also highly dependent on
individual skill, prone to errors or inconsistencies, and it
does not scale at all to larger datasets that contain more
than a handful of values.
      </p>
      <p>There is a number of reasons for the missing adoption
of elaborate data profiling techniques. The most important
factor is time: When deadlines have to be met, there is often
little room for assessing the data carefully. Instead, starting
as soon as possible and producing results immediately is
perceived as more important. However, this can quickly
lead to costly backtracking when certain assumptions about
the data (e.g., its quality) turn out to be wrong.</p>
      <p>Another factor for the negligence of data profiling is
missing knowledge. Many people do not know enough about
it or how it is done, even if they are trained professionals.
Additionally, there seems to be little understanding of the
potential benefits, which leads to a reluctance to learn.</p>
      <p>
        This paper describes an ongoing research effort that has
the goal to bridge this gap between sophisticated data
profiling techniques and algorithms described in the literature on
one side, and the often simplistic and crude data inspection
approaches encountered in practice on the other side. In
order to reach this goal, the development of a process model is
proposed, which acts as a practitioner’s guide to the
application of data profiling and highlights its benefits. This process
model should list necessary steps for data profiling, describe
the involved components (i.e., algorithms, tools, people) and
their interactions, and demonstrate the overall usefulness of
a structured approach to data profiling. Further research is
structured along a research agenda that loosely follows the
suggestions of the design science approach [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It consists
of the following steps:
• Define the objectives and features of the
to-be-developed solution
• Perform a literature review to research and collect data
profiling methods, techniques, and tools, as well as
further work in related fields
• Design the data profiling process and complementary
artifacts to guide practitioners in their data profiling
efforts
• Demonstrate how the solution can be applied in
realworld scenarios
• Evaluate and validate the applicability and performance
of the result
      </p>
      <p>Throughout this work, the term data object will commonly
be used. A data object is the most general abstraction of
a dataset. Such an object can be any kind of data
collection, set, database, flat file, or even multiples of these. This
generality is needed in order to remove any preconceptions
about the syntax or semantics of the data in question. In
particular, a data object does not have to adhere to the
relational data model, but could also be semistructured (e.g.,
XML, JSON) or even unstructured.</p>
      <p>The remainder of this work is structured as follows:
Section 2 lists the most important use cases in which data
profiling should be one of the first steps. In Section 3, the
distinction between intrinsic and extrinsic information is
introduced, which is important to get a complete picture of what
a data profile is about and what it entails. After that,
Section 4 describes selected tools and techniques to explore the
work that has already been done in this field. A first draft
of the process model is shown and explained in Section 5.
Finally, in Section 6 it is described how and evaluation and
validation of the produced result could look like.</p>
    </sec>
    <sec id="sec-2">
      <title>USE CASES FOR DATA PROFILING</title>
      <p>
        The extraction of metadata from a data object is helpful
whenever that data object needs to be processed in some
way. As such, there are many different use cases in which
data profiling can be applied. Abedjan et al. have
provided an initial overview over use cases and the way in which
data profiling tasks applies to them [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A concise summary
of these use cases is given in this section.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Quality Assessment</title>
      <p>
        Assessing the quality of a given data object is usually
done by defining quality metrics, calculating their values and
interpreting the results [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The definition of these metrics
often involves the same results that also occur in a data
profiling run, e.g., completeness or accuracy. This overlap
makes it a natural fit to perform data quality assessment
with the help of a data profiling tool. One example is Profiler
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which utilizes visualization techniques and data mining
methods to allow fast assessment of tabular data.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Data Cleansing</title>
      <p>To clean a data object, it is necessary to first figure out
where it is dirty. To do so, data profiling can help by
supplying information about instances of dirty data, such as
outliers, missing values, or skewed distributions. After the
data has been cleaned, another profiling run can be executed
to verify that the cleaning efforts were successful.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Data Integration</title>
      <p>One core challenge in data integration is to combine
multiple heterogeneous data sources into one unified mediated
schema. The heterogeneity of these sources necessitates an
individual handling of every source, e.g., in the form of
customized ETL processes. This necessitates knowledge about
the structure and content of each source. Gathering this
knowledge can be sped up significantly through the usage
of data profiling. Additionally, profiling multiple sources at
once allows the detection of overlaps or duplicates, which
can facilitate the integration task.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Data Migration</title>
      <p>In order to migrate a data object, it is very helpful to have
certain key characteristics available. For example, the
physical size of the data is important for making sure that the
target destination has enough free space available.
Furthermore, when transferring a data object from one database to
another, it should be ascertained that the target database
offers support for the required schema, e.g., regarding data
types and column sizes. These kinds of information can be
gathered through data profiling.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Query Optimization</title>
      <p>
        A query to a database usually consists of multiple
operations that are ordered in a hierarchical access plan. The
order of these operations heavily influences the time it takes
to answer the query. With the knowledge of the size of
involved data objects, a query optimizer can re-arrange the
operations in such a way that the result stays the same while
the execution time improves. Research on the usage of
profiling techniques for query optimization goes back as far as
1988 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Nowadays, an end-user never worries about the
optimization of his queries, because every modern DBMS
comes with finely tuned optimization techniques. These
optimizations are based on data profiles, which are usually
not revealed. It would be interesting to investigate whether
these internal profiles could be exposed and used in different
contexts.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3. INTRINSIC VS. EXTRINSIC INFORMA</title>
    </sec>
    <sec id="sec-9">
      <title>TION</title>
      <p>Before stepping into the development of a process, a proper
definition of what exactly a data profile consists of is needed.
Previous work in this area mainly addresses the quantifiable
and algorithmically determinable properties of data, such
as various counts, value distributions or dependencies. This
point of view however does not include everything that can
be learned about a data object, and hence, does not fully
describe a data profile.</p>
      <p>
        This leads us to the distinction between what we call
intrinsic and extrinsic information. Intrinsic information is
about characteristics of the data that is inherent to the raw
data values themselves, such as value distributions or
correlations. We call these kinds of information intrinsic, because
they can be extracted and derived directly from the data,
without any outside knowledge. These extraction activities
are what Naumann denotes as “data profiling tasks” [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and
they have been the focus of many research activities in the
past.
      </p>
      <p>After gathering all intrinsic information, there is still more
metadata that cannot be discovered from the data values
alone, such as the data provenance, or the interpretation
mode of special values (empty strings, NULL values). These
types of metadata are what we call extrinsic to differentiate
them from their intrinsic counterparts, and to emphasize the
fact that they can only be learned from sources outside of
the data values themselves.</p>
      <p>Specifying and classifying the various kinds of extrinsic
information is an ongoing research effort at our group. It
is especially challenging to gather extrinsic information,
because per definition, they cannot be derived from only the
data. As such, it is necessary to search for other sources
that are able to provide the needed information. One
approach is to contact the data owner or the data creator, if
they are known. It could be assumed that they possess the
information in question and are willing to disclose it.
Further possible sources are documentations or transformation
logs about the original data object.</p>
      <p>Intrinsic and extrinsic information complement each other
and collectively form the profile of a data object.
However, gathering and inspecting such a data profile are not
the only approaches that can be taken by a data profiler. A
more direct way is to visualize the data object in a graphical
way to immediately identify its overall structure and
possible anomalies. Of course, such a visualization should still
assume no prior knowledge about the data object, because
getting that knowledge is the goal in the first place. We
call these kinds of visualization raw, because they address
the raw data values, and not the processed and calculated
result of a data analysis procedure. The major requirement
for raw visualization techniques is therefore the ability to be
executed with little to no configuration required. Although
raw visualizations can be considered intrinsic, we
intentionally choose to treat them separately, because the results and
purposes are different. An example of such a technique will
be described in the next section.</p>
    </sec>
    <sec id="sec-10">
      <title>4. SELECTED DATA PROFILING TOOLS</title>
    </sec>
    <sec id="sec-11">
      <title>AND TECHNIQUES</title>
      <p>
        There are many different data profiling tools, both
standalone programs as well as techniques integrated into bigger
software suites. Here we present two examples of such tools,
one from academia and the other from the software
market. A more detailed survey is given in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Additionally,
this section describes CityPlot as an example of a
visualization technique that could prove useful in a profiling context.
Further visualization techniques are studied in the area of
visual data exploration [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and data profiling could benefit
greatly from incorporating more ideas from this field.
4.1
      </p>
    </sec>
    <sec id="sec-12">
      <title>Metanome</title>
      <p>
        Metanome is a platform that can be used to perform data
profiling and automatically discover metadata [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It has
been developed by the chair of Prof. Felix Naumann at the
Hasso-Plattner-Institut in Potsdam, Germany. Metanome
offers a range of state-of-the-art algorithms to perform
traditional data profiling tasks and display their results. It is
conceived as a modular platform that allows an easy
integration of self-developed or third-party algorithms, and thus,
allows comparisons and benchmarks. Metanome is available
for free under the Apache license at the project website [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
4.2
      </p>
    </sec>
    <sec id="sec-13">
      <title>Talend Open Studio</title>
      <p>
        Talend Open Studio (TOS) is a software suite that
covers many data-related activities. One component is TOS
Data Quality, which offers many data profiling features in a
convenient user interface [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. TOS Data Quality is a prime
example of a free and easy-to-use data profiling tool that is
able to cover a lot of different use cases. All components of
TOS are distributed under the Apache license.
4.3
      </p>
    </sec>
    <sec id="sec-14">
      <title>CityPlot</title>
      <p>
        CityPlot is an algorithm that provides a combined view
of database structure and contents by replacing data values
with colored rectangles [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The color of a rectangle
represents the data type. Thus, it is very easy for the human eye
to assess a data object at a quick glance. For instance, in the
example shown in Figure 1, it is immediately visible that a
small number of columns contains numerical values (the red
ones), while the majority contains string values (the blue
ones). This can be used by a data profiler as a first
indication about the contents of the data object. Additionally,
CityPlot allows to assess the overall data completeness by
looking at the percentage of white space, which represents
empty cells, and also identify the columns or rows where the
most data is missing.
5.
      </p>
    </sec>
    <sec id="sec-15">
      <title>PROCESS DRAFT</title>
      <p>A first version of the process model has been developed,
which can be seen in Figure 2.</p>
      <p>The process starts with three input objects: The data
object is the central piece about which more information is
required. It can be of any form or shape, e.g., a spreadsheet,
a database, or a flat file. It is assumed that this data object
is in some way new or unknown, because otherwise (if the
object were known) there would be little need to profile it
in the first place. Unknown data objects are encountered
frequently, for example when starting work on a new project,
or adding additional data from new sources.</p>
      <p>The second input is the task description, which is assumed
to be informal and not machine-readable. It contains
information about what needs to be done with the data and how
an end result should look like. Task descriptions directly
relate to the use case, and examples include “integrate this
data object into our database” or “clean up any quality
deficits”.</p>
      <p>The last input object is initial information about the data
object in question. This captures everything that is already
known, like previous profiling results, schema information,
or documentations. The initial information can be empty if
there is no such previous knowledge. Other researchers have
a rather pessimistic view on previously known information:
Olson writes that “any available metadata [...] is either
wrong or incomplete” [10, p. 122]. However, he further
argues that available metadata is still useful, as it can provide
a basic structure and a good starting point for a profiling
activity.</p>
      <p>All input objects are used to construct a so called Data
Comprehension Case (DCC). Thus, a DCC is characterized
by the need to comprehend a given data object in order to
solve a given task. The DCC is the focus of the process,
and solving it is it’s ultimate goal. Based on the distinction
introduced in Section 3, a DCC can be approached from
three main directions:
Intrinsic information can be extracted from the data
object using traditional data profiling tools and
algorithms, and ranges from simple counts (e.g., nulls,
duplicates) over statistical variables (e.g., histograms,
patterns) to multi-column constructs (e.g., correlations,
dependencies).</p>
      <p>Extrinsic information is metadata that cannot be derived
from the data object itself. How exactly these kinds of
metadata can be gathered is a current research
question at our group. First ideas to tackle this challenge
are surveys or interview as a structured approach for
knowledge extraction from relevant people, i.e., the
data owner or the data creator.</p>
      <p>Raw visualizations allow a user to directly assess a data
object in a graphical way. The key difference that
sets this approach apart from data eyeballing is the
fact that visualizations leverage colors, shapes, and the
human cognitive ability to process images faster than
text, to directly reveal the inner structure of the data
object.</p>
      <p>Each of these steps can be performed individually. After
the information has been gathered, it needs to be interpreted
and transformed into insights by a human, which is not an
easy task for untrained users. The process should provide
assistance for this step, e.g., in the form of an
interpretation guide. Such a guide should explain the meaning and
definition of the various metadata, how they are related,
and what value ranges are usually expected. The reason
this guide would be useful is that we have experienced that
people who are new to data profiling get lost easily in the
overflow of information that a profiling tool provides. A
little guidance and pointers on what to look out for can go a
long way here.</p>
      <p>In the last step of the process, it is checked whether the
gained insights are sufficient for the execution of the task.
This is the case when most uncertainties have been removed
and the user feels confident enough. In case that check fails,
further information can be extracted by looping back to the
beginning of the DCC. This loop can be repeated as often
as necessary in order to gather all necessary information, so
that the DCC is solved, the data object is comprehended
by the user sufficiently, and the initially set task can be
executed.</p>
      <p>The absence of an explicit failure state as an end point
is intentional. It is assumed, that any DCC can be solved
if it is explored thoroughly enough. Should this
assumption turn out to be inappropriate, the process needs to be
adapted. Another key feature is that it is not required to
gather every bit of information possible before proceeding.
Instead, the process ends as soon as the minimum amount of
required information has been extracted. This ensures that
no unnecessary work is done, and that the profiling process
is efficient and result-oriented.
6.</p>
    </sec>
    <sec id="sec-16">
      <title>EVALUATION AND VALIDATION</title>
      <p>In order to evaluate the process model and verify its
usefulness, it needs to be tested. An experiment will be set
up that tests the process model in the following way: First,
a concrete use case is needed that would benefit from the
application of data profiling. For example, the
implementation of an ETL process for the integration of two
unknown datasets could be used. This task requires that a
mediated schema is designed into which the sources can be
integrated. Second, test subjects are needed that execute
the task. These test subjects should have basic skills
required for the completion of the task. For example, Master
students from our Information Systems programme could
be recruited. The test subjects are then divided into two
groups. One of the groups is provided with the data
profiling process and corresponding tools, while the other group
acts as a control group and receives no additional material.</p>
      <p>Both groups get the task description and the data. It is then
measured how long it takes everybody to complete the task,
and how good the individual results are. If the data profiling
process is good, the first group should perform much better
than the control group.</p>
      <p>Depending on the number of test subjects available, this
experiment could be even more diversified. For example, the
size of the to-be-integrated datasets could vary from only a
few rows to thousands of rows. It would then be possible to
hypothesize, that an extensive data profiling approach only
shows its benefits if the dataset exceeds a certain size, while
small datasets can be processed just fine with manual ad
hoc methods.</p>
      <p>After the experiment, a short survey should be issued that
interrogates the participants about the perceived complexity
of the task and whether or not data profiling was helpful in
solving it. The gathered feedback could then be used to
further refine and improve upon the model.</p>
    </sec>
    <sec id="sec-17">
      <title>7. SUMMARY</title>
      <p>This paper described our current state of research
regarding the application of data profiling as described in
academia to real-world use cases of practitioners. The
approach mainly consists of the development of a process model
which guides a user through the various steps of data
profiling and how they should be applied. This process model
is complemented by an interpretation guide that facilitates
the handling of results by the end user.</p>
      <p>Additionally, we established the notion of extrinsic
information as a complement to classical data profiling tasks, i.e.,
intrinsic information. This leads to a more complete picture
of what the profile of a data object is and how it can be
created.</p>
      <p>There is of course much work that still needs to be done as
indicated by the research agenda in Section 1. The process
model has been described from a very broad perspective
and needs to be refined to a more granular level. Many
details about how individual steps should be executed are
still missing and need to be filled in.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Data</given-names>
            <surname>Profiling - Talend</surname>
          </string-name>
          ,
          <year>2016</year>
          . URL: https: //www.talend.com/resource/data-profiling.
          <source>html [cited 31 March</source>
          <year>2016</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Metanome - Data Profiling - Hasso-Plattner-Institut</surname>
          </string-name>
          ,
          <year>2016</year>
          . URL: http://hpi.de/naumann/projects/ data
          <article-title>-profiling-and-analytics/ metanome-data-profiling</article-title>
          .
          <source>html [cited 31 March</source>
          <year>2016</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Profiling relational data: a survey</article-title>
          .
          <source>The VLDB Journal - The International Journal on Very Large Data Bases</source>
          ,
          <volume>24</volume>
          (
          <issue>4</issue>
          ):
          <fpage>557</fpage>
          -
          <lpage>581</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dugas</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Vossen. CityPlot: Colored ER</surname>
          </string-name>
          <article-title>Diagrams to Visualize Structure and Contents of Databases</article-title>
          . Datenbank-Spektrum,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <fpage>215</fpage>
          -
          <lpage>218</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paepcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          .
          <article-title>Profiler: integrated statistical analysis and visualization for data quality assessment</article-title>
          .
          <source>In Proceedings of the International Working Conference on Advanced Visual Interfaces</source>
          , pages
          <fpage>547</fpage>
          -
          <lpage>554</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Keim</surname>
          </string-name>
          .
          <article-title>Visual exploration of large data sets</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>44</volume>
          (
          <issue>8</issue>
          ):
          <fpage>38</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Mannino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sager</surname>
          </string-name>
          .
          <article-title>Statistical Profile Estimation in Database Systems</article-title>
          . ACM Comput. Surv.,
          <volume>20</volume>
          (
          <issue>3</issue>
          ):
          <fpage>191</fpage>
          -
          <lpage>221</lpage>
          , Sept.
          <year>1988</year>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/62061.62063, doi:10.1145/62061.62063.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maydanchik</surname>
          </string-name>
          .
          <article-title>Data quality assessment</article-title>
          .
          <source>Technics publications</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Data Profiling Revisited</article-title>
          . SIGMOD Rec.,
          <volume>42</volume>
          (
          <issue>4</issue>
          ):
          <fpage>40</fpage>
          -
          <lpage>49</lpage>
          , Feb.
          <year>2014</year>
          . URL: http://doi.acm.
          <source>org/10</source>
          .1145/2590989.2590995, doi:10.1145/2590989.2590995.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Olson</surname>
          </string-name>
          .
          <article-title>Data quality: the accuracy dimension</article-title>
          . Morgan Kaufmann,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Papenbrock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Finke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zwiener</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Data Profiling with Metanome</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>8</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1860</fpage>
          -
          <lpage>1863</lpage>
          , Aug.
          <year>2015</year>
          . URL: http://dx.doi.org/10.14778/2824032.2824086, doi:10.14778/2824032.2824086.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Pipino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Data quality assessment</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>45</volume>
          (
          <issue>4</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>218</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>R. H. von Alan</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          <string-name>
            <surname>March</surname>
            , J. Park, and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ram</surname>
          </string-name>
          .
          <article-title>Design science in information systems research</article-title>
          .
          <source>MIS quarterly</source>
          ,
          <volume>28</volume>
          (
          <issue>1</issue>
          ):
          <fpage>75</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>