<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data Pro ling in the Relational World</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso Plattner Institute University of Potsdam</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        We can be con dent that most computer or data scientists have engaged in
the activity of data pro ling, at least by \eye-balling" spreadsheets, database
tables, XML les, etc., aptly called data gazing [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. More advanced techniques
to extract metadata may have been used, such as keyword-searching in datasets,
writing structured queries, or even using dedicated data pro ling tools. Data
pro ling is the set of activities and processes to determine metadata about a
given dataset. Among the simpler results are per-column statistics, such as the
number of null values and distinct values in a column, its data type, or the most
frequent patterns of its data values. Metadata that are more di cult to discover
involve multiple columns, such as inclusion, functional and order dependencies
or denial constraints [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        With the emergence and collection of ever more structured datasets from
diverse sources, as manifested for instance in data lakes, the ability to manage,
understand and analyze such data is increasingly di cult but equally important:
\If we just have a bunch of data sets in a repository, it is unlikely anyone will
ever be able to nd, let alone reuse, any of this data. With adequate metadata,
there is some hope, but even so, challenges will remain. . . " [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Traditional uses for metadata discovered by data pro ling algorithms include
data exploration, data cleansing, and data integration. For instance, a discovered
(approximate) dependency can be elevated to a business rule with the aim of
ridding the data of all its violations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Statistics about data are commonly used
for database query optimization. Yet, a signi cant obstacle to data pro ling,
especially to discover dependencies, is the inherent complexity of the problems. For
instance, the number of potential key candidates, i.e., subsets of table columns
that contain only unique value combinations, is exponential in the number of
columns. And validating each candidate requires a scan of the entire dataset. As
a consequence, a plethora of algorithms has been developed tackling the many
individual data pro ling problems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Data pro ling remains an exciting eld of research, with many open
challenges extending well beyond the analysis of a static, relational table. Among
the open problems are e cient pro ling of dynamic data, trading o e ciency
and accuracy of pro ling algorithms, discovery of more complex types of
(semantic) constraints, and of course combining research ideas and directions from
the eld of relational data pro ling with those geared towards data of other data
models, such as graph data [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Pro ling relational data: a survey</article-title>
          .
          <source>VLDB Journal</source>
          <volume>24</volume>
          (
          <issue>4</issue>
          ),
          <volume>557</volume>
          {
          <fpage>581</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Golab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Data Pro ling</article-title>
          ,
          <source>Synthesis Lectures on Data Management</source>
          , vol.
          <volume>10</volume>
          . Morgan &amp; Claypool Publishers (nov
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Challenges and opportunities with Big Data</article-title>
          .
          <source>Tech. rep., Computing Community Consortium</source>
          , http://cra.org/ccc/docs/init/ bigdatawhitepaper.pdf (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Elle</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellahsene</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breslin</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demidova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szymanski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>RDF dataset pro ling - a survey of features, methods, vocabularies and applications</article-title>
          .
          <source>Semantic Web</source>
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <volume>677</volume>
          {
          <fpage>705</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ilyas</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Data Cleaning. Association for Computing Machinery</article-title>
          , New York, NY, United
          <string-name>
            <surname>States</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kruse</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaoudi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quiane-Ruiz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>RDFind: Scalable conditional inclusion dependency discovery in RDF datasets</article-title>
          .
          <source>In: Proceedings of the International Conference on Management of Data (SIGMOD)</source>
          . pp.
          <volume>953</volume>
          {
          <issue>967</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Maydanchik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Data Quality Assessement</article-title>
          . Technics Publications, New Jersey (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>