Data Profiling in the Relational World

                                 Felix Naumann

                             Hasso Plattner Institute
                         University of Potsdam, Germany
                             felix.naumann@hpi.de

    We can be confident that most computer or data scientists have engaged in
the activity of data profiling, at least by “eye-balling” spreadsheets, database
tables, XML files, etc., aptly called data gazing [7]. More advanced techniques
to extract metadata may have been used, such as keyword-searching in datasets,
writing structured queries, or even using dedicated data profiling tools. Data
profiling is the set of activities and processes to determine metadata about a
given dataset. Among the simpler results are per-column statistics, such as the
number of null values and distinct values in a column, its data type, or the most
frequent patterns of its data values. Metadata that are more difficult to discover
involve multiple columns, such as inclusion, functional and order dependencies
or denial constraints [2].
    With the emergence and collection of ever more structured datasets from
diverse sources, as manifested for instance in data lakes, the ability to manage,
understand and analyze such data is increasingly difficult but equally important:
“If we just have a bunch of data sets in a repository, it is unlikely anyone will
ever be able to find, let alone reuse, any of this data. With adequate metadata,
there is some hope, but even so, challenges will remain. . . ” [3].
    Traditional uses for metadata discovered by data profiling algorithms include
data exploration, data cleansing, and data integration. For instance, a discovered
(approximate) dependency can be elevated to a business rule with the aim of
ridding the data of all its violations [5]. Statistics about data are commonly used
for database query optimization. Yet, a significant obstacle to data profiling, es-
pecially to discover dependencies, is the inherent complexity of the problems. For
instance, the number of potential key candidates, i.e., subsets of table columns
that contain only unique value combinations, is exponential in the number of
columns. And validating each candidate requires a scan of the entire dataset. As
a consequence, a plethora of algorithms has been developed tackling the many
individual data profiling problems [1].
    Data profiling remains an exciting field of research, with many open chal-
lenges extending well beyond the analysis of a static, relational table. Among
the open problems are efficient profiling of dynamic data, trading off efficiency
and accuracy of profiling algorithms, discovery of more complex types of (se-
mantic) constraints, and of course combining research ideas and directions from
the field of relational data profiling with those geared towards data of other data
models, such as graph data [4, 6].

  Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
2       Felix Naumann

References
1. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB
   Journal 24(4), 557–581 (2015)
2. Abedjan, Z., Golab, L., Naumann, F., Papenbrock, T.: Data Profiling, Synthesis
   Lectures on Data Management, vol. 10. Morgan & Claypool Publishers (nov 2018)
3. Agrawal, D., et al.: Challenges and opportunities with Big Data. Tech.
   rep., Computing Community Consortium, http://cra.org/ccc/docs/init/
   bigdatawhitepaper.pdf (2012)
4. Ellefi, M.B., Bellahsene, Z., Breslin, J.G., Demidova, E., Dietze, S., Szymanski, J.,
   Todorov, K.: RDF dataset profiling - a survey of features, methods, vocabularies
   and applications. Semantic Web 9(5), 677–705 (2018)
5. Ilyas, I.F., Chu, X.: Data Cleaning. Association for Computing Machinery, New
   York, NY, United States (2019)
6. Kruse, S., Jentzsch, A., Papenbrock, T., Kaoudi, Z., Quiané-Ruiz, J., Naumann, F.:
   RDFind: Scalable conditional inclusion dependency discovery in RDF datasets. In:
   Proceedings of the International Conference on Management of Data (SIGMOD).
   pp. 953–967 (2016)
7. Maydanchik, A.: Data Quality Assessement. Technics Publications, New Jersey
   (2007)