=Paper= {{Paper |id=Vol-1558/paper37 |storemode=property |title=Big Data Provenance: State-Of-The-Art Analysis and Emerging Research Challenges |pdfUrl=https://ceur-ws.org/Vol-1558/paper37.pdf |volume=Vol-1558 |authors=Alfredo Cuzzocrea |dblpUrl=https://dblp.org/rec/conf/edbt/Cuzzocrea16 }} ==Big Data Provenance: State-Of-The-Art Analysis and Emerging Research Challenges== https://ceur-ws.org/Vol-1558/paper37.pdf
                Big Data Provenance: State-Of-The-Art Analysis
                      and Emerging Research Challenges

                                                                      Alfredo Cuzzocrea
                                                                      DIA Department
                                                            University of Trieste and ICAR-CNR
                                                                             Italy
                                                            alfredo.cuzzocrea@dia.units.it

ABSTRACT                                                                               details an organization’s legal ownership of enterprise-wide
This paper focuses the attention on big data provenance is-                            data.
sues, and provides a comprehensive survey on state-of-the-                                When applied to big data, provenance problems become
art analysis and emerging research challenges in this scien-                           prohibitive (e.g., [10]), mostly due to the enormous size of
tific field. Big data provenance is actually one of the most                           big data. For instance, one of the most successful data prove-
relevant problem in big data research, as confirmed by the                             nance techniques consists in the so-called annotation-based
great deal of attention devoted to this topic by larger and                            approaches (e.g., [22]) that propose modifying the input
larger database and data mining research communities. This                             database queries in order to support data provenance tasks,
contribution aims at representing a milestone in the exciting                          while being able to access all the target data set. Obviously,
big data provenance research road.                                                     the latter requirement becomes very hard when applied to
                                                                                       big data repositories. Many others research challenges and
                                                                                       open issues still arise in big data provenance research. For
CCS Concepts                                                                           instance, advanced concepts like confidentiality of the data
•Theory of computation → Data provenance;                                              provenance process, secure and privacy-preserving big data
                                                                                       provenance, flexible big data provenance query tools, and so
                                                                                       forth, still need to be deeply investigated.
Keywords                                                                                  Inspired by these considerations, in this paper we pro-
Big Data Provenance, Privacy of Big Data, Big Data Lin-                                vide an overview of relevant research issues and challenges
eage, Big Data Derivation                                                              of the above-introduced big data provenance problems, by
                                                                                       also highlighting possible future efforts within these research
                                                                                       directions.
1.     INTRODUCTION                                                                       The remaining part of this paper is organized as follows.
   In big data research, privacy and security of big data (e.g.,                       Section 2 contains a comprehensive analysis of state-of-the-
[13, 14, 12]) play a major role. Along with these topics,                              art proposals that focus on big data provenance issues. In
provenance of big data (e.g., [16, 17, 10, 18, 4]) is relevant                         Section 3, we recognize and report on emerging challenges
as well. Data provenance concerns with the problem of de-                              in big data provenance research, by highlighting possible
tecting the origin, the creation and the propagation process                           promising directions. Finally, Section 4 draws the conclu-
of data within a data-intensive system. In other words, data                           sions of our research.
provenance consists in the lineage (e.g., [27]) and derivation
(e.g., [22]) of data and data objects, and it puts its con-
ceptual roots in extensively studies performed in the past
in the contexts of arts, literary works, manuscripts, sculp-
                                                                                       2.   STATE-OF-THE-ART ANALYSIS
tures, and so forth. Another concept that is close to the                                 Data provenance is relevant for a wide spectrum of typi-
“data provenance” one is represented by the so-called own-                             cal enterprise data tasks, such as: (i) data validation (e.g.,
ership of data (e.g., [21]), which refers to the issue of defining                     [7]); (ii) data debugging (e.g., [20]); (iii) data auditing (e.g.,
and providing information about the rightful owner of data                             [26]); (iv ) data quality (e.g., [24]); (v ) data reliability (e.g.,
assets, and to the acquisition, use and distribution policy                            [3]). Application-wise, the provenance problem has been
implemented by the data owner. This way, data ownership                                typically addressed in database management systems (e.g.,
primarily shapes itself like a data governance process that                            [9]), but several efforts even arise in the contexts of work-
                                                                                       flow management systems (e.g., [15]) and distributed systems
                                                                                       (e.g., [25]).
                                                                                          As regards the proper research side, there are several re-
                                                                                       search initiatives that composes the state-of-the-art. Here,
                                                                                       we review some of them.
                                                                                          [11] describes a framework for modeling and capturing
                                                                                       provenance in MapReduce jobs and deriving MapReduce
 c 2016, Copyright is with the authors. Published in the Workshop Proceedings of the   tasks, called Kepler. The approach is distributed in nature,
EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bordeaux, France) on CEUR-            and it exploits the MySQL Cluster distributed database sys-
WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of
the Creative Commons license CCby-nc-nd 4.0                                            tem [2].
   [19, 23] proposed an extension of Hadoop [1] called Reduce      classical methods not suitable to the particular context of
and Map Provenance (RAMP). It introduces a wrapper-                dealing with big data provenance.
based method that can be easily deployed on top of Hadoop
yet resulting transparent to it. Tracing of data-intensive           Analyzing Big Data In order to apply data provenance
processes is supported as well.                                    methods, state-of-the-art techniques require to analyze the
   [5] describes an extension of Hadoop for implementing           target (big) data set. Here, a major problem is represented
provenance detection in MapReduce jobs, called Hadoop-             by the scalability of big data, which can be really explosive.
Prov. The goal of HadoopProv is to minimize overheads
introduced by computing provenance, which is usually a                Scalability Issues When dealing with big data, one of
resource-consuming task. The proposed system provides              the most problematic drawbacks is represented by scalabil-
flexible tools for querying the so-built big data provenance       ity, as highlighted before. This again occurs with provenance
graph.                                                             of big data, as provenance techniques are multi-step in na-
   Pig Lipstick [6] is a kind of hybrid big data provenance sys-   ture and they need to access and process target data repet-
tem that combines the management of fine-grained depen-            itively. This poses relevant issues, as big data are typically
dencies, which are typical of database-oriented provenance         growing-in-size and large-scale.
systems, with the management of workflow-grained depen-
                                                                     Information Sharing Data provenance methods very of-
dencies, which are typical of workflow-oriented provenance
                                                                   ten require the need for sharing information among the ac-
systems. The internal model for reasoning on big data prove-
                                                                   tors that perform the same data provenance task. The lat-
nance is graph-like in nature.
                                                                   ter is not easy when dealing with big data, as such data are
   [4] proposes anatomy and functionalities of a layer-based
                                                                   typically distributed over large-scale network environments,
architecture for supporting big data provenance. In partic-
                                                                   hence information sharing introduces relevant research chal-
ular, the architecture is composite in nature and it focuses
                                                                   lenges as well as technological drawbacks.
on the provenance collection, querying and visualization of
provenance in the specialized context of scientific applica-         Minimum Computational Overhead Requirement
tions.                                                             Data provenance techniques may be data-intensive and
   [17] considers the problem of managing fine-grained prove-      resource-consuming. This imposes the need for devising and
nance in Data Stream Management Systems (DSMS). In-                implementing techniques that introduce a minimum compu-
deed, this problem is recognized as particularly hard due to       tational overhead, in order to avoid impacting on the per-
the fact of the need of supporting flexible analysis tools over    formance of the target system, e.g. workflow management
the so-computed provenance, such as revision processing or         systems.
query debugging. With this goal in mind, the paper pro-
poses a novel big data provenance framework based on the             Query Optimization Issues Data provenance tech-
concept of operator instrumentation. It consists in modify-        niques need to access and query data in order to deter-
ing the behavior of operators in order to generate and prop-       mine their provenance, even in an interactive manner. This
agate fine-grained provenance through several operators of         applicative requirement introduces severe drawbacks when
a query.                                                           these techniques run over big data, as querying big data is
   CloudProv, a framework for integrating, modeling and            a crucial open problem at now.
monitoring data provenance in Cloud environments, is pre-
sented in [18]. The proposed framework is based on a                 Transformation Issues During data provenance tasks,
method that allows us to model collected provenance in-            data sources need to be transformed among different data
formation as to continuously acquire and monitor such in-          formats. Tracing provenance must be introduced accord-
formation for real-time applications, according to a service-      ingly, in order to track all the different transformations oc-
oriented paradigm.                                                 curred. This topic is a first-class one in the family of big data
   Finally, Oruta, an innovative privacy-preserving public au-     provenance research issues, which also has several points in
diting mechanism for supporting data sharing in untrusted          common with the data exchange research area.
Cloud environments is proposed in [26]. The proposed mech-
anism makes use of homomorphism authenticators [8] that               When Computing Provenance? There exist two alter-
allows the third party auditor to check the integrity of shared    natives for computing provenance. One predicates to com-
data from a given user group, yet not superimposing the            pute provenance only when the same provenance is required
need for accessing all data.                                       (this is called lazy provenance model ). The other one ar-
                                                                   gues to compute provenance every time data are transformed
                                                                   (this is called eagerly provenance model ). Both models have
                                                                   pros and cons. They also imply different computational
3.   EMERGING                RESEARCH                CHAL-         overheads. This one is still an open problem to be con-
                                                                   sidered in future efforts.
     LENGES
  A relevant number of issues and challenges in big data              Data Modeling Support for Provenance When data
provenance research arise. In the following, we will introduce     sources are processed to detect their provenance, several
and discuss some noticeable ones.                                  transformations must be applied, as mentioned above. This
                                                                   also implies the need of devising ad-hoc data models for sup-
  Accessing Big Data Big data are prominently                      porting provenance, as data sources may be significantly dif-
enormous-in-size, hence accessing the entire big data set be-      ferent. In this case, semantic techniques seem promising to
come problematic. Accessing data is a strict requirement           this direction.
for data provenance techniques, hence this makes applying
  Heterogeneity of Data Source Models Data prove-                  [5] S. Akoush, R. Sohan, and A. Hopper. Hadoopprov:
nance techniques usually run over heterogeneous data                   Towards provenance as a first class citizen in
sources hence they need to cope with heterogeneous data                mapreduce. In 5th Workshop on the Theory and
models as well. Therefore, heterogeneity of data sources is            Practice of Provenance, TaPP’13, Lombard, IL, USA,
a big challenge for such techniques, as data sources expose            April 2-3, 2013, 2013.
different formats, (data) types, and schema.                       [6] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo,
                                                                       J. Stoyanovich, and V. Tannen. Putting lipstick on
  User Annotation Support for Provenance The data                      pig: Enabling database-style workflow provenance.
provenance process is usually enriched by user annotation,             PVLDB, 5(4):346–357, 2011.
according to which domain experts are devoted to annotate          [7] A. Assaf, A. Senart, and R. Troncy. Roomba:
data in order to enhance the effectiveness of this process. As         Automatic validation, correction and generation of
a consequence, data provenance tools need to introduce ad-             dataset metadata. In Proceedings of the 24th
hoc software modules capable of supporting user annotation             International Conference on World Wide Web
over big data.                                                         Companion, WWW 2015, Florence, Italy, May 18-22,
                                                                       2015 - Companion Volume, pages 159–162, 2015.
   Secure and Privacy-Preserving Provenance Prove-
nance can represent a security and privacy breach for target       [8] G. Ateniese, R. C. Burns, R. Curtmola, J. Herring,
data sources. Therefore, a relevant issue for future efforts           L. Kissner, Z. N. J. Peterson, and D. X. Song.
is represented by the need for secure and privacy-preserving           Provable data possession at untrusted stores. In
big data provenance techniques. Possible solutions are those           Proceedings of the 2007 ACM Conference on
based on accepting a sort of compromise among security and             Computer and Communications Security, CCS 2007,
privacy of data sources from a side, and provenance of data            Alexandria, Virginia, USA, October 28-31, 2007,
sources from the other side.                                           pages 598–609, 2007.
                                                                   [9] P. Buneman, A. Chapman, and J. Cheney. Provenance
   Flexible Provenance Query Tools Provenance needs                    management in curated databases. In Proceedings of
to be used not only to detect the lineage and the deriva-              the ACM SIGMOD International Conference on
tion of data and data objects, but also in the vest of en-             Management of Data, Chicago, Illinois, USA, June
abling methodology for flexible query tools focused to sup-            27-29, 2006, pages 539–550, 2006.
port next-generation cybersecurity systems where users may        [10] Y. Cheah, S. R. Canon, B. Plale, and
be interested in tracking records generated by a particular            L. Ramakrishnan. Milieu: Lightweight and
person in a specific research lab, or detecting the confiden-          configurable big data provenance for science. In IEEE
tiality of tracked records, i.e. understanding who may have            International Congress on Big Data, BigData
looked these tracked records beyond to authorized people.              Congress 2013, June 27 2013-July 2, 2013, pages
                                                                       46–53, 2013.
  Provenance Visualization Tools Visualization tools              [11] D. Crawl, J. Wang, and I. Altintas. Provenance for
are extremely important for big data provenance techniques,            mapreduce-based data-intensive workflows. In
as the provenance one is an interactive process that typi-             WORKS’11, Proceedings of the 6th Workshop on
cally requires intelligent tools for visualizing actual results        Workflows in Support of Large-Scale Science,
and supporting next-step decisions. This will be a relevant            co-located with , SC11, Seattle, WA, USA, November
research challenge in future years.                                    14, 2011, pages 21–30, 2011.
                                                                  [12] A. Cuzzocrea. Privacy and security of big data:
4.   CONCLUSIONS                                                       Current challenges and future research perspectives.
   This paper has provided a comprehensive survey on state-            In Proceedings of the First International Workshop on
of-the-art analysis and emerging research challenges in the            Privacy and Secuirty of Big Data, PSBD@CIKM
context of big data provenance research. We have high-                 2014, Shanghai, China, November 7, 2014, pages
lighted benefits and limitations of most relevant proposals,           45–47, 2014.
and we have described possible research directions in the         [13] A. Cuzzocrea, V. Russo, and D. Saccà. A robust
exciting big data provenance research road.                            sampling-based framework for privacy preserving
                                                                       OLAP. In Data Warehousing and Knowledge
5.   REFERENCES                                                        Discovery, 10th International Conference, DaWaK
 [1] Apache Hadoop. http://wiki.apache.org/hadoop.                     2008, Turin, Italy, September 2-5, 2008, Proceedings,
     Accessed: 2015-01-15.                                             pages 97–114, 2008.
 [2] MySQL Cluster CGE.                                           [14] A. Cuzzocrea and D. Saccà. Balancing accuracy and
     https://www.mysql.com/products/cluster/. Accessed:                privacy of OLAP aggregations on data cubes. In
     2015-01-15.                                                       DOLAP 2010, ACM 13th International Workshop on
                                                                       Data Warehousing and OLAP, Toronto, Ontario,
 [3] N. Agmon and N. Ahituv. Assessing data reliability in
                                                                       Canada, October 30, 2010, Proceedings, pages 93–98,
     an information system. J. of Management Information
                                                                       2010.
     Systems, 4(2):34–44, 1987.
                                                                  [15] S. B. Davidson and J. Freire. Provenance and
 [4] R. Agrawal, A. Imran, C. Seay, and J. Walker. A layer
                                                                       scientific workflows: challenges and opportunities. In
     based architecture for provenance in big data. In 2014
                                                                       Proceedings of the ACM SIGMOD International
     IEEE International Conference on Big Data, Big Data
                                                                       Conference on Management of Data, SIGMOD 2008,
     2014, Washington, DC, USA, October 27-30, 2014,
     pages 1–7, 2014.
     Vancouver, BC, Canada, June 10-12, 2008, pages                 International Conference on High Performance
     1345–1350, 2008.                                               Computing & Simulation, HPCS 2013, Helsinki,
[16] B. Glavic, K. S. Esmaili, P. M. Fischer, and N. Tatbul.        Finland, July 1-5, 2013, pages 57–64, 2013.
     Ariadne: managing fine-grained provenance on data         [22] I. Nunes, Y. Chen, S. Miles, M. Luck, and C. J. P.
     streams. In The 7th ACM International Conference on            de Lucena. Transparent provenance derivation for user
     Distributed Event-Based Systems, DEBS ’13,                     decisions. In Provenance and Annotation of Data and
     Arlington, TX, USA - June 29 - July 03, 2013, pages            Processes - 4th International Provenance and
     39–50, 2013.                                                   Annotation Workshop, IPAW 2012, Santa Barbara,
[17] B. Glavic, K. S. Esmaili, P. M. Fischer, and N. Tatbul.        CA, USA, June 19-21, 2012, Revised Selected Papers,
     Efficient stream provenance via operator                       pages 111–125, 2012.
     instrumentation. ACM Trans. Internet Techn.,              [23] H. Park, R. Ikeda, and J. Widom. RAMP: A system
     14(1):7:1–7:26, 2014.                                          for capturing and tracing provenance in mapreduce
[18] R. Hammad and C. Wu. Provenance as a service: A                workflows. PVLDB, 4(12):1351–1354, 2011.
     data-centric approach for real-time monitoring. In        [24] L. Pipino. Information quality assessment. In
     2014 IEEE International Congress on Big Data,                  Encyclopedia of Database Systems, pages 1512–1515.
     Anchorage, AK, USA, June 27 - July 2, 2014, pages              2009.
     258–265, 2014.                                            [25] Y. S. Tan, R. K. L. Ko, and G. Holmes. Security and
[19] R. Ikeda, H. Park, and J. Widom. Provenance for                data accountability in distributed systems: A
     generalized map and reduce workflows. In CIDR 2011,            provenance survey. In 10th IEEE International
     Fifth Biennial Conference on Innovative Data Systems           Conference on High Performance Computing and
     Research, Asilomar, CA, USA, January 9-12, 2011,               Communications & 2013 IEEE International
     Online Proceedings, pages 273–283, 2011.                       Conference on Embedded and Ubiquitous Computing,
[20] D. Kontokostas, M. Brümmer, S. Hellmann,                      HPCC/EUC 2013, Zhangjiajie, China, November
     J. Lehmann, and L. Ioannidis. NLP data cleansing               13-15, 2013, pages 1571–1578, 2013.
     based on linguistic ontology constraints. In The          [26] B. Wang, B. Li, and H. Li. Oruta: Privacy-preserving
     Semantic Web: Trends and Challenges - 11th                     public auditingfor shared data in the cloud. IEEE T.
     International Conference, ESWC 2014, Anissaras,                Cloud Computing, 2(1):43–56, 2014.
     Crete, Greece, May 25-29, 2014. Proceedings, pages        [27] E. Wu, S. Madden, and M. Stonebraker. Subzero: A
     224–239, 2014.                                                 fine-grained lineage system for scientific databases. In
[21] M. Mizan, M. L. Rahman, R. Khan, M. M. Haque,                  29th IEEE International Conference on Data
     and R. Hasan. Accountable proof of ownership for               Engineering, ICDE 2013, Brisbane, Australia, April
     data using timing element in cloud services. In           8-12, 2013, pages 865–876, 2013.