=Paper=
{{Paper
|id=Vol-1558/paper37
|storemode=property
|title=Big Data Provenance: State-Of-The-Art Analysis and Emerging Research Challenges
|pdfUrl=https://ceur-ws.org/Vol-1558/paper37.pdf
|volume=Vol-1558
|authors=Alfredo Cuzzocrea
|dblpUrl=https://dblp.org/rec/conf/edbt/Cuzzocrea16
}}
==Big Data Provenance: State-Of-The-Art Analysis and Emerging Research Challenges==
Big Data Provenance: State-Of-The-Art Analysis
and Emerging Research Challenges
Alfredo Cuzzocrea
DIA Department
University of Trieste and ICAR-CNR
Italy
alfredo.cuzzocrea@dia.units.it
ABSTRACT details an organization’s legal ownership of enterprise-wide
This paper focuses the attention on big data provenance is- data.
sues, and provides a comprehensive survey on state-of-the- When applied to big data, provenance problems become
art analysis and emerging research challenges in this scien- prohibitive (e.g., [10]), mostly due to the enormous size of
tific field. Big data provenance is actually one of the most big data. For instance, one of the most successful data prove-
relevant problem in big data research, as confirmed by the nance techniques consists in the so-called annotation-based
great deal of attention devoted to this topic by larger and approaches (e.g., [22]) that propose modifying the input
larger database and data mining research communities. This database queries in order to support data provenance tasks,
contribution aims at representing a milestone in the exciting while being able to access all the target data set. Obviously,
big data provenance research road. the latter requirement becomes very hard when applied to
big data repositories. Many others research challenges and
open issues still arise in big data provenance research. For
CCS Concepts instance, advanced concepts like confidentiality of the data
•Theory of computation → Data provenance; provenance process, secure and privacy-preserving big data
provenance, flexible big data provenance query tools, and so
forth, still need to be deeply investigated.
Keywords Inspired by these considerations, in this paper we pro-
Big Data Provenance, Privacy of Big Data, Big Data Lin- vide an overview of relevant research issues and challenges
eage, Big Data Derivation of the above-introduced big data provenance problems, by
also highlighting possible future efforts within these research
directions.
1. INTRODUCTION The remaining part of this paper is organized as follows.
In big data research, privacy and security of big data (e.g., Section 2 contains a comprehensive analysis of state-of-the-
[13, 14, 12]) play a major role. Along with these topics, art proposals that focus on big data provenance issues. In
provenance of big data (e.g., [16, 17, 10, 18, 4]) is relevant Section 3, we recognize and report on emerging challenges
as well. Data provenance concerns with the problem of de- in big data provenance research, by highlighting possible
tecting the origin, the creation and the propagation process promising directions. Finally, Section 4 draws the conclu-
of data within a data-intensive system. In other words, data sions of our research.
provenance consists in the lineage (e.g., [27]) and derivation
(e.g., [22]) of data and data objects, and it puts its con-
ceptual roots in extensively studies performed in the past
in the contexts of arts, literary works, manuscripts, sculp-
2. STATE-OF-THE-ART ANALYSIS
tures, and so forth. Another concept that is close to the Data provenance is relevant for a wide spectrum of typi-
“data provenance” one is represented by the so-called own- cal enterprise data tasks, such as: (i) data validation (e.g.,
ership of data (e.g., [21]), which refers to the issue of defining [7]); (ii) data debugging (e.g., [20]); (iii) data auditing (e.g.,
and providing information about the rightful owner of data [26]); (iv ) data quality (e.g., [24]); (v ) data reliability (e.g.,
assets, and to the acquisition, use and distribution policy [3]). Application-wise, the provenance problem has been
implemented by the data owner. This way, data ownership typically addressed in database management systems (e.g.,
primarily shapes itself like a data governance process that [9]), but several efforts even arise in the contexts of work-
flow management systems (e.g., [15]) and distributed systems
(e.g., [25]).
As regards the proper research side, there are several re-
search initiatives that composes the state-of-the-art. Here,
we review some of them.
[11] describes a framework for modeling and capturing
provenance in MapReduce jobs and deriving MapReduce
c 2016, Copyright is with the authors. Published in the Workshop Proceedings of the tasks, called Kepler. The approach is distributed in nature,
EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bordeaux, France) on CEUR- and it exploits the MySQL Cluster distributed database sys-
WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of
the Creative Commons license CCby-nc-nd 4.0 tem [2].
[19, 23] proposed an extension of Hadoop [1] called Reduce classical methods not suitable to the particular context of
and Map Provenance (RAMP). It introduces a wrapper- dealing with big data provenance.
based method that can be easily deployed on top of Hadoop
yet resulting transparent to it. Tracing of data-intensive Analyzing Big Data In order to apply data provenance
processes is supported as well. methods, state-of-the-art techniques require to analyze the
[5] describes an extension of Hadoop for implementing target (big) data set. Here, a major problem is represented
provenance detection in MapReduce jobs, called Hadoop- by the scalability of big data, which can be really explosive.
Prov. The goal of HadoopProv is to minimize overheads
introduced by computing provenance, which is usually a Scalability Issues When dealing with big data, one of
resource-consuming task. The proposed system provides the most problematic drawbacks is represented by scalabil-
flexible tools for querying the so-built big data provenance ity, as highlighted before. This again occurs with provenance
graph. of big data, as provenance techniques are multi-step in na-
Pig Lipstick [6] is a kind of hybrid big data provenance sys- ture and they need to access and process target data repet-
tem that combines the management of fine-grained depen- itively. This poses relevant issues, as big data are typically
dencies, which are typical of database-oriented provenance growing-in-size and large-scale.
systems, with the management of workflow-grained depen-
Information Sharing Data provenance methods very of-
dencies, which are typical of workflow-oriented provenance
ten require the need for sharing information among the ac-
systems. The internal model for reasoning on big data prove-
tors that perform the same data provenance task. The lat-
nance is graph-like in nature.
ter is not easy when dealing with big data, as such data are
[4] proposes anatomy and functionalities of a layer-based
typically distributed over large-scale network environments,
architecture for supporting big data provenance. In partic-
hence information sharing introduces relevant research chal-
ular, the architecture is composite in nature and it focuses
lenges as well as technological drawbacks.
on the provenance collection, querying and visualization of
provenance in the specialized context of scientific applica- Minimum Computational Overhead Requirement
tions. Data provenance techniques may be data-intensive and
[17] considers the problem of managing fine-grained prove- resource-consuming. This imposes the need for devising and
nance in Data Stream Management Systems (DSMS). In- implementing techniques that introduce a minimum compu-
deed, this problem is recognized as particularly hard due to tational overhead, in order to avoid impacting on the per-
the fact of the need of supporting flexible analysis tools over formance of the target system, e.g. workflow management
the so-computed provenance, such as revision processing or systems.
query debugging. With this goal in mind, the paper pro-
poses a novel big data provenance framework based on the Query Optimization Issues Data provenance tech-
concept of operator instrumentation. It consists in modify- niques need to access and query data in order to deter-
ing the behavior of operators in order to generate and prop- mine their provenance, even in an interactive manner. This
agate fine-grained provenance through several operators of applicative requirement introduces severe drawbacks when
a query. these techniques run over big data, as querying big data is
CloudProv, a framework for integrating, modeling and a crucial open problem at now.
monitoring data provenance in Cloud environments, is pre-
sented in [18]. The proposed framework is based on a Transformation Issues During data provenance tasks,
method that allows us to model collected provenance in- data sources need to be transformed among different data
formation as to continuously acquire and monitor such in- formats. Tracing provenance must be introduced accord-
formation for real-time applications, according to a service- ingly, in order to track all the different transformations oc-
oriented paradigm. curred. This topic is a first-class one in the family of big data
Finally, Oruta, an innovative privacy-preserving public au- provenance research issues, which also has several points in
diting mechanism for supporting data sharing in untrusted common with the data exchange research area.
Cloud environments is proposed in [26]. The proposed mech-
anism makes use of homomorphism authenticators [8] that When Computing Provenance? There exist two alter-
allows the third party auditor to check the integrity of shared natives for computing provenance. One predicates to com-
data from a given user group, yet not superimposing the pute provenance only when the same provenance is required
need for accessing all data. (this is called lazy provenance model ). The other one ar-
gues to compute provenance every time data are transformed
(this is called eagerly provenance model ). Both models have
pros and cons. They also imply different computational
3. EMERGING RESEARCH CHAL- overheads. This one is still an open problem to be con-
sidered in future efforts.
LENGES
A relevant number of issues and challenges in big data Data Modeling Support for Provenance When data
provenance research arise. In the following, we will introduce sources are processed to detect their provenance, several
and discuss some noticeable ones. transformations must be applied, as mentioned above. This
also implies the need of devising ad-hoc data models for sup-
Accessing Big Data Big data are prominently porting provenance, as data sources may be significantly dif-
enormous-in-size, hence accessing the entire big data set be- ferent. In this case, semantic techniques seem promising to
come problematic. Accessing data is a strict requirement this direction.
for data provenance techniques, hence this makes applying
Heterogeneity of Data Source Models Data prove- [5] S. Akoush, R. Sohan, and A. Hopper. Hadoopprov:
nance techniques usually run over heterogeneous data Towards provenance as a first class citizen in
sources hence they need to cope with heterogeneous data mapreduce. In 5th Workshop on the Theory and
models as well. Therefore, heterogeneity of data sources is Practice of Provenance, TaPP’13, Lombard, IL, USA,
a big challenge for such techniques, as data sources expose April 2-3, 2013, 2013.
different formats, (data) types, and schema. [6] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo,
J. Stoyanovich, and V. Tannen. Putting lipstick on
User Annotation Support for Provenance The data pig: Enabling database-style workflow provenance.
provenance process is usually enriched by user annotation, PVLDB, 5(4):346–357, 2011.
according to which domain experts are devoted to annotate [7] A. Assaf, A. Senart, and R. Troncy. Roomba:
data in order to enhance the effectiveness of this process. As Automatic validation, correction and generation of
a consequence, data provenance tools need to introduce ad- dataset metadata. In Proceedings of the 24th
hoc software modules capable of supporting user annotation International Conference on World Wide Web
over big data. Companion, WWW 2015, Florence, Italy, May 18-22,
2015 - Companion Volume, pages 159–162, 2015.
Secure and Privacy-Preserving Provenance Prove-
nance can represent a security and privacy breach for target [8] G. Ateniese, R. C. Burns, R. Curtmola, J. Herring,
data sources. Therefore, a relevant issue for future efforts L. Kissner, Z. N. J. Peterson, and D. X. Song.
is represented by the need for secure and privacy-preserving Provable data possession at untrusted stores. In
big data provenance techniques. Possible solutions are those Proceedings of the 2007 ACM Conference on
based on accepting a sort of compromise among security and Computer and Communications Security, CCS 2007,
privacy of data sources from a side, and provenance of data Alexandria, Virginia, USA, October 28-31, 2007,
sources from the other side. pages 598–609, 2007.
[9] P. Buneman, A. Chapman, and J. Cheney. Provenance
Flexible Provenance Query Tools Provenance needs management in curated databases. In Proceedings of
to be used not only to detect the lineage and the deriva- the ACM SIGMOD International Conference on
tion of data and data objects, but also in the vest of en- Management of Data, Chicago, Illinois, USA, June
abling methodology for flexible query tools focused to sup- 27-29, 2006, pages 539–550, 2006.
port next-generation cybersecurity systems where users may [10] Y. Cheah, S. R. Canon, B. Plale, and
be interested in tracking records generated by a particular L. Ramakrishnan. Milieu: Lightweight and
person in a specific research lab, or detecting the confiden- configurable big data provenance for science. In IEEE
tiality of tracked records, i.e. understanding who may have International Congress on Big Data, BigData
looked these tracked records beyond to authorized people. Congress 2013, June 27 2013-July 2, 2013, pages
46–53, 2013.
Provenance Visualization Tools Visualization tools [11] D. Crawl, J. Wang, and I. Altintas. Provenance for
are extremely important for big data provenance techniques, mapreduce-based data-intensive workflows. In
as the provenance one is an interactive process that typi- WORKS’11, Proceedings of the 6th Workshop on
cally requires intelligent tools for visualizing actual results Workflows in Support of Large-Scale Science,
and supporting next-step decisions. This will be a relevant co-located with , SC11, Seattle, WA, USA, November
research challenge in future years. 14, 2011, pages 21–30, 2011.
[12] A. Cuzzocrea. Privacy and security of big data:
4. CONCLUSIONS Current challenges and future research perspectives.
This paper has provided a comprehensive survey on state- In Proceedings of the First International Workshop on
of-the-art analysis and emerging research challenges in the Privacy and Secuirty of Big Data, PSBD@CIKM
context of big data provenance research. We have high- 2014, Shanghai, China, November 7, 2014, pages
lighted benefits and limitations of most relevant proposals, 45–47, 2014.
and we have described possible research directions in the [13] A. Cuzzocrea, V. Russo, and D. Saccà. A robust
exciting big data provenance research road. sampling-based framework for privacy preserving
OLAP. In Data Warehousing and Knowledge
5. REFERENCES Discovery, 10th International Conference, DaWaK
[1] Apache Hadoop. http://wiki.apache.org/hadoop. 2008, Turin, Italy, September 2-5, 2008, Proceedings,
Accessed: 2015-01-15. pages 97–114, 2008.
[2] MySQL Cluster CGE. [14] A. Cuzzocrea and D. Saccà. Balancing accuracy and
https://www.mysql.com/products/cluster/. Accessed: privacy of OLAP aggregations on data cubes. In
2015-01-15. DOLAP 2010, ACM 13th International Workshop on
Data Warehousing and OLAP, Toronto, Ontario,
[3] N. Agmon and N. Ahituv. Assessing data reliability in
Canada, October 30, 2010, Proceedings, pages 93–98,
an information system. J. of Management Information
2010.
Systems, 4(2):34–44, 1987.
[15] S. B. Davidson and J. Freire. Provenance and
[4] R. Agrawal, A. Imran, C. Seay, and J. Walker. A layer
scientific workflows: challenges and opportunities. In
based architecture for provenance in big data. In 2014
Proceedings of the ACM SIGMOD International
IEEE International Conference on Big Data, Big Data
Conference on Management of Data, SIGMOD 2008,
2014, Washington, DC, USA, October 27-30, 2014,
pages 1–7, 2014.
Vancouver, BC, Canada, June 10-12, 2008, pages International Conference on High Performance
1345–1350, 2008. Computing & Simulation, HPCS 2013, Helsinki,
[16] B. Glavic, K. S. Esmaili, P. M. Fischer, and N. Tatbul. Finland, July 1-5, 2013, pages 57–64, 2013.
Ariadne: managing fine-grained provenance on data [22] I. Nunes, Y. Chen, S. Miles, M. Luck, and C. J. P.
streams. In The 7th ACM International Conference on de Lucena. Transparent provenance derivation for user
Distributed Event-Based Systems, DEBS ’13, decisions. In Provenance and Annotation of Data and
Arlington, TX, USA - June 29 - July 03, 2013, pages Processes - 4th International Provenance and
39–50, 2013. Annotation Workshop, IPAW 2012, Santa Barbara,
[17] B. Glavic, K. S. Esmaili, P. M. Fischer, and N. Tatbul. CA, USA, June 19-21, 2012, Revised Selected Papers,
Efficient stream provenance via operator pages 111–125, 2012.
instrumentation. ACM Trans. Internet Techn., [23] H. Park, R. Ikeda, and J. Widom. RAMP: A system
14(1):7:1–7:26, 2014. for capturing and tracing provenance in mapreduce
[18] R. Hammad and C. Wu. Provenance as a service: A workflows. PVLDB, 4(12):1351–1354, 2011.
data-centric approach for real-time monitoring. In [24] L. Pipino. Information quality assessment. In
2014 IEEE International Congress on Big Data, Encyclopedia of Database Systems, pages 1512–1515.
Anchorage, AK, USA, June 27 - July 2, 2014, pages 2009.
258–265, 2014. [25] Y. S. Tan, R. K. L. Ko, and G. Holmes. Security and
[19] R. Ikeda, H. Park, and J. Widom. Provenance for data accountability in distributed systems: A
generalized map and reduce workflows. In CIDR 2011, provenance survey. In 10th IEEE International
Fifth Biennial Conference on Innovative Data Systems Conference on High Performance Computing and
Research, Asilomar, CA, USA, January 9-12, 2011, Communications & 2013 IEEE International
Online Proceedings, pages 273–283, 2011. Conference on Embedded and Ubiquitous Computing,
[20] D. Kontokostas, M. Brümmer, S. Hellmann, HPCC/EUC 2013, Zhangjiajie, China, November
J. Lehmann, and L. Ioannidis. NLP data cleansing 13-15, 2013, pages 1571–1578, 2013.
based on linguistic ontology constraints. In The [26] B. Wang, B. Li, and H. Li. Oruta: Privacy-preserving
Semantic Web: Trends and Challenges - 11th public auditingfor shared data in the cloud. IEEE T.
International Conference, ESWC 2014, Anissaras, Cloud Computing, 2(1):43–56, 2014.
Crete, Greece, May 25-29, 2014. Proceedings, pages [27] E. Wu, S. Madden, and M. Stonebraker. Subzero: A
224–239, 2014. fine-grained lineage system for scientific databases. In
[21] M. Mizan, M. L. Rahman, R. Khan, M. M. Haque, 29th IEEE International Conference on Data
and R. Hasan. Accountable proof of ownership for Engineering, ICDE 2013, Brisbane, Australia, April
data using timing element in cloud services. In 8-12, 2013, pages 865–876, 2013.