ProvBook: Provenance-based Semantic
      Enrichment of Interactive Notebooks for
                 Reproducibility

                      Sheeba Samuel, Birgitta König-Ries

             Heinz-Nixdorf Chair for Distributed Information Systems
                   Friedrich-Schiller University, Jena, Germany
        sheeba.samuel@uni-jena.de, birgitta.koenig-ries@uni-jena.de


      Abstract. With the rapid growth of data science and machine learning,
      interactive notebooks have gained widespread adoption among scientists
      across all disciplines to publish their computational experiments con-
      taining code, text, and results. As it is easy to modify and re-run the
      computations in a notebook, it is important to know how the provenance
      of results changed in different executions over the course of time, thus
      enabling trust and reproducibility. In this paper, we present ProvBook,
      an extension of Jupyter Notebook to capture and view the provenance
      over the course of time. It also allows the user to share a notebook along
      with its provenance in RDF and also convert it back to a notebook. We
      use the REPRODUCE-ME ontology extended from PROV-O and P-
      Plan to describe the provenance of a notebook. This helps the scientists
      to compare their previous results with the current ones, check whether
      the experiments produce the results as expected and query the sequence
      of executions using SPARQL. The notebook data in RDF can be used
      in combination with the experiments that used them and help to get a
      track of the complete path of the scientific experiments.


Keywords: Notebooks, Provenance, Reproducibility, RDF, Ontology


1   Introduction

The Jupyter Notebook [3] is an open-source web application to create documents
with interactive output and supports over 40 programming languages with mil-
lions of users. Notebook documents contain blocks of text and code organized as
cells. The code cells contain code snippets which can be modified and executed
individually and the output is displayed directly below the cell. The markdown
cells contain documentation of the computational processes. The cells are ar-
ranged linearly but can be moved or executed in any order. The notebook can
be shared in different formats including HTML, PDF, and LaTeX.
Rule et al. [7] present a study in which they analyzed over 1 million publicly
available notebooks and interviewed 15 data scientists from different disciplines.
One of their results highlights the need for tracking provenance especially when
the cells are over-written and re-run. Provenance tracking is largely helpful in
the trial and error experiments where it is essential to understand how exactly
a final result has been achieved. It is also necessary to keep track of the experi-
ments that have been attempted because that may benefit other scientists, even
if the results are not as expected. Pimentel et al. [6] present a mechanism to cap-
ture and analyze provenance of python scripts inside IPythonNotebooks by in-
tegrating with noWorkflow [5]. PROV-O-Matic1 is another provenance-tracking
extension for older versions of IPython Notebooks which saves the provenance
traces to Linked Data file using PROV-O. Another recent approach is to con-
vert notebooks into workflows where notebook developers need to follow a set
of guidelines in writing code [1]. These approaches have the limitation that they
require changes to scripts by the user and are limited to Python scripts. In our
approach, the provenance tracking is integrated within a notebook so there is no
need to change the scripts and learn a new tool. It is also easy to share the note-
book along with the provenance traces of execution described as Linked Data.


            Fig. 1: A code cell with the provenance data of its executions.


2     Development

We developed ProvBook2 , an extension of Jupyter Notebook to capture and
track the provenance of the cells over the course of the time. Every time the
code cell is executed, the provenance of the run is stored in the metadata of the
cell in the notebook. The provenance data of a code cell includes the start and
1
    https://github.com/Data2Semantics/prov-o-matic
2
    https://github.com/Sheeba-Samuel/ProvBook
end time of the execution, the total time it took to run the cell, the source and
the output of the cell. It also allows the user to see the changes and the modified
time of the text of a markdown cell whenever a notebook is saved. Figure. 1
shows the provenance of a code cell along with its input and output. When the
extension is enabled, the visualization of the provenance data is displayed below
the input of every cell. The user can view the history over the course of time
by moving the slider. In this way, the user can compare the previous results
with the current ones and see the difference that occurred. The user has the
option to view the provenance of selected or all cells as well as clear them if
needed. ProvBook also provides the user the ability to convert the notebooks
to RDF. In our previous work [9], we have shown how we combined the P-Plan
[2] and the REPRODUCE-ME [8] ontology extended from PROV-O [4] to de-
scribe the interactive notebooks along with the experiments which used them in
a multi-user environment provided by JupyterHub. The metadata of the note-
book fetched from JupyterHub API was stored along with the experimental data
in a relational database and displayed in the project dashboard of our prototype.
We extend our work by providing an extension to download the notebook as a
Turtle document so that it is easy to share with the collaborators. The RDF
file can also be converted back to the notebook for easy reading and execu-
tion. Figure. 2 shows how the provenance of the notebook is represented using


     REPRODUCE-ME                         xsd:string                      prov:Entity
                                                       executionTime                    prov:Entity
     P-Plan                xsd:dateTime
                                                                       prov:used
     PROV-O                 prov:endedAtTime                                    prov:generated
                xsd:dateTime                                                                  p-plan:Activity

                               prov:startedAtTime                           rdf:type                         prov:Agent

                                                             CellExecution

                   p-plan:correspondsToStep                                 prov:wasAttributedTo


  p-plan:Step                Cell                                           Notebook                             p-plan:Plan
                rdf:type                      p-plan:isStepOfPlan                                 rdf:type


                                                                    hasKernel                hasProgrammingLanguage
     p-plan:hasInputVar                p-plan:hasOutputVar
                                                                                                      ProgrammingLanguage
                                              Output            Kernel
          Source
                                                                                                             hasVersion
                rdf:type                rdf:type
                                                                          rdf:type
                                                                                            rdf:type

                                                                                                                Version
                           p-plan:Variable                                              Setting


Fig. 2: The provenance of the notebook represented using the REPRODUCE-ME on-
tology.

the REPRODUCE-ME ontology. The notebook is described as a p − plan:P lan
with the cells as p − plan:Step of the plan. Every run of a cell is represented
as CellExecution, which is a subclass of p − plan:Activity. The CellExecution
uses input and generates output with start and end time. The notebook is at-
tributed to the authors using the object property prov:wasAttributedT o. The
provenance information including the kernel, programming language, and its ver-
sion is available in the downloaded version of RDF. The ProvBook was evaluated
with around 50 publicly available notebooks.

3   Demonstration Overview
We will demonstrate the working of ProvBook where the participants will be
able to load or author notebooks, execute them and check the provenance of
results. They will also be able to convert the notebooks to RDF and import
them back to notebooks. We would also like to invite the participants to use
their own notebooks with ProvBook. A video showing the installation and use
of ProvBook with an example is available at https://doi.org/10.6084/m9.
figshare.6401096.

Acknowledgements
This research is supported by the Deutsche Forschungsgemeinschaft (DFG) in
Project Z2 of the CRC/TRR 166 High-end light microscopy elucidates membrane
receptor function - ReceptorLight.

References
1. Carvalho, L.A.M.C., Wang, R., Gil, Y., Garijo, D.: Niw: Converting notebooks into
   workflows to capture dataflow and provenance (2017)
2. Garijo, D., Gil, Y.: Augmenting PROV with plans in P-Plan: scientific processes as
   linked data. CEUR Workshop Proceedings (2012)
3. Kluyver, T., Ragan-Kelley, B., et al.: Jupyter notebooks-a publishing format for
   reproducible computational workflows. In: ELPUB. pp. 87–90 (2016)
4. Lebo, T., Sahoo, S., McGuinness, D., Belhajjame, K., et al.: PROV-O: The PROV
   Ontology. W3C Recommendation 30 (2013)
5. Pimentel, J.a.F., Murta, L., Braganholo, V., Freire, J.: noWorkflow: A tool for
   collecting, analyzing, and managing provenance from python scripts. Proc. VLDB
   Endow. 10(12), 1841–1844 (Aug 2017)
6. Pimentel, J.F.N., Braganholo, V., Murta, L., Freire, J.: Collecting and analyzing
   provenance on interactive notebooks: When IPython meets noWorkflow. In: 7th
   USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). USENIX
   Association, Edinburgh, Scotland (2015)
7. Rule, A., Tabard, A., Hollan, J.D.: Exploration and explanation in computational
   notebooks. In: Proceedings of the 2018 CHI Conference on Human Factors in Com-
   puting Systems. pp. 32:1–32:12. CHI ’18, ACM, New York, NY, USA (2018)
8. Samuel, S., König-Ries, B.: REPRODUCE-ME: ontology-based data access for re-
   producibility of microscopy experiments. In: The Semantic Web: ESWC 2017 Satel-
   lite Events, Portorož, Slovenia. pp. 17–20 (2017)
9. Samuel, S., König-Ries, B.: Combining p-plan and the reproduce-me ontology to
   achieve semantic enrichment of scientific experiments using interactive notebooks.
   In: The Semantic Web: ESWC 2018 Satellite Events. pp. 126–130 (2018)