Generating personalized data narrations from EDA notebooks
      Alexandre Chanson, Faten El Outa, Nicolas                                                    Lucile Jacquemart
      Labroche, Patrick Marcel, Verónika Peralta,                                                  University of Tours
                  Willeme Verdeaux                                                                    Blois, France
                         University of Tours                                              Lucile.Jacquemart@etu.univ-tours.fr
                             Blois, France
                  firstName.lastName@univ-tours.fr
ABSTRACT
In this short paper, we present our preliminary results for gener-
ating personalized data narrations by extracting messages from a
collection of Exploratory Data Analysis (EDA) notebooks over a
given dataset. The approach consists of extracting features from
notebooks to learn what interesting messages they expose. Based
on those interesting messages, we formalize the problem of pro-
ducing a user-tailored data narration, i.e., a coherent sequence of
messages matching a given user profile. We developed a proof of
concept and experimented with Kaggle.com notebooks.
                                                                                             Figure 1: Overview of the approach
1    INTRODUCTION
Exploratory Data Analysis (EDA) is the notoriously tedious task
of interactively analyzing datasets to gain insights [10]. EDA note-
books are shared curated, illustrative EDA sessions prepared by                  [2]. Finally, Visual presentation (iv) is ensured by reusing existing
data scientists [6, 17]. EDA notebooks are essentially sequences                 visualizations extracted from notebooks.
of programmatic operations and their commented results, shared                      The general pipeline of our approach is shown in Figure 1.
on code sharing platforms such as Kaggle1 . Supporting EDA can                      There are three offline computation modules that deal, re-
be done by pre-analyzing datasets for computing insights [20] or                 spectively, with message extraction, computation of message
by automatically generating EDA notebooks using deep learning                    interestingness, and computation of the cognitive distance be-
[6].                                                                             tween messages. These computations are useful for ensuring that
    Data narration is the activity of producing narratives sup-                  messages in the data narrative are interesting and are structured
ported by facts extracted from data exploration and analysis,                    in a cognitively-coherent way. Then, online, for a given user need,
using interactive visualizations [1]. In an effort to clarify the                the message selection module preselects messages that match
concepts of data narratives, we recently defined a data narrative                the user’s profile. The user also specifies a budget, representing
as a structured composition of messages that (a) convey findings                 the maximum number of messages to be included in the data nar-
over the data, and, (b) are typically delivered via visual means                 rative. The TAP module takes as input the preselected messages,
in order to facilitate their reception by an intended audience, and              their interestingness and distance scores, and the budget, and pro-
we proposed a conceptual model describing and structuring the                    duces an ordered list of messages (taken among the preselected
key concepts around data narratives [15]. While several works                    ones) that maximize the overall interestingness and minimize
informally describe the process of data narration crafting [3, 11],              their cognitive distance, while satisfying the given budget. Finally,
automated data narration only starts to gain attention [8, 19].                  the narration module generates the data narrative.
    Our present work contributes to the field of automated data                     Our contributions include:
narration, and aims at connecting EDA notebooks to data narra-
                                                                                     • a formal framework,
tions. More precisely, our objective is to construct data narrations
                                                                                     • learning the interestingness of messages,
from EDA notebooks. This requires to (i) identify messages that
                                                                                     • an algorithm to generate a user-tailored data narrative
convey findings in the data, (ii) ensure they are relevant for a
                                                                                       from a set of notebooks,
given user profile, (iii) arrange them in a coherent composition,
                                                                                     • a proof of concept with Kaggle notebooks, producing var-
and (iv) present them visually.
                                                                                       ious data narratives for a given dataset.
    Problem (i) is addressed by formally defining a message as a
component of an EDA notebook, extracting them and learning a                        The paper is organized as follows. The next section reviews
model of message interestingness. Problem (ii) is addressed by                   related works. Section 3 provides the formal background and
representing messages and user profiles in a vector space, using                 describes the features we consider to learn messages’ interesting-
a classical TF-IDF representation, and using Cosine similarity                   ness. Section 4 formalizes the problem and presents our solution.
to select messages closest to the user profile. Problem (iii) is                 Section 5 discusses the implementation and tests. Finally, Section
formalized as an instance of the Traveling Analyst Problem (TAP)                 6 concludes and draw perspectives.
1 https://www.kaggle.com
                                                                                 2   RELATED WORK
© Copyright 2022 for this paper by its author(s). Use permitted under Creative   In this section we review related work pertaining to the genera-
Commons License Attribution 4.0 International (CC BY 4.0)
                                                                                 tion of data narratives or the automation of part of the process.
2.1    Automating data narration                                            2.2    Automatic data exploration
Firstly, several works propose solutions for automating data nar-           Some works [6, 12] propose solutions for automating data explo-
ration starting from a user query [8], a spreadsheet [19], or a             ration, the first step of data narration.
topic [18].                                                                    McAuley et al. [12] propose ExploroBOT, a novel system de-
    The precursor work of Gkesoulis et al. [8] introduced Cine-             veloped to support rapid exploration using a combination of
Cubes, a system that allows the automatic generation of a data              automatic chart generation and intuitive navigation supported
story over an OLAP database, with a simple user query as starting           by a novel visual guidance framework. The criteria to quantify
point. Each data story has three acts. The first providing contex-          the interestingness of chart are: (i) data correlation: highly cor-
tualization for the characters as well as the incident that sets the        related data in scatter plots and trend charts, hints towards an
story on the move, the second where the protagonists and the                interesting relationship between the two variables. (ii) Peaks:
rest of the roles build up their actions and reactions and the third        Spikes and large differences in a numerical attributes instantly at-
where the resolution of the film is taking place. The first one             tract attention. (iii) Outliers: A chart with more outliers is deemed
refers to the execution of the original query provided by the user.         more interesting.
The second act exploits the selection conditions of the original               El et al. [6] proposed ATENA, a system that takes an input
query and automatically generates comparative drill-up queries              dataset and auto-generates a compelling exploratory session,
to provide contextualization and finally, the third act drills down         presented in an EDA notebook. They shaped EDA into a con-
in the grouping levels of the original result to see the breakdown          trol problem, and devised a novel Deep Reinforcement Learning
of its (aggregate) measures and understand its internal structure           architecture to effectively optimize the notebook generation.
to provide further analysis of the results. Their tests revealed the           Personnaz et al. [16] introduce DORA the explorer, which
ability of Cinecubes to generate a fast report of better quality.           provides guidance to data explorer relying on Deep Reinforce-
However, its fixed structure in three acts can only produce simple          ment Learning that combines intrinsic (curiosity) and extrinsic
data stories with limited insights and visualizations.                      (familiarity) rewards.
    Shi et al. [19] proposed Calliope, a system that automatically             Finally, Deutch et al. [5] deal with the generation of expla-
generates visual data stories from an input spreadsheet. The sys-           nations for highlighting exploration results. They proposed Ex-
tem incorporates a new logic-oriented Monte Carlo tree search               plainED, a system for automatically explaining views in EDA
algorithm that explores the data space given by the input spread-           notebooks. The explanations are presented in Natural Language
sheet to progressively generate story pieces (i.e., data facts) and         and describe the particular elements of the view that are the most
organize them in a logical sequence. The importance of data facts           interesting (the ones having the highest Shapley values).
is measured based on information theory. Each data fact is visu-               To the best of our knowledge, our work is the first aiming
alized in a chart and captioned by an automatically generated               at automating the production of personalized data narrative by
description. A user study highlighted that the logical order is             leveraging existing EDA notebooks. One prominent aspect of our
consistent to humans, the generated data story express useful               approach is to qualify the interestingness of messages contained
data insights, and the visualization modes are satisfactory. Nev-           in existing notebooks. This is important as messages are the
ertheless, Calliope cannot understand data semantics to better              cornerstone of data narrartives [15] and since the quality of
generate the story contents and logic. Also, the generated cap-             notebooks is known to be very diverse [21].
tions are too rigid and contain grammar errors, and the visual
encoding generated are notably simple.                                      3     FORMAL BACKGROUND
    Shi et al. [18] proposed AutoClips, an automatic approach to
                                                                            This section introduces the representation of EDA notebooks
generate data videos from a given topic. It is based on 4 phases: (i)
                                                                            and messages and presents the set of properties used for learning
collecting a series of data facts around a certain topic, (ii) construct-
                                                                            interestingness.
ing a storyline as an assembly of these data facts into a sequence,
(iii) choosing data visualizations for the data facts and deciding
how to animate them by drawing a storyboard, and finally, (iv) re-          3.1    Preliminary definitions
alizing the storyboard via a design software in which the narrator          EDA notebooks are essentially sequences of programmatic oper-
edits and combines the animated visualizations until a coherent             ations and their commented results. They are linearly structured
data video is accomplished. Their evaluation revealed that Au-              as a sequence of cells, of two types: text and code. Text cells
toClips can generate comprehensible and engaging data videos                contain explanatory text, typically including titles, definitions,
which have comparable quality with human-made videos. How-                  explanations and comments. Code cells contain a sequence of
ever, the system only supports tabular data and favors datasets             commands and their output, typically including numeric results
with diverse column types.                                                  and graphics.
    Wang et al. [22] conducted a qualitative analysis on 245 info-             We consider that a code cell together with a text cell delivers
graphics studying the design space in terms of structures, sheet            a commented result on a logical part of the notebook. We will
layouts, fact types, and visualization styles. Based on those, the          call it message in what follows. We represent a message as a pair
authors propose a system for the auto-generation of fact sheet              of code and text cells, together with a set of numerical properties
generation. It consists of three phases: (i) fact extraction, (ii) fact     describing their contents (e.g. the number of words in the text
composition, and (iii) presentation synthesis. Their validation             cell, or the complexity of the code). The whole set of properties
of the system highlighted the efficiency of data exploration and            is described in next subsection.
the ease of understanding of the visualizations. As limitations,
we point out that data semantics is not considered during ex-                   Definition 3.1 (Message). Let T be an infinite set of text cells
ploration, and that visualizations are taken from a small-sized             and C an infinite set of code cells. A message is a tuple 𝑚 =
predefined library.                                                         ⟨𝑐, 𝑡, 𝑝𝑚            𝑚
                                                                                    1 , . . . , 𝑝𝑜 ⟩ where 𝑐 ∈ C is a code cell, 𝑡 ∈ T is a text cell
                                                                                    𝑚
                                                                            and 𝑝𝑖 , 1 ≤ 𝑖 ≤ 𝑜, are properties of 𝑐 or 𝑡.
         Dimension        Name                                    #              4     EXTRACTING NARRATIONS FROM
         Notebook         Number of likes                         0                    NOTEBOOKS
         popularity       Number of views                         1
                          Number of forks                         2              In this section, we describe how we process notebook messages
                          Author’s expertise                      3              to extract narrations.
          Notebook        Number of cells                         4
          structure       Number of lines of code                 5
                                                                                 4.1    Problem definition
                          Number of lines text                    6              Let 𝑀 𝐷 be the set of messages over a dataset 𝐷. We are inter-
          Code cell       Number of characters                    7              ested in producing a sequence of 𝜖𝑡 messages from 𝑀 𝐷 such that
                          Halstead score                          8              their total interestingness is maximal, and the overall cognitive
                          Cyclomatic complexity                   9              distance between them is minimal.
                          Generates a visualization              10                 This problem is defined formally in [2] as follows:
           Text cell      Number of characters                   11                 Definition 4.1 (Traveling Analyst Problem (TAP)). Let 𝑄 be a set
                          Number of words                        12              of 𝑁 queries, each associated with a positive time cost 𝑐𝑜𝑠𝑡 (𝑞𝑖 )
                          Flesch reading ease index              13              and a positive interestingness score 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 (𝑞𝑖 ). Each pair of
                          Gunning-Fog index                      14              queries is associated with a metric 𝑑𝑖𝑠𝑡 (𝑞𝑖 , 𝑞 𝑗 ) representing the
                          Automated Readability Index            15              cognitive distance of browsing from one query result to the next.
                          Coleman-Liau index                     16              Given a time budget 𝜖𝑡 , the optimization problem consists in find-
         Message in       Position in notebook                   17              ing a sequence ⟨𝑞 1, . . . , 𝑞𝑀 ⟩ of queries, 𝑞𝑖 ∈ 𝑄, without repetition,
         notebook                                                                with 𝑀 ≤ 𝑁 , such that:
                Table 1: Features considered                                                 Í𝑀
                                                                                    (1) max 𝑖=1     𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 (𝑞𝑖 )
                                                                                         Í𝑀
                                                                                    (2) 𝑖=1 𝑐𝑜𝑠𝑡 (𝑞𝑖 ) ≤ 𝜖𝑡
                                                                                             Í𝑀−1
                                                                                    (3) min 𝑖=1      𝑑𝑖𝑠𝑡 (𝑞𝑖 , 𝑞𝑖+1 ).
                                                                                     Lemma 4.2 (Complexity of TAP [2]). TAP is strongly NP-hard.
  Finally, we represent a notebook as a sequence of messages
and a set of notebook properties (e.g. number of user’s likes).                    It can easily be seen that our problem is an instance of TAP,
                                                                                 where queries are notebook messages and all their costs are the
   Definition 3.2 (Notebook). Let M be an infinite set of messages.              same. We next define the interestingness and distance functions.
A notebook is a tuple 𝑛 = ⟨𝑚 1, . . . , 𝑚 𝑣 , 𝑝 𝑛1 , . . . , 𝑝 𝑛𝑤 ⟩ where 𝑚𝑖 ∈
M, 1 ≤ 𝑖 ≤ 𝑣, are messages, and 𝑝 𝑛𝑗 , 1 ≤ 𝑗 ≤ 𝑤, are properties of              4.2    Characterizing interestingness
𝑛.                                                                               To characterize the interestingness of messages, instead of propos-
                                                                                 ing our own definition, we choose to learn a model of it, using
3.2     Properties                                                               the features of in Table 1. Our strategy is to compute a score for
The properties of cells, messages and notebooks correspond to                    messages based on dimensions: Notebook popularity, Notebook
features extracted from notebooks, detailed in Table 1.                          structure and Message in notebook. And then, to learn this score
   We consider the following feature dimensions:                                 using the features specific to messages, i.e., those in Dimensions
                                                                                 Code cell and Text Cell.
      • notebook popularity: these features indicate the global                     We choose to focus on regression models as they give good
        popularity of the notebooks among Kaggle users. They are                 results on similar problems [14]. We use auto-machine learning
        the main drivers to compute messages’ interestingness as                 [7] to learn the model, since we aim to achieve good accuracy
        they express the opinion of the community of users.                      performances by testing a large spectrum of models and hyper-
      • notebook structure: these feature describe the size of the               parameters.
        notebook in terms of cells and lines of code and comments.
      • code cell: these features characterize code cells in terms               4.3    Ensuring relevance
        of their complexity and the presence of a visualization. In
                                                                                 Note that interestingness of messages is learned independently of
        addition to the number of lines of code, two classical soft-
                                                                                 any user requirement. In order to build a coherent data narrative
        ware engineering metrics are used: cyclomatic complexity
                                                                                 in accordance with user interests, we introduce the notion of
        [13], Halstead metric [9].
                                                                                 user profile and we propose to pre-filter the set of messages that
      • text cell: these features characterize the content of text
                                                                                 are relevant to such a profile.
        cells especially in terms of readability, i.e., indexes related
                                                                                    We model a user profile as a set of keywords representing
        to the level of studies a person needs to understand the
                                                                                 user’s interests. The relevance of a message for a user profile is
        text at the first reading computed considering the number
                                                                                 computed based on the similarity of the text contained in the
        of words, number of sentences, number of syllables or
                                                                                 profile and in the text cell of the message. We use an off-the-shelf
        number of characters as components.
                                                                                 cosine similarity between the TF-IDF vectors of the user profile
      • message characteristics: this feature indicates where the
                                                                                 and the message. We use as document corpus the overall set of
        message is located in the notebook. Often the first mes-
                                                                                 users’ profiles U and text cells of all messages in M.
        sages of a notebook are simple data profiling while mes-
                                                                                    Formally, let 𝑚 ∈ M be a message, and 𝑢 ∈ U be a user profile,
        sages at the end tend to be more elaborated.
                                                                                 with 𝑡 being the text cell of 𝑚. Let 𝑉1 and 𝑉2 be respectively the
   In the following we restrict to notebooks and messages over a                 TF-IDF vectors of 𝑢 and 𝑡. The similarity between 𝑢 and 𝑚 is
given dataset. Implementation details about message and prop-                    computed as:
erties extraction are given in Section 5.                                                             𝑠𝑖𝑚(𝑢, 𝑚) = 𝑐𝑜𝑠𝑖𝑛𝑒 (𝑉1, 𝑉2 )                (1)
4.4    Characterizing distance                                          5    IMPLEMENTATION AND TESTS
The distance between two messages is also computed based on             Our prototype is implemented in Python, using libraries Radon
the similarity of the text contained in the text cells. We use the      for code metrics and py-readability-metrics for readability met-
same TF-IDF vectors for messages, computed for characterizing           rics. We used Kaggle API to access the datasets and notebooks.
relevance.                                                              To match a code cell with the visualization it produces, we used
   Formally, let 𝑚 1 , 𝑚 2 ∈ M be two messages, with 𝑡 1 and 𝑡 2        the HTML page of the notebook because the Kaggle API does
being respectively their text cells. Let 𝑉1 and 𝑉2 be respectively      not provide the visualization. We used Beautiful Soup to parse
the TF-IDF vectors of 𝑡 1 and 𝑡 2 . The distance between the two        the HTML and mapped the visualization with the code cell using
messages is computed as:                                                a join on the code text. We used sklearn for the TFIDF vectoriza-
                                                                        tion. Solving the TAP problem (see Section 4.1) exactly is done
                𝑑𝑖𝑠𝑡 (𝑚 1, 𝑚 2 ) = 1 − 𝑐𝑜𝑠𝑖𝑛𝑒 (𝑉1, 𝑉2 )          (2)
                                                                        with a mathematical model on CPLEX 20.10 and is implemented
                                                                        in C++2 . For large sets of messages (more than 500 messages)
4.5    Main algorithms
                                                                        finding exact solutions is intractable. We use a fast and memory-
This subsection presents two algorithms that implement the ap-          efficient heuristics inspired by the classic “sort by item efficiency”
proach. Algorithm 1 describes the extraction of messages, and the       heuristics for solving the Knapsack problem [4]. The code of the
computation of their interestingness and distance. This algorithm       approach is available on Github3 .
can be executed offline. Algorithm 2 describes the generation of           We tested our code on 377 Kaggle notebooks from the first 18
a data narrative for a specific user profile. It pre-selects relevant   datasets of Kaggle.com having more than 20 notebooks, sorted
messages, calls the TAP for selecting and structuring messages          by votes. We extracted messages from these notebooks by con-
and finally writes the narrative.                                       sidering only the code cells immediately followed by a text cell
                                                                        (a markdown cell in Kaggle terminology). This resulted in 10166
Algorithm 1 Message extraction and computations                         messages. The correlation matrix of the features of Table 1 is
                                                                        displayed in Figure 2, computed with Pearson’s correlation co-
Require: a set of notebooks 𝑁 𝐷 , a set of user profiles 𝑈
                                                                        efficient. The order of features in the figure is the same as the
Ensure: a set of messages 𝑀 𝐷 , an interestingness vector
                                                                        order in the table. Globally it can be seen that the features are
    𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡, a distance matrix 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
                                                                        correlated when they are in the same feature dimension. In more
 1: Let 𝑀 𝐷 = extractMessages (𝑁 𝐷 )
                                                                        details:
 2: Let 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 () = learnInterestingness (𝑀 𝐷 , 𝑁 𝐷 )
 3: index (𝑀 𝐷 ∪ 𝑈 )
                                                                             • in dimension notebook popularity, it can be seen that,
 4: Let 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 () = computeDistance (𝑀 𝐷 ∪ 𝑈 )
                                                                               unsurprisingly, likes, views and forks are quite correlated,
 5: return 𝑀 𝐷 , 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 (), 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ()
                                                                               while expertise is correlated to none of the others ;
                                                                             • number of lines of codes and number of lines of text are
                                                                               only weakly correlated ;
                                                                             • while code metrics are heavily correlated, they are not
Algorithm 2 User tailored narrative from notebook messages                     correlated to the length of the code neither and the gener-
                                                                               ation of a visualization is correlated to none of the other
Require: a set of messages 𝑀 𝐷 , a set of notebooks 𝑁 𝐷 , a user               features in this dimension ;
    profile 𝑢, a similarity threshold 𝜖𝑠 , a number of expected              • interestingly, the position of messages is quite correlated
    messages 𝜖𝑡                                                                to the total number of cells in the notebook and to the total
Ensure: a data narrative for the user                                          number of lines of text lines, while it is less correlated to
 1: Let 𝑀 = ∅                                                                  the number of lines of code. This reflects the correlations
 2: for 𝑚 ∈ 𝑀 𝐷 do                                                             found in the notebook structure dimension and the fact
 3:     if 𝑠𝑖𝑚(𝑚, 𝑢) > 𝜖𝑠 then                                                 that the more messages, the more cells in the notebook.
 4:         𝑀 = 𝑀 ∪ {𝑚}                                                        On the other hand, the position of the message is not
 5:     end if                                                                 correlated to its own cells’ code length or text length.
 6: end for
                                                                            To learn interestingness, we use Auto-sklearn4 with the prin-
 7: Let 𝑇 = TAP (𝑀, 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡, 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒, 𝜖𝑡 )
                                                                        ciple presented in the previous section. Auto-sklearn produces
 8: return narrate (𝑇 , 𝑁 𝐷 )
                                                                        an ensemble model that is not a single model but several models
                                                                        collaborating to achieve the best possible regression. The best
   The extractMessages function extracts messages from a set            ensemble model we obtained, in terms of 𝑅 2 is indicated in Table
of notebooks. Its implementation is described in Section 5. The         2. Its 𝑅 2 score is 0.85 in the training phase and 0.59 in the testing
learnInterestingness function computes message interestingness          phase. The target score we use was constructed by multiplying
as described in Subsection 4.2. The index function indexes the          all the features in dimensions notebook popularity, notebook
corpus of messages and profiles, computing TF-IDF vectors, as           structure and message in notebook.
described in Subsection 4.3. Such vectors are used for computing            We created 20 user profiles by retrieving the owners of the
the distance among messages (computeDistance function) and              datasets of Kaggle.com with the most votes and then retrieving all
similarity between a message and a profile (the sim function,           the datasets owned by these users. The words in the description
which is 1 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ()). The TAP function implements the opti-        of those datasets are used to form the profiles. For the 20 users,
mization problem described in Subsection 4.1. Finally, the narrate      profiles ranged between 5 and 20 words, with an average of 14.5
function generates the narration by writing messages in the or-         2 https://github.com/AlexChanson/Cplex-TAP
der indicated by the TAP, reusing the original visualizations of        3 https://github.com/Blobfish-LIFAT/NotebookCrowdsourcing

messages.                                                               4 https://automl.github.io/auto-sklearn/master/index.html
                                                                                           6     CONCLUSION
                                                                                           This short paper introduces a novel approach for generating per-
                                                                                           sonalized data narratives from EDA notebooks. The approach
                                                                                           consists of extracting messages from existing notebooks, learn-
                                                                                           ing their interestingness, filtering this set of messages for some
                                                                                           user profile and generating a coherent sequence of messages
                                                                                           adapted to this profile. We detailed the implementation of our
                                                                                           proof of concept, and presented a preliminary experiment with
                                                                                           Kaggle.com notebooks.
                                                                                               We are currently working at improving our approach by pro-
                                                                                           viding more robust message detection, better accounting for the
                                                                                           visualizations related to the message, generating narratives that
                                                                                           are more coherent, less redundant and more personalized. We
                                                                                           will evaluate the approach with user tests, comparing it with com-
                                                                                           petitor approaches to generate notebooks [6, 16] and assessing
                                                                                           its scalability.

                                                                                           REFERENCES
      Figure 2: Correlation of all the features in Table 1                                  [1] Sheelagh Carpendale, Nicholas Diakopoulos, Nathalie Henry Riche, and
                                                                                                Christophe Hurter. Data-driven storytelling (dagstuhl seminar 16061).
                                                                                                Dagstuhl Reports, 2016.
                                                                                            [2] Alexandre Chanson, Ben Crulis, Nicolas Labroche, Patrick Marcel, Verónika
           rank     ensemble weight type                                                        Peralta, Stefano Rizzi, and Panos Vassiliadis. The traveling analyst problem:
             1            0.76        gaussian process                                          Definition and preliminary study. In DOLAP@EDBT/ICDT, 2020.
                                                                                            [3] S. Chen, J. Li, G. Andrienko, N. Andrienko, Y. Wang, P. H. Nguyen, and
             2            0.02        gradient boosting                                         C. Turkay. Supporting story synthesis: Bridging the gap between visual
             3            0.04        gradient boosting                                         analytics and storytelling. TVCG, 2018.
             4            0.08        k nearest neighbors                                   [4] George B. Dantzig. Discrete-variable extremum problems. Operations Research,
                                                                                                5(2):266–288, 1957.
             5            0.10        gradient boosting                                     [5] Daniel Deutch, Amir Gilad, Tova Milo, and Amit Somech. Explained: Expla-
                  Table 2: Model of interestingness                                             nations for EDA notebooks. Proc. VLDB Endow., 13(12):2917–2920, 2020.
                                                                                            [6] Ori Bar El, Tova Milo, and Amit Somech. Automatically generating data
                                                                                                exploration sessions using deep reinforcement learning. In SIGMOD, 2020.
                                                                                            [7] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springen-
                                                                                                berg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine
                                                                                                learning. In Advances in Neural Information Processing Systems, Canada, 2015.
(stdev is 4.97). The description of those datasets, together with the                       [8] Dimitrios Gkesoulis, Panos Vassiliadis, and Petros Manousis. Cinecubes:
                                                                                                Aiding data workers gain insights from OLAP queries. Inf. Syst., 53:60–86,
text of all text cells identified when extracting messages, formed                              2015.
the vocabulary from which TF-IDF vectors for users and messages                             [9] Maurice H. Halstead. Elements of Software Science (Operating and Programming
                                                                                                Systems Series). Elsevier Science Inc., USA, 1977.
were computed. For each user, we filtered the set of messages                              [10] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. Overview of
using their profile, using the cosine similarity between both TF-                               data exploration techniques. In SIGMOD, 2015.
IDF vectors, using a threshold of 0. The number of messages                                [11] Robert Kosara and Jock Mackinlay. Storytelling: The next step for visualization.
                                                                                                IEEE Computer, 46, 2013.
relevant for each profile ranges between 191 (minimum) and                                 [12] John McAuley, Rohan Goel, and Tamara Matthews. Explorobot: Rapid explo-
2551 (maximum), with an average of 798.                                                         ration with chart automation. In VISIGRAPP, 2019.
   We generated one narration for each profile, asking for 𝜖𝑡 =10                          [13] Thomas J. McCabe. A complexity measure. IEEE Trans. Software Eng., 2(4):308–
                                                                                                320, 1976.
messages in it. On average, the generated narrations have 8.3                              [14] Martina Megasari, Pandu Wicaksono, Chiao Yun Li, Clément Chaussade,
messages (minimum 2, maximum 10, stdev 1.75). To measure                                        Shibo Cheng, Nicolas Labroche, Patrick Marcel, and Verónika Peralta. Can
                                                                                                models learned from a dataset reflect acquisition of procedural knowledge? an
the degree of personalization of the narration, we use the Szym-                                experiment with automatic measurement of online review quality. In Il-Yeol
kiewicz–Simpson overlap coefficient5 between the profile and                                    Song, Alberto Abelló, and Robert Wrembel, editors, Proceedings of DOLAP,
the text of the messages. On average it is 0.15 (minimum 0.07,                                  volume 2062 of CEUR Workshop Proceedings. CEUR-WS.org, 2018.
                                                                                           [15] Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, and Panos
maximum 0.44, stdev 0.12). These low scores are expected since                                  Vassiliadis. Towards a conceptual model for data narratives. In ER, 2020.
a threshold of 0 was used to select messages for each profile.                             [16] Aurélien Personnaz, Sihem Amer-Yahia, Laure Berti-Équille, Maximilian Fabri-
To measure the coherence and diversity of the messages in the                                   cius, and Srividya Subramanian. DORA THE EXPLORER: exploring very large
                                                                                                data with interactive deep reinforcement learning. In CIKM, 2021.
generated narrations, we measured (i) the number of different                              [17] Adam Rule, Aurélien Tabard, and James D. Hollan. Exploration and explana-
notebooks where the messages come from and (ii) the Szym-                                       tion in computational notebooks. In CHI, 2018.
kiewicz–Simpson overlap coefficient between the different mes-                             [18] D. Shi, F. Sun, X. Xu, Xingyu Lan, David Gotz, and Nan Cao. Autoclips: An
                                                                                                automatic approach to video generation from data facts. Comput. Graph.
sage texts in the narration. Regarding (i), on average the messages                             Forum, 40(3):495–505, 2021.
come from 4.2 notebooks (minimum 1, maximum 9, stdev 0.3). As                              [19] Danqing Shi, Xinyue Xu, Fuling Sun, Yang Shi, and Nan Cao. Calliope: Au-
                                                                                                tomatic visual data story generation from a spreadsheet. IEEE Trans. Vis.
to (ii), the overlap is 0.68 on average (minimum 0.59, maximum                                  Comput. Graph., 27(2):453–463, 2021.
0.77, stdev 0.04). The generated narrations, under the form of                             [20] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. Extracting
Jupyter notebooks, are available on Github6 .                                                   top-k insights from multi-dimensional data. In SIGMOD, 2017.
                                                                                           [21] Jiawei Wang, Li Li, and Andreas Zeller. Better code, better sharing: on the
                                                                                                need of analyzing jupyter notebooks. In ICSE-NIER, 2020.
5 The overlap coefficient is defined as the size of the intersection divided by the        [22] Yun Wang, Zhida Sun, Haidong Zhang, Weiwei Cui, Ke Xu, Xiaojuan Ma, and
                                                                                                Dongmei Zhang. Datashot: Automatic generation of fact sheets from tabular
smaller of the size of the two sets. It is a form of Jaccard coefficient adapted to sets        data. IEEE Trans. Vis. Comput. Graph., 2020.
with different cardinality.
6 https://github.com/Blobfish-LIFAT/NotebookCrowdsourcing/tree/master/
output/notebooks