<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating personalized data narrations from EDA notebooks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre Chanson, Faten El Outa, Nicolas</string-name>
          <email>ifrstName.lastName@univ-tours.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucile Jacquemart</string-name>
          <email>Lucile.Jacquemart@etu.univ-tours.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Labroche</institution>
          ,
          <addr-line>Patrick Marcel, Verónika Peralta, Willeme Verdeaux</addr-line>
          ,
          <institution>University of Tours</institution>
          ,
          <addr-line>Blois</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Tours</institution>
          ,
          <addr-line>Blois</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>In this short paper, we present our preliminary results for generating personalized data narrations by extracting messages from a collection of Exploratory Data Analysis (EDA) notebooks over a given dataset. The approach consists of extracting features from notebooks to learn what interesting messages they expose. Based on those interesting messages, we formalize the problem of producing a user-tailored data narration, i.e., a coherent sequence of messages matching a given user profile. We developed a proof of concept and experimented with Kaggle.com notebooks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Exploratory Data Analysis (EDA) is the notoriously tedious task
of interactively analyzing datasets to gain insights [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. EDA
notebooks are shared curated, illustrative EDA sessions prepared by
data scientists [
        <xref ref-type="bibr" rid="ref17 ref6">6, 17</xref>
        ]. EDA notebooks are essentially sequences
of programmatic operations and their commented results, shared
on code sharing platforms such as Kaggle1. Supporting EDA can
be done by pre-analyzing datasets for computing insights [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or
by automatically generating EDA notebooks using deep learning
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Data narration is the activity of producing narratives
supported by facts extracted from data exploration and analysis,
using interactive visualizations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In an efort to clarify the
concepts of data narratives, we recently defined a data narrative
as a structured composition of messages that (a) convey findings
over the data, and, (b) are typically delivered via visual means
in order to facilitate their reception by an intended audience, and
we proposed a conceptual model describing and structuring the
key concepts around data narratives [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. While several works
informally describe the process of data narration crafting [
        <xref ref-type="bibr" rid="ref11 ref3">3, 11</xref>
        ],
automated data narration only starts to gain attention [
        <xref ref-type="bibr" rid="ref19 ref8">8, 19</xref>
        ].
      </p>
      <p>Our present work contributes to the field of automated data
narration, and aims at connecting EDA notebooks to data
narrations. More precisely, our objective is to construct data narrations
from EDA notebooks. This requires to (i) identify messages that
convey findings in the data, (ii) ensure they are relevant for a
given user profile, (iii) arrange them in a coherent composition,
and (iv) present them visually.</p>
      <p>
        Problem (i) is addressed by formally defining a message as a
component of an EDA notebook, extracting them and learning a
model of message interestingness. Problem (ii) is addressed by
representing messages and user profiles in a vector space, using
a classical TF-IDF representation, and using Cosine similarity
to select messages closest to the user profile. Problem (iii) is
formalized as an instance of the Traveling Analyst Problem (TAP)
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Finally, Visual presentation (iv) is ensured by reusing existing
visualizations extracted from notebooks.
      </p>
      <p>The general pipeline of our approach is shown in Figure 1.</p>
      <p>There are three ofline computation modules that deal,
respectively, with message extraction, computation of message
interestingness, and computation of the cognitive distance
between messages. These computations are useful for ensuring that
messages in the data narrative are interesting and are structured
in a cognitively-coherent way. Then, online, for a given user need,
the message selection module preselects messages that match
the user’s profile. The user also specifies a budget, representing
the maximum number of messages to be included in the data
narrative. The TAP module takes as input the preselected messages,
their interestingness and distance scores, and the budget, and
produces an ordered list of messages (taken among the preselected
ones) that maximize the overall interestingness and minimize
their cognitive distance, while satisfying the given budget. Finally,
the narration module generates the data narrative.</p>
      <p>Our contributions include:
• a formal framework,
• learning the interestingness of messages,
• an algorithm to generate a user-tailored data narrative
from a set of notebooks,
• a proof of concept with Kaggle notebooks, producing
various data narratives for a given dataset.</p>
      <p>The paper is organized as follows. The next section reviews
related works. Section 3 provides the formal background and
describes the features we consider to learn messages’
interestingness. Section 4 formalizes the problem and presents our solution.
Section 5 discusses the implementation and tests. Finally, Section
6 concludes and draw perspectives.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>In this section we review related work pertaining to the
generation of data narratives or the automation of part of the process.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Automating data narration</title>
      <p>
        Firstly, several works propose solutions for automating data
narration starting from a user query [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a spreadsheet [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], or a
topic [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        The precursor work of Gkesoulis et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduced
CineCubes, a system that allows the automatic generation of a data
story over an OLAP database, with a simple user query as starting
point. Each data story has three acts. The first providing
contextualization for the characters as well as the incident that sets the
story on the move, the second where the protagonists and the
rest of the roles build up their actions and reactions and the third
where the resolution of the film is taking place. The first one
refers to the execution of the original query provided by the user.
The second act exploits the selection conditions of the original
query and automatically generates comparative drill-up queries
to provide contextualization and finally, the third act drills down
in the grouping levels of the original result to see the breakdown
of its (aggregate) measures and understand its internal structure
to provide further analysis of the results. Their tests revealed the
ability of Cinecubes to generate a fast report of better quality.
However, its fixed structure in three acts can only produce simple
data stories with limited insights and visualizations.
      </p>
      <p>
        Shi et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] proposed Calliope, a system that automatically
generates visual data stories from an input spreadsheet. The
system incorporates a new logic-oriented Monte Carlo tree search
algorithm that explores the data space given by the input
spreadsheet to progressively generate story pieces (i.e., data facts) and
organize them in a logical sequence. The importance of data facts
is measured based on information theory. Each data fact is
visualized in a chart and captioned by an automatically generated
description. A user study highlighted that the logical order is
consistent to humans, the generated data story express useful
data insights, and the visualization modes are satisfactory.
Nevertheless, Calliope cannot understand data semantics to better
generate the story contents and logic. Also, the generated
captions are too rigid and contain grammar errors, and the visual
encoding generated are notably simple.
      </p>
      <p>
        Shi et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] proposed AutoClips, an automatic approach to
generate data videos from a given topic. It is based on 4 phases: (i)
collecting a series of data facts around a certain topic, (ii)
constructing a storyline as an assembly of these data facts into a sequence,
(iii) choosing data visualizations for the data facts and deciding
how to animate them by drawing a storyboard, and finally, (iv)
realizing the storyboard via a design software in which the narrator
edits and combines the animated visualizations until a coherent
data video is accomplished. Their evaluation revealed that
AutoClips can generate comprehensible and engaging data videos
which have comparable quality with human-made videos.
However, the system only supports tabular data and favors datasets
with diverse column types.
      </p>
      <p>
        Wang et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] conducted a qualitative analysis on 245
infographics studying the design space in terms of structures, sheet
layouts, fact types, and visualization styles. Based on those, the
authors propose a system for the auto-generation of fact sheet
generation. It consists of three phases: (i) fact extraction, (ii) fact
composition, and (iii) presentation synthesis. Their validation
of the system highlighted the eficiency of data exploration and
the ease of understanding of the visualizations. As limitations,
we point out that data semantics is not considered during
exploration, and that visualizations are taken from a small-sized
predefined library.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Automatic data exploration</title>
      <p>
        Some works [
        <xref ref-type="bibr" rid="ref12 ref6">6, 12</xref>
        ] propose solutions for automating data
exploration, the first step of data narration.
      </p>
      <p>
        McAuley et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] propose ExploroBOT, a novel system
developed to support rapid exploration using a combination of
automatic chart generation and intuitive navigation supported
by a novel visual guidance framework. The criteria to quantify
the interestingness of chart are: (i) data correlation: highly
correlated data in scatter plots and trend charts, hints towards an
interesting relationship between the two variables. (ii) Peaks:
Spikes and large diferences in a numerical attributes instantly
attract attention. (iii) Outliers: A chart with more outliers is deemed
more interesting.
      </p>
      <p>
        El et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed ATENA, a system that takes an input
dataset and auto-generates a compelling exploratory session,
presented in an EDA notebook. They shaped EDA into a
control problem, and devised a novel Deep Reinforcement Learning
architecture to efectively optimize the notebook generation.
      </p>
      <p>
        Personnaz et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] introduce DORA the explorer, which
provides guidance to data explorer relying on Deep
Reinforcement Learning that combines intrinsic (curiosity) and extrinsic
(familiarity) rewards.
      </p>
      <p>
        Finally, Deutch et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] deal with the generation of
explanations for highlighting exploration results. They proposed
ExplainED, a system for automatically explaining views in EDA
notebooks. The explanations are presented in Natural Language
and describe the particular elements of the view that are the most
interesting (the ones having the highest Shapley values).
      </p>
      <p>
        To the best of our knowledge, our work is the first aiming
at automating the production of personalized data narrative by
leveraging existing EDA notebooks. One prominent aspect of our
approach is to qualify the interestingness of messages contained
in existing notebooks. This is important as messages are the
cornerstone of data narrartives [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and since the quality of
notebooks is known to be very diverse [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>FORMAL BACKGROUND</title>
      <p>This section introduces the representation of EDA notebooks
and messages and presents the set of properties used for learning
interestingness.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Preliminary definitions</title>
      <p>EDA notebooks are essentially sequences of programmatic
operations and their commented results. They are linearly structured
as a sequence of cells, of two types: text and code. Text cells
contain explanatory text, typically including titles, definitions,
explanations and comments. Code cells contain a sequence of
commands and their output, typically including numeric results
and graphics.</p>
      <p>We consider that a code cell together with a text cell delivers
a commented result on a logical part of the notebook. We will
call it message in what follows. We represent a message as a pair
of code and text cells, together with a set of numerical properties
describing their contents (e.g. the number of words in the text
cell, or the complexity of the code). The whole set of properties
is described in next subsection.</p>
      <p>Definition 3.1 (Message). Let T be an infinite set of text cells
and C an infinite set of code cells. A message is a tuple  =
⟨, , 1, . . . ,  ⟩ where  ∈ C is a code cell,  ∈ T is a text cell
and   , 1 ≤  ≤ , are properties of  or  .</p>
      <p>Name
Number of likes
Number of views
Number of forks
Author’s expertise
Number of cells
Number of lines of code
Number of lines text
Number of characters
Halstead score
Cyclomatic complexity
Generates a visualization
Number of characters
Number of words
Flesch reading ease index
Gunning-Fog index
Automated Readability Index
Coleman-Liau index
Position in notebook</p>
      <p>Finally, we represent a notebook as a sequence of messages
and a set of notebook properties (e.g. number of user’s likes).</p>
      <p>Definition 3.2 (Notebook). Let M be an infinite set of messages.
A notebook is a tuple  = ⟨1, . . . ,  , 1, . . . ,  ⟩ where  ∈
M, 1 ≤  ≤  , are messages, and  , 1 ≤  ≤  , are properties of
.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Properties</title>
      <p>The properties of cells, messages and notebooks correspond to
features extracted from notebooks, detailed in Table 1.</p>
      <p>
        We consider the following feature dimensions:
• notebook popularity: these features indicate the global
popularity of the notebooks among Kaggle users. They are
the main drivers to compute messages’ interestingness as
they express the opinion of the community of users.
• notebook structure: these feature describe the size of the
notebook in terms of cells and lines of code and comments.
• code cell: these features characterize code cells in terms
of their complexity and the presence of a visualization. In
addition to the number of lines of code, two classical
software engineering metrics are used: cyclomatic complexity
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Halstead metric [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
• text cell: these features characterize the content of text
cells especially in terms of readability, i.e., indexes related
to the level of studies a person needs to understand the
text at the first reading computed considering the number
of words, number of sentences, number of syllables or
number of characters as components.
• message characteristics: this feature indicates where the
message is located in the notebook. Often the first
messages of a notebook are simple data profiling while
messages at the end tend to be more elaborated.
      </p>
      <p>In the following we restrict to notebooks and messages over a
given dataset. Implementation details about message and
properties extraction are given in Section 5.</p>
    </sec>
    <sec id="sec-8">
      <title>EXTRACTING NARRATIONS FROM</title>
    </sec>
    <sec id="sec-9">
      <title>NOTEBOOKS</title>
      <p>In this section, we describe how we process notebook messages
to extract narrations.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Problem definition</title>
      <p>Let  be the set of messages over a dataset . We are
interested in producing a sequence of  messages from  such that
their total interestingness is maximal, and the overall cognitive
distance between them is minimal.</p>
      <p>
        This problem is defined formally in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as follows:
      </p>
      <p>Definition 4.1 (Traveling Analyst Problem (TAP)). Let  be a set
of  queries, each associated with a positive time cost  ( )
and a positive interestingness score  ( ). Each pair of
queries is associated with a metric  (,   ) representing the
cognitive distance of browsing from one query result to the next.
Given a time budget  , the optimization problem consists in
finding a sequence ⟨1, . . . ,  ⟩ of queries,  ∈ , without repetition,
with  ≤  , such that:
(1) max Í=1  ( )
(2) Í</p>
      <p>=1  ( ) ≤ 
(3) min Í=−11  (, +1).</p>
      <p>
        Lemma 4.2 (Complexity of TAP [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). TAP is strongly NP-hard.
      </p>
      <p>It can easily be seen that our problem is an instance of TAP,
where queries are notebook messages and all their costs are the
same. We next define the interestingness and distance functions.
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Characterizing interestingness</title>
      <p>To characterize the interestingness of messages, instead of
proposing our own definition, we choose to learn a model of it, using
the features of in Table 1. Our strategy is to compute a score for
messages based on dimensions: Notebook popularity, Notebook
structure and Message in notebook. And then, to learn this score
using the features specific to messages, i.e., those in Dimensions
Code cell and Text Cell.</p>
      <p>
        We choose to focus on regression models as they give good
results on similar problems [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We use auto-machine learning
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to learn the model, since we aim to achieve good accuracy
performances by testing a large spectrum of models and
hyperparameters.
4.3
      </p>
    </sec>
    <sec id="sec-12">
      <title>Ensuring relevance</title>
      <p>Note that interestingness of messages is learned independently of
any user requirement. In order to build a coherent data narrative
in accordance with user interests, we introduce the notion of
user profile and we propose to pre-filter the set of messages that
are relevant to such a profile.</p>
      <p>We model a user profile as a set of keywords representing
user’s interests. The relevance of a message for a user profile is
computed based on the similarity of the text contained in the
profile and in the text cell of the message. We use an of-the-shelf
cosine similarity between the TF-IDF vectors of the user profile
and the message. We use as document corpus the overall set of
users’ profiles U and text cells of all messages in M.</p>
      <p>Formally, let  ∈ M be a message, and  ∈ U be a user profile,
with  being the text cell of . Let 1 and 2 be respectively the
TF-IDF vectors of  and  . The similarity between  and  is
computed as:
 (, ) =  (1, 2)
(1)
The distance between two messages is also computed based on
the similarity of the text contained in the text cells. We use the
same TF-IDF vectors for messages, computed for characterizing
relevance.</p>
      <p>Formally, let 1, 2 ∈ M be two messages, with 1 and 2
being respectively their text cells. Let 1 and 2 be respectively
the TF-IDF vectors of 1 and 2. The distance between the two
messages is computed as:
 (1, 2) = 1 −  (1, 2)
(2)
4.5</p>
    </sec>
    <sec id="sec-13">
      <title>Main algorithms</title>
      <p>This subsection presents two algorithms that implement the
approach. Algorithm 1 describes the extraction of messages, and the
computation of their interestingness and distance. This algorithm
can be executed ofline. Algorithm 2 describes the generation of
a data narrative for a specific user profile. It pre-selects relevant
messages, calls the TAP for selecting and structuring messages
and finally writes the narrative.</p>
      <p>Algorithm 1 Message extraction and computations
Require: a set of notebooks   , a set of user profiles 
Ensure: a set of messages  , an interestingness vector
 , a distance matrix 
1: Let  = extractMessages (  )
2: Let  () = learnInterestingness ( ,   )
3: index ( ∪  )
4: Let  () = computeDistance ( ∪  )
5: return  ,  (),  ()
Algorithm 2 User tailored narrative from notebook messages
Require: a set of messages  , a set of notebooks   , a user
profile , a similarity threshold  , a number of expected
messages 
Ensure: a data narrative for the user
1: Let  = ∅
2: for  ∈  do
3: if  (, ) &gt;  then
4:  =  ∪ {}
5: end if
6: end for
7: Let  = TAP (, , ,  )
8: return narrate ( ,   )</p>
      <p>The extractMessages function extracts messages from a set
of notebooks. Its implementation is described in Section 5. The
learnInterestingness function computes message interestingness
as described in Subsection 4.2. The index function indexes the
corpus of messages and profiles, computing TF-IDF vectors, as
described in Subsection 4.3. Such vectors are used for computing
the distance among messages (computeDistance function) and
similarity between a message and a profile (the sim function,
which is 1 −  ()). The TAP function implements the
optimization problem described in Subsection 4.1. Finally, the narrate
function generates the narration by writing messages in the
order indicated by the TAP, reusing the original visualizations of
messages.</p>
    </sec>
    <sec id="sec-14">
      <title>IMPLEMENTATION AND TESTS</title>
      <p>
        Our prototype is implemented in Python, using libraries Radon
for code metrics and py-readability-metrics for readability
metrics. We used Kaggle API to access the datasets and notebooks.
To match a code cell with the visualization it produces, we used
the HTML page of the notebook because the Kaggle API does
not provide the visualization. We used Beautiful Soup to parse
the HTML and mapped the visualization with the code cell using
a join on the code text. We used sklearn for the TFIDF
vectorization. Solving the TAP problem (see Section 4.1) exactly is done
with a mathematical model on CPLEX 20.10 and is implemented
in C++2. For large sets of messages (more than 500 messages)
ifnding exact solutions is intractable. We use a fast and
memoryeficient heuristics inspired by the classic “sort by item eficiency”
heuristics for solving the Knapsack problem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The code of the
approach is available on Github3.
      </p>
      <p>We tested our code on 377 Kaggle notebooks from the first 18
datasets of Kaggle.com having more than 20 notebooks, sorted
by votes. We extracted messages from these notebooks by
considering only the code cells immediately followed by a text cell
(a markdown cell in Kaggle terminology). This resulted in 10166
messages. The correlation matrix of the features of Table 1 is
displayed in Figure 2, computed with Pearson’s correlation
coeficient. The order of features in the figure is the same as the
order in the table. Globally it can be seen that the features are
correlated when they are in the same feature dimension. In more
details:
• in dimension notebook popularity, it can be seen that,
unsurprisingly, likes, views and forks are quite correlated,
while expertise is correlated to none of the others ;
• number of lines of codes and number of lines of text are
only weakly correlated ;
• while code metrics are heavily correlated, they are not
correlated to the length of the code neither and the
generation of a visualization is correlated to none of the other
features in this dimension ;
• interestingly, the position of messages is quite correlated
to the total number of cells in the notebook and to the total
number of lines of text lines, while it is less correlated to
the number of lines of code. This reflects the correlations
found in the notebook structure dimension and the fact
that the more messages, the more cells in the notebook.
On the other hand, the position of the message is not
correlated to its own cells’ code length or text length.</p>
      <p>To learn interestingness, we use Auto-sklearn4 with the
principle presented in the previous section. Auto-sklearn produces
an ensemble model that is not a single model but several models
collaborating to achieve the best possible regression. The best
ensemble model we obtained, in terms of 2 is indicated in Table
2. Its 2 score is 0.85 in the training phase and 0.59 in the testing
phase. The target score we use was constructed by multiplying
all the features in dimensions notebook popularity, notebook
structure and message in notebook.</p>
      <p>We created 20 user profiles by retrieving the owners of the
datasets of Kaggle.com with the most votes and then retrieving all
the datasets owned by these users. The words in the description
of those datasets are used to form the profiles. For the 20 users,
profiles ranged between 5 and 20 words, with an average of 14.5
2https://github.com/AlexChanson/Cplex-TAP
3https://github.com/Blobfish-LIFAT/NotebookCrowdsourcing
4https://automl.github.io/auto-sklearn/master/index.html
ensemble weight
0.76
0.02
0.04
0.08
0.10
type
gaussian process
gradient boosting
gradient boosting
k nearest neighbors
gradient boosting
(stdev is 4.97). The description of those datasets, together with the
text of all text cells identified when extracting messages, formed
the vocabulary from which TF-IDF vectors for users and messages
were computed. For each user, we filtered the set of messages
using their profile, using the cosine similarity between both
TFIDF vectors, using a threshold of 0. The number of messages
relevant for each profile ranges between 191 (minimum) and
2551 (maximum), with an average of 798.</p>
      <p>We generated one narration for each profile, asking for  =10
messages in it. On average, the generated narrations have 8.3
messages (minimum 2, maximum 10, stdev 1.75). To measure
the degree of personalization of the narration, we use the
Szymkiewicz–Simpson overlap coeficient 5 between the profile and
the text of the messages. On average it is 0.15 (minimum 0.07,
maximum 0.44, stdev 0.12). These low scores are expected since
a threshold of 0 was used to select messages for each profile.
To measure the coherence and diversity of the messages in the
generated narrations, we measured (i) the number of diferent
notebooks where the messages come from and (ii) the
Szymkiewicz–Simpson overlap coeficient between the diferent
message texts in the narration. Regarding (i), on average the messages
come from 4.2 notebooks (minimum 1, maximum 9, stdev 0.3). As
to (ii), the overlap is 0.68 on average (minimum 0.59, maximum
0.77, stdev 0.04). The generated narrations, under the form of
Jupyter notebooks, are available on Github6.
5The overlap coeficient is defined as the size of the intersection divided by the
smaller of the size of the two sets. It is a form of Jaccard coeficient adapted to sets
with diferent cardinality.
6https://github.com/Blobfish-LIFAT/NotebookCrowdsourcing/tree/master/
output/notebooks</p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION</title>
      <p>This short paper introduces a novel approach for generating
personalized data narratives from EDA notebooks. The approach
consists of extracting messages from existing notebooks,
learning their interestingness, filtering this set of messages for some
user profile and generating a coherent sequence of messages
adapted to this profile. We detailed the implementation of our
proof of concept, and presented a preliminary experiment with
Kaggle.com notebooks.</p>
      <p>
        We are currently working at improving our approach by
providing more robust message detection, better accounting for the
visualizations related to the message, generating narratives that
are more coherent, less redundant and more personalized. We
will evaluate the approach with user tests, comparing it with
competitor approaches to generate notebooks [
        <xref ref-type="bibr" rid="ref16 ref6">6, 16</xref>
        ] and assessing
its scalability.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sheelagh</given-names>
            <surname>Carpendale</surname>
          </string-name>
          , Nicholas Diakopoulos, Nathalie Henry Riche, and
          <string-name>
            <given-names>Christophe</given-names>
            <surname>Hurter</surname>
          </string-name>
          .
          <article-title>Data-driven storytelling (dagstuhl seminar 16061)</article-title>
          .
          <source>Dagstuhl Reports</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Chanson</surname>
          </string-name>
          , Ben Crulis, Nicolas Labroche, Patrick Marcel, Verónika Peralta, Stefano Rizzi, and
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>The traveling analyst problem: Definition and preliminary study</article-title>
          .
          <source>In DOLAP@EDBT/ICDT</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Andrienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Andrienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Turkay</surname>
          </string-name>
          .
          <article-title>Supporting story synthesis: Bridging the gap between visual analytics and storytelling</article-title>
          .
          <source>TVCG</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>George</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dantzig</surname>
          </string-name>
          .
          <article-title>Discrete-variable extremum problems</article-title>
          .
          <source>Operations Research</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>266</fpage>
          -
          <lpage>288</lpage>
          ,
          <year>1957</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Deutch</surname>
          </string-name>
          , Amir Gilad, Tova Milo, and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Somech</surname>
          </string-name>
          . Explained:
          <article-title>Explanations for EDA notebooks</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>13</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2917</fpage>
          -
          <lpage>2920</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ori</given-names>
            <surname>Bar</surname>
          </string-name>
          <string-name>
            <surname>El</surname>
          </string-name>
          , Tova Milo, and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Somech</surname>
          </string-name>
          .
          <article-title>Automatically generating data exploration sessions using deep reinforcement learning</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Feurer</surname>
          </string-name>
          , Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Eficient and robust automated machine learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , Canada,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Dimitrios</given-names>
            <surname>Gkesoulis</surname>
          </string-name>
          , Panos Vassiliadis, and
          <string-name>
            <given-names>Petros</given-names>
            <surname>Manousis</surname>
          </string-name>
          . Cinecubes:
          <article-title>Aiding data workers gain insights from OLAP queries</article-title>
          . Inf. Syst.,
          <volume>53</volume>
          :
          <fpage>60</fpage>
          -
          <lpage>86</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Maurice</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Halstead</surname>
          </string-name>
          .
          <source>Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., USA</source>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Stratos</surname>
            <given-names>Idreos</given-names>
          </string-name>
          , Olga Papaemmanouil, and
          <string-name>
            <given-names>Surajit</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          .
          <article-title>Overview of data exploration techniques</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Kosara</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jock</given-names>
            <surname>Mackinlay</surname>
          </string-name>
          .
          <article-title>Storytelling: The next step for visualization</article-title>
          .
          <source>IEEE Computer</source>
          ,
          <volume>46</volume>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>John</surname>
            <given-names>McAuley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Rohan</given-names>
            <surname>Goel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Tamara</given-names>
            <surname>Matthews</surname>
          </string-name>
          . Explorobot:
          <article-title>Rapid exploration with chart automation</article-title>
          .
          <source>In VISIGRAPP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Thomas</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>McCabe</surname>
          </string-name>
          .
          <article-title>A complexity measure</article-title>
          .
          <source>IEEE Trans. Software Eng.</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ):
          <fpage>308</fpage>
          -
          <lpage>320</lpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Martina</surname>
            <given-names>Megasari</given-names>
          </string-name>
          , Pandu Wicaksono, Chiao Yun Li,
          <string-name>
            <given-names>Clément</given-names>
            <surname>Chaussade</surname>
          </string-name>
          , Shibo Cheng, Nicolas Labroche, Patrick Marcel, and
          <string-name>
            <given-names>Verónika</given-names>
            <surname>Peralta</surname>
          </string-name>
          .
          <article-title>Can models learned from a dataset reflect acquisition of procedural knowledge? an experiment with automatic measurement of online review quality</article-title>
          . In Il-Yeol
          <string-name>
            <surname>Song</surname>
          </string-name>
          , Alberto Abelló, and Robert Wrembel, editors,
          <source>Proceedings of DOLAP</source>
          , volume
          <volume>2062</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Faten</given-names>
            <surname>El</surname>
          </string-name>
          <string-name>
            <surname>Outa</surname>
          </string-name>
          , Matteo Francia, Patrick Marcel, Verónika Peralta, and
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <article-title>Towards a conceptual model for data narratives</article-title>
          .
          <source>In ER</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Aurélien</surname>
            <given-names>Personnaz</given-names>
          </string-name>
          ,
          <source>Sihem Amer-Yahia</source>
          ,
          <article-title>Laure Berti-Équille, Maximilian Fabricius, and Srividya Subramanian. DORA THE EXPLORER: exploring very large data with interactive deep reinforcement learning</article-title>
          .
          <source>In CIKM</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Adam</surname>
            <given-names>Rule</given-names>
          </string-name>
          , Aurélien Tabard, and
          <string-name>
            <given-names>James D.</given-names>
            <surname>Hollan</surname>
          </string-name>
          .
          <article-title>Exploration and explanation in computational notebooks</article-title>
          .
          <source>In CHI</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xingyu</given-names>
            <surname>Lan</surname>
          </string-name>
          , David Gotz,
          <string-name>
            <given-names>and Nan</given-names>
            <surname>Cao</surname>
          </string-name>
          .
          <article-title>Autoclips: An automatic approach to video generation from data facts</article-title>
          .
          <source>Comput. Graph. Forum</source>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ):
          <fpage>495</fpage>
          -
          <lpage>505</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Danqing</surname>
            <given-names>Shi</given-names>
          </string-name>
          , Xinyue Xu,
          <string-name>
            <given-names>Fuling Sun</given-names>
            ,
            <surname>Yang Shi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Nan</given-names>
            <surname>Cao</surname>
          </string-name>
          . Calliope:
          <article-title>Automatic visual data story generation from a spreadsheet</article-title>
          .
          <source>IEEE Trans. Vis. Comput. Graph.</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <fpage>453</fpage>
          -
          <lpage>463</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Bo</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Shi Han,
          <source>Man Lung Yiu</source>
          , Rui Ding, and Dongmei Zhang.
          <article-title>Extracting top-k insights from multi-dimensional data</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jiawei</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>and Andreas</given-names>
            <surname>Zeller</surname>
          </string-name>
          .
          <article-title>Better code, better sharing: on the need of analyzing jupyter notebooks</article-title>
          .
          <source>In ICSE-NIER</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Yun</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Zhida Sun, Haidong Zhang, Weiwei Cui, Ke Xu, Xiaojuan Ma, and Dongmei Zhang. Datashot:
          <article-title>Automatic generation of fact sheets from tabular data</article-title>
          .
          <source>IEEE Trans. Vis. Comput. Graph.</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>