Generating personalized data narrations from EDA notebooks Alexandre Chanson, Faten El Outa, Nicolas Lucile Jacquemart Labroche, Patrick Marcel, Verónika Peralta, University of Tours Willeme Verdeaux Blois, France University of Tours Lucile.Jacquemart@etu.univ-tours.fr Blois, France firstName.lastName@univ-tours.fr ABSTRACT In this short paper, we present our preliminary results for gener- ating personalized data narrations by extracting messages from a collection of Exploratory Data Analysis (EDA) notebooks over a given dataset. The approach consists of extracting features from notebooks to learn what interesting messages they expose. Based on those interesting messages, we formalize the problem of pro- ducing a user-tailored data narration, i.e., a coherent sequence of messages matching a given user profile. We developed a proof of concept and experimented with Kaggle.com notebooks. Figure 1: Overview of the approach 1 INTRODUCTION Exploratory Data Analysis (EDA) is the notoriously tedious task of interactively analyzing datasets to gain insights [10]. EDA note- books are shared curated, illustrative EDA sessions prepared by [2]. Finally, Visual presentation (iv) is ensured by reusing existing data scientists [6, 17]. EDA notebooks are essentially sequences visualizations extracted from notebooks. of programmatic operations and their commented results, shared The general pipeline of our approach is shown in Figure 1. on code sharing platforms such as Kaggle1 . Supporting EDA can There are three offline computation modules that deal, re- be done by pre-analyzing datasets for computing insights [20] or spectively, with message extraction, computation of message by automatically generating EDA notebooks using deep learning interestingness, and computation of the cognitive distance be- [6]. tween messages. These computations are useful for ensuring that Data narration is the activity of producing narratives sup- messages in the data narrative are interesting and are structured ported by facts extracted from data exploration and analysis, in a cognitively-coherent way. Then, online, for a given user need, using interactive visualizations [1]. In an effort to clarify the the message selection module preselects messages that match concepts of data narratives, we recently defined a data narrative the user’s profile. The user also specifies a budget, representing as a structured composition of messages that (a) convey findings the maximum number of messages to be included in the data nar- over the data, and, (b) are typically delivered via visual means rative. The TAP module takes as input the preselected messages, in order to facilitate their reception by an intended audience, and their interestingness and distance scores, and the budget, and pro- we proposed a conceptual model describing and structuring the duces an ordered list of messages (taken among the preselected key concepts around data narratives [15]. While several works ones) that maximize the overall interestingness and minimize informally describe the process of data narration crafting [3, 11], their cognitive distance, while satisfying the given budget. Finally, automated data narration only starts to gain attention [8, 19]. the narration module generates the data narrative. Our present work contributes to the field of automated data Our contributions include: narration, and aims at connecting EDA notebooks to data narra- • a formal framework, tions. More precisely, our objective is to construct data narrations • learning the interestingness of messages, from EDA notebooks. This requires to (i) identify messages that • an algorithm to generate a user-tailored data narrative convey findings in the data, (ii) ensure they are relevant for a from a set of notebooks, given user profile, (iii) arrange them in a coherent composition, • a proof of concept with Kaggle notebooks, producing var- and (iv) present them visually. ious data narratives for a given dataset. Problem (i) is addressed by formally defining a message as a component of an EDA notebook, extracting them and learning a The paper is organized as follows. The next section reviews model of message interestingness. Problem (ii) is addressed by related works. Section 3 provides the formal background and representing messages and user profiles in a vector space, using describes the features we consider to learn messages’ interesting- a classical TF-IDF representation, and using Cosine similarity ness. Section 4 formalizes the problem and presents our solution. to select messages closest to the user profile. Problem (iii) is Section 5 discusses the implementation and tests. Finally, Section formalized as an instance of the Traveling Analyst Problem (TAP) 6 concludes and draw perspectives. 1 https://www.kaggle.com 2 RELATED WORK © Copyright 2022 for this paper by its author(s). Use permitted under Creative In this section we review related work pertaining to the genera- Commons License Attribution 4.0 International (CC BY 4.0) tion of data narratives or the automation of part of the process. 2.1 Automating data narration 2.2 Automatic data exploration Firstly, several works propose solutions for automating data nar- Some works [6, 12] propose solutions for automating data explo- ration starting from a user query [8], a spreadsheet [19], or a ration, the first step of data narration. topic [18]. McAuley et al. [12] propose ExploroBOT, a novel system de- The precursor work of Gkesoulis et al. [8] introduced Cine- veloped to support rapid exploration using a combination of Cubes, a system that allows the automatic generation of a data automatic chart generation and intuitive navigation supported story over an OLAP database, with a simple user query as starting by a novel visual guidance framework. The criteria to quantify point. Each data story has three acts. The first providing contex- the interestingness of chart are: (i) data correlation: highly cor- tualization for the characters as well as the incident that sets the related data in scatter plots and trend charts, hints towards an story on the move, the second where the protagonists and the interesting relationship between the two variables. (ii) Peaks: rest of the roles build up their actions and reactions and the third Spikes and large differences in a numerical attributes instantly at- where the resolution of the film is taking place. The first one tract attention. (iii) Outliers: A chart with more outliers is deemed refers to the execution of the original query provided by the user. more interesting. The second act exploits the selection conditions of the original El et al. [6] proposed ATENA, a system that takes an input query and automatically generates comparative drill-up queries dataset and auto-generates a compelling exploratory session, to provide contextualization and finally, the third act drills down presented in an EDA notebook. They shaped EDA into a con- in the grouping levels of the original result to see the breakdown trol problem, and devised a novel Deep Reinforcement Learning of its (aggregate) measures and understand its internal structure architecture to effectively optimize the notebook generation. to provide further analysis of the results. Their tests revealed the Personnaz et al. [16] introduce DORA the explorer, which ability of Cinecubes to generate a fast report of better quality. provides guidance to data explorer relying on Deep Reinforce- However, its fixed structure in three acts can only produce simple ment Learning that combines intrinsic (curiosity) and extrinsic data stories with limited insights and visualizations. (familiarity) rewards. Shi et al. [19] proposed Calliope, a system that automatically Finally, Deutch et al. [5] deal with the generation of expla- generates visual data stories from an input spreadsheet. The sys- nations for highlighting exploration results. They proposed Ex- tem incorporates a new logic-oriented Monte Carlo tree search plainED, a system for automatically explaining views in EDA algorithm that explores the data space given by the input spread- notebooks. The explanations are presented in Natural Language sheet to progressively generate story pieces (i.e., data facts) and and describe the particular elements of the view that are the most organize them in a logical sequence. The importance of data facts interesting (the ones having the highest Shapley values). is measured based on information theory. Each data fact is visu- To the best of our knowledge, our work is the first aiming alized in a chart and captioned by an automatically generated at automating the production of personalized data narrative by description. A user study highlighted that the logical order is leveraging existing EDA notebooks. One prominent aspect of our consistent to humans, the generated data story express useful approach is to qualify the interestingness of messages contained data insights, and the visualization modes are satisfactory. Nev- in existing notebooks. This is important as messages are the ertheless, Calliope cannot understand data semantics to better cornerstone of data narrartives [15] and since the quality of generate the story contents and logic. Also, the generated cap- notebooks is known to be very diverse [21]. tions are too rigid and contain grammar errors, and the visual encoding generated are notably simple. 3 FORMAL BACKGROUND Shi et al. [18] proposed AutoClips, an automatic approach to This section introduces the representation of EDA notebooks generate data videos from a given topic. It is based on 4 phases: (i) and messages and presents the set of properties used for learning collecting a series of data facts around a certain topic, (ii) construct- interestingness. ing a storyline as an assembly of these data facts into a sequence, (iii) choosing data visualizations for the data facts and deciding how to animate them by drawing a storyboard, and finally, (iv) re- 3.1 Preliminary definitions alizing the storyboard via a design software in which the narrator EDA notebooks are essentially sequences of programmatic oper- edits and combines the animated visualizations until a coherent ations and their commented results. They are linearly structured data video is accomplished. Their evaluation revealed that Au- as a sequence of cells, of two types: text and code. Text cells toClips can generate comprehensible and engaging data videos contain explanatory text, typically including titles, definitions, which have comparable quality with human-made videos. How- explanations and comments. Code cells contain a sequence of ever, the system only supports tabular data and favors datasets commands and their output, typically including numeric results with diverse column types. and graphics. Wang et al. [22] conducted a qualitative analysis on 245 info- We consider that a code cell together with a text cell delivers graphics studying the design space in terms of structures, sheet a commented result on a logical part of the notebook. We will layouts, fact types, and visualization styles. Based on those, the call it message in what follows. We represent a message as a pair authors propose a system for the auto-generation of fact sheet of code and text cells, together with a set of numerical properties generation. It consists of three phases: (i) fact extraction, (ii) fact describing their contents (e.g. the number of words in the text composition, and (iii) presentation synthesis. Their validation cell, or the complexity of the code). The whole set of properties of the system highlighted the efficiency of data exploration and is described in next subsection. the ease of understanding of the visualizations. As limitations, we point out that data semantics is not considered during ex- Definition 3.1 (Message). Let T be an infinite set of text cells ploration, and that visualizations are taken from a small-sized and C an infinite set of code cells. A message is a tuple 𝑚 = predefined library. ⟨𝑐, 𝑡, 𝑝𝑚 𝑚 1 , . . . , 𝑝𝑜 ⟩ where 𝑐 ∈ C is a code cell, 𝑡 ∈ T is a text cell 𝑚 and 𝑝𝑖 , 1 ≤ 𝑖 ≤ 𝑜, are properties of 𝑐 or 𝑡. Dimension Name # 4 EXTRACTING NARRATIONS FROM Notebook Number of likes 0 NOTEBOOKS popularity Number of views 1 Number of forks 2 In this section, we describe how we process notebook messages Author’s expertise 3 to extract narrations. Notebook Number of cells 4 structure Number of lines of code 5 4.1 Problem definition Number of lines text 6 Let 𝑀 𝐷 be the set of messages over a dataset 𝐷. We are inter- Code cell Number of characters 7 ested in producing a sequence of 𝜖𝑡 messages from 𝑀 𝐷 such that Halstead score 8 their total interestingness is maximal, and the overall cognitive Cyclomatic complexity 9 distance between them is minimal. Generates a visualization 10 This problem is defined formally in [2] as follows: Text cell Number of characters 11 Definition 4.1 (Traveling Analyst Problem (TAP)). Let 𝑄 be a set Number of words 12 of 𝑁 queries, each associated with a positive time cost 𝑐𝑜𝑠𝑡 (𝑞𝑖 ) Flesch reading ease index 13 and a positive interestingness score 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 (𝑞𝑖 ). Each pair of Gunning-Fog index 14 queries is associated with a metric 𝑑𝑖𝑠𝑡 (𝑞𝑖 , 𝑞 𝑗 ) representing the Automated Readability Index 15 cognitive distance of browsing from one query result to the next. Coleman-Liau index 16 Given a time budget 𝜖𝑡 , the optimization problem consists in find- Message in Position in notebook 17 ing a sequence ⟨𝑞 1, . . . , 𝑞𝑀 ⟩ of queries, 𝑞𝑖 ∈ 𝑄, without repetition, notebook with 𝑀 ≤ 𝑁 , such that: Table 1: Features considered Í𝑀 (1) max 𝑖=1 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 (𝑞𝑖 ) Í𝑀 (2) 𝑖=1 𝑐𝑜𝑠𝑡 (𝑞𝑖 ) ≤ 𝜖𝑡 Í𝑀−1 (3) min 𝑖=1 𝑑𝑖𝑠𝑡 (𝑞𝑖 , 𝑞𝑖+1 ). Lemma 4.2 (Complexity of TAP [2]). TAP is strongly NP-hard. Finally, we represent a notebook as a sequence of messages and a set of notebook properties (e.g. number of user’s likes). It can easily be seen that our problem is an instance of TAP, where queries are notebook messages and all their costs are the Definition 3.2 (Notebook). Let M be an infinite set of messages. same. We next define the interestingness and distance functions. A notebook is a tuple 𝑛 = ⟨𝑚 1, . . . , 𝑚 𝑣 , 𝑝 𝑛1 , . . . , 𝑝 𝑛𝑤 ⟩ where 𝑚𝑖 ∈ M, 1 ≤ 𝑖 ≤ 𝑣, are messages, and 𝑝 𝑛𝑗 , 1 ≤ 𝑗 ≤ 𝑤, are properties of 4.2 Characterizing interestingness 𝑛. To characterize the interestingness of messages, instead of propos- ing our own definition, we choose to learn a model of it, using 3.2 Properties the features of in Table 1. Our strategy is to compute a score for The properties of cells, messages and notebooks correspond to messages based on dimensions: Notebook popularity, Notebook features extracted from notebooks, detailed in Table 1. structure and Message in notebook. And then, to learn this score We consider the following feature dimensions: using the features specific to messages, i.e., those in Dimensions Code cell and Text Cell. • notebook popularity: these features indicate the global We choose to focus on regression models as they give good popularity of the notebooks among Kaggle users. They are results on similar problems [14]. We use auto-machine learning the main drivers to compute messages’ interestingness as [7] to learn the model, since we aim to achieve good accuracy they express the opinion of the community of users. performances by testing a large spectrum of models and hyper- • notebook structure: these feature describe the size of the parameters. notebook in terms of cells and lines of code and comments. • code cell: these features characterize code cells in terms 4.3 Ensuring relevance of their complexity and the presence of a visualization. In Note that interestingness of messages is learned independently of addition to the number of lines of code, two classical soft- any user requirement. In order to build a coherent data narrative ware engineering metrics are used: cyclomatic complexity in accordance with user interests, we introduce the notion of [13], Halstead metric [9]. user profile and we propose to pre-filter the set of messages that • text cell: these features characterize the content of text are relevant to such a profile. cells especially in terms of readability, i.e., indexes related We model a user profile as a set of keywords representing to the level of studies a person needs to understand the user’s interests. The relevance of a message for a user profile is text at the first reading computed considering the number computed based on the similarity of the text contained in the of words, number of sentences, number of syllables or profile and in the text cell of the message. We use an off-the-shelf number of characters as components. cosine similarity between the TF-IDF vectors of the user profile • message characteristics: this feature indicates where the and the message. We use as document corpus the overall set of message is located in the notebook. Often the first mes- users’ profiles U and text cells of all messages in M. sages of a notebook are simple data profiling while mes- Formally, let 𝑚 ∈ M be a message, and 𝑢 ∈ U be a user profile, sages at the end tend to be more elaborated. with 𝑡 being the text cell of 𝑚. Let 𝑉1 and 𝑉2 be respectively the In the following we restrict to notebooks and messages over a TF-IDF vectors of 𝑢 and 𝑡. The similarity between 𝑢 and 𝑚 is given dataset. Implementation details about message and prop- computed as: erties extraction are given in Section 5. 𝑠𝑖𝑚(𝑢, 𝑚) = 𝑐𝑜𝑠𝑖𝑛𝑒 (𝑉1, 𝑉2 ) (1) 4.4 Characterizing distance 5 IMPLEMENTATION AND TESTS The distance between two messages is also computed based on Our prototype is implemented in Python, using libraries Radon the similarity of the text contained in the text cells. We use the for code metrics and py-readability-metrics for readability met- same TF-IDF vectors for messages, computed for characterizing rics. We used Kaggle API to access the datasets and notebooks. relevance. To match a code cell with the visualization it produces, we used Formally, let 𝑚 1 , 𝑚 2 ∈ M be two messages, with 𝑡 1 and 𝑡 2 the HTML page of the notebook because the Kaggle API does being respectively their text cells. Let 𝑉1 and 𝑉2 be respectively not provide the visualization. We used Beautiful Soup to parse the TF-IDF vectors of 𝑡 1 and 𝑡 2 . The distance between the two the HTML and mapped the visualization with the code cell using messages is computed as: a join on the code text. We used sklearn for the TFIDF vectoriza- tion. Solving the TAP problem (see Section 4.1) exactly is done 𝑑𝑖𝑠𝑡 (𝑚 1, 𝑚 2 ) = 1 − 𝑐𝑜𝑠𝑖𝑛𝑒 (𝑉1, 𝑉2 ) (2) with a mathematical model on CPLEX 20.10 and is implemented in C++2 . For large sets of messages (more than 500 messages) 4.5 Main algorithms finding exact solutions is intractable. We use a fast and memory- This subsection presents two algorithms that implement the ap- efficient heuristics inspired by the classic “sort by item efficiency” proach. Algorithm 1 describes the extraction of messages, and the heuristics for solving the Knapsack problem [4]. The code of the computation of their interestingness and distance. This algorithm approach is available on Github3 . can be executed offline. Algorithm 2 describes the generation of We tested our code on 377 Kaggle notebooks from the first 18 a data narrative for a specific user profile. It pre-selects relevant datasets of Kaggle.com having more than 20 notebooks, sorted messages, calls the TAP for selecting and structuring messages by votes. We extracted messages from these notebooks by con- and finally writes the narrative. sidering only the code cells immediately followed by a text cell (a markdown cell in Kaggle terminology). This resulted in 10166 Algorithm 1 Message extraction and computations messages. The correlation matrix of the features of Table 1 is displayed in Figure 2, computed with Pearson’s correlation co- Require: a set of notebooks 𝑁 𝐷 , a set of user profiles 𝑈 efficient. The order of features in the figure is the same as the Ensure: a set of messages 𝑀 𝐷 , an interestingness vector order in the table. Globally it can be seen that the features are 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡, a distance matrix 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 correlated when they are in the same feature dimension. In more 1: Let 𝑀 𝐷 = extractMessages (𝑁 𝐷 ) details: 2: Let 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 () = learnInterestingness (𝑀 𝐷 , 𝑁 𝐷 ) 3: index (𝑀 𝐷 ∪ 𝑈 ) • in dimension notebook popularity, it can be seen that, 4: Let 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 () = computeDistance (𝑀 𝐷 ∪ 𝑈 ) unsurprisingly, likes, views and forks are quite correlated, 5: return 𝑀 𝐷 , 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡 (), 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 () while expertise is correlated to none of the others ; • number of lines of codes and number of lines of text are only weakly correlated ; • while code metrics are heavily correlated, they are not Algorithm 2 User tailored narrative from notebook messages correlated to the length of the code neither and the gener- ation of a visualization is correlated to none of the other Require: a set of messages 𝑀 𝐷 , a set of notebooks 𝑁 𝐷 , a user features in this dimension ; profile 𝑢, a similarity threshold 𝜖𝑠 , a number of expected • interestingly, the position of messages is quite correlated messages 𝜖𝑡 to the total number of cells in the notebook and to the total Ensure: a data narrative for the user number of lines of text lines, while it is less correlated to 1: Let 𝑀 = ∅ the number of lines of code. This reflects the correlations 2: for 𝑚 ∈ 𝑀 𝐷 do found in the notebook structure dimension and the fact 3: if 𝑠𝑖𝑚(𝑚, 𝑢) > 𝜖𝑠 then that the more messages, the more cells in the notebook. 4: 𝑀 = 𝑀 ∪ {𝑚} On the other hand, the position of the message is not 5: end if correlated to its own cells’ code length or text length. 6: end for To learn interestingness, we use Auto-sklearn4 with the prin- 7: Let 𝑇 = TAP (𝑀, 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡, 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒, 𝜖𝑡 ) ciple presented in the previous section. Auto-sklearn produces 8: return narrate (𝑇 , 𝑁 𝐷 ) an ensemble model that is not a single model but several models collaborating to achieve the best possible regression. The best The extractMessages function extracts messages from a set ensemble model we obtained, in terms of 𝑅 2 is indicated in Table of notebooks. Its implementation is described in Section 5. The 2. Its 𝑅 2 score is 0.85 in the training phase and 0.59 in the testing learnInterestingness function computes message interestingness phase. The target score we use was constructed by multiplying as described in Subsection 4.2. The index function indexes the all the features in dimensions notebook popularity, notebook corpus of messages and profiles, computing TF-IDF vectors, as structure and message in notebook. described in Subsection 4.3. Such vectors are used for computing We created 20 user profiles by retrieving the owners of the the distance among messages (computeDistance function) and datasets of Kaggle.com with the most votes and then retrieving all similarity between a message and a profile (the sim function, the datasets owned by these users. The words in the description which is 1 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ()). The TAP function implements the opti- of those datasets are used to form the profiles. For the 20 users, mization problem described in Subsection 4.1. Finally, the narrate profiles ranged between 5 and 20 words, with an average of 14.5 function generates the narration by writing messages in the or- 2 https://github.com/AlexChanson/Cplex-TAP der indicated by the TAP, reusing the original visualizations of 3 https://github.com/Blobfish-LIFAT/NotebookCrowdsourcing messages. 4 https://automl.github.io/auto-sklearn/master/index.html 6 CONCLUSION This short paper introduces a novel approach for generating per- sonalized data narratives from EDA notebooks. The approach consists of extracting messages from existing notebooks, learn- ing their interestingness, filtering this set of messages for some user profile and generating a coherent sequence of messages adapted to this profile. We detailed the implementation of our proof of concept, and presented a preliminary experiment with Kaggle.com notebooks. We are currently working at improving our approach by pro- viding more robust message detection, better accounting for the visualizations related to the message, generating narratives that are more coherent, less redundant and more personalized. We will evaluate the approach with user tests, comparing it with com- petitor approaches to generate notebooks [6, 16] and assessing its scalability. REFERENCES Figure 2: Correlation of all the features in Table 1 [1] Sheelagh Carpendale, Nicholas Diakopoulos, Nathalie Henry Riche, and Christophe Hurter. Data-driven storytelling (dagstuhl seminar 16061). Dagstuhl Reports, 2016. [2] Alexandre Chanson, Ben Crulis, Nicolas Labroche, Patrick Marcel, Verónika rank ensemble weight type Peralta, Stefano Rizzi, and Panos Vassiliadis. The traveling analyst problem: 1 0.76 gaussian process Definition and preliminary study. In DOLAP@EDBT/ICDT, 2020. [3] S. Chen, J. Li, G. Andrienko, N. Andrienko, Y. Wang, P. H. Nguyen, and 2 0.02 gradient boosting C. Turkay. Supporting story synthesis: Bridging the gap between visual 3 0.04 gradient boosting analytics and storytelling. TVCG, 2018. 4 0.08 k nearest neighbors [4] George B. Dantzig. Discrete-variable extremum problems. Operations Research, 5(2):266–288, 1957. 5 0.10 gradient boosting [5] Daniel Deutch, Amir Gilad, Tova Milo, and Amit Somech. Explained: Expla- Table 2: Model of interestingness nations for EDA notebooks. Proc. VLDB Endow., 13(12):2917–2920, 2020. [6] Ori Bar El, Tova Milo, and Amit Somech. Automatically generating data exploration sessions using deep reinforcement learning. In SIGMOD, 2020. [7] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springen- berg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, Canada, 2015. (stdev is 4.97). The description of those datasets, together with the [8] Dimitrios Gkesoulis, Panos Vassiliadis, and Petros Manousis. Cinecubes: Aiding data workers gain insights from OLAP queries. Inf. Syst., 53:60–86, text of all text cells identified when extracting messages, formed 2015. the vocabulary from which TF-IDF vectors for users and messages [9] Maurice H. Halstead. Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc., USA, 1977. were computed. For each user, we filtered the set of messages [10] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. Overview of using their profile, using the cosine similarity between both TF- data exploration techniques. In SIGMOD, 2015. IDF vectors, using a threshold of 0. The number of messages [11] Robert Kosara and Jock Mackinlay. Storytelling: The next step for visualization. IEEE Computer, 46, 2013. relevant for each profile ranges between 191 (minimum) and [12] John McAuley, Rohan Goel, and Tamara Matthews. Explorobot: Rapid explo- 2551 (maximum), with an average of 798. ration with chart automation. In VISIGRAPP, 2019. We generated one narration for each profile, asking for 𝜖𝑡 =10 [13] Thomas J. McCabe. A complexity measure. IEEE Trans. Software Eng., 2(4):308– 320, 1976. messages in it. On average, the generated narrations have 8.3 [14] Martina Megasari, Pandu Wicaksono, Chiao Yun Li, Clément Chaussade, messages (minimum 2, maximum 10, stdev 1.75). To measure Shibo Cheng, Nicolas Labroche, Patrick Marcel, and Verónika Peralta. Can models learned from a dataset reflect acquisition of procedural knowledge? an the degree of personalization of the narration, we use the Szym- experiment with automatic measurement of online review quality. In Il-Yeol kiewicz–Simpson overlap coefficient5 between the profile and Song, Alberto Abelló, and Robert Wrembel, editors, Proceedings of DOLAP, the text of the messages. On average it is 0.15 (minimum 0.07, volume 2062 of CEUR Workshop Proceedings. CEUR-WS.org, 2018. [15] Faten El Outa, Matteo Francia, Patrick Marcel, Verónika Peralta, and Panos maximum 0.44, stdev 0.12). These low scores are expected since Vassiliadis. Towards a conceptual model for data narratives. In ER, 2020. a threshold of 0 was used to select messages for each profile. [16] Aurélien Personnaz, Sihem Amer-Yahia, Laure Berti-Équille, Maximilian Fabri- To measure the coherence and diversity of the messages in the cius, and Srividya Subramanian. DORA THE EXPLORER: exploring very large data with interactive deep reinforcement learning. In CIKM, 2021. generated narrations, we measured (i) the number of different [17] Adam Rule, Aurélien Tabard, and James D. Hollan. Exploration and explana- notebooks where the messages come from and (ii) the Szym- tion in computational notebooks. In CHI, 2018. kiewicz–Simpson overlap coefficient between the different mes- [18] D. Shi, F. Sun, X. Xu, Xingyu Lan, David Gotz, and Nan Cao. Autoclips: An automatic approach to video generation from data facts. Comput. Graph. sage texts in the narration. Regarding (i), on average the messages Forum, 40(3):495–505, 2021. come from 4.2 notebooks (minimum 1, maximum 9, stdev 0.3). As [19] Danqing Shi, Xinyue Xu, Fuling Sun, Yang Shi, and Nan Cao. Calliope: Au- tomatic visual data story generation from a spreadsheet. IEEE Trans. Vis. to (ii), the overlap is 0.68 on average (minimum 0.59, maximum Comput. Graph., 27(2):453–463, 2021. 0.77, stdev 0.04). The generated narrations, under the form of [20] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang. Extracting Jupyter notebooks, are available on Github6 . top-k insights from multi-dimensional data. In SIGMOD, 2017. [21] Jiawei Wang, Li Li, and Andreas Zeller. Better code, better sharing: on the need of analyzing jupyter notebooks. In ICSE-NIER, 2020. 5 The overlap coefficient is defined as the size of the intersection divided by the [22] Yun Wang, Zhida Sun, Haidong Zhang, Weiwei Cui, Ke Xu, Xiaojuan Ma, and Dongmei Zhang. Datashot: Automatic generation of fact sheets from tabular smaller of the size of the two sets. It is a form of Jaccard coefficient adapted to sets data. IEEE Trans. Vis. Comput. Graph., 2020. with different cardinality. 6 https://github.com/Blobfish-LIFAT/NotebookCrowdsourcing/tree/master/ output/notebooks