=Paper= {{Paper |id=Vol-2322/BigVis_8 |storemode=property |title=PySnippet: Accelerating Exploratory Data Analysis in Jupyter Notebook through Facilitated Access to Example Code |pdfUrl=https://ceur-ws.org/Vol-2322/BigVis_8.pdf |volume=Vol-2322 |authors=Alex Watson,Scott Bateman,Suprio Ray |dblpUrl=https://dblp.org/rec/conf/edbt/WatsonBR19 }} ==PySnippet: Accelerating Exploratory Data Analysis in Jupyter Notebook through Facilitated Access to Example Code== https://ceur-ws.org/Vol-2322/BigVis_8.pdf
                             PySnippet: Accelerating Exploratory Data
                              Analysis in Jupyter Notebook through
                                Facilitated Access to Example Code
                                              Alex Watson, Scott Bateman and Suprio Ray
                                                          University of New Brunswick
                                                   awatson@unb.ca,scottb@unb.ca,sray@unb.ca

ABSTRACT                                                                          her task. In this paper, we refer to a code snippet as a small piece
Interactive environments like Jupyter Notebook, Mathematica,                      of reusable source code that completes a desired analytic task.
RStudio, and MATLAB are used to ease development in the grow-                         Finding snippets to complete tasks may be faster than reading
ing fields of data science and data analytics. These systems allow                through the documentation, but it can still be time-consuming.
users access to many open-source technologies, packages, and                      Searching for a desired snippet likely requires time browsing
libraries, which include functionality such as big data analytics,                through online documentation or searching online repositories
machine learning, statistical analysis, data wrangling, and large-                like StackOverflow5 . Furthermore, after finding the desired code
scale scientific calculations and visualization. The variety and                  snippet, an analyst must integrate it into her own solution. Inte-
complexity of these libraries mean that data analysts are not as                  grating example code can also be time-consuming, code snippets
productive as they might be, because they must spend substantial                  need be adapted to the current context of existing code (i.e., vari-
time learning the libraries and their programming interfaces. This                ables must be renamed, dependencies must be included). Further,
learning typically requires a significant amount of time to find                  beginners and intermediate developers may lack the expertise
and review documentation, and/or find example code. To address                    to successfully interpret and transfer code snippets into their
these inefficiencies, we propose an automatic code snippet fea-                   solutions.
ture that is built directly into the Jupyter Notebook environment.                    To address these common situations we have developed PyS-
To illustrate the effectiveness of our proposal, we developed a                   nippet. PySnippet aims to reduce the overall development time
prototype called PySnippet. In an initial user-study, participants                for users by providing an automatic, easy-to-access code snippet
were able to complete several exploratory data analysis tasks                     feature directly in the Jupyter Notebook environment. PySnippet
with both familiar and unfamiliar libraries significantly faster                  is implemented in Jupyter Notebook, an open-source, web-based,
with PySnippet.                                                                   interactive data analysis environment/tool, which allows users
                                                                                  to create and share documents that contain live code, equations,
                                                                                  visualizations, and narrative text. While Jupyter Notebook al-
1     INTRODUCTION                                                                ready supports auto-completion, however, it only assists with
More and more data science and data analysts employ interac-                      finding known functions and objects and does not provide as-
tive environments as the primary tool in their analysis activities.               sistance with many common needs, such as, how methods can
Popular interactive data analysis environments include Jupyter                    work together or how common tasks can be accomplished with
Notebook1 , Mathematica2 , RStudio3 , and MATLAB4 . These inter-                  a library. PySnippet addresses these shortcomings by allowing
active environments ease development [5, 6] effort for both new                   rapid access to code snippets that illustrate how common tasks
and experienced data analysts and scientists, by (among other                     can be completed, integrating them directly into an analyst’s
things) providing easy access to a multitude of open source pack-                 current workbook.
ages and libraries, which include functionality such as big data                      To demonstrate the baseline utility and advantages of using
analytics, machine learning, statistical analysis, data wrangling,                PySnippet, we report the findings of an initial experiment, where
and large-scale information visualization.                                        we asked 8 participants to complete representative tasks using
   While interactive data analysis environments facilitate access                 either normal Jupyter or Jupyter with PySnippet. Our results
to many powerful libraries and APIs (application programming in-                  show that PySnippet makes common analytics and visualization
terfaces) to support analysis and visualization, this power comes                 tasks using Jupyter faster, reduces the need for using search
with an increase in complexity. The variety and complexity of                     engines and is preferred by analysts.
libraries mean that data analysts are not as productive as they                       The rest of this paper is organized as follows. Section 2 exam-
could be, because they must spend substantial time learning the                   ines relevant research. Section 3 describes PySnippet and its im-
libraries. As a result, if an analyst wants to perform a common                   plementation in detail. Section 4 presents our experimental eval-
task with an unfamiliar API, she will need to spend time to re-                   uation. Section 5 mentions our findings, describes the strengths
search, find and review documentation, and/or to find example                     and weaknesses of our current implementation, and highlights
code (a code snippet) that demonstrates how she can complete                      directions for future work.
1 Jupyter Notebook : https://jupyter.org/
2 Wolfram Mathematica http://www.wolfram.com/mathematica/                         2     RELATED WORK
3 RStudio https://www.rstudio.com/
4 MATlab https://www.mathworks.com/products/matlab.html
                                                                                  Many modern IDEs (Integrated Development Environments) pro-
                                                                                  vide code completion features; e.g., IntelliJ 6 and Eclipse 7 . Code
© 2019 Copyright held by the author(s). Published in the Workshop Proceedings
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
                                                                                  5 StackOverflow: https://stackoverflow.com/
CEUR-WS.org.
                                                                                  6 IntelliJ, https://www.jetbrains.com/idea/
Distribution of this paper is permitted under the terms of the Creative Commons
                                                                                  7 Eclipse, https://www.eclipse.org/
license CC-by-nc-nd 4.0.
completion features speed up development time by reducing com-
mon mistakes that arise due to input errors (e.g., typos) and the
difficulty of remembering function names [5, 6]. Research has
also recognized the importance of code snippets in the day to day
practices of programmers, both to accelerate development [6]
and to improve learning [3, 4]. As such, research has focused on
several aspects, including how novices use snippets [4], novel in-
terfaces for accessing and incorporating snippets [5, 6], improved
discovery of snippets [1, 2, 5], and the improvement of snippet
authoring [3, 6].
   CodeMend [5] is a system that also targets Jupyter as a basis
for providing access to code snippets without the added overhead        (a) After an analyst has pressed the TAB key after typing the keywords ‘plot scatter’,
                                                                        the matching code snippet descriptions are shown in the list, and the code Snippet of
of searching through online search engines and documentation.           the currently selected description in the list is displayed on the right.
CodeMend provides the ability to use natural language to try and
find the desired snippets. However, it is a fundamental redesign
of the Jupyter environment that incorporates new workflow for a
limited set of activities. In other words, it re-designs the interac-
tion with Jupyter with an additional user interface and dashboard,
as well as it is implemented to only work with the matplotlib
library. In contrast, PySnippet is a lightweight tool that was de-
signed to easily fit within the current practices of developers
that has little or no overhead to learn. The work that is most
closely related to ours is SnipMatch [6], which works similarly
to PySnippet but is intended for Java snippets in Eclipse. While
some work exists that is similar to PySnippet, unlike previous
work, our goal is to provide a tool that can easily be incorporated
into current data analysis and visualization practices.

3     OUR SYSTEM
Jupyter Notebook provides an auto-complete feature that can be
used by hitting the TAB key beside an incomplete identifier. This       (b) The result after the analyst pressed ‘Enter’ on the code snippet description in (a)
action pops up a list of potentially matching methods, variables        above. The user’s keywords ‘plot scatter’ are replaced with the selected code Snippet.
                                                                        The user runs the corresponding notebook to get the resulting scatter plot.
or parameters to finish the incomplete identifier. The list is based
on the current identifier, as well as the current context of the code
(e.g., which object the identifier is attached to). Our implemen-       Figure 1: Example of PySnippet used to easily access a code
tation of PySnippet extends the existing auto-complete feature          snippet that employs matplotlib to create a scatter plot.
to work based on keywords. For example, if a user presses TAB,
PySnippet parses the current line as a set of keywords; if snip-
pets matching the keywords is found, code snippets are added            and it has a powerful N-dimensional array object. Matplotlib is
as options in the code completion list. The list can be navigated       a plotting and data visualization library, which produces many
using the arrow keys. When a snippet is highlighted in the list,        figures and graphs. Lastly, timeit measures the execution time of
a small description of the code snippet appears in a pop-up to          code snippets.
the right of the list showing the corresponding code snippet as             Figure 1 displays an example of an analyst using PySnippet
shown in Figure 1(a). The rest of this section will provide more        to quickly create a scatter plot using the matplotlib library. The
detail on how and why we implemented PySnippet.                         analyst simply types a few keywords: ‘plot’ and ‘scatter’. These
    PySnippet is built directly into Jupyter Notebook 5.4.0 8 . Like    keywords are similar to what might be googled in an attempt to
the existing code completion system, PySnippet is activated using       find a code snippet. Then the analyst presses the TAB key, and
the TAB key. PySnippet co-exists with the current code comple-          PySnippet uses the keywords ‘plot’ and ‘scatter’ to search through
tion functionality, allowing users to easily access the conven-         a dictionary of code snippets. It then returns all code snippets
tional or PySnippet’s functionality.                                    matching these keywords, as well as a title and description for
    PySnippet was implemented with a data analysis or data sci-         each snippet. Short descriptions of the snippets are shown in a
ence work-flow in mind. This is why our current implementation          list, which the user can navigate to explore and select a snippet.
focuses on four common Python libraries typically used in data          This is shown in Figure 1(a), in which the smaller pop-up on
analytics and visualization. Our current version of PySnippet pro-      the left shows a list of text descriptions. When a description
vides snippets for matplotlib 9 , NumPy 10 , Pandas 11 and timeit       is highlighted, it shows its corresponding code snippet in the
12 . Pandas is a package that provides data structures and tools        pop-up on the right of Figure 1(a). For example ‘plot.scatter’ is
for data analysis. NumPy is a package for scientific computing          the highlighted description on the list, so its corresponding code
                                                                        snippet is shown in the pop-up to the right. Once the analyst has
8 Github: https://github.com/jupyter/notebook                           chosen the snippet she would like to use, she can press ‘Enter’
9 matplotlib: https://matplotlib.org/
10 Numpy: http://www.numpy.org/
                                                                        to insert it into their notebook. Running the Jupyter Notebook
11 Pandas: https://pandas.pydata.org/                                   with the newly incorporated snippet would result in the scatter
12 timeit: https://docs.python.org/2/library/timeit.html                plot in Figure 1(b).
   All the code snippets that were added to PySnippet were di-        NumPy. For each of the tasks, the import statements involving
rectly copied from websites like StackOverflow or from the partic-    the mentioned libraries were provided. Also, clear directions of
ular library, package or technology online documentation. Minor       which packages to be used were provided in the task description.
modifications were only done to these code snippets to make           To achieve a correct answer the participant needed to provide
them more general. Thus, PySnippet is not creating any novel          a code snippet (either through using PySnippet, searching the
custom code snippets for users, it is simply taking code snippets     Web or reading documentation) that met all of the requirements
that already exist online and providing users with a quicker and      of the particular task. It is important to note that not all snippets
more direct way to access them. In our future work, we plan           available in PySnippet were relevant to tasks in the experiment;
to extend this functionality to allow users to author their own       we provided approximately 20 additional snippets.
snippets [6] and to access snippets from online sources [1].
                                                                      4.3    Procedure
4     EVALUATION                                                      At the beginning of the experiment, all participants were given a
The goal of our initial evaluation of PySnippet was to under-         15 minute introductory tutorial about Jupyter Notebook, Python,
stand how it fared in the typical uses of Jupyter. In particular,     and the libraries involved in the experiments. This provided
whether analysts would find any important problems with PyS-          participants, who were new to Jupyter, Python or any of the
nippet, and, if not, whether PySnippet would allow participants       libraries an introduction, so that they would have an idea about
to complete common tasks faster than without it. To this end,         how to use Jupyter, as well as where to find online documentation
we designed a formal experiment that compared two versions            for each of the libraries. Participants were evenly assigned to
of Jupyter: normal Jupyter (referred to as normal) and Jupyter        start with either the normal or PySnippet versions of Jupyter.
with PySnippet (referred to as PySnippet). Both versions differed     Participants were then asked to complete the four task trials,
only in the availability of PySnippet. Regardless of the version      after which they completed four tasks with the other version. A
used, participants had access to the Internet, so they could search   time limit of ten minutes was given for each task trial, and when
online for anything they needed.                                      the time limit was reached, a participant’s number of incorrect
                                                                      submissions was increased by one and the completion time was
4.1    Participants                                                   set to ten minutes for that particular task. At the end of the tasks
Eight participants were recruited, who were graduate students or      involving each version a questionnaire, which solicited opinions
recent graduates of a local university. Of the eight participants,    on both versions, was provided.
seven had programming experience at an undergraduate level,
and five had programmed in a professional environment. Only           4.4    Analysis
one participant had little to no programming experience. Four         Our experiment was designed with one independent variable
of the eight participants had no or minimal experience program-       with two levels (System version: normal Jupyter, and Jupyter with
ming in Python and had never worked with Jupyter, while the           PySnippet). There were three dependent variables: completion
other four had varying degrees of experience with Python and          time, number of incorrect submissions and number of Google
had previously used Jupyter or were familiar with it. Three of        searches. The completion time was collected by the system as
the participants were familiar with the Python libraries used in      the amount of time it took for a participant to complete a given
the experiment.                                                       task. The number of incorrect submissions was a count of how
                                                                      many times a participant submitted an incorrect or incomplete
4.2    Experimental Task                                              answer for a given task. The number of Google searches was
Throughout the experiment, participants completed a series of         the number of times a participant performed a search on Google
eight common data analytics tasks (four with each version). For       (and used to represent the need for finding information outside
each version, one of the tasks was considered a practice task (it     of Jupyter). The data was analyzed to determine whether there
was not used in the analysis, and alternated between participants).   were significant differences in the dependent variable as a result
The practice task was used to help participants get used to each      of differences in the two conditions using a repeated-measure
version without the pressure of being timed, as well as, provide      ANOVA.
them with an opportunity to ask questions about functionality.
We made it clear in the practice task that the participants should    4.5    Results
work on their own to complete the experimental tasks, and that        Task Completion Time. The mean task completion time for
the experimenter would not provide any help. The presentation         PySnippet was 222.7s. This was approximately 30.5% less than
of task and version was balanced using a Latin square to minimize     the mean completion 327.8s observed for normal Jupyter. The
any bias in our results due to presentation ordering.                 difference was statistically significant (F 1,7 = 8.314, p < .05). Figure
   The tasks were created to be fairly straightforward and in-        2(a) displays the mean task completion time of each participant.
volved introductory and common data wrangling, analytics or           All but one participant showed an improvement in completion
visualization tasks. Tasks included, "create a scatter plot of the    time using PySnippet. The mean completion time for every trial
provided data with matplotlib", "merge two Pandas dataframes,         was faster using PySnippet.
then perform a groupby", "create and multiply two NumPy ar-              Google Searches. The mean number of Google searches for
rays", "filter Pandas dataframe, then perform simple statistical      each task using PySinppet was 0.198. This was 92.8% less than
analysis" and "determine the execution time of the code below         the mean 2.75 observed normal Jupyter. The difference was sta-
using timeit". Of the eight tasks, four tasks used Pandas, two used   tistically significant (F 1,7 = 159.2, p < .05). Looking at figure 2(c),
matplotlib, one used NumPy, and one used timeit. The tasks were       the mean Google searches for each version, we can see that six
grouped in a way so that each version would be used in two tasks      out of the eight participants did not use Google at all while us-
with Pandas, one task with matplotlib and one of either timeit or     ing PySnippet. Thus, PySnippet provided sufficient information
                              Normal          PySnippet                                                 Normal        PySnippet                                                 Normal        PySnippet




                                                                     # of Incorrect Submissions
                                                                                                  6                                                                       4
                      400




                                                                                                                                                   # of Google Searches
                                                                                                  5
Completion Time (s)




                      300                                                                         4                                                                       3
                      200                                                                         3                                                                       2
                                                                                                  2
                      100                                                                         1                                                                       1
                        0                                                                         0                                                                       0
                             P1 P2 P3 P4 P5 P6 P7 P8                                                  P1 P2 P3 P4 P5 P6 P7 P8                                                 P1 P2 P3 P4 P5 P6 P7 P8
                                          Participants                                                           Participants                                                            Participants
                      (a) Mean completion time of each participant   (b) Mean number of incorrect submissions for each par-                       (c) Mean number of google searches for each participant
                                                                     ticipant
                                                                                              Figure 2: Results of the user study.


about the desired code snippets in most cases, and no further                                                           entries. Based on the results of our study, we believe that this
explanation was needed through a Google search.                                                                         effort would be worthwhile and that PySnippet would be ben-
    Incorrect Submissions. The mean number of incorrect sub-                                                            eficial in many common analytics and visualization scenarios.
missions was 1.687 for PySnippet and 1.875 for normal Jupyter.                                                          Our future plan is to build PySnippet into an actual extension
As there was substantial variation across participants as shown in                                                      that would be available to download with the Jupyter Notebook
2(b), the difference was not statistically significant (F 1,8 = 0.229).                                                 kernel. We have several ideas about how we can further improve
The number of incorrect submissions dropped slightly in the PyS-                                                        PySnippet and extend our features further. For example, we are
nippet version. This showed that the participants were able to                                                          developing a feature that allows users to highlight code and right
comprehend the code snippets provided by PySnippet, at least in                                                         click to "add a new snippet", so they can easily create and save
a similar way as if they had found the snippets elsewhere online.                                                       useful snippets for future use, as well as improve on the minor
    At the end of the experiment, participants were given a final                                                       usability issues as suggested by participants. For the longer term,
questionnaire to gauge preference opinions about the two ver-                                                           we imagine an online repository, where users could actively share
sions. Seven out of the eight participants preferred PySnippet                                                          and curate snippets for the community.
over normal Jupyter; one of the participants was indifferent and                                                           When considering the implications of work more broadly, we
none preferred normal Jupyter. Additionally, two out of the four                                                        believe that there is a huge potential for simple, well-designed
participants that completed the PySnippet tasks first stated that it                                                    tools, like PySnippet, to improve data analysis and visualiza-
was frustrating to have to go back and search online for snippets                                                       tion activities. Working with large data sets is a complex task,
instead of just using PySnippet from within Jupyter, which they                                                         with many concerns to attend to. By facilitating the data analysis
found much easier.                                                                                                      process, through better and easier to use tools, we reduce the cog-
    Participants also provided feedback on how PySnippet could                                                          nitive load and time requirements on data analysts and scientists
be improved. Two participants wanted to use the mouse to click                                                          incurred while attending to the mundane and non-sophisticated
on descriptions rather than using the keyboard to explore snip-                                                         tasks. Simple tasks such as learning how to create simple plots,
pets. Further, clicking on a description would automatically insert                                                     currently require more attention than they should. Improved
the snippet (even if it was not the one currently selected/ high-                                                       tools lead to improved processes, and enhance the ability to focus
lighted). Another participant mentioned that they would prefer                                                          on bigger and more important challenges. In our future work,
if a snippet would not replace the whole line of text (only back                                                        we will continue to look for opportunities that provide new and
to the equal sign), so they could assign the snippet to a variable                                                      improved tools that empower data scientists to make new dis-
when inserting it. These comments highlighted minor usabil-                                                             coveries and provide new insights.
ity issues that could easily be improved in future iterations of
PySnippet.                                                                                                              REFERENCES
                                                                                                                        [1] Balachandran, V. Query by example in large-scale code repositories. In 2015
                                                                                                                            IEEE International Conference on Software Maintenance and Evolution (ICSME)
5                      CONCLUSION AND FUTURE WORK                                                                           (Sept. 2015), pp. 467–476.
                                                                                                                        [2] Diamantopoulos, T., Karagiannopoulos, G., and Symeonidis, A. CodeCatch:
Our initial evaluation of PySnippet suggests that it works as it was                                                        Extracting Source Code Snippets from Online Sources. In 2018 IEEE/ACM 6th
intended to; it allowed participants in our study to spend less time                                                        International Workshop on Realizing Artificial Intelligence Synergies in Software
                                                                                                                            Engineering (RAISE) (May 2018), pp. 21–27.
searching online for code snippets and rapidly find the solutions                                                       [3] Ginosar, S., De Pombo, L. F., Agrawala, M., and Hartmann, B. Authoring
they were looking for. It was able to reduce overall development                                                            Multi-stage Code Examples with Editable Code Histories. In Proceedings of the
time and reduce the numbers of times a participant needed to                                                                26th Annual ACM Symposium on User Interface Software and Technology (New
                                                                                                                            York, NY, USA, 2013), UIST ’13, ACM, pp. 485–494.
search for a code snippet or find other documentation online.                                                           [4] Ichinco, M., and Kelleher, C. Exploring novice programmer example use.
While a few participants encountered minor usability issues,                                                                In 2015 IEEE Symposium on Visual Languages and Human-Centric Computing
                                                                                                                            (VL/HCC) (Oct. 2015), pp. 63–71.
the results showed that participants had little difficulty making                                                       [5] Rong, X., Yan, S., Oney, S., Dontcheva, M., and Adar, E. Codemend: Assisting
use of the system to accomplish the realistic data analysis and                                                             interactive programming with bimodal embezdding. In Proceedings of the 29th
visualization tasks they were given. PySnippet was also preferred                                                           Annual Symposium on User Interface Software and Technology (New York, NY,
                                                                                                                            USA, 2016), UIST ’16, ACM, pp. 247–258.
over normal Jupyter by all but one participant.                                                                         [6] Wightman, D., Ye, Z., Brandt, J., and Vertegaal, R. Snipmatch: Using source
   The current implementation of PySnippet performed extremely                                                              code context to enhance snippet retrieval and parameterization. pp. 219–228.
well on the limited set of tasks we evaluated. Extending our solu-
tion further would entail new challenges with new complexities.
For example, future work will have to deal with issues associated
with finding snippets in a repository of potentially thousands of