=Paper=
{{Paper
|id=Vol-2975/paper10
|storemode=property
|title=nanoHUB User Behavior: Moving from Retrospective Statistics to Actionable Behavior Analysis
|pdfUrl=https://ceur-ws.org/Vol-2975/paper10.pdf
|volume=Vol-2975
|authors=Gerhard Klimeck,Gustavo Valencia-Zapata,Nathan Denny,Lynn Zentner,Michael Zentner
|dblpUrl=https://dblp.org/rec/conf/iwsg/KlimeckVDZZ19
}}
==nanoHUB User Behavior: Moving from Retrospective Statistics to Actionable Behavior Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-2975/paper10.pdf</pdf>
<pre>
                              11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


                              nanoHUB user behavior:
                        moving from retrospective statistics to
                            actionable behavior analysis
        Gerhard Klimeck1, Gustavo A. Valencia-Zapata1, Nathan Denny2, Lynn K. Zentner1, Michael G. Zentner1,2
                                    1
                                      Network for Computational Nanotechnology
                                       2
                                         Rosen Center for Advanced Computing
                                 Purdue University, West Lafayette, IN 47907, USA

                                                                               provide a stable, national-level infrastructure, provide support
                            ABSTRACT                                           for the offered services, and provide compute cycles for an
    nanoHUB annually serves 17,000+ registered users with                      ever-growing user base.
over 1 million simulations. In the past, we have used data                         From the very beginning in 1996 [1], the predecessor to
analytics to demonstrate that nanoHUB can be a powerful                        nanoHUB called Purdue Network Computing Hub (PUNCH)
scientific knowledge sharing platform. We used retrospective                   was created to enable researchers to share their code without
data analytics to show how simulation tools were used in                       re-writes through novel web interfaces with end-users in
structured education and how simulation tools were used in                     education and research. PUNCH was so novel that even the
novel research. With the use of such retrospective analytics,                  web-server had to be created within the team. By 2004 the
we have made strategic decisions in terms of tool and content                  standard web-form-interfaces were antiquated and did not
developments and justified continued nanoHUB investments                       inspire the interactive exploration of simulation results with
by the US National Science Foundation (NSF). As we migrate                     rapid “What If?” questions that users might have. Users had to
towards a sustainable nanoHUB we must embrace similar                          download their simulation data to manipulate them in a form
processes pursued by in similar platforms such as Uber or                      where they can be truly used. nanoHUB was not an end-to-
AirBnB: we need to create actionable data analytics that can                   end usage platform. It became clear that the system had to be
rapidly support user experience and help grow the supply in the                revamped to enable the hosting of user-friendly engineering-
two-sided market platform – we need to improve the                             use inspired interactive applications. Such interactive sessions
experience of providers as well as end-users. This paper                       had to be hosted in a reliable, scalable middleware that was
describes some aspects on how we pursue user behavior                          running in production mode, not as a research paper
analysis inside the virtual worlds of nanotechnology simulation                demonstration. 3D dataset exploration had to be supported on
tools. From such user behavior we plan to derive actionable                    remote, dedicated GPUs that deliver the results to end users.
analytics that influence user behaviors as they interact with
nanoHUB.                                                                           RAPPTURE, the Rapid APPlication infrastrucTURE
                                                                               toolkit [2] enabled researchers, who typically did not have any
   Keywords— nanoHUB; HUBzero; science gateways; user
                                                                               graphical user interfaces to their codes to describe the input and
behavior; analytics; cluster; meander; education
                                                                               outputs of their codes in XML and to generate a GUI. New
                                                                               middleware [3] enabled 1,000+ users to be hosted
               INTRODUCTION AND BACKGROUND                                     simultaneously on a moderate cluster of about 20 compute
    nanoHUB is a scientific knowledge platform that has                        nodes. A novel remote GPU-based visualization system [4]
enabled over 3,500 researchers and educators to share 500+                     supported hundreds of simultaneous sessions. nanoHUB
research simulation tools and models as well as 6,000+ lectures                established the first community accounts on TeraGrid and OSG
and tutorials globally through a novel cyberinfrastructure.                    which would execute heavy-lifting nanoHUB simulation jobs
nanoHUB annually serves 17,000+ registered users with over 1                   completely transparently on behalf of users who had no
million simulations in an end-to-end user-oriented scientific                  accounts on these grid platforms [5]. We developed processes
computing cloud. Over 1.5 million visitors access the openly                   [6] to continually test the reliability of these remote grid
available web content items annually.          These might be                  services to ensure smooth user services. For application
considered impressive summative numbers, but they do not                       support we developed policies and operational infrastructure
address if the site has any impact or what these users are doing.              that enabled tool contributors to support and improve their
                                                                               tools through question & answer forums and through wishlists.
    Understanding these numbers requires some background on                    As this novel infrastructure emerged in 2005 we observed rapid
the original intentions and cyberinfrastructure developments                   growth in the simulation user base from the historical numbers
around nanoHUB. Fundamental issues raised by peer-                             of 500 annual users to over 10,000 in a few years.              As
reviewers were the perceived ability of a University project to


Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


questions of technical feasibility were addressed new questions     - How many parameters do users change?
as to actual and potential impact emerged.                          - How different do researchers, classroom users, and self-
Early-on our peer reviewers raised fundamental questions               study users behave?
whether such research-based simulation tools could be used by       - How different do different classes behave?
other researchers at all and if these tools could be used in        - Does different class instruction material / scaffolding make a
education without specific customizations. The nanoHUB                 difference?
team developed analytics that documented nanoHUB use                - Can we provide feedback to instructors on their classrooms?
research through reference and citation searches in the             - Given certain usage patterns inside the tool:
scientific literature. Today we can document over 2,200 papers
that cite nanoHUB and we keep track of the used resources and       - Can we improve the tools and provide feedback to the
tools, to provide attribution to the published tools. When we          developers?
showed the first 200 formal citations our peers remained               There are a variety of different requirements that need to be
unconvinced that this could be good research. We then began        met to address some of these questions in a scalable
to track secondary citations, which today sum to over 30,000       infrastructure such as:
resulting in an h-index of 82.                                      - Storage/availability of individual simulation runs within user
Our peers had a similarly strong opinion that research tools           sessions
could not be used in education. We therefore developed novel        - A data description language that is shared across different
clustering algorithms [7] that documented systematic used of           tools
simulation tools in formal education settings. Today we can         - A large set of simulation runs and participants
show that over 35,000 students in over 1,800 classes at over        - Other user data such as classroom participation, or researcher
180 institutions have used nanoHUB in formalized education             identification, geolocation, etc.
settings. We could also measure the time-to-adoption between
                                                                   In the next Sections we describe some of our first results that
tool publication and first-time systematic use in a classroom.
                                                                   begin to address some of these questions.
The median time was determined to be less than 6 months.
                                                                   For our initial study presented here we focus on the user
From the analysis of research use and education use we can
                                                                   behavior for PN Junction Lab [9] which is consistently one of
begin to qualify the attributes of the underlying simulation
                                                                   the top 10 nanoHUB tools [10] within any year. Despite our
tools. We found significant use in education and in research       codename pntoy the tool is powered by an industrial strength
for many of the nanoHUB tools. These research and education
                                                                   semiconductor device modeling tool called PADRE [11].
impact studies are documented in detail in Nature                  Instead of learning the complex PADRE input language that
Nanotechnology [8].                                                involves gridding, geometry, material and environmental
    We used retrospective data analytics to show how               specifications, users can easily ask “What if?” questions in a
simulation tools were used in structured education and how         toy-like fashion.
simulation tools were used in novel research. We showed that
the transition from research tool publication to adoption in the                  II SEARCHERS AND WILDCATTERS
classroom is happening rapidly in typically less than six
months and demonstrated through longitudinal data how                  RAPPTURE provides a rather generic description of
research tools migrate into education. With the use of these       simulation tool inputs and outputs. Over 90% of the 500+
retrospective analytics, we have made strategic decisions in       nanoHUB simulation tools utilize RAPPTURE as their data
terms of tool and content developments and justified continued     description language. With existing simulation logs we can
investments by NSF into nanoHUB.                                   now begin to study the user behavior inside simulation tools.
                                                                   Each simulation tool typically consists of 10 to 50 parameters
As we migrate towards a sustainable nanoHUB we must                that are exposed to the users. Most of these parameters are
embrace similar processes pursued by in similar platforms such     freeform numbers such as length, doping, effective mass,
as Uber or AirBnB: we need to create actionable data               dielectric constant, temperature etc. with their specific units,
analytics that can rapidly support user experience and help        while there is also a significant set of discrete options such as
grow the supply in the two-sided market platform – we need to      model or geometry choices. Assuming that each parameter
improve the experience of providers as well as end-users.          might have just 10 reasonable choices, then each tool spans a
                                                                   configurational design space of at least 1010 to 1050. The
                 II RESEARCH QUESTIONS                             dimensionality of these tools is clearly too large to be
                                                                   intuitively understood.
   Beyond raw numbers of users and simulations, we have
over the years continued to ask ourselves: How do users               We developed a visualization methodology [12] to flatten
behave in their virtual world of a simulation tool? More           an N-dimensional space into 2 dimensions. Figure 1 shows the
specifically:                                                      conceptual mapping and shows two significantly different user
                                                                   behaviors. A searcher, who moves through the design space in
 - How do they “travel” through the design/exploration world?      subsequent steps that appear to indicate a method or a goal. A
 - How many individual simulations do they run within one          wildcatter who modifies, apparently wildly, the same set of
    session?                                                       parameters and appears to jump throughout the design space.
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


Within the same publication we also documented the                       III CLASSROOM CLUSTERS
development of a “Searchiness” index that assigns a single              To demonstrate our ability to fuse different data sets from
value to the degree a user behaves like a prototypical wildcatter   our datastore we pick two different class clusters with
(Searchiness=0) and prototypical searcher (Searchiness=1).          significantly different characteristics as depicted in Figure 2.
                                                                    Class C12 is a class that reoccurred in 15 times between 2008
                                                                    and 2018. We have pntoy simulation data from 7 classes within
                                                                    2014 to 2018 for 109 users who ran 180 sessions. Historically,
                                                                    we do not have the simulation data from all users in that time
                                                                    frame. Going forward in the future we have developed a
                                                                    simulation caching system where all Rappture simulations are
                                                                    stored and users will receive stored solutions if they exist. The
                                                                    cluster view in Figure 2 shows a subset, the individual class
                                                                    held in the fall 2015 with 40 students who ran 80 sessions.
                                                                    C12 only uses pntoy. Class C16 uses 6 different tools
                                                                    throughout a semester. pntoy is one of these 6 tools used by 20
                                                                    users in 29 sessions. The visual cluster representation in Figure
                                                                    2 clearly shows the temporal behavior of 7 users who have
                                                                    used all 6 tools in the class. In the next section we will
   (a)                                                              compare the behavior in these classes against all available data
                                                                    and against all self-study users within the same region (Texas).


   (b)
                                                                    Figure 2: Visual representation of two temporal usage
                                                                    patterns in two different classes.         The horizontal axis
                                                                    represents time in units of days. The vertical axis stacks
                                                                    different users within a cluster. 40 users in C12 of fall 2015
                                                                    use pntoy. C12 only uses pntoy. Class C16 uses 6 different
                                                                    tools throughout a semester. pntoy is one of these tools used
                                                                    by 20 users. 7 users utilize all 6 tools as depicted.

                                                                         IV USER BEHAVIOR DISTRIBUTIONS
                                                                    Figure 3 shows the Searchiness distribution of all 2,747 geo-
                                                                    located simulation sessions of pntoy by 1,865 users in the time
                                                                    frame of 2014-2018. The complete distribution of all runs
                                                                    shows clear peaks around 0 (wildcatters) and 1 (searchers).
   (c)
Figure 1: a) Visual representation of a multidimensional            Class cluster C12 is a subset of all the available data consistent
space in two dimensions. b) a prototypical searcher. c) a           of 40 users with 80 sessions. This cluster usage uses only a
prototypical wildcatter.                                            single tool in the whole class. Wildcatter behavior appears to
                                                                    dominate this class C12. In contrast the smaller class that uses
   In this paper we show the analysis of a whole user               in total 6 tools, including pntoy with 20 users and 28 sessions
population using a specific tool and fuse that data set with        shows a distribution that seems to indicate more searchers.
specific classroom users.
                                                                       Finally, we look at a third population within the users.
                                                                    These are all the geo-located users in Texas (the location of
                                                                    C12 and C16) who have not been identified as participants in
                        11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


in any classes. We title this group of 20 who ran 36 simulation       Next to Searchiness, which is a computed model metric, we
sessions in pntoy as “self-study” users. These users show yet a   can also look at a simple raw number, which is the number of
different distribution of Searchiness compared to the other       queries each individual has performed within a single tool
populations.                                                      session. Within each tool session a user can execute the tool
                                                                  multiple times and compare results as visualized in Figure 1.
                                                                  Figure 4 shows the normalized distribution of queries executed
                                                                  by the 4 different populations we examined in Figure 3. The
                                                                  number of queries does not reveal much information except
                                                                  that the overall population runs more queries than the 2 Texas
                                                                  classes and the Texas self-study users. The classes and self-
                                                                  study users show a rather strong drop-off for more than the
                                                                  minimal queries of 4, which is the minimal number of queries
                                                                  needed to define Searchiness. Initial analysis does not seem to
                                                                  indicate a strong correlation to between Searchiness and
                                                                  number of queries.

                                                                                            VI CONCLUSION
                                                                      We report the development of a nanoHUB infrastructure
                                                                  that begins to enable the study of user behavior in virtual
                                                                  worlds of simulation tools. We use the previously published
                                                                  model index Searchiness and compute it for a complete data set
                                                                  of simulation sessions within a specific tool. We fuse data sets
                                                                  of class cluster identification with the model index Searchiness
Figure 3: Normalized Searchiness density for four                 and number of queries. No surprising results are seen or
nanoHUB populations. All pntoy runs contain 1,865 users           critical insight gained at this stage. We observe in the data
with 2,745 sessions. C12 Texas contains simulation data           that different user populations appear to behave differently in
from 109 users running pntoy in 180 sessions in 7 classes         terms of Searchiness and classes seem to appear similar in
from 2014-2018 (we do not have the simulation data of all         terms of number of queries. At this stage the data opens new
users in those classes). C16 is a class that occurred once in     vectors for questions such as:
the spring of 2015 and uses 6 tools. 20 users utilized 28          - Do all single-tool classes have similar behavior?
sessions. The Self-Study users populations are all 20 geo-
                                                                   - Do classes with more diverse tool use or better scaffolding
located users in Texas who ran 36 sessions who were not
                                                                      foster more search-like behavior?
associated with a formal class in the time fram of 2014-
                                                                   - Can similar behavior differences be seen with the other tools
2018.
                                                                      that are used in classes?
                                                                   - What does a peak in Searchiness value of 0.5 mean? Do we
                                                                      need to refine the Searchiness index?
                                                                   - Do we need to identify other behavioral metrics in addition
                                                                      to Searchiness?
                                                                   - Do the users who use other nanoHUB material outside the
                                                                      tools behave differently than the ones that use tools only?
                                                                      We conclude that this work is a first demonstrator that
                                                                  indicates that we can assess the simulation behavior of different
                                                                  user populations inside nanoHUB. We plan to refine these
                                                                  metrics and classifiers to gain more insights on the user
                                                                  behavior, and ultimately influence their behavior during use.

                                                                                     ACKNOWLEDGEMENTS
                                                                  Funding by the US National Science Foundation under Grant
                                                                  Nos. EEC-0228390, EEC-0634750, OCI-0438246, OCI-
                                                                  0721680, and EEC-1227110 as well as Purdue University is
                                                                  gratefully acknowledged.

Figure 4: Normalized number of queries for four                                              REFERENCES
nanoHUB user populations described in Figure 3.                   [1]   N.H. Kapadia, J.A.B. Fortes, M.S. Lundstrom, The Semiconductor
                                                                        Simulation Hub: A network-based microelectronics simulation
                                                                        laboratory, Proceedings of the Twelfth Biennial Conference:
                             11th International Workshop on Science Gateways (IWSG 2019), 12-14 June 2019


      University/Government/Industry Microelectronics Symposium, 1997,         [7]  Michael Zentner, Nathan Denny, Krishna Madhavan, Swaroop Samek,
      IEEE Xplore, DOI: 10.1109/UGIM.1997.616686                                    George Adams III., Gerhard Klimeck, "Using Automatic Detection and
[2]   Michael McLennan (2005), "Add Rappture to Your Software                       Characterization to Measure Educational Impact of nanoHUB",
      Development - Learning Module," https://nanohub.org/resources/240.            Proceedings of the 13th Gateway Computing Environments Conference,
                                                                                    September 25-27, 2018, Austin, TX
[3]   Michael Mclennan, Rick Kennell, HUBzero: A Platform for
      Dissemination and Collaboration in Computational Science and             [8] Krishna Madhavan, Michael Zentner, Gerhard Klimeck, "Learning and
      Engineering, IEEE Computing in Science and Engineering 12(2):48 –             research in the cloud", Nature Nanotechnology 8, 786–789 (2013);
      53, 2010 DOI: 10.1109/MCSE.2010.41                                            doi:10.1038/nnano.2013.231
[4]   Wei Qiao, Michael McLennan, Rick Kennel, David Ebert, Gerhard            [9] Dragica Vasileska, Matteo Mannino, Michael McLennan, Xufeng Wang,
      Klimeck, "Hub-based Simulation and Graphics Hardware Accelerated              Gerhard Klimeck, Saumitra Raj Mehrotra, Benjamin P Haley (2014),
      Visualization for Nanotechnology Applications". IEEE Transactions on          "PN Junction Lab," https://nanohub.org/resources/pntoy. (DOI:
      Visualization and Computer Graphics, Vol. 12, Issue: 5, Page(s): 1061-        10.21981/D3GH9B95N).
      1068, Sept.-Oct. 2006;doi : 10.1109/TVCG.2006.150                        [10] https://nanohub.org/usage/tools provides nanoHUB tool listings ranked
[5]   Gerhard Klimeck, Michael McLennan, Sean Brophy, George Adams                  by various criteria, such as number of users, number of simulations, wall
      III., Mark Lundstrom, "nanoHUB.org: Advancing Education and                   clock time, etc.
      Research in Nanotechnology", IEEE Computers in Engineering and           [11] Mark R. Pinto, Kent Smith, Muhammad Alam, Steven Clark, Xufeng
      Science (CISE), Vol. 10, Issue: 5, Page(s): 17 - 23, Sept.-Oct.               Wang, Gerhard Klimeck, Dragica Vasileska (2014), "Padre,"
      2008;doi:10.1109/MCSE.2008.120                                                https://nanohub.org/resources/padre. (DOI: 10.21981/D30C4SK7Z).
[6]    Lynn Zentner, Steven Clark, Krishna Madhavan, Swaroop                   [12] Nathan Denny, Gerhard Klimeck, Michael Zentner, "Visualizing User
      Shivarajapura, Victoria Farnsworth, Gerhard Klimeck, "Automated               Interactions with Simulation Tool", Proceedings of the 13th Gateway
      Grid-Probe System to Improve End-To-End Grid Reliability for a                Computing Environments Conference, September 25-27, 2018, Austin,
      Science Gateway", Proceedings of TeraGrid 2011 conference. July 18-           TX
      21, 2011, Salt Lake City, ACM proceedings, ISBN: 978-1-4503-0888-
      5;doi:10.1145/2016741.2016789

</pre>