=Paper=
{{Paper
|id=Vol-2485/paper17
|storemode=property
|title=Visualizing a Supercomputer: a Case of Objects Regrouping
|pdfUrl=https://ceur-ws.org/Vol-2485/paper17.pdf
|volume=Vol-2485
|authors=Vladimir Averbukh,Alexander Bersenev,Majid Forgani,Alexander Igumnov,Dmitry Manakov,Andrey Popel,Sergey Sharf,Pavel Vasev
}}
==Visualizing a Supercomputer: a Case of Objects Regrouping==
<pdf width="1500px">https://ceur-ws.org/Vol-2485/paper17.pdf</pdf>
<pre>
            Visualizing a Supercomputer: a Case of Objects Regrouping
                           V. L. Averbukh1,2, A. S. Bersenev1, M.A. Forghani1,2, A. S. Igumnov1,
                                 D.V. Manakov1, A. A. Popel1, S. V. Sharf 1, P. A. Vasev1
                                                    vasev@imm.uran.ru
1
  N.N. Krasovskii Institute of Mathematics and Mechanics of the Russian Academy of Sciences, Ekaterinburg, Russia;
          2
            Ural Federal University named after the first President of Russia B.N. Yeltsin, Ekaterinburg, Russia;
    In the paper we present the situation which had required visualization of a large amount of non-trivial objects, such as
supercomputer’s tasks. The method of visualization of these objects was hard to find. Then we used additional information about an
extra structure on those objects. This knowledge led us to an idea of grouping the objects into new generalized ones. Those new artificial
objects were easy to visualize due to their small quantity. And they happened to be enough for the cognition of the original problem.
That was a successful change of point of view. As a whole, our work belongs to a high-performance computing performance visualization
area. It gains valuable attention from scientists over the whole world, for example [1-2].
    Keywords: high-performance computing, software visualization, projection, multi-dimensional visualization.

                                                                            1. The time goes from the bottom to the top (covering the
1. Introduction. The environment                                        range of one month).
                                                                            2. Each blue column corresponds to one computing node of
    In high-performance computing (HPC), a typical program
                                                                        a supercomputer.
consists of many processes running simultaneously on a set of
                                                                            3. The width of a node’s column is determined by the
computing nodes. Typically, those processes communicate with
                                                                        function nfs_activity(node, time)→int. The more activity a node
each other, for example to interchange with partial results. Aside      had at a specified time moment, the wider is its column.
from that, processes perform read and write of data placed on
external storage. This data may be a program's input, output, or
intermediate values.
    In Krasovskii Institute a supercomputer named Uran uses a
dedicated file server as external storage. The server contains an
array of disks attached using RAID. The server is connected to
computing nodes using a dedicated network and is accessible via
a network file system (NFS) protocol. HPC programs access
user’s files via a computing node’s local file systems, while
physically files are located on external storage.
    During supercomputer operation, we observed one problem
in a long time span. File operations with the storage had become
occasionally laggy. For example, the creation of a zero-byte                Fig. 1. NFS activity of all nodes of the Uran supercomputer
length file was performing for up to 10 seconds and more. This                    during one month. “Fields” are a node types.
situation lasted for a few minutes about 1-2 times a month.
    To investigate the problem we started to collect statistics. We         The figure contains three interesting regions, which are
started probing each computing node of the supercomputer at             empty — A, B and C. The A region is empty due to error in
every 30 seconds for the following data:                                collecting data: it was suspended on some nodes. The B region is
    - count of NFS read operations performed by the node since          empty due to maintenance for those nodes for a few days. The C
the last probe,                                                         region is a set of nodes under long-time maintenance.
    - count of NFS write operations,                                        Although the obtained view seemed interesting, it doesn’t
    - NFS traffic amount of the node,                                   answer the predefined question of finding programs with NFS
    - other information, including CPU load and memory usage.           spikes. So we continued our visualization efforts.
    Then we analyzed the achieved data and came to the                      When an HPC program is launched, a set of parallel
conclusion: file storage lagging correlates with a total count of       processes is started on computing nodes. We will call such a
NFS operations per second (IOPS) performed on that storage,             launch as a task. At the Uran supercomputer, the Slurm
especially when IOPS exceeds some limits. It was regardless to          scheduling system is used to create and assign a task's processes
read or write direction, as well as to traffic amount. The details      to hardware computing nodes.
of this conclusion have been reported in [3].                               Slurm provides historical information on such assignments
    Accordingly, we made a hypothesis: spikes in NFS IOPS are           that determines: which tasks were running on which nodes at
generated by a stable set of concrete supercomputer programs.           which times. Each task record includes:
Thus, if we determine these programs, we might re-factor them               – user id,
and so eliminate the spikes, and so eliminate the file server lags.         – program name,
    To check the stated hypothesis we decided to visualize the              – nodes list, where the task's processes were assigned to,
collected data. The final target for visualization was defined as:          – the start time and the finish time of the task.
allow to visually find the programs which cause NFS spikes.                 This information allowed us to recalculate NFS activity from
                                                                        per-node to per-task basis. Thus we achieved a function
2. Our way and the problem                                              nfs_activity_t(task, time)→int which returns NFS activity of the
                                                                        given task at the given time.
    First of all, we were able to visualize the existing data of per-       However, we felt it to be unusable to visualize the tasks “as
node NFS activity. Here the term “NFS activity” means a count           is” because of their amount. For example, the Uran
of NFS operations (both read and write) accumulated during a            supercomputer executes about 100 000 of tasks each month.
probing period. It is shown in figure 1.                                Actually, we even had not found any ideas on how to do that.
    The view is the following:


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    Alternatively, there was an idea – to group tasks into                 We added 3D graphics for the view, and camera
programs, and try to visualize them. Unfortunately, it turned to      manipulation tools such as zooming, movement, and rotation
be technically impossible due to the following. Slurm provides        were used. It allowed us to get extra angles of view to data. It
corresponding program name for each task, but those names are         also provided an ability to use another visual method - view of
not stable. For example, suppose there are three tasks. Slurm         function graphs (see below) simultaneously with columns view.
might provide the following program names for them: «wrap-            Figure 3 illustrates this effect.
414», «wrap-5020», «wrap-0103». The first two names might                  We added visualization for summarized supercomputer
correspond to some real program P1 and the last to a real program     activity over time:
P2. Thus the real program name is hidden.                                  – total number of running tasks at supercomputer,
   The names are so strange because often users ask                        – total count of NFS operations at supercomputer,
supercomputer to run bash script as a program, and those scripts           – amount of running tasks,
are in turn run real program images. Slurm encodes script names            – amount of used CPUs of supercomputer,
as «wrap_NN» for some reason.                                              – duration of empty file creation.
   Respectively, the following question arose: how task’s visual           The meaning of this data is obvious with except of last one.
representation might help us to find programs with NFS spikes?        We constructed special test for our NFS storage: create empty
Or at least, which representation method we might use for             file and delete it, and measure the time used by that operation.
visualization of tasks?                                               We started to run it every few seconds and collect times. This
                                                                      measurement occurred to be very expressive for registering
3. Solution using grouping                                            storage lagging. Sometime, this operation performed for more
    The stated problem has been solved by the following.              than tens of seconds!
    We used our background observation: for almost all time                The example of summarized activity is presented on the left
each user runs only a few programs that are often unique. For         side in figure 3 as graphs. We decided to show them as function
example, a user N1 usually spawns only program P1, a user N2          graphs, to visually emphasis the difference of this data in contrast
spawns only programs P1 and P2, and user N3 spawns only P3.           with user’s activity (which is shown as columns).
    This observation allowed us to change our visualization idea:
instead of focusing on programs, we may focus on users.
    In contrast with the idea of visualizing per-program activity,
the per-user activity is possible to compute. It allows determining
users who cause NFS spikes and further analyze their programs
manually.
    In comparison with the attempt for visualization of tasks, the
mentioned approach seems to be simpler, since working with
users, which is about 200 in our case, is much convenient than
the number of tasks.
    We constructed the following view, figure 2:
                                                                                Fig. 3. Summarized supercomputer activity and
                                                                                    per-user NFS activity in a 3D view.
                                                                           Also we added an ability for an investigator to highlight time
                                                                      moments on the view when summarized activity meets some
                                                                      criteria. For example, in figure 3 the red lines depict time
                                                                      moments when empty file creation lasted more than 10 seconds
                                                                      for 2 minutes. This example is crucial – it means that NFS storage
                                                                      naturally hanged.
                                                                           In common, the criteria used for highlighting is the
                                                                      following: whenever selected column of summarized activity is
                                                                      more or less than specified constant for specified number of
                                                                      sequential probe values. This was enough for our aims.
            Fig. 2. NFS activity of supercomputer users                    Another option we added is an ability to perform multiple
                     during 2-weeks period.                           highlights, one with red, and another with green or blue. For
     1. The time goes from the bottom to the top.                     example, we were able to highlight moments of slow empty file
     2. Each column corresponds to a user of the supercomputer.       creation with red color and moments with huge NFS operations
     3. The width of the column at each specific time is              count with green color. It was nice to visually see correlation
determined by the value of function nfs_activity_u(user,time)→        between these highlights, displayed simultaneously.
int, e.g. the total count of NFS operations generated by all of the        Working with this highlights opened an unexpected
running tasks of that user at that time.                              observations and conclusions on our problem, described in the
     Additionally, users' columns were sorted by the sum of NFS       following section.
usage. The motivation for that idea was to visually emphasize
most active users and group them. This "group" was under              5. Unexpected observations
suspicion as a source of storage problems and thus was subject           One may look at figure 2 or 3 and decide that most active users
to manual investigation. Now we saw this group visually.              (with fat columns) are “guilty” in NFS storage hangs. But that
     We found the stated view as interesting and informative. It      occurred not to be true. Figure 4 illustrates that.
allowed us to determine the most NFS-active users and their              Only user activities are shown. Time goes from right to left.
programs. But this was not the end of the story.                      The time moments when NFS storage hangs are depicted by grey
                                                                      lines. The corresponding per-user NFS activities, during storage
4. Enhancing the view                                                 hangs, are highlighted with purple color.
    During visualization research, we conducted more steps than          Naturally speaking, all the users whose columns contain purple
stated above.                                                         color had contributed to storage hangs.
                                                                       then Viewlang translates them into WebGL calls. Thus all the
                                                                       graphics was programmed in Javascript in reactive fashion.
                                                                           The data is divided into month slices. One-month view was
                                                                       considered as enough for our research. On the visualization index
                                                                       page, a user selects which month he wants to be visualized and
                                                                       visualization is launched, parametrized by the month selected.
                                                                           In experimentation purposes, we spawned visualizations in
                                                                       Virtual Reality mode using WebVR. We used Oculus Rift DK2
                                                                       headset for this. It was nice to see the stated views in VR mode.

                                                                       7. Conclusion
                                                                           The presented problem and the solution demonstrates the
                                                                       transformation of the initial visualization idea into a simple and
                                                                       more descriptive presentation.
                                                                           We had a final goal – visually find programs that cause NFS
  Fig. 4. NFS activities of users making contribution to storage
                                                                       spikes. We had obtained information about tasks from the
              hang are highlighted with purple.
                                                                       supercomputer's scheduler. Our initial idea was to visualize these
    The figure visually explains that not only most NFS-active         tasks and their NFS activity. But we had no way to associate tasks
users are contributing to storage hang, but a lot of users, even       with programs, due to limitations in scheduler software. So this
lesser-active (located most far at the figure).                        effort looked as useless in terms of the final goal.
    We concluded: storage problems are a result of a                       And then we found out that there is no need to visualize tasks.
combination of contributions from various users, and this              Instead, we had synthesized new objects – the “NFS activity of
combination is different from time to time.                            users”. We redirected our view on tasks from per-program to per-
    Considering the formula “one user = few unique programs”,          user basis. Thus tasks were regrouped.
it means that there are no problematic programs subset, as we              It was reasonable due to our preliminary knowledge that each
considered at the beginning of our research. The roots of the          user runs a small subset of programs, different from other users.
problem are that all programs perform NFS operations                   A lot of users run just one self-written program. Furthermore, the
simultaneously, and their total activity sometimes excess limits       number of users of the supercomputer is much less than the
leading to storage hang.                                               number of tasks executed each month. So the visualization had
    The set of programs caused storage problems each time              become easier – both in implementation and understanding.
seems to be different. At least it is not connected with most-active       The focus of an investigator’s mind was also transformed.
NFS users. Probably, there are other patterns that we have not         Instead of thinking in terms of original objects, it was moved to
determined yet.                                                        think about new ones. Instead of thinking about “bad programs”,
    One interesting part of this section is the following. Using       we started thinking about “bad users”.
purple highlight of users, we achieved visual representation of            The final of our story is tricky. We found out that there were
problematic time moments and their participants, e.g. reasons for      no “bad users”. The overload of supercomputer’s storage was
the problems. It means that we achieved a visual method of             caused at time moments when some subset of users together
solving our task (of finding programs causing NFS lags).               generated an NFS spike. This subset was not stable. Individually,
    Of course, we performed extra computations that gave us the        each spike has its specific subset of users who contributed NFS
same conclusions numerically. But also we achieved that                load and thus generated that spike.
visualization has “computed” the result for us.                            We consider the presented case as pretty remarkable to share
    We find such cases when visualization makes an answer              with the community in order of our common efforts in the
obvious (to human) as very remarkable. A great example of such         evolution of visualization theory and practice.
a case is a functions graphs y=f(x): when it crosses the OX axis
it corresponds to “roots” of function’s solution and these roots       8. References
are visually obvious to an observer.
    An important detail in our case was the coloring of a user’s       [1] Showerman M. Real Time Visualization of Monitoring
columns (in contrast to just highlighting time moments by grey             Data for Large Scale HPC Systems // 2015 IEEE
lines). If a user has valuable NFS activity at a problematic               International Conference on Cluster Computing. IEEE,
moment, his column is visually emphasized. Please note that in             2015. Pp. 706-709. doi.org/10.1109/CLUSTER.2015.122
figure 4 if some user has small NFS activity at problematic            [2] Nikitenko, D., Zhumatiy, S., Shvets, P.: Making large-scale
moments, his column is visually not emphasized due to                      systems observable – another inescapable step towards
corresponding small shape, even still colored.                             exascale. Supercomput. Front. Innov. J. 3(2), 72–79 (2016).
    We see here that it was required to perform additional visual          doi.org/10.14529/jsfi160205
representation changes so human become able to detect what he          [3] Igumnov A. S., Bersenev A. Y., Popel A. A. and Vasev P.
needs to. And in turn, after that we didn't need any further steps:        A. Studying the Latency of Concurrent Access to Network
visualization began to work.                                               Storage // In proceedings of 13th international conference
    If the illustrated approach will be successful, an interesting         on Parallel computational technologies (PCT’2019),
theoretical question arises: which visual adjustments should we            Kaliningrad,       Russia,     2019.       Pp.       66-77.
make to visual representations to make them provide “obvious”              elibrary.ru/item.asp?id=37327250
answers to questions valuable to a human?

6. Technical background
    The visualization was created using web technologies,
including WebGL. A 3D visualization framework Viewlang.ru
was used for graphics programming. It allows programmer to
specify objects like Lines, Points, Spheres, Triangles, etc, and

</pre>