=Paper= {{Paper |id=Vol-1787/140-144-paper-23 |storemode=property |title=Improving Networking Performance of a Linux Cluster |pdfUrl=https://ceur-ws.org/Vol-1787/140-144-paper-23.pdf |volume=Vol-1787 |authors=Alexander Bogdanov,Vladimir Gaiduchok,Nabil Ahmed,Pavel Ivanov,Ivan Gankevich }} ==Improving Networking Performance of a Linux Cluster== https://ceur-ws.org/Vol-1787/140-144-paper-23.pdf
   Improving Networking Performance of a Linux Cluster
    A. Bogdanov1,a, V. Gaiduchok1,2, N. Ahmed2, P. Ivanov2, I. Gankevich1
           1
               Saint Petersburg State University, 7/9, Universitetskaya emb., Saint Petersburg, 199034, Russia
    2
        Saint Petersburg Electrotechnical University, 5, Professora Popova st., Saint Petersburg, 197376, Russia
                                                E-mail: a bogdanov@csa.ru


      Networking is known to be a "bottleneck" in scientific computations on HPC clusters. It could become a
problem that limits the scalability of systems with a cluster architecture. And that problem is a worldwide one
since clusters are used almost everywhere. Expensive clusters usually have some custom networks. Such systems
imply expensive and powerful hardware, custom protocols, proprietary operating systems. But the vast majority
of up-to-date systems use conventional hardware, protocols and operating systems. For example, Ethernet net-
work with OS Linux on cluster nodes. This article is devoted to the problems of small and medium clusters that
are often used in universities. We will focus on Ethernet clusters with OS Linux. This topic will be discussed by
an example of implementing a custom protocol. TCP/IP stack is used very often, it is used on clusters too. While
it was originally developed for the global network and could impose unnecessary overheads when it is used on a
small cluster with reliable network. We will discuss different aspects of Linux networking stack (e.g. NAPI) and
modern hardware (e.g. GSO and GRO); compare performance of TCP, UDP, custom protocol implemented with
raw sockets and as a kernel module; discuss possible optimizations. As a result several recommendations on im-
proving networking performance of Linux clusters will be given. Our main goal is to point possible optimization
of the software since one could change the software with ease, and that could lead to performance improvements.

    Keywords: computational clusters, Linux, networking, networking protocols, kernel, sockets, NAPI, GSO,
GRO.

This work was supported by Russian Foundation for Basic Research (projects N 16-07-01111, 16-07-00886, 16-07-01113)
and St. Petersburg State University (project N 0.37.155.2014).


                                        © 2016 Alexander Bogdanov, Vladimir Gaiduchok, Nabil Ahmed, Pavel Ivanov, Ivan Gankevich




                                                                                                                         140
Possible optimizations
      First of all, since the subject of this article is a network of a small computational clusters, let us
highlight basic features of such network in comparison with generic one (probably global). As op-
posed to the generic network, computational cluster network has the following features: it is very reli-
able; all nodes of a cluster are usually the same (they use the same protocols, works at the same speed,
etc.); congestion control could be simplified; number of nodes is known and usually does not change;
no need in sophisticated routing subsystem (even more, cluster nodes usually belongs to the same
VLAN); quality of service questions could be simpler (clusters are usually dedicated to several well-
known applications with predictable requirements; there are usually only a small number of users in
case of small cluster); fewer security issues since administrator of a cluster manages all the nodes of
the network (even more, cluster nodes are usually not accessible from the outside world); overall
overheads could be small. So, computational cluster network (even in case of small cluster and not
very powerful network) is a special case. This case is simpler than a global network. That is why we
can omit many assumptions required for a generic network (since the network is well-known and de-
termined). That is why the algorithms and the implementation of protocols in this case could be much
more simpler and efficient. All in all, when designing a protocol stack for the described case one could
assume the following possible optimizations: simple addressing, simple routing, simple flow control,
simple checksumming, explicit congestion notifications, simple headers (processing time is more im-
portant than header size in case of computational cluster with powerful reliable network (in other cases
one e.g. could use ROHC)). All these assumptions lead to a simpler and fast implementation. Actually,
estimating the possible improvements (in terms of performance) is the main purpose of this article.


Overview of Linux networking stack
       In our case when some user application wants to start some communication via conventional
network it creates a socket. When that application wants to send some data via network it finally does
‘send’ or similar system call. L4 protocol implementation ‘sendmsg’ function (pointed to by
‘sendmsg’ field of ‘ops’ field of ‘struct socket’) will be invoked. This function does necessary checks,
finds appropriate device to send from and then creates a new buffer (‘struct sk_buff’) which represents
the buffer for packet to send. It will copy the data passed by the user application to the buffer. Then it
fills the protocol header as necessary (or headers in case e.g. when we use that implementation to work
with transport and network layers) and tries to send the data. It usually will call ‘dev_queue_xmit’,
which, in turn, will call other functions – these function will e.g. pass the packet to ‘raw sockets’ (if
any presents), queue the buffer (qdisc [Almesberger, Salim, …, 1999], if used). Finally our buffer will
be passed to the NIC driver which will send the passed buffer as is (at this point we should have all the
necessary headers at 2-7 layers). The ‘sk_buff’ structure allows one (in general case) to copy the data
only once, from user buffer to kernel buffer and then (inside the kernel) work conventionally with
headers at different layers without additional data copying.
       When the user application wants to receive some data it finally does ‘recv’ or similar system call.
L4 protocol implementation ‘recvmsg’ function (pointed to by ‘recvmsg’ field of ‘ops’ field of ‘struct
socket’) is called. This function checks the receive queue of the socket (to which the user application
refers). If the queue is empty, it sleeps (if it is blocking). If there is some data for this socket, this func-
tion handles the data – it will eventually copy the data from kernel buffer to user buffer.
       When packet arrives on the wire NIC will raise an interrupt. So, appropriate NIC driver function
will be called. That function will form the ‘sk_buff’ buffer and call ‘netif_rx’ or similar
(‘netif_receive_skb’ in case of modern distribution with NAPI) function which will check for special
handling (and do it if necessary), pass packet to ‘raw sockets’ (if any) and finally call L3 protocol im-
plementation function registered via ‘dev_add_pack’. That function searches for matching socket (tak-




                                                                                                          141
ing into account appropriate headers of the received packet) and adds the packet to the socket receive
queue (if any found). It could also awake a thread waiting for the data for this socket in ‘recvmsg’
function (using internal fields related to this socket). The described approach of receiving the data was
substantially modified. If we look at the networking I/O evolution we will see polled I/O, I/O with in-
terrupts, I/O with DMA. Then developers tries to use some interrupt mitigation techniques. And final-
ly, NICs start to support offloading (offload some networking related work from CPU to NIC).
      So, one should take into account NAPI. NAPI (‘New API’) is an interface that uses interrupt coa-
lescing. In case of modern driver that uses NAPI the steps described above will be modified in the fol-
lowing way: on packet reception NIC driver disables the interrupts from the NIC and schedules NAPI
polling; NAPI polling is invoked after some time (we hope that during that time NIC could receive
new packets) and we handle the packet(s); if we cannot handle all packets (taking into account NAPI
budget), we schedule another polling, otherwise we enable interrupts from NIC and so returns to the
initial state. Such scheme could improve the performance since in the described approach we decrease
the number of interrupts and handle several packets at once – and so decrease the overheads (e.g. time
for context switching). More details about NAPI could be found in [Leitao, 2009].
      Another possible option is GSO/GRO - techniques that could improve outbound/inbound
throughput by offloading some amount of work to the NIC. In case of GSO/GRO NIC is responsible
to do some amount of work which is usually done by OS on CPU: in case of GSO CPU prepares initial
headers (at 2-4 layers) and passes them with pointer to big chunk of data (which should be sent in sev-
eral packets) while NIC splits the data into several packets and fill the headers (2-4 layers) taking into
account the initial template headers. In most cases such NIC supports only TCP/IP and UDP/IP. Of
course, OS should support such feature of the NIC. And Linux kernel supports it. GSO/GRO could
improve the networking performance since dedicated hardware device - NIC - could take big amount
of work that was previously done by CPU. Actually, NAPI, GSO/GRO are not a new techniques. But
one should always remember about that options since proper configuration is still required.
      One could usually turn on or off GSO/GRO (e.g. via ‘ethtool -K