=Paper=
{{Paper
|id=Vol-1787/140-144-paper-23
|storemode=property
|title=Improving Networking Performance of a Linux Cluster
|pdfUrl=https://ceur-ws.org/Vol-1787/140-144-paper-23.pdf
|volume=Vol-1787
|authors=Alexander Bogdanov,Vladimir Gaiduchok,Nabil Ahmed,Pavel Ivanov,Ivan Gankevich
}}
==Improving Networking Performance of a Linux Cluster==
Improving Networking Performance of a Linux Cluster A. Bogdanov1,a, V. Gaiduchok1,2, N. Ahmed2, P. Ivanov2, I. Gankevich1 1 Saint Petersburg State University, 7/9, Universitetskaya emb., Saint Petersburg, 199034, Russia 2 Saint Petersburg Electrotechnical University, 5, Professora Popova st., Saint Petersburg, 197376, Russia E-mail: a bogdanov@csa.ru Networking is known to be a "bottleneck" in scientific computations on HPC clusters. It could become a problem that limits the scalability of systems with a cluster architecture. And that problem is a worldwide one since clusters are used almost everywhere. Expensive clusters usually have some custom networks. Such systems imply expensive and powerful hardware, custom protocols, proprietary operating systems. But the vast majority of up-to-date systems use conventional hardware, protocols and operating systems. For example, Ethernet net- work with OS Linux on cluster nodes. This article is devoted to the problems of small and medium clusters that are often used in universities. We will focus on Ethernet clusters with OS Linux. This topic will be discussed by an example of implementing a custom protocol. TCP/IP stack is used very often, it is used on clusters too. While it was originally developed for the global network and could impose unnecessary overheads when it is used on a small cluster with reliable network. We will discuss different aspects of Linux networking stack (e.g. NAPI) and modern hardware (e.g. GSO and GRO); compare performance of TCP, UDP, custom protocol implemented with raw sockets and as a kernel module; discuss possible optimizations. As a result several recommendations on im- proving networking performance of Linux clusters will be given. Our main goal is to point possible optimization of the software since one could change the software with ease, and that could lead to performance improvements. Keywords: computational clusters, Linux, networking, networking protocols, kernel, sockets, NAPI, GSO, GRO. This work was supported by Russian Foundation for Basic Research (projects N 16-07-01111, 16-07-00886, 16-07-01113) and St. Petersburg State University (project N 0.37.155.2014). © 2016 Alexander Bogdanov, Vladimir Gaiduchok, Nabil Ahmed, Pavel Ivanov, Ivan Gankevich 140 Possible optimizations First of all, since the subject of this article is a network of a small computational clusters, let us highlight basic features of such network in comparison with generic one (probably global). As op- posed to the generic network, computational cluster network has the following features: it is very reli- able; all nodes of a cluster are usually the same (they use the same protocols, works at the same speed, etc.); congestion control could be simplified; number of nodes is known and usually does not change; no need in sophisticated routing subsystem (even more, cluster nodes usually belongs to the same VLAN); quality of service questions could be simpler (clusters are usually dedicated to several well- known applications with predictable requirements; there are usually only a small number of users in case of small cluster); fewer security issues since administrator of a cluster manages all the nodes of the network (even more, cluster nodes are usually not accessible from the outside world); overall overheads could be small. So, computational cluster network (even in case of small cluster and not very powerful network) is a special case. This case is simpler than a global network. That is why we can omit many assumptions required for a generic network (since the network is well-known and de- termined). That is why the algorithms and the implementation of protocols in this case could be much more simpler and efficient. All in all, when designing a protocol stack for the described case one could assume the following possible optimizations: simple addressing, simple routing, simple flow control, simple checksumming, explicit congestion notifications, simple headers (processing time is more im- portant than header size in case of computational cluster with powerful reliable network (in other cases one e.g. could use ROHC)). All these assumptions lead to a simpler and fast implementation. Actually, estimating the possible improvements (in terms of performance) is the main purpose of this article. Overview of Linux networking stack In our case when some user application wants to start some communication via conventional network it creates a socket. When that application wants to send some data via network it finally does ‘send’ or similar system call. L4 protocol implementation ‘sendmsg’ function (pointed to by ‘sendmsg’ field of ‘ops’ field of ‘struct socket’) will be invoked. This function does necessary checks, finds appropriate device to send from and then creates a new buffer (‘struct sk_buff’) which represents the buffer for packet to send. It will copy the data passed by the user application to the buffer. Then it fills the protocol header as necessary (or headers in case e.g. when we use that implementation to work with transport and network layers) and tries to send the data. It usually will call ‘dev_queue_xmit’, which, in turn, will call other functions – these function will e.g. pass the packet to ‘raw sockets’ (if any presents), queue the buffer (qdisc [Almesberger, Salim, …, 1999], if used). Finally our buffer will be passed to the NIC driver which will send the passed buffer as is (at this point we should have all the necessary headers at 2-7 layers). The ‘sk_buff’ structure allows one (in general case) to copy the data only once, from user buffer to kernel buffer and then (inside the kernel) work conventionally with headers at different layers without additional data copying. When the user application wants to receive some data it finally does ‘recv’ or similar system call. L4 protocol implementation ‘recvmsg’ function (pointed to by ‘recvmsg’ field of ‘ops’ field of ‘struct socket’) is called. This function checks the receive queue of the socket (to which the user application refers). If the queue is empty, it sleeps (if it is blocking). If there is some data for this socket, this func- tion handles the data – it will eventually copy the data from kernel buffer to user buffer. When packet arrives on the wire NIC will raise an interrupt. So, appropriate NIC driver function will be called. That function will form the ‘sk_buff’ buffer and call ‘netif_rx’ or similar (‘netif_receive_skb’ in case of modern distribution with NAPI) function which will check for special handling (and do it if necessary), pass packet to ‘raw sockets’ (if any) and finally call L3 protocol im- plementation function registered via ‘dev_add_pack’. That function searches for matching socket (tak- 141 ing into account appropriate headers of the received packet) and adds the packet to the socket receive queue (if any found). It could also awake a thread waiting for the data for this socket in ‘recvmsg’ function (using internal fields related to this socket). The described approach of receiving the data was substantially modified. If we look at the networking I/O evolution we will see polled I/O, I/O with in- terrupts, I/O with DMA. Then developers tries to use some interrupt mitigation techniques. And final- ly, NICs start to support offloading (offload some networking related work from CPU to NIC). So, one should take into account NAPI. NAPI (‘New API’) is an interface that uses interrupt coa- lescing. In case of modern driver that uses NAPI the steps described above will be modified in the fol- lowing way: on packet reception NIC driver disables the interrupts from the NIC and schedules NAPI polling; NAPI polling is invoked after some time (we hope that during that time NIC could receive new packets) and we handle the packet(s); if we cannot handle all packets (taking into account NAPI budget), we schedule another polling, otherwise we enable interrupts from NIC and so returns to the initial state. Such scheme could improve the performance since in the described approach we decrease the number of interrupts and handle several packets at once – and so decrease the overheads (e.g. time for context switching). More details about NAPI could be found in [Leitao, 2009]. Another possible option is GSO/GRO - techniques that could improve outbound/inbound throughput by offloading some amount of work to the NIC. In case of GSO/GRO NIC is responsible to do some amount of work which is usually done by OS on CPU: in case of GSO CPU prepares initial headers (at 2-4 layers) and passes them with pointer to big chunk of data (which should be sent in sev- eral packets) while NIC splits the data into several packets and fill the headers (2-4 layers) taking into account the initial template headers. In most cases such NIC supports only TCP/IP and UDP/IP. Of course, OS should support such feature of the NIC. And Linux kernel supports it. GSO/GRO could improve the networking performance since dedicated hardware device - NIC - could take big amount of work that was previously done by CPU. Actually, NAPI, GSO/GRO are not a new techniques. But one should always remember about that options since proper configuration is still required. One could usually turn on or off GSO/GRO (e.g. via ‘ethtool -K