Network traffic analysis for the computing cluster at IHEP A. A. Kotliara, V.V. Kotliarb National Research Center “Kurchatov Institute” State Research Center of Russian Federation Institute for High Energy Physics, Protvino, Russia E-mail: a Anna.Kotliar@ihep.ru, b Viktor.Kotliar@ihep.ru A task for analysis of network traffic flows on the high performance computing network for the computer cluster is very important and allows to understand the way of complicated computing and storage resources us- age by different software applications running on the cluster. As soon as all these applications are not managed by the cluster administrators they need a tool to understand usage patterns to make then an appropriate tuning for the core cluster software to achieve more effective usage for the cluster resources. The paper presents the devel- opment of such system for the IHEP cluster. Keywords: sflow, netflow, flox, pmacct, R programming language, network traffic analysis, packet cap- ture, traffic classification, GRID-computing © 2016 Anna A. Kotliar 541 Introduction A task for analysis of netw work traffic flows on the high performance compuuting network for the computer cluster is very importaant and allows to understand the way of compliicated computing and storage resources usage by differrent software applications running on the cluster.. As soon as all these applications are not managed by y the cluster administrators they need a tool to understand usage pat- terns to make then an appropriatte tuning for the core cluster software to achieve more effective usage for the cluster resources. Open source software for performing such analysis outt of the box does not exist. Also the problem is that very complex and specific computing cluster infrrastructure need to be taken into account. The more geeneral way is to use different software componennts clued together in a system witch fulfill requested prroperties. In this paper such system based on floow collector software, relational data base software, weeb services and a programming language for staatistic analysis is de- scribed. System for flow data analysis The system architecture for network traffic analysis is presented on the figurre 1. It consists of the following software components:  many network devices w which generate sFlow data [Claise, Trammell, …, 2013];  pmacct system for colleccting this data, filtering them, splitting to the dataa blocks per date and time;  MySQL relation DB to sstore prepared by pmacct data block;  FLOX web-server softwaare for simple analysis of flow data tables;  R interface to the MySQL L DB for sophisticated and statistical analysis. Figure 1. System architecture for collecting and analyzing flow data For the data source sFlow (it is a short for "sampled flow") is used. It is an industry standard for packet export at Layer 2 of the OSI model. sFlow uses sampling to achieve scalability and is applic applica- ble to high speed networks such as the network of the computing cluster. By monitorin monitoring traffic flows on all ports continuously, sFlow can be used to instantly highlight congested links, identify the source of the traffic, and the associated application level conversations [Hofstede, [Hofstede Celeda Celeda, …, 2014]. 542 All data comes to the pmacct system. It is a small set of multi-purpose purpose passive network monito monitor- ing tools. It can account, classify, aggregate, replicate and export forwarding-plane forwarding plane data, ie. IPv4 and IPv6 traffic. It collects data in memory tables and then store it persistently to MySQL DB. ppmacct is able to perform data aggregation, offering a rich set of primitives to choose from; it can also filter, sample, re-normalize, normalize, tag and classify at L7. FloX (Flow eXplorer) is a simple PHP tool to examine large tables of flow data in a SQL dat data- base. It is easy extendable and a allow to use SQL like requests to the netflow database. It is used as a first tool for real-time time data analysis. For more complex analysis R language can be used. It is a language and environment for statist statisti- cal computing and graphics. aphics. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. R can easy produce well-designed well publication-quality quality plots. Integration to IHEP in nfrastructure Computing cluster at HEP has two kind of networks: external network atttached to the campus LAN with an access to the reseaarch network in Internet and internal high througghput network which serves for the inter-cluster comm munications. Described system for traffic analysiss is applied to the in- ternal network. The network coree of the cluster is built on top of the HP ProCurvee switches 5406zl. All switches bound together by pairss into logical switches with the distributed trunkiing technology. Con- nection bandwidth between corees switches is 2x10Gb/s when all computer harrdware connected by 2x1Gb/s or 2x4Gb/s links with h using bonding technique over link aggregatioon protocol. Figure 2 shows the implemented schema for sFlow analysis. Fig Figure 2. Implemented schema for sFlow analysis 543 Data analyze For data analysis stateless packet inspection technologies were used [Getman A.I., 2015]. Flow data analysis consist of two part. First of all it is an interactive real-time analysis of traffic collected from network devices in MySQL database. FloX web-interface is used for it. This interface allows easily navigate through simple flow table where stored only fields like: source and destination IP addresses, traffic type, tcp/udp ports, and number of packets transmitted. In the web interface it is also possible to use complex SQL queries if needed. Many helper tables were created inside MySQL to allow to be made Union queries where some types of network devices need to be grouped. For instance there were created lustreOSSNodes , seAlicePoolNodes, seAtlasPoolNodes, seCMSPoolNodes tables with ip_int, ip_ext fields. These tables allow create queries like “how much traffic was sent to Lustre storage file system” or “how much network packets were transmitted from Atlas storage system to nodes”. All helper tables are specific for IHEP computing cluster but the idea is general and could be used any- where. As the result of such analysis it was discovered a misconfiguration on the IHEP cluster w here incorrectly set storage nodes sent traffic through external interfaces to the internal cluster nodes. So additional setup was performed to mitigate such problem. It shows that it is the only way to understand traffic flows on the complex cluster system with dual network setup where we even do not know ex- actly how each program works as soon as these programs still under constantly development in the Grid community. Second and more complicated analysis is done by using R programming language for statistical computing and graphics. In this analysis we try to use statistical methods to analyze unstructured data to understand how flows of the networks traffic could be used for describing usage patterns of the computing cluster for later modification in the setup environment to minimize additional traffic be- tween different working zones means maximize computation usage. Increase effectivity of the usage of computing resources is a primary goal. Maybe using some machine leaning techniques which a l- lows R language we even will be able to predict abnormal behavior of the cluster which depends on flow connections and trigger alarms or we will be able to self heal our system. The simplicity of using R for analysis is shown on figure 3. acc<-dbGetQuery(conn = con, statement = "select sum(bytes),(UNIX_TIMESTAMP(stamp_updated)- UNIX_TIMESTAMP(DATE('20160330'))) from acct_20160330 where ip_src in (select ip_int from lustreOSSNodes) group by stamp_updated;") plot(acc[,2]/60/60,acc[,1]/1024/1024,main="Cluster from Lustre ",xlab="Time",ylab="MiB",col='blue') Figure 3. Generate traffic plot with R Here we can see two lines for generating graphical plot for traffic from the Lustre storage within co m- puting cluster (figure 4). Figure 4. R plot for network traffic 544 Conclusion The way of creation and usage flow tools for analysis network traffic for high-throughput net- work for high-performance and Grid computing cluster was presented in the described work. As soon as all these tools based on open source it is possible to easy extend them for any particular usage. Main achievement is a creation of the very simple architecture for performing almost any kind of network data analysis. By combining statistical computing language for programming and simple query lan- guage for DB it is possible to get helpful results in a short period of time. The system is implemented on the production high-performance computing cluster at IHEP and allowed to find anomalies and misconfigurations in the cluster environment. As future works it is planned to use R for detecting anomalies in traffic patterns with statistical and machine learning techniques which could help to implement some principles of autonomic com- puting [White paper…, 2006] such as self-optimization, self-healing, self-protection for the IHEP computer cluster. References Claise B., Trammell B., Aitken P. Specification of the IP Flow Information Export (IPFIX) protocol for the exchange of flow information. // RFC 7011 (Internet Standard), Internet Engineering Task Force. — 2013. [Electronic resource]. URL: http://www.ietf.org/rfc/rfc7011.txt. Hofstede R., Celeda P., Trammell B., Drago I., Sadre R., Sperotto A., Pras A. Flow Monitoring Ex- plained: From Packet Capture to Data Analysis With NetFlow and IPFIX // IEEE communica- tions surveys & tutorials. — 2014. — Vol. 16, No. 4. — P. 2037. White Paper. An architectural blueprint for autonomic computing // IBM. — 2006. Getman A.I., Evstropov E.F., Markin Y.V. Wirespeed network traffic analysis: survey of applied prob- lems, approaches and solutions. // Preprint ISP RAS. — 2015. [Electronic resource]. URL: http://www.ispras.ru/preprints/docs/prep_28_2015.pdf http://www.sflow.org/about/sampling_history.php 545