Applying CEP to problems of real-time data analysis in distributed IDS. Tigran Tsaturyan, Bauman Moscow State Technical University a915200@yandex.ru Abstract guide to set up DIDS easy. Researchers [4] used CEP (Complex Event Processing) as basement for network Nowadays developers and researchers apply scans detection system with positive results. Complex different approaches from the traditional rule Event processing (CEP) is a method of tracking and based solutions to data mining or pattern search analyzing (processing) streams of information in DIDS. To implement their inventions into life, combined from several sources about things that happen performance question is still open as DIDS is and deriving a conclusion from them. Due to CEP designed to operate in real-time with millions of limitation (described lately here) these researchers had packets per second. In this paper, we pay close to employ extra data warehouse. Some researchers [10] attention to data systems used and will offer our simply use sensors as a sniffer to redirect in and out draft of the data processing system to be used at going traffic with only native or implemented lists, implementation including all best from old and arrays, and other datasets in memory. Distributed new approaches, designed especially to data Prelude-IDS [11] work as with MySQL as well as with mining applications. PostgreSQL, storing however only results and perform Academic supervisor: Yuri Gapanyuk, “analysis on the fly”. Experiments, based on Intelligent gapyu@bmstu.ru Techniques such as Genetic Algorithms, Neural Networks, Data Mining were carried on relatively little amount of traffic with some possible delays, however 1.Introduction the desire to apply it in real time transformed the software for real-time intelligence. Distributed Intrusion detection system (DIDS) is software or device designed for detection malicious activity in several inspected objects. Every DIDS 3 Requirements require data to perform its analysis. The faster data To build the framework for the DIDS to operate delivery system is, the quicker decision will be made successfully, we must meet main requirement for such and attacker locked. Usually such systems are built as systems, i.e. being fast, reliable and fit to possible tasks. client-server architecture, where server consist of main engine, logic, rules, logs, etc. and a client (also called 3.1 Understanding Data Processing in DIDS sensor), that collects data or perform basic/extended analysis. [3] To understand what kind information and queries, The more functions perform the client; the more is we are facing with, there is need to understand the the requirement for a hardware. On the other hand, attacks themselves and used identification approaches. making the sensor to perform only data collection will Basically attacks can be classified into three groups: result in extreme network loading as a copy of every • Network scanning in/out packet will be sent to the server. As combined • Denial of service attack approach here is used, so designed system should be • Different human attacks (exploits, errors in able to function according to both possible options and code, vulnerabilities, etc.) act not only as a communication platform, but the Network scanning is a way to understand network processing solution itself twisted with IDS. structure, its hosts and open ports. Such hacking technique can give an answer to questions what is the 2. Related Work network, how many hosts are there, what OS they are running, what is opened or locked. Let TCP SYN be an Question of data-processing architecture in DIDS is example [please, refer to appropriate documentation of relatively new, probably as data-mining IDS are still in such attack]: the scanner sends SYN packet to open development and modern security products are either connection to the target and waits for response. If SYN- closed in documentation or are limited. There are ACK packet is received, so scanner suppose such port attempts [2] to use MySQL database as data-storage open and host up. If RST-ACK packet is received than core, for example, but solution offered does not practice scanner conclude port to be closed. If nothing is any intelligent techniques, but rather is a good practical received, so port can be filtered or host down. The 4 Existing and Our Architectures scanner can be run on one host, or the scanner can be several hosts. To identify the scanning attack, IDS Researchers in [6] showed architecture (please, refer to (DIDS), for example, create a list of enquires to figure 1) of data mining IDS that is not distributed. Also different ports from one source, sum them and compare the performance of such system was not given. with some threshold [4]. Denial of Service attack (DoS) is an attack to block or overload the target machine and make it inaccessible to users. DoS is easily identified, but not so easy to prevent. The approach for attack recognition is similar to network scanning with difference that now we have to consider packet size or go deeper to application layers and understand requests. Very often attack is performed by the group of machines (bot net) and to distinguish real user from malicious computer necessitate some analysis. Moreover, we need to mention here that DoS significantly increases system traffic, what means that the data-processing system should be able also to withstand an attack and not be the Figure 1. Design of Data Mining base IDS [6]. narrow bottle. Different human attacks are designed for targeting Assuming that millions packets move through such errors in code (vulnerabilities) or wrongly configured system build on relational database, we understand that software. Basic rule approach is used here, where each it will probably: packet is searched in existing signature database or 1) Be too slow (at least 2 x millions INSERT per some anomaly detection. Anomaly detection require to second, 2 x millions SELECT per second) have a vast database of good packets and examples of 2) Not very scalable. wrong one to function properly [5]. For example, K- 3) Possible problems with parallel processing, as means clustering algorithm gives satisfactory results simultaneous INSERT (and it’s very likely to occur) with its simplicity [5]. may face that table is locked thus bringing delays. Realization of these approaches requires a solution Based on current IDS design, our requirements stated to transfer fast data between sensors and server and a above and different approaches and demand of storage system in server to keep suspicious packets and researchers in [1-7], we decided to consider next relevant data. The process of evaluation information as structure: it arrives is called Real-time intelligence. 3.2 Performance When it comes to solution, some real data is needed. For example, ordinary communication speeds are 1- 40Gbs depending on an application. In case of distributed system, roughly, this figure is multiplied by s times, where s equals to number of sensors. Such approximation shows that there is no opportunity to store all data. However, statistics, IP addresses, ports, type and other important and crucial to our application information need to be. Data can be hashed to store minimum required information. It’s hard to predict real numbers as they are very seriously application specific, but in any case, system should be able to handle millions of packets per second, what is the average load Figure 2. Our DIDS architecture based on CEP of middle size network. Sensor is responsible for capturing data. It can be 3.3 Transferring data installed on the gate or dedicated server linked with gate via hub to collect every in- or out- going packets. For Transferred data from client to server consist of example, detection of network scans or DoS attacks informational messages, messages for statistics (IP’s, requires getting IPs and ports information. Sensor ports, etc.) and packets themselves. Ideally, a new collects this data based on packed headers, groups it and protocol should be designed here based on existing sends back to the server. Search for dependencies or efforts, however, we see employing Binary XML as a patterns are performed on the server. possible solution, allowing storage of binary data. 5 CEP role in such process is primary filtering in order not to overload the database and searching for chains, with Due to high volume of information, it should be additional filtering performed at the core. processed in real-time or with minimum delay, with C) Statistical techniques instant decisions and limited backing for future Here hidden Markov Models are used for example. investigation. Some papers [9] however suggest that so Such IDS are off-line systems work on existing called off-line processing (processing after database of packets and are quite complex [9]. With communications, in off-line mode) give benefits of same quality as other approaches, we see no reason to more accurate recognition and usually is employed assume that they might be used in scalable, high along with real-time analysis. performance DIDS, but our designed system allows We suppose that complex event processing fits all performing such deep and complex analysis on the required conditions. CEP is designed like a database database of packets. turned upside-down. We load rules there and put all information through it to get results. Such approach supposed to be much faster than traditional SQL at least 6 Packet operations because CEP (for example, Esper) will extract data only All operations are performed on sequences of once, while in traditional SQL every query will search undependable chains of packets (events). We can infer over and over this data again. Researchers [4] next basic operations: succeeded in using Esper for DIDS to detect scans, so • Sub data extraction we may conclude that it is also possible to extend task • Filtering and make universal solution. • Sequences searching CEP perfectly fit searching for network scans or DoS • Text comparison application; however, using it for data mining looks not • Full text search fully clear for us and further research is needed. We can Everything else (Data Mining, Patterns Search, etc.) is assume that combining SQL database and brief results based on these packet operations, as at least “sub data from CEP can significantly narrow search area and extraction” operation can return the packet itself. improve results. On the other hand, existing rules (for example from Snort ®) can be easily rewritten for CEP 6.1 Analysis in real-time application, as Snort rule-based detection approach looks similar to real-time event processing. At previous sections was described CEP Authors [9] give a list of data mining approaches function to perform basic packet analysis and limiting that they infer from different research groups: packets stored to the database. For example, a vast part A) Feature selection. of traffic can occupy file transactions (download, Feature selection is selection of important parts of upload) that can be regarded as safe with ξ probability. data and reject non-important. It can significantly limit ξ threshold value is determined by many factors such as available options what surely benefit machine learning. network aim, application tasks, DIDS responsibilities For example, this can be % of same service to same and can be as fixed or dynamically changing. Function host, % on same host to same service or average F (also determined) is applied to packet and ξpacket= duration / all services. Such markers can be calculated F(mixed packet data), where packet data are our packet in real-time and researchers [4] showed basic examples information, like port, protocol, flags, direction and that can be extended. The CEP problem is that it does many others. Dynamic rules engine create rules that not offer data storage. Researchers [4] used a detect safe packets and delete them from flow to mechanism of time windows (this is roughly the time database insert. when SEP store all your information, and after it perform search and results appear). Their test platform 6.2 Scalability used value of 10 seconds [4]. Calculation of average (10Gbit/sec, 5 networks) shows that at least 62.5 Gb of “Esper exceeds over 500 000 event/s on a dual RAM is needed and to solve problem with data storage CPU 2GHz Intel based hardware, with engine latency they used global lists where they stored IP addresses below 3 microseconds average (below 10us with more that require future inspection. To store our values, we than 99% predictability) on a VWAP benchmark with recommend using in-memory traditional SQL database 1000 statements registered in the system - this tops at because of vast number of parameters that need to be 70 Mbit/s at 85% CPU usage. Esper also demonstrates inspected (this can be at least 30) and quick data linear scalability from 100 000 to 500 000 event/s on changes. this hardware, with consistent results across different statements.” [8]. B) Machine learning. Machine learning mainly consist of classification into good or malicious traffic or Clustering techniques. To apply most of methods full chain of events is needed. Network communication is time process and packets storage should be implemented. Database from Figure 3 acts as a packet storage, where the task of CEP Selection Model, 11th Asia-Pacific Network Operations and Management Symposium, 2010 [8] Esper website: http://esper.codehaus.org/tutorials/faq_esper/faq.html#h ow-does-it-work-overview [9] Theodoros Lappas and Konstantinos Pelechrinis, “Data Mining Techniques for (Network) Intrusion Figure 3. CEP application scalability. Detection Systems”, (pdf document), May 2010, online http://www.slideshare.net/Tommy96/data-mining- Scalability of CEP based applications shows techniques-for-network-intrusion-detection-systems extreme flexibility and inter-independence, see figure 3. [10] Mohamad Eid, A New Mobile Agent-Based Compared to databases, where database files has to be Intrusion Detection System Using Distributed Sensors, shared, such approach shows that it is convenient and (pdf document) elegant way to apply parallel filtering that leads to http://www.academia.edu/2884731/A_new_mobile_age further transmission latency fall. nt- based_intrusion_detection_system_using_distributed_se 7 Conclusions and Future Work nsors [11] Prelude-IDS web site: This paper present CEP based methodology united https://www.prelude-ids.org/ with relational SQL Database as a platform for designing different types and prototypes of distributed intrusion detection system. Despite that no line of code was presented here, we considered and examined different approaches (as real and working, as well research and planned systems) for inside IDS structure and showed how it can benefit from event orientated system. We can clearly see that DIDS processing idea and avenue mainly relies on event aggregation and including event processing component benefit the system and facilitate development. The warehouse and data-processing system is a crucial factor for DIDS performance and success. At future, we plan to implement a platform designed for transferring crucial data and events. We will use several networks with Esper&Storm to transfer traffic, parallel it and feed to several IDS for next inspection. References [1] Steven R. Snapp, James Brentano DIDS (Distributed Intrusion Detection System) − Motivation, Architecture, and An Early Prototype. In Proceedings of the 14th National Computer Security Conference [2] Michael P. Brennan Using Snort For a Distributed Intrusion Detection System, SANS Institute [3] Rogier Spoor. A Distributed Intrusion Detection System based on passive sensors, SURFnet, 2005 [4] Leonardo Aniello, Giorgia Lodi and Roberto Baldoni. Inter-Domain Stealthy Port Scan Detection through Complex Event Processing, Proceedings of the 13th European Workshop on Dependable Computing [5] Stefano Zanero, (pdf document), 2014 http://www.blackhat.com/presentations/bh-dc- 07/Zanero/Presentation/bh-dc-07-Zanero.pdf [6] Wenke Lee, Salvatore J. Stolfo, Real Time Data Mining-based Intrusion Detection, In DARPA Information Survivability Conference and Exposition II, 2003 [7] Huy Anh Nguyen and Deokjai Choi, Application of Data Mining to Network Intrusion Detection: Classifier