=Paper= {{Paper |id=Vol-1308/paper1 |storemode=property |title=Applying CEP to Problems of Real-Time Data Analysis in Distributed IDS. |pdfUrl=https://ceur-ws.org/Vol-1308/paper1.pdf |volume=Vol-1308 |dblpUrl=https://dblp.org/rec/conf/syrcodis/Tsaturyan14 }} ==Applying CEP to Problems of Real-Time Data Analysis in Distributed IDS.== https://ceur-ws.org/Vol-1308/paper1.pdf
     Applying CEP to problems of real-time data analysis in
                       distributed IDS.
                                                  Tigran Tsaturyan,
                                     Bauman Moscow State Technical University
                                                 a915200@yandex.ru



                       Abstract                              guide to set up DIDS easy. Researchers [4] used CEP
                                                             (Complex Event Processing) as basement for network
 Nowadays developers and researchers apply                   scans detection system with positive results. Complex
 different approaches from the traditional rule              Event processing (CEP) is a method of tracking and
 based solutions to data mining or pattern search            analyzing (processing) streams of information
 in DIDS. To implement their inventions into life,           combined from several sources about things that happen
 performance question is still open as DIDS is               and deriving a conclusion from them. Due to CEP
 designed to operate in real-time with millions of           limitation (described lately here) these researchers had
 packets per second. In this paper, we pay close             to employ extra data warehouse. Some researchers [10]
 attention to data systems used and will offer our           simply use sensors as a sniffer to redirect in and out
 draft of the data processing system to be used at           going traffic with only native or implemented lists,
 implementation including all best from old and              arrays, and other datasets in memory. Distributed
 new approaches, designed especially to data                 Prelude-IDS [11] work as with MySQL as well as with
 mining applications.                                        PostgreSQL, storing however only results and perform
 Academic supervisor: Yuri Gapanyuk,                         “analysis on the fly”. Experiments, based on Intelligent
 gapyu@bmstu.ru                                              Techniques such as Genetic Algorithms, Neural
                                                             Networks, Data Mining were carried on relatively little
                                                             amount of traffic with some possible delays, however
1.Introduction                                               the desire to apply it in real time transformed the
                                                             software for real-time intelligence.
    Distributed Intrusion detection system (DIDS) is
software or device designed for detection malicious
activity in several inspected objects. Every DIDS            3 Requirements
require data to perform its analysis. The faster data            To build the framework for the DIDS to operate
delivery system is, the quicker decision will be made        successfully, we must meet main requirement for such
and attacker locked. Usually such systems are built as       systems, i.e. being fast, reliable and fit to possible tasks.
client-server architecture, where server consist of main
engine, logic, rules, logs, etc. and a client (also called   3.1 Understanding Data Processing in DIDS
sensor), that collects data or perform basic/extended
analysis. [3]                                                    To understand what kind information and queries,
    The more functions perform the client; the more is       we are facing with, there is need to understand the
the requirement for a hardware. On the other hand,           attacks themselves and used identification approaches.
making the sensor to perform only data collection will           Basically attacks can be classified into three groups:
result in extreme network loading as a copy of every             •     Network scanning
in/out packet will be sent to the server. As combined            •     Denial of service attack
approach here is used, so designed system should be              •     Different human attacks (exploits, errors in
able to function according to both possible options and      code, vulnerabilities, etc.)
act not only as a communication platform, but the                Network scanning is a way to understand network
processing solution itself twisted with IDS.                 structure, its hosts and open ports. Such hacking
                                                             technique can give an answer to questions what is the
2. Related Work                                              network, how many hosts are there, what OS they are
                                                             running, what is opened or locked. Let TCP SYN be an
    Question of data-processing architecture in DIDS is      example [please, refer to appropriate documentation of
relatively new, probably as data-mining IDS are still in     such attack]: the scanner sends SYN packet to open
development and modern security products are either          connection to the target and waits for response. If SYN-
closed in documentation or are limited. There are            ACK packet is received, so scanner suppose such port
attempts [2] to use MySQL database as data-storage           open and host up. If RST-ACK packet is received than
core, for example, but solution offered does not practice    scanner conclude port to be closed. If nothing is
any intelligent techniques, but rather is a good practical
received, so port can be filtered or host down. The         4 Existing and Our Architectures
scanner can be run on one host, or the scanner can be
several hosts. To identify the scanning attack, IDS         Researchers in [6] showed architecture (please, refer to
(DIDS), for example, create a list of enquires to           figure 1) of data mining IDS that is not distributed. Also
different ports from one source, sum them and compare       the performance of such system was not given.
with some threshold [4].
     Denial of Service attack (DoS) is an attack to block
or overload the target machine and make it inaccessible
to users. DoS is easily identified, but not so easy to
prevent. The approach for attack recognition is similar
to network scanning with difference that now we have
to consider packet size or go deeper to application
layers and understand requests. Very often attack is
performed by the group of machines (bot net) and to
distinguish real user from malicious computer
necessitate some analysis. Moreover, we need to
mention here that DoS significantly increases system
traffic, what means that the data-processing system
should be able also to withstand an attack and not be the
                                                                Figure 1. Design of Data Mining base IDS [6].
narrow bottle.
     Different human attacks are designed for targeting
                                                                Assuming that millions packets move through such
errors in code (vulnerabilities) or wrongly configured
                                                            system build on relational database, we understand that
software. Basic rule approach is used here, where each
                                                            it will probably:
packet is searched in existing signature database or
                                                                1) Be too slow (at least 2 x millions INSERT per
some anomaly detection. Anomaly detection require to
                                                            second, 2 x millions SELECT per second)
have a vast database of good packets and examples of
                                                                2) Not very scalable.
wrong one to function properly [5]. For example, K-
                                                                3) Possible problems with parallel processing, as
means clustering algorithm gives satisfactory results
                                                            simultaneous INSERT (and it’s very likely to occur)
with its simplicity [5].
                                                            may face that table is locked thus bringing delays.
     Realization of these approaches requires a solution
                                                            Based on current IDS design, our requirements stated
to transfer fast data between sensors and server and a
                                                            above and different approaches and demand of
storage system in server to keep suspicious packets and
                                                            researchers in [1-7], we decided to consider next
relevant data. The process of evaluation information as
                                                            structure:
it arrives is called Real-time intelligence.

3.2 Performance
When it comes to solution, some real data is needed.
For example, ordinary communication speeds are 1-
40Gbs depending on an application. In case of
distributed system, roughly, this figure is multiplied by
s times, where s equals to number of sensors. Such
approximation shows that there is no opportunity to
store all data. However, statistics, IP addresses, ports,
type and other important and crucial to our application
information need to be. Data can be hashed to store
minimum required information. It’s hard to predict real
numbers as they are very seriously application specific,
but in any case, system should be able to handle
millions of packets per second, what is the average load       Figure 2. Our DIDS architecture based on CEP
of middle size network.
                                                            Sensor is responsible for capturing data. It can be
3.3 Transferring data                                       installed on the gate or dedicated server linked with gate
                                                            via hub to collect every in- or out- going packets. For
Transferred data from client to server consist of           example, detection of network scans or DoS attacks
informational messages, messages for statistics (IP’s,      requires getting IPs and ports information. Sensor
ports, etc.) and packets themselves. Ideally, a new         collects this data based on packed headers, groups it and
protocol should be designed here based on existing          sends back to the server. Search for dependencies or
efforts, however, we see employing Binary XML as a          patterns are performed on the server.
possible solution, allowing storage of binary data.
5 CEP role                                                   in such process is primary filtering in order not to
                                                             overload the database and searching for chains, with
    Due to high volume of information, it should be          additional filtering performed at the core.
processed in real-time or with minimum delay, with              C) Statistical techniques
instant decisions and limited backing for future             Here hidden Markov Models are used for example.
investigation. Some papers [9] however suggest that so       Such IDS are off-line systems work on existing
called     off-line    processing     (processing    after   database of packets and are quite complex [9]. With
communications, in off-line mode) give benefits of           same quality as other approaches, we see no reason to
more accurate recognition and usually is employed            assume that they might be used in scalable, high
along with real-time analysis.                               performance DIDS, but our designed system allows
    We suppose that complex event processing fits all        performing such deep and complex analysis on the
required conditions. CEP is designed like a database         database of packets.
turned upside-down. We load rules there and put all
information through it to get results. Such approach
supposed to be much faster than traditional SQL at least
                                                             6 Packet operations
because CEP (for example, Esper) will extract data only          All operations are performed on sequences of
once, while in traditional SQL every query will search       undependable chains of packets (events). We can infer
over and over this data again.            Researchers [4]    next basic operations:
succeeded in using Esper for DIDS to detect scans, so            •    Sub data extraction
we may conclude that it is also possible to extend task          •    Filtering
and make universal solution.                                     •    Sequences searching
CEP perfectly fit searching for network scans or DoS             •    Text comparison
application; however, using it for data mining looks not         •    Full text search
fully clear for us and further research is needed. We can    Everything else (Data Mining, Patterns Search, etc.) is
assume that combining SQL database and brief results         based on these packet operations, as at least “sub data
from CEP can significantly narrow search area and            extraction” operation can return the packet itself.
improve results. On the other hand, existing rules (for
example from Snort ®) can be easily rewritten for CEP        6.1 Analysis in real-time
application, as Snort rule-based detection approach
looks similar to real-time event processing.                          At previous sections was described CEP
    Authors [9] give a list of data mining approaches        function to perform basic packet analysis and limiting
that they infer from different research groups:              packets stored to the database. For example, a vast part
    A) Feature selection.                                    of traffic can occupy file transactions (download,
    Feature selection is selection of important parts of     upload) that can be regarded as safe with ξ probability.
data and reject non-important. It can significantly limit    ξ threshold value is determined by many factors such as
available options what surely benefit machine learning.      network aim, application tasks, DIDS responsibilities
For example, this can be % of same service to same           and can be as fixed or dynamically changing. Function
host, % on same host to same service or average              F (also determined) is applied to packet and ξpacket=
duration / all services. Such markers can be calculated      F(mixed packet data), where packet data are our packet
in real-time and researchers [4] showed basic examples       information, like port, protocol, flags, direction and
that can be extended. The CEP problem is that it does        many others. Dynamic rules engine create rules that
not offer data storage. Researchers [4] used a               detect safe packets and delete them from flow to
mechanism of time windows (this is roughly the time          database insert.
when SEP store all your information, and after it
perform search and results appear). Their test platform      6.2 Scalability
used value of 10 seconds [4]. Calculation of average
(10Gbit/sec, 5 networks) shows that at least 62.5 Gb of               “Esper exceeds over 500 000 event/s on a dual
RAM is needed and to solve problem with data storage         CPU 2GHz Intel based hardware, with engine latency
they used global lists where they stored IP addresses        below 3 microseconds average (below 10us with more
that require future inspection. To store our values, we      than 99% predictability) on a VWAP benchmark with
recommend using in-memory traditional SQL database           1000 statements registered in the system - this tops at
because of vast number of parameters that need to be         70 Mbit/s at 85% CPU usage. Esper also demonstrates
inspected (this can be at least 30) and quick data           linear scalability from 100 000 to 500 000 event/s on
changes.                                                     this hardware, with consistent results across different
                                                             statements.” [8].
    B) Machine learning.
    Machine learning mainly consist of classification
into good or malicious traffic or Clustering techniques.
To apply most of methods full chain of events is
needed. Network communication is time process and
packets storage should be implemented. Database from
Figure 3 acts as a packet storage, where the task of CEP
                                                            Selection Model, 11th Asia-Pacific Network Operations
                                                            and Management Symposium, 2010
                                                            [8] Esper website:
                                                            http://esper.codehaus.org/tutorials/faq_esper/faq.html#h
                                                            ow-does-it-work-overview
                                                            [9] Theodoros Lappas and Konstantinos Pelechrinis,
                                                            “Data Mining Techniques for (Network) Intrusion
        Figure 3. CEP application scalability.              Detection Systems”, (pdf document), May 2010, online
                                                            http://www.slideshare.net/Tommy96/data-mining-
          Scalability of CEP based applications shows       techniques-for-network-intrusion-detection-systems
extreme flexibility and inter-independence, see figure 3.   [10] Mohamad Eid, A New Mobile Agent-Based
Compared to databases, where database files has to be       Intrusion Detection System Using Distributed Sensors,
shared, such approach shows that it is convenient and       (pdf document)
elegant way to apply parallel filtering that leads to       http://www.academia.edu/2884731/A_new_mobile_age
further transmission latency fall.                          nt-
                                                            based_intrusion_detection_system_using_distributed_se
7 Conclusions and Future Work                               nsors
                                                            [11] Prelude-IDS web site:
    This paper present CEP based methodology united
                                                            https://www.prelude-ids.org/
with relational SQL Database as a platform for
designing different types and prototypes of distributed
intrusion detection system. Despite that no line of code
was presented here, we considered and examined
different approaches (as real and working, as well
research and planned systems) for inside IDS structure
and showed how it can benefit from event orientated
system. We can clearly see that DIDS processing idea
and avenue mainly relies on event aggregation and
including event processing component benefit the
system and facilitate development. The warehouse and
data-processing system is a crucial factor for DIDS
performance and success. At future, we plan to
implement a platform designed for transferring crucial
data and events. We will use several networks with
Esper&Storm to transfer traffic, parallel it and feed to
several IDS for next inspection.


References
[1] Steven R. Snapp, James Brentano DIDS
(Distributed Intrusion Detection System) − Motivation,
Architecture, and An Early Prototype. In Proceedings of
the 14th National Computer Security Conference
[2] Michael P. Brennan Using Snort For a Distributed
Intrusion Detection System, SANS Institute
[3] Rogier Spoor. A Distributed Intrusion Detection
System based on passive sensors, SURFnet, 2005
[4] Leonardo Aniello, Giorgia Lodi and Roberto
Baldoni. Inter-Domain Stealthy Port Scan Detection
through Complex Event Processing, Proceedings of the
13th European Workshop on Dependable Computing
[5] Stefano Zanero, (pdf document), 2014
http://www.blackhat.com/presentations/bh-dc-
07/Zanero/Presentation/bh-dc-07-Zanero.pdf
[6] Wenke Lee, Salvatore J. Stolfo, Real Time Data
Mining-based Intrusion Detection, In DARPA
Information Survivability Conference and Exposition II,
2003
[7] Huy Anh Nguyen and Deokjai Choi, Application of
Data Mining to Network Intrusion Detection: Classifier