=Paper=
{{Paper
|id=Vol-2081/paper22
|storemode=property
|title=Big Data Technologies for Cybersecurity
|pdfUrl=https://ceur-ws.org/Vol-2081/paper22.pdf
|volume=Vol-2081
|authors=Sergei A. Petrenko,Krystina A. Makoveichuk
}}
==Big Data Technologies for Cybersecurity==
Big Data Technologies for Cybersecurity Sergei A. Petrenko Krystina A. Makoveichuk Information Security Department Department of Informatics and Information Technologies Saint Petersburg Electrotechnical University "LETI" Vernadsky Crimean Federal University St. Petersburg, Russia Yalta, Russia s.petrenko@rambler.ru christin2003@yandex.ru Abstract—The article presents variants for building a is able to provide proactive security and monitor the impending cognitive early warning system about a computer attack on the information security incidents even before they can adversely information resources of the Russian Federation on the basis of affect the sustainability of the critical infrastructure [3, 23]. Big Data technologies. The essence of Big Data technologies is considered in the context of its application in information Thus, by Big Data technologies in information security we security. Approaches for stream processing of data based on the will understand the technologies of efficient processing of CEP model, the MapRedutze modification, the model of actors, dynamically growing data volumes (structured and the combination of the model of actors and the modification of unstructured) in heterogeneous Internet / Itranet and IIoT / IoT MapRedutze are analyzed. The architecture of the prototype of systems for solving urgent security tasks. The practical the software-hardware complex "Warning-2016", a typical significance of Big Data technologies lies in the ability to scheme of the hardware implementation of the stand and the detect primary and secondary signs of preparation and conduct technical specification of its equipment are presented. of computer attacks, the detection of abnormal behavior of controlled objects and subjects, the classification of previously Keywords—Big Data; cybersecurity, streaming data processing, unknown mass and group cyber attacks (including new DDOS actor model, modification of MapReduce, software and hardware and APT), the detection of the traces of computer traces complex (SHC). crimes, etc., that is, in all cases when the use of traditional I. INTRODUCTION means of information protection (SIEM, IDS / IPS, system of protection from unauthorized access to information, At present, the technology of processing, storage and cryptographic information protection facility, antiviruses, etc.) analysis of large data, Big Data (hereinafter - the technology of is not very effective [4, 5, 7, 9, 18, 22, 23]. Big Data) are becoming increasingly important for the state of critical infrastructure security monitoring of the Russian II. COMPARATIVE ANALYSIS OF BIG DATA Federation (electrical networks, pipelines, communication Currently, the following approaches are known for systems, and so forth.) and monitoring the corresponding streaming data processing [2, 6, 10-15, 19-21] based on: criteria and indicators of information security and sustainability of functioning in general [3]. Classical CEP model, for example, StreamBase; Big Data technologies are already being used in a number Modification of MapReduce, for example, D-Streams; of cybersecurity applications. For example, in Security Actor model, for example, Storm, S4 and Zont; information and Event Management systems (SIEM) non- Combinations of the actor model and the modification of relational database (NoSQL) are used to store logs, messages MapReduce, for example, Zont + RTI. and security events. In the near future, a qualitative leap in the In the first approach, the classic Complex Event Processing development of SIEM is expected on the basis of models and (CEP) model is used. The use of CEP allows you to search for methods of forecast analytics. In the known solutions of Red "significant" cybersecurity events in a data stream over a Lambda, Palantir, etc., Big Data technologies are used to build certain time interval, perform a correlation analysis of events, user profiles and social groups in order to detect abnormal and allocate appropriate event patterns that require immediate behavior [1, 4, 9, 16]. At the same time, the following response. information sources serve as information sources: corporate mail, CRM system and personnel system, access control To automate the process of developing data processing system (ACS), as well as various pullers in connected data systems based on CEP, a number of tools are proposed (f. e., networks Inetrnet / Intanet and IIoT / IoT, external news tapes, StreamBase with its own declarative programming languages collectors and aggregators in in social networks [24]. StreamSQL and EventFlow). The relevance of Big Data technologies is confirmed by the The practice of using CEP has shown that it is optimal for fundamental possibility to conduct "online analysis" of packet the collection and processing of simple cyber-security events. and streaming data, to isolate and process significant simple For example, to extract events from several data streams, and complex cybersecurity events in real (or quasi-real) time aggregate them into complex events, reverse decomposition, scale, and to generate new useful knowledge for detection and etc. However, the implementation of complex logic for prevention of security incidents. It is significant that Big Data handling cybersecurity events is difficult. To solve this 107 problem, we proposed an approach based on the generalization allocated Storm system (Twitter), and S4 (Yahoo!) and Zont of MapReduce to the processing of streaming data. (Moscow Institute of Physics and Technology, MIPT). In the second approach, the D-Streams model of discrete The first two solutions, Storm and S4 [8, 17, 21] make it streams is used, in which streaming computations are presented possible to implement the so-called pipeline data processing as sets of non-session deterministic batch calculations on small based on a relatively small number of actors. time series intervals. It is significant that such a representation of calculations allowed not only to implement the complex The third of the named Zont system can work with a large logic of processing cybersecurity events, but also to offer better number of actors. This is true when working with a cloud of methods of restoration than traditional replication and backup sensors, when each sensor is assigned its own actor. To copying. The fact is that in practice, in computer networks with develop distributed resilient systems, it is possible to use the a large number of nodes (from hundreds or more), failures and Erlang & RIAK Core development environment. Here, the "hangs" (or "slow" nodes) inevitably occur, and here the functional language Erlang (Ericsson) allows you to create operative data recovery in case of failure or failure is important programs that can work in a distributed computing enough. Since, even a minimum delay of 10-30 seconds can be environment on several nodes (processors, cores of one critical for making the right decision. processor, cluster of machines), and the open library Riak Core (Basho Technologies) allows to create distributed applications It should be stated that, apparently, for the known systems according to Amazon's Dynamo architecture. the streaming data Storm, MapReduce Online et al. resiliency reached threshold values. The systems mentioned are based on Thus, all three systems, Storm, S4 and Zont (MIPT) can be the model of "long-lived" session operators, which, upon used to process a large data stream from several sources. In this receiving the message, update the internal state and send a new case, Zont is optimally suited for working with a cloud of message further. In this case, the system is restored by sensors. replicating to a pre-prepared copy of the node or by backing it In the fourth approach, the combined advantages of the up in a data stream, meaning "replay" of messages on each new second and third approaches, which allows the system to create copy of the "fallen" node. the streaming data in real time based on the combination and As a result, the use of the replication mechanism results in a modification MapReduce actor model. costly two or three times the node reservation, and the use of a III. EXAMPLE OF A SOLUTION BASED ON backup in the data stream is characterized by significant time BIG DATA delays due to the need to wait for the nodes to "update" when the data is re-run through the operators. In addition, none of Consider the possible options for building a cognitive early these approaches can not cope with "hangs". Replication warning system about a computer attack on the information systems use Flu synchronization protocols to coordinate resources of the Russian Federation (software and hardware replicas and hangs slow both replicas. When backing up, any complex, SHC "Warning-2016") on the basis of Big Data "hang" is considered a failure with subsequent costly recovery. technologies. The D-Stream model offers better recovery methods. For Variant 1. Implementation of the experimental model of the example, the Resilient Distributed Datasets (RDD) recovery SHC "Warning-2016" based on HBase. method, which allows you to restore data directly from Here the basis for the proposed solution was the non- memory without having to replicate for several sub-seconds, or relational distributed database HBase, working on top of the a method of parallel restoration of the state of a "lost node" in HDFS file system (Hadoop Distributed File System). which, when the node falls "It initiates the" connection "of the workable nodes of the cluster to the" recalculation of the lost This database allows you to perform analytical and "structure of the RDD. Note that in traditional systems of predictive operations on terabytes of data to assess the threats continuous data processing such restoration is impossible due to cybersecurity and the stability of the critical infrastructure as to complex synchronization protocols. a whole. It is also possible to prepare in the automated mode appropriate scenarios for detection, neutralization and warning. Note that using the D-Streams model requires splitting an array of input data into streams, which inevitably results in the The second hypothesis analysis module is designed to loss of certain events. In addition, in the case of large flows, the handle large amounts of data, respectively, from it require high data processing system is no longer flexible and scalable. The performance. The module interacts with standard configuration time the system responds to events slows down and the system servers and is implemented in C language (via PECL, PHP moves further away from the real-time mode. To solve these extensions repository). Special interactive tools based on problems was proposed third approach - based on actor model. JavaScript / CSS / DHTML and libraries such as jQuery have been developed to work with the content of the proper In the third approach, the streaming data processing provision of cybersecurity. systems are based on the actor model. Here, actors are understood as some primitives of parallel computations. The As a data store, MySQL is used - Percona Server (version main advantage of the actors is the ability to store states, 5.6) with the XtraDB engine. DB servers are integrated into a including those obtained from historical data, which can be multi-master cluster using the Galera Cluster. For balancing the used to highlight significant cyber security events. Among the database servers, haproxy is used. Redis (version 2.8) is used to known solutions of the streaming data based on actor model implement task queues, as well as for data caching. 108 As a web server, nginx is used. Involved PHP-FPM with TABLE I. DESCRIPTION OF THE DATA PROCESSING MODULES APC enabled. For balancing HTTP requests, DNS (multiple A- Name of the element records) is used. To develop special client applications running Functionality of the element (name in the figure) Apple iOS programming languages are used: Objective C, C ++, and Apple iOS SDK based on Cocoa Touch, CoreData, TCP-Balancer Network load balancing between cluster members. UIKit. The above applications are compatible with devices TCP session manager - Separate control of the network connection for running Apple iOS version 9 and above. (Socket server each individual sensor; process) - Preliminary check of integrity of the incoming To develop applications running Android OS, the native data. Google SDK is used. The applications are compatible with Transactional buffer - Buffering input data to optimize performance devices running Android OS version 4.1 and higher. Software (TX buffer) at peak loads; development for the web platform is carried out using PHP and - Calling the data parsing procedure from the JavaScript.The experimental sample SHC "Warning-2016" is sensors; deployed in Saint Petersburg Electrotechnical University - Redirect data to the corresponding FSM. The state machine - Dedicated FSM process for each sensor; "LETI" on the platform DigitalOcean and contains in its sensor (Sensor FSM) - Logical data processing from the sensor; composition: - Track events within a single detector; - Saving the processed data to the database; 3 servers for production stage; NoSql Distributed KV - Saving the processed data; 1 server for testing stage. Storage - Increasing the reliability of writing and reading data by repeatedly storing data on different In addition, CloudFlare is used - to increase the speed of the nodes. service (through the use of CDN) and protection from DoS Analytical data - Analytical data processing, tracking complex attacks. The carried out load testing of the experimental model processing module events; of the SHC "Warning 2016" indicates the viability of the - Generate reports based on the content of the proposed technical solution [17]. database. Module for interaction - Interaction with external clients using REST Variant 2. Implementation of the prototype of the SHC with client and WebService. "Warning-2016" on the basis of the telematics platform Zont applications (MIPT)The possible architecture of the experimental layout of the SHC "Warning-2016" based on Zont is presented (see Fig. At the same time, geo-index support based on geo-hash 1). Here the basis of the proposed solution was the telematics technology was added to store the input data of mobile sensors. platform Zont, which allows creating resiliency scalable cloud This index is specifically designed to store a large array of systems for streaming large data. Table 1 describes the information about the spatial position of point objects, in modules of the experimental layout of the SHC "Warning- particular, moving sensors. 2016" based on Zont. As a result, it allowed to obtain better performance It is significant that Zont has its own specialized storage, indicators compared to such well-known universal solutions as built on the basis of Riak Core technology, for storing and Postgres GiST and MongoDB 2dsphere. The hardware retrieving archived data that represent time series. The backend implementation of the prototype SHC "Warning-2016" is a for the repository is LevelDB, which is developed by Google. cluster of general-purpose servers connected by a network. LevelDB is an embedded KV database specifically designed for use as a backend in the construction of specialized The hardware implementation of the prototype SHC databases, providing operations for writing, searching and "Warning-2016" is a cluster of general-purpose servers sequential data viewing. connected by a network (Table 2). The key advantages of the database are high speed of data A typical scheme of the hardware implementation of the recording, predictable speed of data search by key and high stand is presented (see Fig. 2). speed of sequential reading. Also worth noting the following The stand includes the following components: important characteristics of LevelDB for the storage of time series: Erlang Virtual Server (Erlang Node Server) to implement the distributed cloud platform module; a) Use of the LSM tree model, which allows providing high resistance to failures and failures; Erlang Virtual Server (Erlang Node test Server) to implement test module, including sensor network emulator; b) Organization of data storage in an ordered form. 109 Fig. 1. Possible architecture of the SHC "Warning-2016" based on Zont Auxiliary software server of the Platform, which contains the JBoss application server and PostgreSQL DBMS. TABLE II. TECHNICAL SPECIFICATION OF STAND EQUIPMENT Standard server High-performance Equipment cluster cluster "IScalare" The number of servers allocated for 6 200 the layout of the Platform 2x Intel Xeon E5- 2 x Intel Xeon E5- CPU 26206-core 2,0 ГГц 2690 (8-core) 2,9 ГГц Memory 32 Гб 64 Гб Hard disk drives 3Tб 600Гб (SATA) InfiniBand – QDR Network Security Ethernet 1Gb (40Гбит/с) The system and application software of the stand included: Linux OS CentOS 6.4; Erlang R160B2 as the execution environment of the distributed machine Erlang; Fig. 2. Software and hardware implementation of the stand HAProxy - as a proxy server, JBoss - as an application server; Map Reduce technology over the distributed cloud KV data PostgreSQL - for storing metadata, etc. warehouse. At the same time, solutions based on CEP turned Preliminary tests of the experimental design of the SHC out to be less demanding for memory. They allow storing data "Warning of 2016" showed its ability to act [17]. in a single "window" of events, but they demanded considerable computing resources when analyzing such "windows". And based on the actor model solutions were less demanding of computer resources, but more demanding of IV. CONCLUSIONS memory due to the need to duplicate data for each event / The obtained positive experience of using Big Data for object. Accordingly, the solutions based on the modification of solving information security problems testifies to the MapReduce took an intermediate position. expediency of choosing solutions based on the actor model and 110 In our opinion, the technology Big Data to radically change Management of Data, SIGMOD ’13], New York, 2013. ACM, pp. the situation in the following areas of information security: 1197–1208. DOI: 10.1145/2463676.2463712. [10] Golab L., Ozsu M. T. Issues in data stream management, SIG-MOD Proactive management of cybersecurity incidents; Record 32 (2) (2003) 5–14. DOI:10.1145/776985.776986. Early detection, prevention and elimination of the [11] Jamshidi P., Casale G. An Uncertainty-Aware Approach to Optimal consequences of computer attacks; Configuration of Stream Processing Systems [Proc. In: MASCOTS 2016]. DOI: 10.5281/zenodo.56238. Predictive network monitoring of cybersecurity; [12] Jayashree M., Zahoor S. U. H. Beyond Batch Process: A BigData Authentication, user authorization and identity processing Platform based on Memory Computing and Streaming Data // management; International Journal of Innovative Research in Science, Engineering Preventing computer crime and fraud; and Technology (An ISO 3297: 2007 Certified Organization). 2016. - V. Information security risk management; 5, I. 10, pp. 1783-1789. DOI:10.15680/IJIRSET.2016.0510013. Compliance with regulatory requirements, etc. [13] Kuhlenkamp J., Klems M., Ross O. Benchmarking scalability and elasticity of distributed database systems. [Proc. VLDB Endow.], 2014, In this case, the first results should be expected precisely in v. 7(12), pp. 1219–1230. DOI: 10.14778/2732977.2732995. the proactive management of incidents of cybersecurity and [14] Lachhab F,., Bakhouya M., Ouladsine R., Essaaidi M. Performance early warning of computer attacks. Note that the known results evaluation of CEP engines for stream data processing. [Proc. 2nd International Conference on Cloud Computing Technologies and of foreign developers of information protection tools confirm Applications (CloudTech)], Marrakech, 2016. DOI: this assumption. For example, in 2015-2016. RSA and IBM 10.1109/CloudTech.2016.7847726. announced plans to create a new generation of security [15] Malewicz G., Austern M. H., Bik A. J. C., James C. Dehnert, Horn I., management centers (Security Operations Center, SOC) [17], Leiser N., Czajkowski G. Pregel: A system for large-scale graph the so-called Intelligence-Driven Security Operations Center, processing - ”abstract”. 2009, pp. 6–6. DOI: 10.1145/1582716.1582723. iSOC. [16] Petrenko S.A., Makoveichuk K.A., Chetyrbok P.V., Petrenko A.S. About Readiness for Digital Economy. In Proceedings of the 2017 IEEE REFERENCES II International Conference on Control in Technical Systems, IEEE, CTS, 2017, pp. 96–99. DOI: 10.1109/CTSYS.2017.8109498. [1] Armstrong T. G., Ponnekanti V., Borthakur D., Callaghan M. [17] Petrenko S. A., Stupin D. D. Natsional'naya sistema rannego Linkbench: A database benchmark based on the facebook social graph. preduprezhdeniya o komp'yuternom napadenii [National system of [Proc. In: 2013 ACM SIGMOD International Conference on advance computer attacks alerting]. Innopolis, Afina Publ., 2017. 440 p. Management of Data, SIGMOD ’13], New York, 2013. ACM, pp. 1185–1196. DOI: 10.1145/2463676.2465296. [18] Skatkov A., Shevchenko V. Expansion of reference model for the cloud computing environment in the concept of large-scale scientific [2] Babcock B., Babu S., Datar M., Motwani R., Widom J. Models and researches. Trudy ISP RAN/Proc. ISP RAS, vol. 27, issue 6, 2015, pp. issues in data stream systems, in: 21st ACM SIGMOD-SIGACT- 285-306 (in Russian). DOI:10.15514/ISPRAS-2015-27(6)-18. SIGART Symposium on Principles of Database Systems, PODS ’02, ACM, New York, USA, 2002, pp. 1–16. DOI:10.1145/543613.543615. [19] Surekha D., Swamy G., Venkatramaphanikumar S. Real time streaming data storage and processing using storm and analytics with Hive. [Proc. [3] Barabanov A., Markov A., Tsirlov V. Procedure for Substantiated International Conference on Advanced Communication Control and Development of Measures to Design Secure Software for Automated Computing Technologies (ICACCCT)], Ramanathapuram, 2016. DOI: Process Control Systems. In Proceedings of the 12th International 10.1109/ICACCCT.2016.7831712. Siberian Conference on Control and Communications (Moscow, Russia, May 12-14, 2016). SIBCON 2016. IEEE, 7491660, 1-4. DOI: [20] Tang Y., Gedik B. Autopipelining for data stream processing, IEEE 10.1109/SIBCON.2016.7491660. Transactions on Parallel and Distributed Systems 24 (12) (2013) 2344– 2354. DOI:10.1109/TPDS.2012.333. [4] Biryukov D. N., Lomako A. G., Rostovtsev Yu. G. The appearance of anti-cyber systems to prevent the risks of cyber-threat [Proc. SPIIRAN]. [21] Tasharofi S., Dinges P., Johnson R. E. Why Do Scala Developers Mix 2015, V. 39, pp. 5 - 25. DOI: http://dx.doi.org/10.15622/sp.39.1 the Actor Model with other Concurrency Models? Proc. In: Castagna G. (eds) Conference proceedings of the 27th European Conference on [5] Borovsky A.S., Ryapolova E.I. Building a model of the protection Object-Oriented Programming, ECOOP 2013, Montpellier, France, July system in cloud technologies based on a multi-agent approach with the 1-5. Lecture Notes in Computer Science, Berlin, Heidelberg, Springer, use of automatic model. Voprosy kiberbezopasnosti [Cybersecurity vol 7920. DOI: 10.1007/978-3-642-39038-8_13. issues]. 2017. No. 4 (22), pp. 10-20. DOI: 10.21681/2311-3456-2017-4- 10-20. [22] Vorobiev E.G., Petrenko S.A., Kovaleva I.V., Abrosimov I.K. Organization of the entrusted calculations in crucial objects of [6] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, informatization under uncertainty. In Proceedings of the 20th IEEE and Russell Sears. Benchmarking cloud serving systems with ycsb. International Conference on Soft Computing and Measurements (24-26 [Proc. 1st ACM Symposium on Cloud Computing, SoCC ’10], New May 2017, St. Petersburg, Russia). SCM 2017, 2017, pp. 299 - 300. York, 2010, ACM, pp. 143–154. DOI: 10.1145/1807128.1807152. DOI: 10.1109/SCM.2017.7970566. [7] Gai K., Qiu M., Zhao H. Cost-aware multimedia data allocation for [23] Vorobiev E.G., Petrenko S.A., Kovaleva I.V., Abrosimov I.K. Analysis heterogeneous memory using genetic algorithm in cloud computing, of computer security incidents using fuzzy logic. In Proceedings of the IEEE Transactions on Cloud Computing PP (99) (2016) 1–1. 20th IEEE International Conference on Soft Computing and DOI:10.1109/TCC.2016.2594172. Measurements (24-26 May 2017, St. Petersburg, Russia). SCM 2017, [8] Gedik B., Ozsema H., Oztürk O. Pipelined fission for stream programs 2017, pp. 369 - 371. DOI: 10.1109/SCM.2017.7970587. with dynamic selectivity and partitioned state, Journal of Parallel and [24] Petrenko A.S., Petrenko S.A., Makoveichuk K.A., Chetyrbok P.V. The Distributed Computing 96 (2016) 106–120. DOI: IIoT/IoT device control model based on narrow-band IoT (NB-IoT). In http://dx.doi.org/10.1016/j.jpdc.2016.05.003. Proceedings of the the 2018 IEEE Conference of Russian Young [9] Ghazal A., Rabl T., Hu M., Raab F., Poess M., Crolotte A., Jacobsen H.- Researchers in Electrical and Electronic Engineering (29 Jan.-1 Feb. A. Bigbench: Towards an industry standard benchmark for big data 2018, Moscow and St. Petersburg, Russia) EIConRus, IEEE, 2018, pp. analytics. [Proc. ACM SIGMOD International Conference on 950-953. DOI: 10.1109/EIConRus.2018.8317246. 111