Gathering Malware Data through High-Interaction Honeypots (DISCUSSION PAPER) Angelo Furfaro1[0000−0003−2537−8918] , Francesco Lupia2[0000−0003−0775−6890] , and Domenico Saccà1[0000−0003−3584−5372] 1 University of Calabria, Rende(CS), Italy {angelo.furfaro, domenico.sacca}@unical.it 2 OKT srl, Rende(CS), Italy francesco.lupia@okt-srl.com Abstract. The widespread and ever increasing number of services and devices which expose their interfaces to the Internet make the cyberspace a fertile ground for malware activities. Hence there is a strong demand for cybersecurity solutions ensuring their safe operation. Honeypots are networked computer systems purposely designed and crafted to mimic regular services, operating systems and devices with the goal of capturing and storing information about the interactions with attacking entities and we repute them a crucial technology in the study of cyber threats and attacks. We presents the main features of EMPHAsis, a data streaming analytics system based on high-interaction honeypots, which enables the collection and analysis of relevant data about intercepted malware. Keywords: Honeypots · Cybersecurity · Data collection · data stream- ing analytics. 1 Introduction The era of Internet of Things with billions of connected devices has created an ever larger space for cyber attackers to exploit. In particular, there is a widespread amount of automated bots scanning and probing the Internet to search for vulnerable hosts. This has resulted in the need for fast and accurate detection of possible system vulnerabilities and attackers by means of the pro- cessing of the high-velocity, high-volume data from various sources to discover anomalies and/or attack patterns as fast as possible to limit the vulnerability of the systems and increase their resilience. Such data, generated in a real-time data stream, need a big data streaming analysis [13, 9]: the output must be gen- erated with low-latency and any incoming data must be reflected in the newly generated output within seconds. In other words, big data streaming analytics Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. tools must be able to identify new information, incrementally build models and access whether the new incoming data deviate from model predictions. Even though many big data streaming analytics tools have been developed in the past few years, their usage in the field of cybersecurity is not immediate and calls for new approaches as pointed out in [5]. The main issue is how to access in real time to valuable data on possible attackers, typically log files and security alerts generated by operating systems and applications across various hosts and systems. A crucial point is to provide computer systems with software tools to identify suspicious event log activity, such as repeated failed login attempts, ex- cessive CPU usage, large data transfers and immediate alert IT security analysts when a possible Indicator of Compromise (IoC) is detected. It should be noted that the techniques and attack methods employed by malwares are typically very simple, In fact, the most used attack method is often brute-forcing login credentials of servers and, consequently, the most heavily attacked services are Telnet, FTP and SSH services. There are many tools used in computer security to catch such malicious actions and one effective solution involves the usage of honeypots. Honeypots [12] are decoy systems which aim at emulating real services on the net in order to detect and attract malicious agents. These emulated services are publicly displayed and their access is made voluntarily simple to facilitate attackers’ intrusions. This is done for example by configuring accounts and services with weak credentials. After accessing the service, all the activities of the malicious agents are subject to monitoring and logging and become the object of study by specialists in the sector, in order to be able to reconstruct with extreme precision the behavior of the attacker and thus be able to prepare prevention measures to reduce the risk of future attacks. Furthermore, it is possible to obtain valuable information concerning the behavior of the attacker, the actions carried out on the target system and the types of vulnerabilities exploited to complete the malevolent activity, which can be subsequently shared with the academia and industry researchers. Here we present the main features of EMPHAsis, a distributed system con- ceived to support malware detection by acting as a high-interaction honey- pot [11]. The system is able to collect and disseminate information about new threats that proliferate every day on the Internet by providing researchers with fresh data that could help them to devise possible countermeasures against ma- licious traffic. In addition to its detection, logging and monitoring capabilities, EMPHAsis is capable to capture different malware binaries and to exploit ex- ternal services for analyze them. 2 System Architecture While malware’s actions may vary depending on the targeted system, there are only few points of interest that need to be observed and analyzed: network information, the commands executed, processes created, kernel drivers loaded and the files created or modified by the malicious agent. Each attacker’s action is processed by EMPHAsis according to the functional architecture depicted in Figure 1 discussed next. The system operates by constructing dynamically Agent1 Agentn Proxy Scheduler Sandbox Proxy Scheduler Sandbox Storage Management and gui tools Core Fig. 1. EMPHAsis architecture virtual environments (sandboxes) to which all traffic generated by the attacker is redirected out. Each environment is equipped with several specialized probes, whose goal is to monitor in a completely transparent way to the malicious entity, specific critical events that occur in the attack scenario at hand. All the data collected by the probes is eventually stored and indexed for subsequent use (e.g., in order to devise appropriate countermeasures) and possible disclosure. In more details, the EMPHAsis architecture consists of several modules each of which is in charge of performing a specific task. The core is the most im- portant module which plays the role of coordinating the operations between all the other modules of the system. Furthermore, it provides a graphical environ- ment for managing the system, to help real-time monitoring and to visualize the collected data. When a connection attempt reaches the system, the agent module checks whether there is a configured service to handle the request. If a suitable service exists, a virtual trap environment is built and executed on the fly and all the attacker’s traffic is then redirected out to it. The upper part of Figure 2 shows the main steps performed by the agent module in response to an attack. In more detail, as a preliminary step the proxy component notifies a new connection arrival to the core module while collecting also information about the geographical origin of the attack. Then, it continues by performing payload detection and if a suitable service is found, the request is handled and the scheduler component is allerted. The scheduler is the main responsible for the virtual trap environment setup. After receiving an alert from the proxy, it prepares and starts the sandbox and then injects one or more probes directly into the isolated environment. These probes will monitor all the actions performed by the attacker (or malware) inside the sandbox. In particular, in the lower part of Figure 2 is shown an example of a specific case where four probes are in place and used for monitoring networking events, commands executed in the system Attacker Proxy 4. Attacker's traffic redirection 1. Notifies connection Scheduler 3. Probe injection 2. Virtual trap enviroment setup Sandbox Agent Network Shell Process FileSystem Probe Probe Probe Probe Network traffic Shell commands Process creation Filesystem changes Interception Interception Interception Interception Exposed service (e.g., SSH) Logged events are sent to the database Sandbox Fig. 2. Agent module shell, process creations, and changes to the filesystem. It should be noted that the concrete implementation of these probes poses technical challenges. Indeed, one needs to use specific tools and instruments according to the type of operating system used by the emulated guest machine. Our probes support a custom 32- bit Linux kernel and we rely on QEMU-KVM to emulate this platform. Further implementation details can be found in [6]. It is interesting to notice that our implementation of the agent exploits a fake interactive session where attacker’s traffic is man-in-the-middle proxied. This is achieved by using a custom shell implementation. The storage module comprehends one or more nodes in a redundant config- uration whose role is to offer all standard operations for storage and subsequent consultation of the collected data. An attack can produce a significant amount of information, which therefore needs to be properly cataloged and organized for subsequent consultation. To this end, our system exploits appropriate com- ponents for the storage and processing of such information, making it suitable for the production of detailed reports and in characterizing the threats detected. Additionally, the above processing can be customized by the user for the spe- cific resources she is most interested in. In particular, an individual storage node uses both a MongoDB NoSQL database [1] and an Elasticsearch database [3]. The MongoDB holds the users’ credentials required to access the control and the data analytic dashboard, stores the configuration (including custom kernel images and the probes) which will be downloaded by the agent module during the initialization phase of the sandboxes. and stores the captured malware sam- Fig. 3. Main panel of the EMPHAsis dashboard ples. The system will then send over all the security events described above to the Elasticsearch database. Such events can be visualized by the built-in dash- board (see section 3) which exploits a Kibana instance for advanced filtering mechanisms. A key aspect in order to achieve effective results (in terms of number of malicious actions logged and malware instances intercepted) concerns the correct positioning of the honeypots in the infrastructure to be protected. Thus, there is significant motivation for studying the best possible locations where agent modules (and honeypots in general) can be deployed in a network. Interestingly, EMPHAsis agents can be deployed both behind or outside the perimeter of the network, therefore allowing to detect also potential insider threats. In any case, deploying honeypots in a production environment can be problematic since it could expose the network to far greater risks than the threats from which it is intended to be protected. Towards this end, it is essential to study and analyze the way honeypots interact with the rest of the system before the actual deployment. This could be done by using virtual simulation environments, such as the one described in [7] which allow to reproduce multiple operating systems as well as networks in a realistic and controlled way. 3 Collected Data All of the security events collected through the probes are sent to the storage module and stored in the Elasticsearch database allowing for fast real-time access through the dashboard. More precisely, we construct a single index which is shared by every probe of any agents running in the system. Figure 3 shows the main panel of the dashboard which displays the information about the active sandboxes, under the control of the core module, and the number of intercepted attacks. The lower part of the panel reports some statistics about the last five days of operation: the percentage of sandboxes launched, grouped by their type, and the top-10 IP addresses from where the majority of connections to the sandboxes have been established. Fig. 4. Shell log tab It is possible to inspect the details specific to each sandbox and to access the data collected by the active probes. Figure 4 shows a panel, for a given sand- box, which is organized in more tabs each of which reports specific information according to the following categories: shell, network, filesystem and processes. For those sandboxes related to services based on interactive sessions, e.g. ssh or ftp, the flow of messages exchanged (i.e. commands executed and responses) between the attacking entity and the sandbox, is available in the shell logs tab (the one active in the panel of Figure 4). Figure 4 depicts a scenario where the attacker, once logged as the admin user, successfully attempted to download a shell script from a remote host and then executed it inside the sandbox shell environment. Figure 5 shows the network log tab, which allows to visualize the information about the network events relevant to the service exposed by the sandbox. In the specific case reported in the figure, some ssh login attempts, part of a brute-force attack directed to gain access as the root user, are visible. When a malware successfully gain access to a sandbox, from where it is possible to modify the content of the local filesystem, e.g. from a ssh session, such modifications are tracked and the relevant events (i.e. filesystem related system calls invocations) displayed in the filesystem log tab as shown in Figure 6. Such events are grouped on the basis of the specific path and ordered by the timestamp of execution. In a similar way, the processes log tab displays the time intervals when pro- cesses launched by a malware from inside the sandbox are executed. The EMPHAsis probes are able to capture and store in the system database all the files dropped by an attacker (e.g. shell scripts, binary files, source code) during a session. Before being stored, such files undergo a basic analysis by re- Fig. 5. Network log tab Fig. 6. Filesystem log tab sorting to the malware identification and classification services offered by Virus- total [14] and those developed by Cythereal [2]. In particular the Cythereal MAGIC API [2] exploits a state-of-art machine-learning based malware analy- sis system which is able to identify the belonging family of a malware even in presence of sophisticated code obfuscation techniques. 4 Conclusions and Future works We presented the main functionalities of EMPHAsis and it practical applica- bility for malware capturing, analysis and prevention. EMPHAsis exploits high- interaction honeypots and has a modular and extensible architecture making it effective and suitable for various practical deployment scenarios. Currently EM- PHAsis features the emulation of Linux based systems on top of which the great majority of Internet exposed services are based on. Moreover, it can be easily extended to support other unix-based systems. As a future work, we plan to sup- port also computer systems based on Microsoft Windows or on Apple macOS. Another important direction of improvement is the integration of EMPHAsis with more advanced data analysis tools specifically tailored to cybersecurity events, i.e. the so called security information and event management systems (SIEM), like IBM QRadar [8] and ElasticSIEM [10], which has been recently developed on top of Elasticsearch. We also plan to extend the interoperability of EMPHAsis with other malware analysis tools, like for example cuckoo [4] san- box, and to integrate in it virtual environment technologies for building more realistic and complex decoy environments [7], e.g. honeynets, and thus increasing the effectiveness in capturing more sophisticated malware. Acknowledgments This research has been supported by the ISCOM - Italian Ministry of Economic Development under agreement EMPHAsis (CUP B51G17000300006). References 1. Cuckoo. https://cuckoosandbox.org/. 2. Cythereal, changing the rules of cyber engagement. http://www.cythereal.com. 3. Elasticsearch. https://www.elastic.co/products/elasticsearch. 4. Mongodb. https://www.mongodb.com. 5. Pelin Angin, Bharat K. Bhargava, and Rohit Ranchal. Big data analytics for cyber security. Security and Communication Networks, 2019:4109836:1–4109836:2, 2019. 6. Michele Bombardieri, Salvatore Castanò, Fabrizio Curcio, Angelo Furfaro, and He- len D. Karatza. Honeypot-powered malware reverse engineering. In 2016 IEEE In- ternational Conference on Cloud Engineering Workshop, IC2E Workshops, Berlin, Germany, April 4-8, 2016, pages 65–69, 2016. 7. Angelo Furfaro, Luciano Argento, Andrea Parise, and Antonio Piccolo. Using virtual environments for the assessment of cybersecurity issues in iot scenarios. Simulation Modelling Practice and Theory, 73:43–54, 2017. 8. IBM. QRadar. https://www.ibm.com/security/security-intelligence/qradar. 9. Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi. Big data stream anal- ysis: a systematic literature review. J. Big Data, 6:47, 2019. 10. Mike Paquette. Introducing Elastic SIEM. https://www.elastic.co/blog/introducing-elastic-siem, 2019. 11. Lance Spitzner. The honeynet project: Trapping the hackers. IEEE Security & Privacy, 1(2):15–23, 2003. 12. Lance Spitzner. Honeypots: Catching the insider threat. In 19th Annual Computer Security Applications Conference (ACSAC 2003), 8-12 December 2003, Las Vegas, NV, USA, pages 170–179, 2003. 13. Nicoleta Tantalaki, Stavros Souravlas, Manos Roumeliotis, and Stefanos Katsavou- nis. Linear scheduling of big data streams on multiprocessor sets in the cloud. In Payam M. Barnaghi, Georg Gottlob, Yannis Manolopoulos, Theodoros Tzourama- nis, and Athena Vakali, editors, 2019 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2019, Thessaloniki, Greece, October 14-17, 2019, pages 107–115. ACM, 2019. 14. Virustotal. How it works. https://support.virustotal.com/hc/en- us/articles/115002126889-How-it-works.