Distributed System of Intelligent Content Monitoring Agents Artem Soboliev 1, Dmytro Lande 1 1 National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Peremohy Avenue, 37, Kyiv, 03056, Ukraine Abstract The coverage and generalization of large dynamic information flows constantly generated in the space of the Internet requires qualitatively new methods and approaches to the implementation of measures to ensure the completeness, accessibility and reliability of the target information. In the process of content monitoring of the Internet it is important to use tools of distributed global network content monitoring and creation of multiple interfaces between agents who control collection of information from different Internet segments. This paper proposes methods and tools for monitoring the distributed content of social networks, taking into account the constant changes in the availability of certain segments of the Internet. The proposed solutions are used to populate the content monitoring databases of InfoStream and CyberAggregator web resources and social networks. A system of intelligent content monitoring agents based on several servers located in different data centers is offered. The intelligent system of agents interacts with the management and control system, which ensures an appropriate level of fault tolerance, completeness and reliability of the received information. Keywords 1 Information resources, social networks, intelligent agent information gathering, distributed content monitoring, intelligent agent system 1. Introduction Currently, the level of tasks solved by Internet content monitoring systems is constantly growing - from traditional information retrieval tasks to management, design, modeling and forecasting of various processes/events. It is worth noting that the amount of accumulated information is becoming gigantic (Big Data). Consequently, when creating content monitoring systems, it is advisable to take into account the peculiarities of access to certain segments of the Internet. There is a need to find unconventional approaches to the use of information technology and mathematical methods of collecting, processing and analyzing information [1, 3]. Internet resources and social networks have become a convenient and effective means of communication. They provide a huge freedom of action in the information space, which is mostly open and accessible. When used effectively, Internet resources become a powerful source of information for analytical work, open source intelligence (Open-Source Intelligence, OSINT) [3, 4] and at the same time provide an opportunity to obtain strategically important, expert and at the same time publicly available information, in particular security issues that allow to assess the mood of society in a particular information field. Consideration of information from open sources is of great importance for determining the directions of economic, scientific and technical development, as well as for solving problems in the spheres of security and defense [1]. In the global information and technological environment, rapid response and early warning systems for challenges and threats based on monitoring information via the Internet, OSINT systems are being actively improved. Analysis of technological and information problems and functional needs of such systems shows that the use of distributed content monitoring tools (in particular, creation of networks XXI International Scientific and Practical Conference "Information Technologies and Security" (ITS-2021), December 9, 2021, Kyiv, Ukraine EMAIL: artem1sobolev@gmail.com (A. Soboliev); dwlande@gmail.com (D. Lande) ORCID: 0000-0003-4027-042X (A. Soboliev); 0000-0003-3945-1178 (D. Lande) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 205 of information proxies) and a set of interfaces between intelligent information gathering agents (scanning) is important in the process of analyzing information via the Internet. This is due to the following factors: 1. Rapidly growing need for reliable information for management decision-making; 2. The need to take into account the level of accessibility of Internet segments and their features in the formation of content; 3. Different approaches to the possibility of providing information in different segments of global networks, the constant expansion of software interfaces; 4. The need to reduce the load on computing resources when using software agent systems operating in separate segments of global networks; 5. The need for automated management, design, modeling and forecasting, taking into account the distribution of content in the segments of global networks Popular social networks contain a huge amount of data about people's daily lives and social interactions, so they carefully check every request for information and do not always allow third-party services to use this information. When these information services notice unusual behavior of a client (in particular, a software client, a data collection agent) that makes a request for their data, they immediately block its access and additionally check requests coming from the IP addresses of this client. This requires that the clients, intelligent software agents that collect information for content monitoring systems, provide certain standard behavior in relation to information services inherent in the average user. This achieves the availability of both individual services and entire segments of the Internet. The number of such agents can be quite large, and information interaction should be provided between them. This will reduce the load on the target information services in global networks, which will also increase their availability. The purpose of the article is to describe the features of building a system of distributed monitoring of the content of global information networks, taking into account belonging to different segments with the help of intelligent information gathering agents. 2. Presentation of the basic material of the research Coverage and generalization of large dynamic information flows continuously generated in the Internet space requires qualitatively new methods and approaches to the implementation of measures to ensure monitoring of their content. Specialized content monitoring systems are used for prompt coverage of the information space. Such systems provide: 1. Responsiveness, which cannot be obtained from traditional search engines, where the time of indexing online content can vary from a day to several weeks; 2. Completeness, both in terms of covering sources and providing materials from those sources that are not provided by news aggregators; 3. Application of analytical tools for automated design, modeling and forecasting of processes/events. A large number of multilingual information resources complicates their use in information and analytical work. When analyzing or collecting such data, there are problems of processing extremely large amounts of data, searching and navigating in dynamic information flows. To solve these problems, such technological concepts as Big Data, Complex Networks, Cloud Computing, Data/Text Mining are used. [1]. Problems in the dynamics and dimensionality of multilingual information resources of global networks require fundamental research in the field of pattern recognition, discrete mathematics, linguistics, digital signal processing, wavelet and fractal analysis. Although in some cases modern development of technology allows us to find the necessary information in networks, there are still unresolved problems of further analytical processing of this information, the allocation of the necessary factual data, determination of trends in the development of certain subject areas, the relationship of 206 objects, events, recognition of significant anomalies, forecasting, etc. Many of the obtained problems are actual issues of semantic processing of ultra-large dynamic text arrays of information. Today, some attempts to solve these problems in practice determine the success of such projects as search engines Baidu, Yandex, social network monitoring systems Google Keyhole, Brandwatch, CyberAlert, analytical systems Palantir, Centrifuge. One of the suggested approaches to solving of such problems is based on the system of content monitoring of web resources InfoStream [5] and social networks Cyber Aggregator [13]. Modernization and scaling (formation of multilingual full-text databases, modeling of information flows in huge computer networks) in these systems takes into account the software and hardware placement of the "essence" of the segments of the information space, that is, the deployment of a network of information, proxy servers built on the basis of distributed content monitoring, the use of a system of intelligent agents for collecting information. Let us consider the need for distributed content monitoring tools. It would seem that at the primary level we can use the data available through traditional network search engines hosted on the servers of well-known news integrators. But in this case, there are a number of problems that prevent further serious use of network resources to perform analytical work [7]: Not all resources are available in the national segment, in particular, there is no access to some foreign sites and some social networks. Traditional search engines do not always index news posted on deep levels of websites, news is not always indexed by them in time, social networks and special databases posted on the Internet are poorly covered, there is a problem of the Deep Web. In some cases, when the access is not anonymous, websites or social networks involved in information wars can provide distorted information, fakes, to the mainstream users. In some cases, access to information may be denied even if the information has the status of open to all. In addition, requests that satisfy the information needs of analysts, transmitted in an unprotected form, can disclose these needs to an interested party - an information adversary. To solve these OSINT tasks, it is necessary to use modern integrated systems that are characterized by the following features: In order to ensure the simultaneous process of obtaining information from social networks without the use of third-party paid services and to control and manage such a system from a single place, it is proposed to introduce teams of agents that allow downloading and exchanging such information with each other and ensure the integrity of the received data and distribute the load among themselves. This will overcome the complexity of distributed content monitoring of information resources on the Internet [8]. Distributed collection of information from websites and social networks using ensembles of intelligent collection agents distributed in a cloud environment that geographically spans different countries. These agents must interact, exchange information, and pass this information to the analytical part of the OSINT system. Information retrieval agents should execute pre-programmed and customized information gathering scenarios, interact with websites, social networks, deep web databases, news aggregators, preferably (if possible) in an anonymous mode. The use of information retrieval agents as the basis of the system of information proxy servers should ensure the completeness of information in case of blocking of individual agents, prevent distortion and duplication of information transmitted to the OSINT system databases. To prevent information leakage, OSINT analysts should use anonymization, masking, VPN, etc. during data extraction and processing. According to the source [4], OSINT principles are based on the continuous collection of information from public available sources, then analyzed, preparation and timely delivery of the final result to the customer. In order to solve tasks of timely intelligence based on the information received from OSINT is used the result of systematic collection and processing of analyzed publicly available Information. The basis of cybersecurity using the OSINT principle is determined by a number of aspects, including the speed and cost of information obtaining, its volume, quality, reliability, convenience of further use etc. The process of planning and preparation for OSINT management depends on the following factors [3]: 1. Efficiency in information support is achieved by using the method of collecting information from the Internet, user-generated content, hashtags, geo-tags, etc; 2. The relevance, depth, availability and volume of publicly available information makes it possible to find the information necessary for intelligence without the involvement of other specialized intelligence tools; 207 3. Simplification of data collection processes. OSINT provides the necessary information, eliminating the need to attract unnecessary technical and human resources; 4. Depth of data analysis. As part of the intelligence process, OSINT enables in-depth analysis of publicly available information to make appropriate decisions; 5. Efficiency. Dramatic reduction of time to access information on the Internet. Fast receipt of valuable operational information. The situation that changes rapidly during crises is most fully reflected in the current news; 6. Volumes. The possibility of mass monitoring of certain information sources in order to find targeted content, people and events; 7. Quality. Compared to the reports of special forces, information from open sources is devoid of subjectivity; 8. Reliability; 9. Ease of use. OSINT-data can be easily transferred to any interested authorities, they are open; 10. Cost. Obtaining data on the price in OSINT is minimal. There is a problem associated with the large amount of information received from social networks and the analysis of this data, assessing the dynamics and susceptibility to constant change. This problem and ways to overcome it today are called Big Data. In this case, it is problematic to implement the functions of collecting, cleaning, storing, searching, accessing, transmitting, analyzing and visualizing such sets as a complete integrity, rather than local fragments. The defining characteristics of big data are the "three V's": Volume (physical volume), velocity (the rate of growth in data transmission and retrieval and the need for high speed of processing and retrieval of results), variety (diversity, the ability to handle different types of structured data). and weak data, structured data at the same time). [13] 2.1. System Functionality There are problems with processing large volumes of circulating information necessary to search and navigate in dynamic data streams during the collection and analysis of open data from the Internet. Large number of multilingual dynamic information resources, as well as dominance of information noise complicate the search for the necessary information in the operational analysis, and hence the use of open sources in information and analytical work in general. Most of the above problems are topical issues of semantic processing of large dynamic text arrays of information. Nowadays such technological concepts as Big Data, Complex Networks, Cloud Computing, Data/Text Mining are used to solve these problems. In cybersecurity, the Ontology approach is increasingly used to build models of subject areas. [13] The actual solution to the problem of creating such a corporate system is the simultaneous use of methods and tools for searching, analyzing and aggregating data from information flows. A system of monitoring and analysis of social media, automatic processing of full texts from social networks for a certain period on the topic of "cybersecurity" has been created. Information is scraped from social networks (blogs, various social networks, websites, messengers, etc.) in search mode. Queries (search key phrases in the relevant social network, otherwise an account is required) are read by the software from special configuration tables. Next comes the search and display of records that match the corresponding queries. After that, unique records are written to the server database. Analysis of existing approaches to the aggregation of thematic news has led to the need and possibility of creating a set of tools for monitoring the content of social networks on specific issues, in particular, cybersecurity. The described system includes personalization tools that provide online access to databases, including from mobile devices, for which the possibilities of RSS formats are widely used. The choice of "off-the- shelf" software modules is substantiated, the development tools (scrapers for social networks, tools for generating dynamic RSS-feeds) are described and the results of their integration into the system are presented. [13] The intelligent agent system interacts with the management and control system, which ensures an appropriate level of fault tolerance, completeness and reliability of the information received. As a basis for control, management, load synchronization and interaction of the intelligent system of data collection agents developed by the authors, the document-oriented database management system 208 MongoDB is used, which allows storing information about agents, lists of their corresponding network sources, and algorithms of agent behavior. The MongoDB NoSQL database can be easily deployed in most popular operating systems, is undemanding to resources and can withstand heavy loads. To provide interaction between a group of agents to obtain information, it is necessary to use a database that will allow synchronizing and launching agents according to certain settings of the intelligent system. This database will be used to continuously write and read data. Since in dynamically growing systems, data volumes tend to increase rapidly, you may encounter a problem when the current resources of the machine will not be enough for normal operation. To solve this problem, scaling is used. Scaling is of two types - horizontal and vertical. Vertical scaling - increasing the power of one machine - adding CPU, RAM, HDD. Horizontal scaling - adding new machines to existing ones and distributing data between them. The first case is the simplest because it does not require additional program settings and any additional database configuration, but its disadvantage is that it is not suitable for distributed intelligent agents, because each server must run a replica of the database and it is easy to connect additional servers if necessary. Therefore, in our case, horizontal scaling will be used, which has the following advantages: almost infinite scaling (you can include as many machines as you want); better data security (only if replication is used) - machines can be located in different data centers (if one of them fails, the others will remain). In this scheme, in addition to sharing, there is segment replication. Let us say a few words about it. All write, delete and update operations go to the primary (primary), and then are written to a special collection oplog, from where they are asynchronously transferred to replicas - repl.1 and repl.2 (secondary). Thus, data duplication occurs. Why is it necessary? 1. Redundancy provides data security - if the master fails, a vote is taken between the replicas and one of them becomes the master. 2. Master and replicas can be located in different data centers - this can be useful if the server is physically damaged (fire in the data center). 3. Replicas can be used to read data more efficiently. For example, there is an application that has a clientele in Europe and the USA. One replica can be placed in the United States and configured so that American clients read data from it. It should be noted that documents to replicas are received with a delay and it is not always possible to immediately find a document re-recorded to a replica. Therefore, this item is an advantage only if the program logic allows reading on replicas. 4. The replica set scheme is often used in serious production applications where data security is important or there is a large number of reads, and the application logic allows reading from replicas. We will not dwell on this scheme in detail, because it can be a subject of a separate paper. In order to ensure the simultaneous process of obtaining information from social networks without the use of third-party paid services, as well as to control and manage such a system from a single place, it is proposed to introduce agent teams that allow you to download and exchange data with each other and ensure the integrity of the data received and distribute the load among themselves. Access control systems in popular social networks are focused on detecting unusual user behavior. In order to avoid such "non-standard behavior", special algorithms for agents' access to such networks were created, which took into account the amount of time spent in these networks, the amount of information collected per request, and the amount of information provided by such a social network agent. It should also be noted that certain attention in such services is paid to the behavior of clients at night (from the region from which the request is made), since during this period the number of requests from clients should be minimized, otherwise such clients are already an object for research. In the pilot model of the system of intelligent agents for collecting information for distributed interaction ("Nabla" system, ∇), the authors used 3 servers that were geographically located in different data centers, at a great distance from each other (Fig.1): 1. Netherlands; 2. Ukraine; 3. United States of America. 209 Figure 1: Scheme of distributed information retrieval based on 3 servers MongoDB database cluster, intelligent management system and agent commands for information retrieval are deployed on these servers. For management and interaction between agents, the HTTPS protocol is used, as it is the most popular protocol on the global Internet, which allows you to quickly optimize commands for network agents and has an appropriate level of security. Also use RESTful service architecture as the basis for these agent commands, which allows you to configure and monitor their operation efficiently. They also use system messages like Heartbeat as the basis of their interaction, which allows them to efficiently explore the lifecycle of agents and, if any of them fails, quickly identify the problem without losing the data received. The proposed ∇ teams of agents represent a high availability cluster in which, if one agent fails, its functions are taken over by another available agent. Thus, the process of obtaining information from social networks continues continuously due to the intelligent management and control of these agents. In order to build a fault-tolerant structure, at least two physical servers with storage systems are required, so an auxiliary third server is used to provide fault tolerance on two servers, which allows efficient use of resources and even load balancing between the two servers. Also, the intelligent system of management and control of network agents operates on the following principle: when one agent fails, the other one is automatically included in the work with a message about the agent failure. A separate agent in the system normally operates in the following scenario (Fig. 2): The agent has a local registry of collected documents, which is constantly synchronized with the general registry stored in the MongoDB DBMS environment. According to the time schedule, the agent selects the task of checking the information resource or part of it, accesses the registers to avoid repeated scanning. Then, if necessary, the information is collected and loaded into the information proxy. Meta-information is written to the local and general registers. The agent then selects a new task again and so on. The process can only end at the command of the system administrator. 210 Figure 2: Scheme of the information collection agent functioning The general logic of the agent cluster is created at the level of software protocols and allows users to: 1. Manage all network agents with a single intelligent module; 2. Add and update software and hardware resources without system shutdown or major architecture changes; 3. Ensure uninterrupted system operation in case of failure of one or two agents; 4. synchronize data between clusters of agents; 5. Efficiently distribute requests to agent clusters; 6. Use a common database of agents. One of the servers in the cluster is the central server of the cluster. The central server, in addition to serving client connections, manages the entire cluster and maintains a cluster registry for this purpose. When a connection is established, the agent communicates with the cluster’s central server. The central server, based on the analysis of the agent's workload statistics, directs it to the specific workflow it should perform. So, the main task of the agent cluster is to eliminate system downtime and report how much the agents collected themselves and how much they took from other agents. Ideally, any incident related to external interference or failure of an internal resource should allow the system to continue working. In the distributed system under consideration, shared memory, through which different agents can exchange data, is almost never used. Thus, traditional synchronization and communication methods can be considered excluded. It is a set of autonomous agents that are logically combined in communication networks to perform the task of collecting and exchanging data and information about them. With the help of individual agents and MongoDB-based control and management tools, distributed actions are coordinated and information is exchanged. The considered system of intelligent agents for collecting information, which monitors distributed content, is currently used in the InfoStream and CyberAggregator systems, which distribute monitoring by language (Ukrainian, English, German, Italian, Spanish, French, etc.), by segments of the Internet (web resources, social networks: Youtube, Twitter, Telegram, Facebook, etc.). In this case, information proxies, covering information collected by their respective teams of intelligent agents, process 211 individual segments of the Internet. This approach reduces the load on computing resources by distributing the power of processor modules and global network segments. Models built using data mining algorithms can be used to make decisions about connecting an additional source. For example, when outliers appear in the time series of the main data stream, a model that identifies such outliers can be used to determine whether it is an error. The model can be built from previously collected data and emergent situations resolved by experts. In addition, if there is feedback from an analyst while receiving data, it can learn, thus adapting to current conditions. To make a decision on connecting a new source, it is proposed to use methods of intelligent analysis. 3. Intelligent collection of information for analysis Raw data analysis can be an alternative to collecting all data before performing the analysis. This approach includes the following key steps: 1. Selecting the main data sources; 2. The primary analysis is performed based on information from these data sources; 3. Prioritizing the query for each data source; 4. Can be done based on different criteria; 5. Requesting other data sources, depending on requirements, if information from the basic data sources is not sufficient for the preliminary analysis and/or the probability that the results are correct is low (e.g. outliers and/or jumps are detected in the data); 6. Data processing and evaluation of results performed separately from the main task, if possible close to the data source (preferably at the node where the data are located or in its local network). Internet solution systems use cloud computing technologies to analyze data and solve problems with computing resources [9]. The cloud provides scalable computing resources and other tools for creating analytical services. However, this approach retains the listed disadvantages. To solve them, Cisco proposed the concept of fog computing [10]. It extends cloud computing closer to the sources. Fog computing completely solves or reduces the impact of a number of common problems of distributed systems: 1. High network latency; 2. Scalability of information sources; 3. Difficulties associated with endpoint mobility; 4. High cost of bandwidth; 5. Large geographical distribution of systems. Despite the advantages and popularity of the fog computing concept, there are no ready-made solutions for its implementation. This is explained by both the youth of the concept and the high level of abstraction. One of the solutions that corresponds to the concept of fog computing is distributed data analysis based on intelligent agents (actors) [11]. It can be used for both cloud and fog computing. The proposed approach allows splitting data mining algorithms into "pure" functions and executing them on distributed sources. The data mining algorithm is represented as a sequence of function calls. For their parallel execution, a function is added that allows parallelizing algorithms. For execution in a distributed environment, the data mining algorithm decomposed into functions was compared with the model of intelligent agents (actors). Thus, the distributed data analysis algorithm is represented as a set of agents that exchange messages with the main agent. Intelligent agents (actors) transfer part of the computation to the sources, which improves the analysis performance and reduces the network traffic between the sources and the cloud. However, this approach has some limitations: it does not allow prioritizing data sources and querying data according to their priorities. In addition, the cost of queries from data sources is not taken into account [12]. Consider a simpler system of two agents (Agent 1 and Agent 2), shown in Figure 3, which collect information from an information resource on behalf of two users (User 1 and User 2). The collected information is placed on proxy servers (Proxy 1 and Proxy 2). The information needs of users are represented by query packages (Query 1 and Query 2). If agents make these requests to an information 212 resource (for example, social networks YouTube or Twitter), they may receive several documents R1 and R2, which may coincide - the same documents may satisfy the conditions of different requests. Figure 3: A simpler two‐agent information collection system The rational application of the dual-agent system is to avoid double collection of the same documents due to the interaction of agents. For both request packages, documents already collected by one agent are collected by the second agent not from an external information resource, but from the corresponding information proxy. In this case, the benefit (K) from the application of this scheme of information collection can be calculated as a coefficient: |𝑅1 ∨ 𝑅2| (1) 𝐾 , |𝑅1 ∪ 𝑅2| where R1 is resource 1, R2 is resource 2. 4. Conclusions This paper presents an algorithm for distributed content search agents based on OSINT fundamentals and demonstrates the main possibilities of its use. At the same time, it becomes obvious how diverse and powerful information retrieval processes can be optimized through their interaction. In addition, a mechanism for building a cluster of distributed agents has been presented. It ensures the correct extraction of information. It becomes obvious that the OSINT algorithm is constantly changing and improving, as the owners of open information try to provide a minimum level of access to their resources to other systems. Based on the testing and evaluation of these network agents, which extract content from the web and social networks and consist of 3 servers, it can be concluded that the proposed interaction is effective. The proposed architecture provided the pilot system with an appropriate level of fault tolerance and reliability of the information received and ensured uniform loading of agent clusters. In addition, the presented system performs the security tasks of the monitoring service, which allows it to ensure the integrity of the received data, bypassing the limitations in the collection of information, the availability of data for monitoring and the completeness of the received information. If the 213 information for any country is changed, distorted, agents in interaction with each other will reflect these changes and save all copies of the received data. References [1] D. V. Lande, Analysis of information flows in global computer networks, Bulletin of the National Academy of Sciences of Ukraine- No 3 (2017), pp. 46-54. doi:10.15407/visn2017.03.045. [2] H. Liu, A. Gegov, and М. Cocea, Rule Based Systems for Big Data. A Machine Learning Approach. Heidelberg, Germany: Springer, 2016. doi:10.1007/978-3-319-23696-4. [3] M. Glassman, M. J. Kang, Intelligence in the internet age: The emergence and evolution of Open Source Intelligence (OSINT), Computers in Human Behavior, 2012, volume 28, No. 2, pp. 673- 682. doi:10.1016/j.chb.2011.11.014. [4] Army Techniques Publication No. 2-22.9 (FMI 2-22.9), Headquarters Department of the Army, ATP 2-22.9 (Washington, DC, 10 July 2012), URL: https://fas.org/irp/doddir/army/atp2-22-9.pdf. [5] A. N. Grigoryev, D. V. Lande, S. A. Borodenkov, R. V. Mazurkevich, and V. N. Poter, InfoStream. Monitoring News from the Internet: Technology, System, and Service. Kyiv, Ukraine: Start-98, 2007. [6] S. Choi, B. Bae, The real-time monitoring system of social big data for disaster management , Computer science and its applications. – Springer, Berlin, Heidelberg, 2015, pp. 809-815. doi: 10.1007/978-3-662-45402-2_115. [7] A. Hannemann, K. Liiva, R. Klamma, Navigation Support in Evolving Open-Source Communities by a Web-Based Dashboard, IFIP International Conference on Open Source Systems. – Springer, Berlin, Heidelberg, 2014, pp. 11-20. doi:10.1007/978-3-642-55128-4_2. [8] A. M. Sobolev, D. V. Lande, Distributed Intelligent Content Extraction Agents from Social Networks". Proceedings of the Scientific and Practical Conference "Information and Telecommunication Systems and Technologies and Cybersecurity: New Challenges, New Tasks". - Kyiv: Igor Sikorsky Institute of Cybernetics, 2021, pp. 274-275. [9] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 2013, No 29, pp. 1645– 1660. doi:10.1016/j.future.2013.01.010. [10] F. Bonomi, R. Milito, J. Zhu, S. Addepalli, Fog computing and its role in the internet of things. Proc. MCC, Helsinki, Finland, 2012, pp. 13–15. URL:https://conferences.sigcomm.org/sigcomm/2012/paper/mcc/p13.pdf. [11] I. Kholod, I. Petuhov, N. Kapustin, Creation of data mining cloud service on the actor model. Internet of Things, Smart Spaces, and Next Generation Networks and Systems. Springer, 2015, pp. 585–598. doi:10.1007/978-3-319-23126-6_52. [12] M. S. Efimova Smart data collection from distributed data sources Software & Systems Received, 2019, vol. 32, no. 4, pp. 565–572. doi:10.15827/0236-235X.128.565-572. [13] D. Lande; I. Subach; A. Puchkov System of Analysis of Big Data from Social Media Information & Security: An International Journal 47, no. 1 (2020): 44-61. doi:10.11610/isij.4703. [14] R. Layton, P. Watters, Automating open source intelligence: algorithms for OSINT (Rockland, MA: Syngress Media, 2016), URL: www.bookdepository.com/Automating-Open-Source- Intelligence-Robert-Layton/9780128029169. [15] B. Akhgar, P. Saskia Bayerl, F. Sampson, Open Source Intelligence Investigation: From Strategy to Implementation (Springer International Publishing AG, 2016), doi:10.1007/978-3-319-47671- 1.D. [16] U. K. Wiil, Counterterrorism and Open Source Intelligence (Wien: SpringerVerlag, 2011), doi:10.1007/978-3-7091-0388-3. [17] B. J. Jansen, D. L. Booth, A. Spink, Determining the informational, navigational, and transactional intent of Web queries,Information Processing & Management, 2008, volume 44, No 3. – С. 1251- 1266. doi: 10.1016/j.ipm.2007.07.015 [18] G. Gutin, T. Mansour, S. Severini, A characterization of horizontal visibility graphs and combinatorics on words, Physica A: Statistical Mechanics and its Applications, 2011, volume 390, No. 12, pp. 2421-2428. doi: 10.1016/j.physa.2011.02.031. 214