eXtream: a System for Real-time Monitoring of Dynamic Web Sources Marcos Fernández-Pichel Rodrigo Martínez-Castaño David E. Losada Juan C. Pichel marcosfernandez.pichel@usc.es rodrigo.martinez@usc.es david.losada@usc.es juancarlos.pichel@usc.es Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela 15782 Santiago de Compostela, Spain TAG CLOUD ABSTRACT REDDIT CRAWLER FILTER tags In this work, we introduce eXtream, a Big Data platform whose filtered_texts TOPIC ANALYSIS main goal is to deploy modular and customisable processing topolo- reddit_texts detected_topics BATCH gies for massive analysis of web data in real time. The system offers TWITTER CRAWLER no_texts a reduced group of pre-installed modules that can be easily com- bined in a visual way. Additionally, an advanced user can upload twitter_texts STATS new modules and extend an existing topology. This tool facilitates stats commands the development of many Information Retrieval and Big Data ap- CONSUMER / HTTP API results Twitter plications, such as query-based real-time filtering or topic analysis results Reddit WEB INTERFACE services on Social Media data. To demonstrate it, we have also Kafka ALL OUTPUTS developed an initial web-based demonstrator. results buffer MongoDB React webapp STATE CCS CONCEPTS • Information Retrieval → Real-time; • Text Mining → Social Figure 1: Example of a possible eXtream topology using Media Analytics. the pre-installed modules (top) and visualization and persis- tence pipeline (bottom). KEYWORDS Big Data, Real Time, Web Streams, Datasets level details such as manually deploying Docker containers. More- 1 INTRODUCTION over, eXtream has several important differences with respect to Processing Social Media data is a challenge and doing it in real other existing frameworks (see Table 1 for details). For instance, time is critical for many added-value applications. For example, our system can combine real-time processing with the capability according to Twitter1 , the number of daily posted tweets is higher of doing batch tasks. It also supports Python natively and allows than 500 million (around 5, 787 tweets per second). defining cycles among the topology modules. eXtream is a Big Data platform for building topologies oriented As said before, eXtream offers a reduced group of pre-installed to real-time processing of web or stream data. It has many potential modules. Among them, we can highlight a query-based filter or a use cases, such as doing a reputation analysis about a company, its topic analysis module. Note that, in addition, advanced users can products or its competitors from social media data. It can also be easily extend the processing pipelines adding new user-defined used by Information Retrieval (IR) experts to collect social media modules. texts and create their own datasets. This tool is built with the Python framework Catenae [1], which 2 SYSTEM OVERVIEW AND has several advantages over other existing technologies (e.g., inher- IMPLEMENTATION ent horizontal scalability and inter-module communication through A topology example that interconnects all the current available pre- RPC). However, eXtream goes a step further offering a set of text installed modules of eXtream is displayed in Figure 1 (top). This is mining modules which can be easily interconnected using the GUI just a simple example of the countless possible configurations. All provided by our web-based demonstrator. As a consequence, users the topology modules and sources are interconnected using Apache are able to construct data processing topologies avoiding the low Kafka2 topics, and the system state is kept in a MongoDB database. "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- It is worth noting that all the implementation and deploying details mons License Attribution 4.0 International (CC BY 4.0)." 1 https://business.twitter.com/en.html 2 https://kafka.apache.org/ Fernández-Pichel, et al. Python- Execution Easy deployment Graphical definition Docker Resource Technology Streaming native cycles management of topologies oriented assignment Hadoop MR No No No No No No Yes Spark Yes No No Yes No No Yes Storm Yes No Yes No No No Yes Stream Parse Yes No Yes Yes No No Yes Kafka Yes Yes Yes No No No No Kafka Streams Yes Yes Yes No No No No eXtream Yes Yes Yes Yes Yes Yes Yes Table 1: Main features of eXtream and other related technologies and frameworks. are completely hidden to the users of our platform. In particular, the available modules are: • A Reddit crawler [2] and a Twitter crawler [3] that inject text streams into the topologies built with eXtream. • A real-time filtering module is essential since it acts as a first distilling step of the data that eXtream receives in real time. It is a query-based filter where the current implemen- tation supports exact and inexact3 matching. This module can be easily customised to implement any IR filter and it can help IR experts create their own collections. • A dynamic tag cloud generator represents a primitive form of summary. It removes stopwords and normalises the words to build a tag cloud from the resulting bag of words. • A topic analysis module attempts to discover the hidden Figure 2: Main view of the eXtream GUI. topics in the texts. We have used Gensim [4] and LDA [5], which perform unsupervised learning over a corpus. • A stats module that returns the number of distinct users and texts that are currently being analyzed by the platform. • eXtream also supports batch tasks. As a first example, we provide a module that counts the number of recovered texts over a certain period. Furthemore, a module placed at the end of every topology re- ceives the output data (see Figure 1 at the bottom). It also imple- ments a RESTful API in order to visualize results. Figure 2 displays the GUI main view and a possible topology example. It should be noticed that each kind of module has its own dashboard or view. For instance, Figure 3 shows the result of searching the exact query Figure 3: Filter view of the eXtream GUI. Amazon in Reddit. Finally, we provide a demonstration video show- ing eXtream in operation4 . the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center of the Galician University System. 3 CONCLUSIONS Flexibility is a core strength of this system, which can be customis- REFERENCES able for other IR or related purposes with little effort. This tool is [1] Rodrigo Martínez-Castaño, Juan C. Pichel, and David E. Losada. Building python- available for the research community to expand it and employ it based topologies for massive processing of social media data in real time. In Proceedings of the 5th Spanish Conference on Information Retrieval, pages 1–8. ACM, for numerous Information Access tasks5 . 2018. [2] Rodrigo Martínez-Castaño, Juan C. Pichel, David E. Losada, and Fabio Crestani. A ACKNOWLEDGMENTS micromodule approach for building real-time systems with python-based models: Application to early risk detection of depression on social media. In Advances This work was funded by FEDER/Ministerio de Ciencia, Innovación y Universidades in Information Retrieval - 40th European Conference on IR Research, ECIR, volume – Agencia Estatal de Investigación/ Project (RTI2018-093336-B-C21). This work has 10772 of Lecture Notes in Computer Science, pages 801–805. Springer, 2018. [3] Rodrigo Martínez-Castaño, Juan C. Pichel, and Pablo Gamallo. Polypus: a big received financial support from the Consellería de Educación, Universidade e Forma- data self-deployable architecture for microblogging text extraction and real-time ción Profesional (accreditation 2019-2022 ED431G-2019/04, ED431C 2018/29 , ED431C sentiment analysis. CoRR, abs/1801.03710, 2018. [4] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with 2018/19) and the European Regional Development Fund (ERDF), which acknowledges Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for 3 Stopwords are removed and the remaining words can appear in any order. NLP Frameworks, pages 45–50. ELRA, May 2010. 4 https://youtu.be/5Aw4mAc9lTc [5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. 5 https://github.com/MarcosFP97/eXtream The Journal of Machine Learning Research, 3:993–1022, 2003.