=Paper= {{Paper |id=Vol-2621/CIRCLE20_34 |storemode=property |title=eXtream: a System for Real-time Monitoring of Dynamic Web Sources |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_34.pdf |volume=Vol-2621 |authors=Marcos Fernández-Pichel,Rodrigo Martínez-Castaño,David E. Losada Carril,Juan C. Pichel Campos |dblpUrl=https://dblp.org/rec/conf/circle/Fernandez-Pichel20 }} ==eXtream: a System for Real-time Monitoring of Dynamic Web Sources== https://ceur-ws.org/Vol-2621/CIRCLE20_34.pdf
    eXtream: a System for Real-time Monitoring of Dynamic Web
                              Sources
                                                               Marcos Fernández-Pichel
                                                               Rodrigo Martínez-Castaño
                                                                    David E. Losada
                                                                     Juan C. Pichel
                                               marcosfernandez.pichel@usc.es
                                                   rodrigo.martinez@usc.es
                                                     david.losada@usc.es
                                                   juancarlos.pichel@usc.es
        Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela
                                             15782 Santiago de Compostela, Spain
                                                                                                                                                                                         TAG CLOUD
ABSTRACT                                                                                 REDDIT CRAWLER
                                                                                                                                            FILTER                                                                 tags

In this work, we introduce eXtream, a Big Data platform whose                                                                                                   filtered_texts           TOPIC ANALYSIS


main goal is to deploy modular and customisable processing topolo-                                         reddit_texts
                                                                                                                                                                                                             detected_topics
                                                                                                                                            BATCH
gies for massive analysis of web data in real time. The system offers
                                                                                         TWITTER CRAWLER
                                                                                                                                                                 no_texts
a reduced group of pre-installed modules that can be easily com-
bined in a visual way. Additionally, an advanced user can upload                                           twitter_texts
                                                                                                                                            STATS




new modules and extend an existing topology. This tool facilitates                                                                                                 stats

                                                                                                                                              commands
the development of many Information Retrieval and Big Data ap-                                               CONSUMER / HTTP API
                                                                                                                                                      results
                                                                                                                                                                                                         Twitter

plications, such as query-based real-time filtering or topic analysis                                                                       results                                                      Reddit
                                                                                                                                                                        WEB INTERFACE

services on Social Media data. To demonstrate it, we have also                                                                                                                                           Kafka

                                                                                          ALL OUTPUTS
developed an initial web-based demonstrator.                                                                               results buffer
                                                                                                                                                                                                         MongoDB


                                                                                                                                                                                                         React webapp


                                                                                                                                              STATE

CCS CONCEPTS
• Information Retrieval → Real-time; • Text Mining → Social                          Figure 1: Example of a possible eXtream topology using
Media Analytics.                                                                     the pre-installed modules (top) and visualization and persis-
                                                                                     tence pipeline (bottom).
KEYWORDS
Big Data, Real Time, Web Streams, Datasets
                                                                                     level details such as manually deploying Docker containers. More-
1    INTRODUCTION                                                                    over, eXtream has several important differences with respect to
Processing Social Media data is a challenge and doing it in real                     other existing frameworks (see Table 1 for details). For instance,
time is critical for many added-value applications. For example,                     our system can combine real-time processing with the capability
according to Twitter1 , the number of daily posted tweets is higher                  of doing batch tasks. It also supports Python natively and allows
than 500 million (around 5, 787 tweets per second).                                  defining cycles among the topology modules.
   eXtream is a Big Data platform for building topologies oriented                      As said before, eXtream offers a reduced group of pre-installed
to real-time processing of web or stream data. It has many potential                 modules. Among them, we can highlight a query-based filter or a
use cases, such as doing a reputation analysis about a company, its                  topic analysis module. Note that, in addition, advanced users can
products or its competitors from social media data. It can also be                   easily extend the processing pipelines adding new user-defined
used by Information Retrieval (IR) experts to collect social media                   modules.
texts and create their own datasets.
   This tool is built with the Python framework Catenae [1], which                   2        SYSTEM OVERVIEW AND
has several advantages over other existing technologies (e.g., inher-                         IMPLEMENTATION
ent horizontal scalability and inter-module communication through                    A topology example that interconnects all the current available pre-
RPC). However, eXtream goes a step further offering a set of text                    installed modules of eXtream is displayed in Figure 1 (top). This is
mining modules which can be easily interconnected using the GUI                      just a simple example of the countless possible configurations. All
provided by our web-based demonstrator. As a consequence, users                      the topology modules and sources are interconnected using Apache
are able to construct data processing topologies avoiding the low                    Kafka2 topics, and the system state is kept in a MongoDB database.
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-   It is worth noting that all the implementation and deploying details
mons License Attribution 4.0 International (CC BY 4.0)."
1 https://business.twitter.com/en.html                                               2 https://kafka.apache.org/
                                                                                                                                                   Fernández-Pichel, et al.

                                                  Python-     Execution      Easy deployment         Graphical definition       Docker             Resource
                   Technology      Streaming
                                                   native       cycles         management               of topologies           oriented          assignment
                  Hadoop MR            No           No            No                  No                     No                    No                 Yes
                     Spark             Yes          No            No                  Yes                    No                    No                 Yes
                     Storm             Yes          No            Yes                 No                     No                    No                 Yes
                 Stream Parse          Yes          No            Yes                 Yes                    No                    No                 Yes
                     Kafka             Yes          Yes           Yes                 No                     No                    No                 No
                 Kafka Streams         Yes          Yes           Yes                 No                     No                    No                 No
                   eXtream             Yes          Yes           Yes                 Yes                    Yes                   Yes                Yes
                            Table 1: Main features of eXtream and other related technologies and frameworks.



are completely hidden to the users of our platform. In particular,
the available modules are:
     • A Reddit crawler [2] and a Twitter crawler [3] that inject
       text streams into the topologies built with eXtream.
     • A real-time filtering module is essential since it acts as a
       first distilling step of the data that eXtream receives in real
       time. It is a query-based filter where the current implemen-
       tation supports exact and inexact3 matching. This module
       can be easily customised to implement any IR filter and it
       can help IR experts create their own collections.
     • A dynamic tag cloud generator represents a primitive
       form of summary. It removes stopwords and normalises the
       words to build a tag cloud from the resulting bag of words.
     • A topic analysis module attempts to discover the hidden                                           Figure 2: Main view of the eXtream GUI.
       topics in the texts. We have used Gensim [4] and LDA [5],
       which perform unsupervised learning over a corpus.
     • A stats module that returns the number of distinct users
       and texts that are currently being analyzed by the platform.
     • eXtream also supports batch tasks. As a first example, we
       provide a module that counts the number of recovered texts
       over a certain period.
   Furthemore, a module placed at the end of every topology re-
ceives the output data (see Figure 1 at the bottom). It also imple-
ments a RESTful API in order to visualize results. Figure 2 displays
the GUI main view and a possible topology example. It should be
noticed that each kind of module has its own dashboard or view.
For instance, Figure 3 shows the result of searching the exact query                                    Figure 3: Filter view of the eXtream GUI.
Amazon in Reddit. Finally, we provide a demonstration video show-
ing eXtream in operation4 .                                                                 the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago
                                                                                            de Compostela as a Research Center of the Galician University System.
3    CONCLUSIONS
Flexibility is a core strength of this system, which can be customis-                       REFERENCES
able for other IR or related purposes with little effort. This tool is                      [1] Rodrigo Martínez-Castaño, Juan C. Pichel, and David E. Losada. Building python-
available for the research community to expand it and employ it                                 based topologies for massive processing of social media data in real time. In
                                                                                                Proceedings of the 5th Spanish Conference on Information Retrieval, pages 1–8. ACM,
for numerous Information Access tasks5 .                                                        2018.
                                                                                            [2] Rodrigo Martínez-Castaño, Juan C. Pichel, David E. Losada, and Fabio Crestani. A
ACKNOWLEDGMENTS                                                                                 micromodule approach for building real-time systems with python-based models:
                                                                                                Application to early risk detection of depression on social media. In Advances
This work was funded by FEDER/Ministerio de Ciencia, Innovación y Universidades                 in Information Retrieval - 40th European Conference on IR Research, ECIR, volume
– Agencia Estatal de Investigación/ Project (RTI2018-093336-B-C21). This work has               10772 of Lecture Notes in Computer Science, pages 801–805. Springer, 2018.
                                                                                            [3] Rodrigo Martínez-Castaño, Juan C. Pichel, and Pablo Gamallo. Polypus: a big
received financial support from the Consellería de Educación, Universidade e Forma-             data self-deployable architecture for microblogging text extraction and real-time
ción Profesional (accreditation 2019-2022 ED431G-2019/04, ED431C 2018/29 , ED431C               sentiment analysis. CoRR, abs/1801.03710, 2018.
                                                                                            [4] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with
2018/19) and the European Regional Development Fund (ERDF), which acknowledges
                                                                                                Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for
3 Stopwords are removed and the remaining words can appear in any order.                        NLP Frameworks, pages 45–50. ELRA, May 2010.
4 https://youtu.be/5Aw4mAc9lTc                                                              [5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
5 https://github.com/MarcosFP97/eXtream
                                                                                                The Journal of Machine Learning Research, 3:993–1022, 2003.