=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_34
|storemode=property
|title=eXtream: a System for Real-time Monitoring of Dynamic Web Sources
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_34.pdf
|volume=Vol-2621
|authors=Marcos Fernández-Pichel,Rodrigo Martínez-Castaño,David E. Losada Carril,Juan C. Pichel Campos
|dblpUrl=https://dblp.org/rec/conf/circle/Fernandez-Pichel20
}}
==eXtream: a System for Real-time Monitoring of Dynamic Web Sources==
eXtream: a System for Real-time Monitoring of Dynamic Web
Sources
Marcos Fernández-Pichel
Rodrigo Martínez-Castaño
David E. Losada
Juan C. Pichel
marcosfernandez.pichel@usc.es
rodrigo.martinez@usc.es
david.losada@usc.es
juancarlos.pichel@usc.es
Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela
15782 Santiago de Compostela, Spain
TAG CLOUD
ABSTRACT REDDIT CRAWLER
FILTER tags
In this work, we introduce eXtream, a Big Data platform whose filtered_texts TOPIC ANALYSIS
main goal is to deploy modular and customisable processing topolo- reddit_texts
detected_topics
BATCH
gies for massive analysis of web data in real time. The system offers
TWITTER CRAWLER
no_texts
a reduced group of pre-installed modules that can be easily com-
bined in a visual way. Additionally, an advanced user can upload twitter_texts
STATS
new modules and extend an existing topology. This tool facilitates stats
commands
the development of many Information Retrieval and Big Data ap- CONSUMER / HTTP API
results
Twitter
plications, such as query-based real-time filtering or topic analysis results Reddit
WEB INTERFACE
services on Social Media data. To demonstrate it, we have also Kafka
ALL OUTPUTS
developed an initial web-based demonstrator. results buffer
MongoDB
React webapp
STATE
CCS CONCEPTS
• Information Retrieval → Real-time; • Text Mining → Social Figure 1: Example of a possible eXtream topology using
Media Analytics. the pre-installed modules (top) and visualization and persis-
tence pipeline (bottom).
KEYWORDS
Big Data, Real Time, Web Streams, Datasets
level details such as manually deploying Docker containers. More-
1 INTRODUCTION over, eXtream has several important differences with respect to
Processing Social Media data is a challenge and doing it in real other existing frameworks (see Table 1 for details). For instance,
time is critical for many added-value applications. For example, our system can combine real-time processing with the capability
according to Twitter1 , the number of daily posted tweets is higher of doing batch tasks. It also supports Python natively and allows
than 500 million (around 5, 787 tweets per second). defining cycles among the topology modules.
eXtream is a Big Data platform for building topologies oriented As said before, eXtream offers a reduced group of pre-installed
to real-time processing of web or stream data. It has many potential modules. Among them, we can highlight a query-based filter or a
use cases, such as doing a reputation analysis about a company, its topic analysis module. Note that, in addition, advanced users can
products or its competitors from social media data. It can also be easily extend the processing pipelines adding new user-defined
used by Information Retrieval (IR) experts to collect social media modules.
texts and create their own datasets.
This tool is built with the Python framework Catenae [1], which 2 SYSTEM OVERVIEW AND
has several advantages over other existing technologies (e.g., inher- IMPLEMENTATION
ent horizontal scalability and inter-module communication through A topology example that interconnects all the current available pre-
RPC). However, eXtream goes a step further offering a set of text installed modules of eXtream is displayed in Figure 1 (top). This is
mining modules which can be easily interconnected using the GUI just a simple example of the countless possible configurations. All
provided by our web-based demonstrator. As a consequence, users the topology modules and sources are interconnected using Apache
are able to construct data processing topologies avoiding the low Kafka2 topics, and the system state is kept in a MongoDB database.
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- It is worth noting that all the implementation and deploying details
mons License Attribution 4.0 International (CC BY 4.0)."
1 https://business.twitter.com/en.html 2 https://kafka.apache.org/
Fernández-Pichel, et al.
Python- Execution Easy deployment Graphical definition Docker Resource
Technology Streaming
native cycles management of topologies oriented assignment
Hadoop MR No No No No No No Yes
Spark Yes No No Yes No No Yes
Storm Yes No Yes No No No Yes
Stream Parse Yes No Yes Yes No No Yes
Kafka Yes Yes Yes No No No No
Kafka Streams Yes Yes Yes No No No No
eXtream Yes Yes Yes Yes Yes Yes Yes
Table 1: Main features of eXtream and other related technologies and frameworks.
are completely hidden to the users of our platform. In particular,
the available modules are:
• A Reddit crawler [2] and a Twitter crawler [3] that inject
text streams into the topologies built with eXtream.
• A real-time filtering module is essential since it acts as a
first distilling step of the data that eXtream receives in real
time. It is a query-based filter where the current implemen-
tation supports exact and inexact3 matching. This module
can be easily customised to implement any IR filter and it
can help IR experts create their own collections.
• A dynamic tag cloud generator represents a primitive
form of summary. It removes stopwords and normalises the
words to build a tag cloud from the resulting bag of words.
• A topic analysis module attempts to discover the hidden Figure 2: Main view of the eXtream GUI.
topics in the texts. We have used Gensim [4] and LDA [5],
which perform unsupervised learning over a corpus.
• A stats module that returns the number of distinct users
and texts that are currently being analyzed by the platform.
• eXtream also supports batch tasks. As a first example, we
provide a module that counts the number of recovered texts
over a certain period.
Furthemore, a module placed at the end of every topology re-
ceives the output data (see Figure 1 at the bottom). It also imple-
ments a RESTful API in order to visualize results. Figure 2 displays
the GUI main view and a possible topology example. It should be
noticed that each kind of module has its own dashboard or view.
For instance, Figure 3 shows the result of searching the exact query Figure 3: Filter view of the eXtream GUI.
Amazon in Reddit. Finally, we provide a demonstration video show-
ing eXtream in operation4 . the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago
de Compostela as a Research Center of the Galician University System.
3 CONCLUSIONS
Flexibility is a core strength of this system, which can be customis- REFERENCES
able for other IR or related purposes with little effort. This tool is [1] Rodrigo Martínez-Castaño, Juan C. Pichel, and David E. Losada. Building python-
available for the research community to expand it and employ it based topologies for massive processing of social media data in real time. In
Proceedings of the 5th Spanish Conference on Information Retrieval, pages 1–8. ACM,
for numerous Information Access tasks5 . 2018.
[2] Rodrigo Martínez-Castaño, Juan C. Pichel, David E. Losada, and Fabio Crestani. A
ACKNOWLEDGMENTS micromodule approach for building real-time systems with python-based models:
Application to early risk detection of depression on social media. In Advances
This work was funded by FEDER/Ministerio de Ciencia, Innovación y Universidades in Information Retrieval - 40th European Conference on IR Research, ECIR, volume
– Agencia Estatal de Investigación/ Project (RTI2018-093336-B-C21). This work has 10772 of Lecture Notes in Computer Science, pages 801–805. Springer, 2018.
[3] Rodrigo Martínez-Castaño, Juan C. Pichel, and Pablo Gamallo. Polypus: a big
received financial support from the Consellería de Educación, Universidade e Forma- data self-deployable architecture for microblogging text extraction and real-time
ción Profesional (accreditation 2019-2022 ED431G-2019/04, ED431C 2018/29 , ED431C sentiment analysis. CoRR, abs/1801.03710, 2018.
[4] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with
2018/19) and the European Regional Development Fund (ERDF), which acknowledges
Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for
3 Stopwords are removed and the remaining words can appear in any order. NLP Frameworks, pages 45–50. ELRA, May 2010.
4 https://youtu.be/5Aw4mAc9lTc [5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation.
5 https://github.com/MarcosFP97/eXtream
The Journal of Machine Learning Research, 3:993–1022, 2003.