<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Comparative Evaluation of Big Data Frameworks for Log Processing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Attila Péter Boros</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Péter Lehotay-Kéry</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Attila Kiss</string-name>
          <email>kissae@ujs.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ELTE Eötvös Loránd University, Budapest, Hungary Faculty of Informatics, Department of Information Systems</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ericsson Hungary</institution>
          ,
          <addr-line>Budapest</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are complex monitored systems of diferent institutions where computations require powerful distributed environments to run. Our work aims the specific area of log data obtained from telecommunication operator systems with the goal to identify non-trivially detectable problems, like frequency of node restarts on a given time period or the reason of these events. In order to substitute significant new information from these system logs, it is important to use proper frameworks for analyzing them. This being a comprehensive problem, various frameworks have been proposed. In this paper we evaluate and compare Apache Spark and Elasticsearch (with Logstash) as two prominent frameworks for processing log data. Through our work we perform experiments on diferent problem solutions with diferent complexity in order to measure how non-functional features, like processing time and resource consumption vary between them. Additionally, our experimental data shows that how choosing between diferent frameworks can influence the performance of these computations.</p>
      </abstract>
      <kwd-group>
        <kwd>telecommunication</kwd>
        <kwd>network</kwd>
        <kwd>data analysis</kwd>
        <kwd>Big Data</kwd>
        <kwd>Elasticsearch</kwd>
        <kwd>Spark</kwd>
        <kwd>Log analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>A system has been built, that aims on continuous, automatic software deployment
on customer telecommunication network server nodes. Therefore, continuous
automatic tests of new softwares and continuous automatic data collection have also
been introduced for the network nodes.</p>
      <p>At first the new software can be released on a small part of the live customer
network, then the testing can be done on the field. If the new software works fine,
the deployment can be extended on further nodes. It is also mandatory to see what
are the configurations and states on the nodes before and after the upgrade and
to see what events happened on the nodes after the upgrade. This part is done by
the continuous data collection and processing on daily basis. Figure 1 shows the
process.</p>
      <p>Thus we are aiming to collect node configuration-, state descriptor- and also log
ifles generated by these nodes. This paper focuses on the processing of these log
ifles, searching for restart events and also reasons behind these restarts, calculating
logging intensity and the boot time, comparing the results before and after the
upgrades. This gives us more insights on the impact of the upgrades and also we
are able to react faster if needed, or in worst case scenarios to even call back the
software upgrade.</p>
      <p>Each customer network contains thousands of nodes and every node generates
several kinds of logs. So for processing the large amount of files it is mandatory to
use efective, distributed algorithms and technologies that support these kinds of
calculations.</p>
      <p>First, we started with the processing of the configuration and state descriptor
ifles. For these we used Spark, which fulfilled our needs. However, when it came
to the processing of log files, we found that we should also investigate other
technologies too. In this paper we are presenting our findings using Apache Spark and
Elasticsearch.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>In this section, first we look through what have been done in comparing technologies
for log parsing.</p>
      <p>
        In Tools and benchmarks for automated log parsing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] authors evaluated 13 log
parsers on a total of 16 log datasets spanning diferent architectures. They reported
the benchmarking results in terms of accuracy, robustness and eficiency. They also
shared the success stories and lessons learned in an industrial application.
      </p>
      <p>
        Authors of An evaluation study on log parsing and its use in log mining [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
studied four log parsers and packaged them into a toolkit to allow their reuse.
Also, by evaluating the performance of the log parsers on over ten million raw log
messages in five datasets, they obtained insightful findings, while their efectiveness
on a real-world log mining task has been thoroughly examined.
      </p>
      <p>Now, let us look through what have been done using one of our chosen
framework: Spark.</p>
      <p>
        In Log-based abnormal task detection and root cause analysis for spark [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
authors proposed an approach to detect abnormality and analyze root causes using
Spark on log files. Their proposed method has been tested on real-world Spark
benchmarks.
      </p>
      <p>
        Authors of LADRA: Log-based abnormal task detection and root-cause analysis
in big data processing with Spark [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a tool, named LADRA, for log-based
abnormal tasks detection and root-cause analysis using Spark logs. In LADRA,
a log parser first converts raw log files into structured data and extracts features.
Then, a detection method is proposed to detect where and when abnormal tasks
happen. At last, leverage General Regression Neural Network (GRNN) to identify
root causes for abnormal tasks.
      </p>
      <p>Finally, let us look through what have been done using our other chosen
framework: Elasticsearch.</p>
      <p>
        Authors of Monitoring of IaaS and scientific applications on the Cloud using
the Elasticsearch ecosystem [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] used the Elasticsearch, Logstash and Kibana stack to
set up a monitoring system to inspect the site activities. They fed heterogeneous
accounting information to diferent MySQL databases and sent to Elasticsearch
via a custom Logstash plugin. Then they were starting to consider dismissing the
intermediate level provided by the SQL database and evaluating a NoSQL option
as a uniquecentral database for all the monitoring information.
      </p>
      <p>
        In Elasticsearch and Carrot2-Based Log Analytics and Management [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], authors
reflected on how Elasticsearch along with Carrot2 is used with their algorithm to
manage and analyze logs of any format. They set up log analytics and management
on Amazon web server.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Used technologies</title>
      <p>Through our work Apache Spark and Elasticsearch combined with Logstash has
been used.</p>
      <p>
        Apache Spark[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a platform to implement large-scale data processing by
providing a simple interface for eficient distributed computations on significant
amount of data. It provides and includes useful basic functionalities like
scheduling, distributing and data monitoring, while providing flexibility in integration with
other frameworks. It provides an interface of several functional style methods with
which we can operate on its resilient distributed datasets, which therefore can be
distributed in an easier way. Spark also has support for eficient graph processing
and machine learning libraries, providing them an eficient platform to distribute
calculations i.e. when optimizing on diferent hyper-parameters while training
neural networks.
      </p>
      <p>
        Elasticsearch[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is an open source search and analytic engine which provides a
distributed framework handling all types of data. Being a standalone search engine,
most of the time it serves as a core part of a layered monitoring system in real world
applications, where to provide other services i.e. for presenting and persisting data,
other plugins have to be attached. Logstash[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] serves as an extension plugin which
strengthens the core API with the ability of preprocessing and delivering data read
from diferent sources like local file system, databases and streaming applications
etc. In the following we will refer to this set-up of Elasticsearch with Logstash
plugin as ESL-stack. Being an open-source software it has developed a large and
responsive community over time, which favors all developers from beginners to
experts.
      </p>
      <p>In the course of our test cases, we used Apache Spark 2.4.3, and Elasticsearch
7.5.2 with the same version of Logstash plugin. To provide equal scenarios for
testing we seized the opportunity of ingesting data from local storage, in the same
time eliminating the overhead of dealing with other frameworks related to the
persistence of data. This direction of focusing only on the core system was also
supported by the feature of Apache Spark, this framework being able to load inputs
on its executors from local storage from version 2.0.0.</p>
      <p>In such circumstances we were able to provide test cases which realize the same
train of thought, expressed in the corresponding environment.</p>
    </sec>
    <sec id="sec-4">
      <title>4. System description</title>
      <p>For evaluating our implementations we used two environments. One single node
cluster, with a machine equipped with an Intel Core i7-6700HQ processor, 8Gb of
DDR4-1866MHz RAM and 1Tb 5400rpm of local disk storage; and another with a
set of 20 connected machines, each having node equipped with an Intel Core series
processor running at 2.5GHz, 16Gb RAM and 250Gb disk storage, where nodes
were strongly connected between them. In our former single-node Apache Spark
set-up, the framework was deployed as a single-node cluster, and in our single-node
ESL-stack set-up Elasticsearch and Logstash were deployed on the single available
node. In the latter environments we had two diferent set-ups, one for Apache
Spark (1 driver node, 10 executor node, each allocated with 10Gb of RAM), and
one for Elasticsearch-cluster (1 coordinator node with 4Gb RAM, 5 master + data
nodes with 8Gb RAM and 5 Logstash nodes each with 8Gb of RAM). Through our
work we firstly investigated in evaluating scenarios in a single-node environment.
Further works include investigating in evaluation and comparison on multi-node
cluster.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We have investigated evaluation on four test cases: restart count from error log
ifles (a), actual restarts with module, trigger entry and trigger action (b), system
boot-up times (c), and local log intensity (d), all grouped by node and date.</p>
      <p>Test case (a) parses special log files gathered from nodes, which contain
information about various system events. It collects the count of restart events from all
the files, each line in the result showing that on the specific date and specific node
how many restart events happened. This information is being acquired by parsing
each log line and deciding by pattern matching that the actual event is a restart
event, and if so when and on which node it happened.</p>
      <p>Test case (b) parses the same specific files as test case (a), and aims for collecting
only restart events, but instead of event count specific attributes about restart
events are gathered. These specific attributes are module (which module initiates
the restart), trigger entry (what is the reason of the restart event, in most cases a
system error code) and trigger action (what action was triggered by the error).</p>
      <p>Test case (c) collects information about how much time took a node to boot
up. It is a much lighter test case than the two previous, because there is no need
to parse all of the log lines, because in the ending part of every log file there is
stated this information. But there is possibility in each framework only to parse
whole files, so the specific line has to be filtered, and the information parsed out
of it. Thus the result contains in each of its lines the boot time value in seconds of
every node and every day.</p>
      <p>Test case (d) gathers local log intensity. It is done similarly to test case (a),
but it difers in the files which are parsed. An extra complexity is added in the
case with the fact that these log files are enriched by multi-line scripts too. So an
extra filtering is needed for log files to obtain only the scheme-based lines. From
these lines is then gathered the count of events classified by node and date.</p>
      <p>We have run each test case with both of the frameworks.</p>
      <p>Our single-node test cases were run in each case on 13 files, each containing
between 50 and 5000 log lines. Test results are shown in Figure 2. We conclude,
that throughout our test cases Elasticsearch performed better in each test case.</p>
      <p>This could be explained by the fact that after pre-processing log file lines, the
main search engine creates indices on documents received from Logstash, thus in
further being able to perform faster searching. Meanwhile in Spark there is no
searching, the output is a result of several transformation and aggregation on
resilient distributed data-set (RDD) objects. Therefore another future work might
include investigating in a search oriented, optimized Spark implementation,
including the utilization of built-in data-frames and data-sets, from which the latter also
includes a built-in query optimisation tool.</p>
      <p>Further we set up and tested in multi-node environment our Spark
implementations too. These test cases were run with the aforementioned setup, and on 1500
times more log files, which were generated from the basic sample. This means that
in every test case 18000 log files were processed. The results of it are shown in
Table 1.</p>
      <p>Therefore as a future work we plan on comparing the two frameworks in real
distributed environment with multiple nodes.
actual restarts
1571.4s
elog intensity
1534.26s
local log intensity
1260.463s
boot time
1440.710s</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Through our work we targeted the comparative evaluation of Apache Spark and
ESL-stack frameworks. As a result of our single-node cluster tests we came to the
conclusion that in our cases the two computation environments perform diferently
on the test cases. After our evaluation, it came out that Elasticsearch performed
significantly better, than Apache Spark. But the reality is, that each of these test
scenarios could be improved.</p>
      <p>Although this work serves as a proper foundation of future investigations in
other similar comparative evaluations like testing our implementations in
multinode environments. Also our plans are elaborated more in detail in the Future
Work section.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Future work</title>
      <p>As a first task for our future works will be the testing of Elasticsearch in the defined
test cases in multi-node environment. Therefore we will be able to compare the
run-time performance of the two framework in real multi-node cluster environment.</p>
      <p>Another future investigation will be on the side of security. The collected files
can contain sensitive customer data, so we are planning to investigate more security
solutions and encryption techniques, first checking the built-in solutions of the
compared technologies.</p>
      <p>It would also be good if we were able to predict the errors, so we are planning
to develop some machine learning techniques to learn from the logs what kinds of
events are happening before the errors and restarts.</p>
      <p>Furthermore, we are planning to investigate the capabilities of Splunk and Flink
frameworks too.</p>
      <p>Acknowledgements. The project was supported by the European Union,
coifnanced by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).</p>
      <p>This publication is the partial result of the Research &amp; Development
Operational Programme for the project “Modernisation and Improvement of Technical
Infrastructure for Research and Development of J. Selye University in the Fields
of Nanotechnology and Intelligent Space”, ITMS 26210120042, co-funded by the
European Regional Development Fund.</p>
      <p>The project was also supported by the Ericsson-ELTE Software Technology Lab.
Furthermore thanks to Ericsson coworkers who worked on the project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tools</surname>
          </string-name>
          and benchmarks for
          <source>automated log parsing</source>
          ,
          <source>2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)</source>
          (
          <year>2019</year>
          ),
          <fpage>121</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lyu</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <article-title>An evaluation study on log parsing and its use in log mining, 2016 46th annual IEEE/IFIP international conference on dependable systems and networks (DSN), (</article-title>
          <year>2016</year>
          ),
          <fpage>654</fpage>
          -
          <lpage>661</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tak</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Log-based abnormal task detection and root cause analysis for spark</article-title>
          ,
          <source>2017 IEEE International Conference on Web Services (ICWS)</source>
          (
          <year>2017</year>
          ),
          <fpage>389</fpage>
          -
          <lpage>396</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tak</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L. LADRA</given-names>
          </string-name>
          :
          <article-title>Log-based abnormal task detection and root-cause analysis in big data processing with Spark</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          , Vol.
          <volume>95</volume>
          (
          <year>2019</year>
          ),
          <fpage>392</fpage>
          -
          <lpage>403</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bagnasco</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berzano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guarise</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lusso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masera</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vallero</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Monitoring of IaaS and scientific applications on the Cloud using the Elasticsearch ecosystem</article-title>
          ,
          <source>Journal of physics: Conference series</source>
          , Vol.
          <volume>608</volume>
          (
          <year>2015</year>
          ), p.
          <fpage>012016</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suryawanshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,&amp;
          <string-name>
            <surname>Saindane</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Elasticsearch and Carrot 2-Based Log Analytics and Management, Innovations in computer science and engineering (</article-title>
          <year>2016</year>
          ),
          <fpage>71</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Frampton</surname>
          </string-name>
          , Mike Mastering apache spark, Packt Publishing Ltd, Birmingham, UK (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Kuc</surname>
          </string-name>
          , Rafal, and
          <article-title>Marek Rogozinski Mastering elasticsearch</article-title>
          , Packt Publishing Ltd, Birmingham, UK (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Turnbull</surname>
          </string-name>
          , James The Logstash Book, James Turnbull, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>