<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Internet Service Log Data to Assess the Level of Cyber-threats in the Corporate Network*</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computational Modelling of the Siberian Branch of the Russian Academy of Sciences</institution>
          ,
          <addr-line>50/44 Akademgorodok, Krasnoyarsk, 660036</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The article describes log analysis of Internet services of the Krasnoyarsk Science Center (Russia). The importance of log analysis as a method to improve the effectiveness of network security is shown. Data sources are described. The study examines the following systems: Netflow IP traffic, intrusion prevention system, corporate mail server, web server. The log data was used to distinguish the frequency of events and to identify malicious behavior. The article describes security threats identified during the analysis of logs. The analysis results allow optimizing protection systems against network attacks. Measures taken to improve network security are presented.</p>
      </abstract>
      <kwd-group>
        <kwd>Cyber-Threats</kwd>
        <kwd>Security</kwd>
        <kwd>Data Analysis</kwd>
        <kwd>Log</kwd>
        <kwd>Internet</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Development of modern information technologies leads to increasing digitalization
level and active use of various Internet services for scientific and business processes.
The corporate network and services it provides become daily working tools, without
which the full functioning of the organization is impossible. In this regard, the tasks
of assessing the risks of cybersecurity and the level of cyber-threats for providing
adequate protection are becoming more and more relevant. An important aspect of
cybersecurity is the study of security logs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Modern researchers use dynamic
methods of analysis since traditional approaches with static metrics may skip
intellectual low-frequency attacks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. An important parameter of a secure system is
the response time to information security incidents. Minimization of this parameter to
the extent of full attack prevention is described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Methods and algorithms of mail
spam resistance are actively developing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To provide information security various
software is used: security scanners [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], complex security analysis systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], etc.
Thus, revealing new cyber-threat signs and analysis methods is an urgent problem.
* Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>
        For many years, Krasnoyarsk Science Center has been studying the problems of
cyber security analysis and ensuring the network protection [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The purpose of this
work is to analyze log data on Internet services, identify potential risks, and optimize
security protection mechanisms.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Sources</title>
      <p>The corporate network of the Krasnoyarsk Science Center has a four-level
architecture: 1) network core which provides routing and connectivity to external
networks; 2) server network which hosts Internet services; 3) aggregation level which
connects multiple organizations together; 4) local network for end users. All the
information about network traffic is collected at the Internet connection points and
server network. In addition, there are logs from the main Internet services: corporate
mail server, web server, and proxy server for access to external resources. The
sources of data analysis are the following:
1. Netflow IP traffic: more than 400 GB, more than 2 billion records.
2. Mail server log: 1 year, more than 1.5 million records.
3. Intrusion Prevention System (IPS) log: 1 year, about 200 thousand records.
4. WWW log: 1 year, about 12 GB, more than 44 million records.
3</p>
    </sec>
    <sec id="sec-3">
      <title>IP Traffic Data Analysis</title>
      <p>To assess the permanent threat level, IP traffic to unused network addresses was
analyzed (Fig. 1).</p>
      <p>60000
40000
20000
0</p>
      <p>Jun</p>
      <p>Jul</p>
      <p>Aug</p>
      <p>Sep</p>
      <p>Oct</p>
      <p>Nov Dec</p>
      <p>Jan</p>
      <p>Feb</p>
      <p>Mar</p>
      <p>Apr</p>
      <p>May Jun
The analysis shows the permanent number of access attempts. The daily aggregation
shows a trend in the number of access attempts from 5000 to 12000 per day. The
detected peak of 60000 on 14.01.2020 is explained by the attack duration rather than
intensity, as seen in the hourly aggregation (Fig. 2).
Feb'20</p>
      <p>Mar'20</p>
      <p>Apr'20</p>
      <p>May'20</p>
      <p>Jun'20</p>
      <p>The hourly rate of access attempts to a single address is about 500 per hour,
regardless of the time of day and days of week, with the peaks of up to 4500 per hour.
Thus, the time interval for confident detection of network attacks should be no more
than one hour. The number of unique threat sources per day (Fig. 3) changes from
1000 to 2000, which indicates the presence of a large and constantly operating
network used for scanning Internet services.</p>
      <p>6000
4000
2000
0</p>
      <p>Jan'20
2500
2000
1500
1000
500</p>
      <p>Building the distribution by hours (Fig. 4) allows one to calculate the average
deviation of about 1.5% with the maximum deviation around 07:00 KRAT of about
5%. The detected maximum corresponds to 00:00 GMT time, which indicates the
prevalence of the systems with scheduled scans and attacks launched at midnight
GMT.
The analysis of scanning frequency of individual services (Fig. 5) allows identifying
popular services, which are the most attacked and under which the threats are masked:
Telnet/23, MS SQL/1433, HTTP/80, Personal Agent/5555, SSH/22, HTTP
Alternate/8080, RDP/3389. Thus, Telnet and MS SQL can be added to the existing
blocking network ports (SSH, RDP, SMTP) to increase the protection efficiency.</p>
      <p>Telnet MS SQL</p>
      <p>HTTP Personal</p>
      <p>Agent</p>
      <p>SSH</p>
      <p>HTTP
Alternate</p>
      <p>MS RDP threat</p>
      <p>Web
services</p>
      <p>HTTPS
The analysis of the mail traffic reveals its periodicity during both the day and days of
week (Fig. 6).</p>
      <sec id="sec-3-1">
        <title>Email</title>
        <p>Viruses
04'19 05'19 06'19 07'19 08'19 09'19 10'19 11'19 12'19 01'20 02'20 03'20 04'20
The number of viruses in the mail by hours in the recipient's time zone (Fig. 7) shows
a good correlation with the number of mail spam (0.63), but time distributions by the
sender have zero correlation, which may indicate different sources of mail spam and
mail viruses.
0,8
0,6
0,4
0,2
0</p>
      </sec>
      <sec id="sec-3-2">
        <title>Viruses</title>
      </sec>
      <sec id="sec-3-3">
        <title>Spam 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</title>
        <p>Using geographical databases and data aggregation by the territory allows building a
distribution by the country of spam sources (Fig. 8). The most active mail spam
countries: Russia, USA, Germany, France, China.</p>
        <p>In terms of the number of virus sources, the United States, France, Russia, and
Vietnam are the leaders. The weekend activity and threat level is 2-3 times lower than
on workdays, while Tuesday and Wednesday are the highest threat level days. In
terms of the time of day, the threat level at night is 2 times lower than during working
hours.
5</p>
        <p>Intrusion Prevention System Data Analysis
Analysis of the Intrusion Prevention System log shows no periodicity both with
regard to the days of week and time of day (Fig. 9). The number of blocked network
addresses is approximately 3 times lower than the number of unique scanning sources
per day (on average, 500 and 1500, respectively), which indicates that only every
third source takes actions leading to its blocking.</p>
        <p>1400
1200
1000
800
600
400
200</p>
        <p>0
10000
8000
6000
4000
2000
0</p>
        <p>In the hourly frequency distribution of the blocking system over the SSH and RDP
protocols (Fig. 10), there is a peak (about 150% of the average) at around midnight
GMT, which is an additional indicator of a large threat scanning network launched at
00:00 GMT.</p>
        <p>SSH,RDP</p>
        <p>Mail services
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23</p>
        <p>The largest number of responses (165 thousand) falls on the SSH service, followed by
SMTP (about 30 thousand) and RDP (about 3 thousand). The significant
predominance of SSH may be due to the specific of the blocking system: all
connections from the threat sources are blocked, and SSH has the minimum port
number (22) among popular services and, thus, is checked first. The analysis of the
geographical location of threat sources shows the leadership of China (39%) and the
United States (12%). Thus, it is possible to make a conclusion about purposeful
invasion attempts from China, since other data indicates that it does not have a
leading position.
6</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>WWW Data Analysis</title>
      <p>The analysis of web services logs shows periodicity both in terms of the days of week
and time of day. In addition, there is a tendency for the number of requests associated
with the WWW services expansion to increase and an increasing number of visitors is
250000
200000
150000
100000
also observed. The request analysis in terms of the country shows the following
results: Russia – about 80% of all visits, USA – 7%, Germany – 2%. The analysis of
error logs shows the following: Russia – 54%, USA – 27%, China – 3%. The most
popular browsers are Chrome – 47%, Firefox – 16%, Internet Explorer – 6%. Web
spiders and bots amount to about 9% of the total number of requests and 32% of the
total number of errors.</p>
      <p>When processing web service logs, requests were divided into two non-intersecting
groups: legitimate and erroneous requests. Legitimate requests are those that are
processed by a web application or web service in normal mode and whose HTTP
response code is one of 1XX, 2XX, 3XX. Erroneous requests (or errors) are those that
are processed incorrectly either on the client side (HTTP response code 4XX) or on
the server side (5XX). The error analysis is important because it allows revealing
malicious activity. Figures 11 and 12 show the number of requests and errors per day.
As one can see from the graphs, the number of requests and errors depends on
holidays since most of the services are used during business hours.</p>
      <sec id="sec-4-1">
        <title>Requests</title>
      </sec>
      <sec id="sec-4-2">
        <title>Trend</title>
        <p>01'19 02'19 03'19 04'19 05'19 06'19 07'19 08'19 09'19 10'19 11'19 12'19 01'20</p>
        <p>The correlation coefficient for the server and client requests is 0.984 (Fig. 13). This
indicates that most of the requests are carried out from Krasnoyarsk (KRAT) and
nearby time zones.</p>
        <p>0 3 6 9 12 15 18 21</p>
      </sec>
      <sec id="sec-4-3">
        <title>Server</title>
      </sec>
      <sec id="sec-4-4">
        <title>Client</title>
        <p>0 3 6 9 12 15 18 21</p>
      </sec>
      <sec id="sec-4-5">
        <title>Server</title>
      </sec>
      <sec id="sec-4-6">
        <title>Client</title>
        <p>The correlation was calculated and queries and errors were normalized by server and
client time (Fig. 14). The high value of the correlation coefficient for errors was found
to be caused by incorrect operation of websites and web services. However, the rest of
the errors show the presence of scans and attacks performed by web spiders.</p>
      </sec>
      <sec id="sec-4-7">
        <title>Distribution by server time</title>
      </sec>
      <sec id="sec-4-8">
        <title>Distribution by client time</title>
      </sec>
      <sec id="sec-4-9">
        <title>Hourly requests Hourly errors</title>
        <p>50000
40000
30000
20000
10000</p>
        <p>0
1
0,8
0,6
0,4
0,2</p>
        <p>0
3500000
3000000
2500000
2000000
1500000
1000000
500000</p>
        <p>0
1
0,8
0,6
0,4
0,2</p>
        <p>Since errors indicate the attempts to access non-existent or non-public resources, there
is a high probability of an increasing number of threat sources from the United States
and China since their portion of errors is several times greater than the portion of
requests. The majority of errors are caused by detecting vulnerabilities in popular
Content Management Systems (CMS). In addition, the browsers analysis in terms of
requests and errors shows an increased percentage of errors from web spiders, which
may indicate a high risk of threat.
As a result of the research, some measures were taken to increase the security of
Internet services of the Krasnoyarsk Science Center. In particular, the following was
performed: 1) the threshold time interval for more confident detection of network
scanning was increased; 2) new TCP ports to the monitoring system to track
malicious activity were added; 3) firewall settings to more effectively blocking
unwanted hosts were optimized; 4) web server settings to prevent attacks on the
known CMS vulnerabilities was updated; 5) network settings of internal switches
were optimized to block unwanted traffic between different divisions.
8</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we analyzed data logs from the corporate Internet services of the
Krasnoyarsk Science Center. The main sources of cybersecurity threats were
identified. New signs of threat sources were determined which can be used to improve
corporate network security systems. In general, the applied security tools allow
detecting and blocking threats at early stages. The results of the study allow
optimizing protection systems against network attacks, taking into account the
identified sources of threats which were previously not taken into account in standard
security tools. The measures taken increase the responsiveness to emerging threats
and cybersecurity of the organization as a whole.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parkinson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Discovering and utilising expert knowledge from security event logs</article-title>
          .
          <source>Journal of Information Security and Applications</source>
          <volume>48</volume>
          ,
          <issue>102375</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wurzenberger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skopik</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection</article-title>
          .
          <source>Computers &amp; Security</source>
          <volume>79</volume>
          ,
          <fpage>94</fpage>
          -
          <lpage>116</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.-H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Fast attack detection system using log analysis and attack tree generation</article-title>
          .
          <source>Cluster Computing</source>
          <volume>22</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1827</fpage>
          -
          <lpage>1835</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mujtaba</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shuib</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raj</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majeed</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Garadi</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          :
          <article-title>Email Classification Research Trends: Review and Open Issues</article-title>
          .
          <source>IEEE Access 5</source>
          ,
          <fpage>9044</fpage>
          -
          <lpage>9064</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An Intrusion Action-Based IDS Alert Correlation Analysis and Prediction Framework</article-title>
          .
          <source>IEEE Access 7</source>
          ,
          <fpage>150540</fpage>
          -
          <lpage>150551</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sapegin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaeger</surname>
          </string-name>
          , D., Cheng, F.,
          <string-name>
            <surname>Meinel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards a system for complex analysis of security events in large-scale networks</article-title>
          .
          <source>Computers &amp; Security</source>
          <volume>67</volume>
          ,
          <fpage>16</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kulyasov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isaev</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Research of network anomalies in the corporate network of the Krasnoyarsk Scientific Center</article-title>
          .
          <source>Siberian Journal of Science and Technology</source>
          <volume>19</volume>
          (
          <issue>3</issue>
          ),
          <fpage>412</fpage>
          -
          <lpage>422</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>