16


    Analysis of Internet Service Log Data to Assess the Level
          of Cyber-threats in the Corporate Network*

              Sergey Isaev[0000-0002-6678-0084], Dmitry Kononov[0000-0002-8757-5274]
                          and Andrey Malyshev[0000-0001-5669-1574]

                   Institute of Computational Modelling of the Siberian Branch
    of the Russian Academy of Sciences, 50/44 Akademgorodok, Krasnoyarsk, 660036, Russia
                                     ddk@icm.krasn.ru


         Abstract. The article describes log analysis of Internet services of the
         Krasnoyarsk Science Center (Russia). The importance of log analysis as a
         method to improve the effectiveness of network security is shown. Data sources
         are described. The study examines the following systems: Netflow IP traffic,
         intrusion prevention system, corporate mail server, web server. The log data
         was used to distinguish the frequency of events and to identify malicious
         behavior. The article describes security threats identified during the analysis of
         logs. The analysis results allow optimizing protection systems against network
         attacks. Measures taken to improve network security are presented.

         Keywords: Cyber-Threats, Security, Data Analysis, Log, Internet.


1        Introduction

Development of modern information technologies leads to increasing digitalization
level and active use of various Internet services for scientific and business processes.
The corporate network and services it provides become daily working tools, without
which the full functioning of the organization is impossible. In this regard, the tasks
of assessing the risks of cybersecurity and the level of cyber-threats for providing
adequate protection are becoming more and more relevant. An important aspect of
cybersecurity is the study of security logs [1]. Modern researchers use dynamic
methods of analysis since traditional approaches with static metrics may skip
intellectual low-frequency attacks [2]. An important parameter of a secure system is
the response time to information security incidents. Minimization of this parameter to
the extent of full attack prevention is described in [3]. Methods and algorithms of mail
spam resistance are actively developing [4]. To provide information security various
software is used: security scanners [5], complex security analysis systems [6], etc.
Thus, revealing new cyber-threat signs and analysis methods is an urgent problem.


*   Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
    License Attribution 4.0 International (CC BY 4.0).
                                                                                               17


   For many years, Krasnoyarsk Science Center has been studying the problems of
cyber security analysis and ensuring the network protection [7]. The purpose of this
work is to analyze log data on Internet services, identify potential risks, and optimize
security protection mechanisms.


2          Data Sources

The corporate network of the Krasnoyarsk Science Center has a four-level
architecture: 1) network core which provides routing and connectivity to external
networks; 2) server network which hosts Internet services; 3) aggregation level which
connects multiple organizations together; 4) local network for end users. All the
information about network traffic is collected at the Internet connection points and
server network. In addition, there are logs from the main Internet services: corporate
mail server, web server, and proxy server for access to external resources. The
sources of data analysis are the following:

1. Netflow IP traffic: more than 400 GB, more than 2 billion records.
2. Mail server log: 1 year, more than 1.5 million records.
3. Intrusion Prevention System (IPS) log: 1 year, about 200 thousand records.
4. WWW log: 1 year, about 12 GB, more than 44 million records.


3          IP Traffic Data Analysis

To assess the permanent threat level, IP traffic to unused network addresses was
analyzed (Fig. 1).

 60000

 40000

 20000

       0
           Jun   Jul   Aug   Sep   Oct    Nov   Dec    Jan    Feb Mar      Apr   May    Jun

    Fig. 1. Incoming connections to unused network addresses (daily, June 2019 – June 2020).

The analysis shows the permanent number of access attempts. The daily aggregation
shows a trend in the number of access attempts from 5000 to 12000 per day. The
detected peak of 60000 on 14.01.2020 is explained by the attack duration rather than
intensity, as seen in the hourly aggregation (Fig. 2).
                                                                                           18


   6000

   4000

   2000

      0
      Jan'20         Feb'20         Mar'20          Apr'20           May'20       Jun'20

             Fig. 2. Incoming connections to unused network addresses (hourly).

The hourly rate of access attempts to a single address is about 500 per hour,
regardless of the time of day and days of week, with the peaks of up to 4500 per hour.
Thus, the time interval for confident detection of network attacks should be no more
than one hour. The number of unique threat sources per day (Fig. 3) changes from
1000 to 2000, which indicates the presence of a large and constantly operating
network used for scanning Internet services.

  2500
  2000
  1500
  1000
   500
      0
          1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

               Fig. 3. Number of unique scan sources per day for April 2020.

Building the distribution by hours (Fig. 4) allows one to calculate the average
deviation of about 1.5% with the maximum deviation around 07:00 KRAT of about
5%. The detected maximum corresponds to 00:00 GMT time, which indicates the
prevalence of the systems with scheduled scans and attacks launched at midnight
GMT.

  135000
  130000
  125000
  120000
  115000
  110000
            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

                           Fig. 4. Distribution of scans by hours.
                                                                                         19


The analysis of scanning frequency of individual services (Fig. 5) allows identifying
popular services, which are the most attacked and under which the threats are masked:
Telnet/23, MS SQL/1433, HTTP/80, Personal Agent/5555, SSH/22, HTTP
Alternate/8080, RDP/3389. Thus, Telnet and MS SQL can be added to the existing
blocking network ports (SSH, RDP, SMTP) to increase the protection efficiency.

8000
7000
6000
5000
4000
3000
2000
1000
   0
        Telnet   MS SQL   HTTP   Personal   SSH     HTTP MS RDP threat   Web    HTTPS
                                  Agent           Alternate            services

                                 Fig. 5. Scan rates by service.


4      Mail Server Data Analysis

The analysis of the mail traffic reveals its periodicity during both the day and days of
week (Fig. 6).

 1,0
 0,9            Email
 0,8            Viruses
 0,7
 0,6
 0,5
 0,4
 0,3
 0,2
 0,1
 0,0
    04'19 05'19 06'19 07'19 08'19 09'19 10'19 11'19 12'19 01'20 02'20 03'20 04'20

       Fig. 6. Normalized daily number of emails and viruses detected during the year.

The number of viruses in the mail by hours in the recipient's time zone (Fig. 7) shows
a good correlation with the number of mail spam (0.63), but time distributions by the
sender have zero correlation, which may indicate different sources of mail spam and
mail viruses.
                                                                                     20


    1

 0,8

 0,6

 0,4

 0,2                                             Viruses
                                                 Spam
    0
        0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

            Fig. 7. Distribution of spam and viruses by the recipient's time zone.

Using geographical databases and data aggregation by the territory allows building a
distribution by the country of spam sources (Fig. 8). The most active mail spam
countries: Russia, USA, Germany, France, China.


                                                                    Russia
                                                                    United States
                           28%          24%
                                                                    Germany
                                                                    France
                          4%               13%
                           9%
                                                                    China
                                 10% 12%                            Netherlands
                                                                    Other


                          Fig. 8. Leading countries of mail spam.

In terms of the number of virus sources, the United States, France, Russia, and
Vietnam are the leaders. The weekend activity and threat level is 2-3 times lower than
on workdays, while Tuesday and Wednesday are the highest threat level days. In
terms of the time of day, the threat level at night is 2 times lower than during working
hours.


5       Intrusion Prevention System Data Analysis

Analysis of the Intrusion Prevention System log shows no periodicity both with
regard to the days of week and time of day (Fig. 9). The number of blocked network
addresses is approximately 3 times lower than the number of unique scanning sources
per day (on average, 500 and 1500, respectively), which indicates that only every
third source takes actions leading to its blocking.
                                                                                    21


    1400
    1200
    1000
     800
     600
     400
     200
       0
       04'19 05'19 06'19 07'19 08'19 09'19 10'19 11'19 12'19 01'20 02'20 03'20 04'20

                 Fig. 9. Histogram of the number of blocks during the year.

In the hourly frequency distribution of the blocking system over the SSH and RDP
protocols (Fig. 10), there is a peak (about 150% of the average) at around midnight
GMT, which is an additional indicator of a large threat scanning network launched at
00:00 GMT.

 10000
    8000
    6000
                           SSH,RDP
    4000
    2000
                           Mail services

       0
           0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

                       Fig. 10. Hourly distribution of IPS responses.

The largest number of responses (165 thousand) falls on the SSH service, followed by
SMTP (about 30 thousand) and RDP (about 3 thousand). The significant
predominance of SSH may be due to the specific of the blocking system: all
connections from the threat sources are blocked, and SSH has the minimum port
number (22) among popular services and, thus, is checked first. The analysis of the
geographical location of threat sources shows the leadership of China (39%) and the
United States (12%). Thus, it is possible to make a conclusion about purposeful
invasion attempts from China, since other data indicates that it does not have a
leading position.


6          WWW Data Analysis

The analysis of web services logs shows periodicity both in terms of the days of week
and time of day. In addition, there is a tendency for the number of requests associated
with the WWW services expansion to increase and an increasing number of visitors is
                                                                                     22


also observed. The request analysis in terms of the country shows the following
results: Russia – about 80% of all visits, USA – 7%, Germany – 2%. The analysis of
error logs shows the following: Russia – 54%, USA – 27%, China – 3%. The most
popular browsers are Chrome – 47%, Firefox – 16%, Internet Explorer – 6%. Web
spiders and bots amount to about 9% of the total number of requests and 32% of the
total number of errors.
   When processing web service logs, requests were divided into two non-intersecting
groups: legitimate and erroneous requests. Legitimate requests are those that are
processed by a web application or web service in normal mode and whose HTTP
response code is one of 1XX, 2XX, 3XX. Erroneous requests (or errors) are those that
are processed incorrectly either on the client side (HTTP response code 4XX) or on
the server side (5XX). The error analysis is important because it allows revealing
malicious activity. Figures 11 and 12 show the number of requests and errors per day.
As one can see from the graphs, the number of requests and errors depends on
holidays since most of the services are used during business hours.

 250000
                  Requests
 200000
                  Trend
 150000
 100000
  50000
         0
         01'19 02'19 03'19 04'19 05'19 06'19 07'19 08'19 09'19 10'19 11'19 12'19 01'20

                    Fig. 11. Number of requests per day during the year.

  6000
                                                Errors
                                                Trend
  4000

  2000

     0
       01'19 02'19 03'19 04'19 05'19 06'19 07'19 08'19 09'19 10'19 11'19 12'19 01'20


                     Fig. 12. Number of errors per day during the year.

The correlation coefficient for the server and client requests is 0.984 (Fig. 13). This
indicates that most of the requests are carried out from Krasnoyarsk (KRAT) and
nearby time zones.
                                                                                          23


                    Hourly requests                              Hourly errors
    3500000                                         50000
    3000000
                                                    40000
    2500000
    2000000                                         30000
    1500000                                         20000
    1000000
                                                    10000
     500000
            0                                            0
                0    3   6    9 12 15 18 21                  0   3   6    9 12 15 18 21
                     Server        Client                        Server          Client


                         Fig. 13. Number of requests and errors per hour.

The correlation was calculated and queries and errors were normalized by server and
client time (Fig. 14). The high value of the correlation coefficient for errors was found
to be caused by incorrect operation of websites and web services. However, the rest of
the errors show the presence of scans and attacks performed by web spiders.

          Distribution by server time                     Distribution by client time
    1                                                1

   0,8                                             0,8

   0,6                                             0,6

   0,4                                             0,4

   0,2                                             0,2

    0                                                0
         0 2 4 6 8 10 12 14 16 18 20 22                  0 2 4 6 8 10 12 14 16 18 20 22
              Requests       Errors                           Requests       Errors

             Fig. 14. Distribution of requests and errors by server and client time.

Since errors indicate the attempts to access non-existent or non-public resources, there
is a high probability of an increasing number of threat sources from the United States
and China since their portion of errors is several times greater than the portion of
requests. The majority of errors are caused by detecting vulnerabilities in popular
Content Management Systems (CMS). In addition, the browsers analysis in terms of
requests and errors shows an increased percentage of errors from web spiders, which
may indicate a high risk of threat.
                                                                                         24


7      Measures Taken

As a result of the research, some measures were taken to increase the security of
Internet services of the Krasnoyarsk Science Center. In particular, the following was
performed: 1) the threshold time interval for more confident detection of network
scanning was increased; 2) new TCP ports to the monitoring system to track
malicious activity were added; 3) firewall settings to more effectively blocking
unwanted hosts were optimized; 4) web server settings to prevent attacks on the
known CMS vulnerabilities was updated; 5) network settings of internal switches
were optimized to block unwanted traffic between different divisions.


8      Conclusion

In this work, we analyzed data logs from the corporate Internet services of the
Krasnoyarsk Science Center. The main sources of cybersecurity threats were
identified. New signs of threat sources were determined which can be used to improve
corporate network security systems. In general, the applied security tools allow
detecting and blocking threats at early stages. The results of the study allow
optimizing protection systems against network attacks, taking into account the
identified sources of threats which were previously not taken into account in standard
security tools. The measures taken increase the responsiveness to emerging threats
and cybersecurity of the organization as a whole.


References
 1. Khan, S., Parkinson, S.: Discovering and utilising expert knowledge from security event
    logs. Journal of Information Security and Applications 48, 102375 (2019)
 2. Landauer, M., Wurzenberger, M., Skopik, F.: Dynamic log file analysis: An unsupervised
    cluster evolution approach for anomaly detection. Computers & Security 79, 94–116
    (2018)
 3. Kim, D., Kim, Y.-H., Shin, D., Shin, D.: Fast attack detection system using log analysis
    and attack tree generation. Cluster Computing 22(2), 1827–1835 (2019)
 4. Mujtaba, G., Shuib, L., Raj, R. G., Majeed, N., Al-Garadi, M. A.: Email Classification
    Research Trends: Review and Open Issues. IEEE Access 5, 9044–9064 (2017)
 5. Zhang, K., Zhao, F., Luo, S., Xin, Y., Zhu, H.: An Intrusion Action-Based IDS Alert
    Correlation Analysis and Prediction Framework. IEEE Access 7, 150540-150551 (2019)
 6. Sapegin, A., Jaeger, D., Cheng, F., Meinel, C.: Towards a system for complex analysis of
    security events in large-scale networks. Computers & Security 67, 16–34 (2017)
 7. Kulyasov, N., Isaev, S.: Research of network anomalies in the corporate network of the
    Krasnoyarsk Scientific Center. Siberian Journal of Science and Technology 19(3), 412–
    422 (2018)