Non-Linear Analytic Prediction of IP Addresses for
Supporting Cyber Attack Detection and Analysis
Alfredo Cuzzocrea1 , Enzo Mumolo2 , Edoardo Fadda3 , Selim Soufargi4 and
Carson K. Leung5
1
  iDEA Lab, University of Calabria, Rende, Italy
2
  University of Trieste, Trieste, Italy
3
  Politecnico di Torino & ISIRES, Torino, Italy
4
  iDEA Lab, University of Calabria, Rende, Italy
5
  University of Manitoba, Winnipeg, MB, Canada


                                         Abstract
                                         Computer network systems are often subject to several types of attacks. For example the distributed
                                         Denial of Service (DDoS) attack introduces an excessive traffic load to a web server to make it unusable.
                                         A popular method for detecting attacks is to use the sequence of source IP addresses to detect possible
                                         anomalies. With the aim of predicting the next IP address, the Probability Density Function of the IP
                                         address sequence is estimated. Prediction of source IP address in the future access to the server is meant
                                         to detect anomalous requests. In this paper we consider the sequence of IP addresses as a numerical
                                         sequence and develop the nonlinear analysis of the numerical sequence. We used nonlinear analysis
                                         based on Volterra’s Kernels and Hammerstein’s models. The experiments carried out with datasets of
                                         source IP address sequences show that the prediction errors obtained with Hammerstein models are
                                         smaller than those obtained both with the Volterra Kernels and with the sequence clustering by means
                                         of the K-Means algorithm.

                                         Keywords
                                         Cyber Attack, Distributed Denial of Service, Hammerstein Models


1. Introduction
User modeling is an important task for web applications dealing with large traffic flows. They
can be used for a variety of applications such as to predict future situations or classify current
states. Furthermore, user modeling can improve detection or mitigation of Distributed Denial
of Service (DDoS) attack [1, 2, 3], improve the quality of service (QoS) [4], individuate click
fraud detection and optimize traffic management. In peer-to-peer (P2P) overlay networks, IP
models can also be used for optimizing request routing [5]. Those techniques are used by severs
for deciding how to manage the actual traffic. In this context, also outlier detection methods
are often used if only one class is known. If, for example, an Intrusion Prevention System wants
to mitigate DDoS attacks, it usually has only seen the normal traffic class before and it has to

SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV),
Italy
" alfredo.cuzzocrea@unical.it (A. Cuzzocrea); mumolo@units.it (E. Mumolo); edorardo.fadda@polito.it
(E. Fadda); selim.soufargi@unical.it (S. Soufargi); kleung@cs.umanitoba.ca (C. K. Leung)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
detect the outlier class by its different behavior. In this paper we deal with the management of
DDos because nowadays it has become a major threat in the internet. Those attacks are done by
using a large scaled networks of infected PCs (bots or zombies) that combine their bandwidth
and computational power in order to overload a publicly available service and denial it for
legal users. Due to the open structure of the internet, all public servers are vulnerable to DDoS
attacks. The bots are usually acquired automatically by hackers who use software tools to scan
through the network, detecting vulnerabilities and exploiting the target machine. Furthermore,
there is also a strong need to mitigate DDoS attacks near the target, which seems to be the only
solution to the problem in the current internet infrastructure. The aim of such a protection
system is to limit their destabilizing effect on the server through identifying malicious requests.
There are multiple strategies with dealing with DDoS attacks. The most effective ones are the
near-target filtering solutions. They estimates normal user behavior based on IP packet header
information. Then, during an attack the access of outliers is denied. One parameter that all
methods have in common is the source IP address of the users. It is the main discriminant for
DDoS traffic classification. However, the methods of storing IP addresses and estimating their
density in the huge IP address space, are different. In this paper, we present a novel approach
based on system identification techniques and, in particular, on the Hammerstein models.


2. Non-Linear Analytic Prediction of IP Addresses
Data driven identification of mathematical models of physical systems (i.e. nonlinear) starts
with representing the systems as a black box. In other terms, while we may have access to
the inputs and outputs, the internal mechanisms are totally unknown to us. Once a model
type is chosen to represent the system, its parameters are estimated through an optimization
algorithm so that eventually the model mimics at a certain level of fidelity the inner mechanism
of the nonlinear system or process using its inputs and outputs. These approach is, for instance,
widely used in the related big data analytics area (e.g., [6, 7, 8, 9, 10, 11, 12, 13, 14])
   In this work, we consider a particular sub-class of nonlinear predictors: the Linear-in-the-
parameters (LIP) predictors. LIP predictors are characterized by a linear dependence of the
predictor output on the predictor coefficients. Such predictors are inherently stable, and that
they can converge to a globally minimum solution (in contrast to other types of nonlinear filters
whose cost function may exhibit many local minima) avoiding the undesired possibility of getting
stuck in a local minimum. Let us consider a causal, time-invariant, finite-memory,continuous
nonlinear predictor as described in (1).

                               ˆ𝑠(𝑛) = 𝑓 [𝑠(𝑛 − 1), . . . , 𝑠(𝑛 − 𝑁 )]                         (1)
where 𝑓 [·] is a continuous function, 𝑠(𝑛) is the input signal and ˆ𝑠(𝑛) is the predicted sample.
We can expand 𝑓 [·] with a series of basis functions 𝑓𝑖 (𝑛), as shown in (2).
                                            ∞
                                            ∑︁
                                  ˆ𝑠(𝑛) =         ℎ(𝑖)𝑓𝑖 [𝑠(𝑛 − 𝑖)]                            (2)
                                            𝑖=1

where ℎ(𝑖) a re proper coefficients. To make (2) realizable we truncate the series to the first 𝑁
terms, thus we obtain
                                               𝑁
                                              ∑︁
                                    ˆ𝑠(𝑛) =         ℎ(𝑖)𝑓𝑖 [𝑠(𝑛 − 𝑖)]                       (3)
                                              𝑖=1

In the general case, a linear-in-the-parameters nonlinear predictor is described by the input-
output relationship reported in (4).
                                                        𝑇
                                                  ⃗ 𝑋
                                          ˆ𝑠(𝑛) = 𝐻 ⃗ (𝑛)                                   (4)

      ⃗ 𝑇 is a row vector containing predictor coefficients and 𝑋
where 𝐻                                                         ⃗ (𝑛) is the corresponding
column vector whose elements are nonlinear combinations and/or expansions of the input
samples.

2.1. Linear Predictor
Linear prediction is a well known technique with a long history [15]. Given a time series 𝑋
                                                                                          ⃗,
linear prediction is the optimum approximation of sample 𝑥(𝑛) with a linear combination of
the 𝑁 most recent samples. That means that the linear predictor is described as eq. (5).
                                                 𝑁
                                                ∑︁
                                      ˆ𝑠(𝑛) =         ℎ1 (𝑖)𝑠(𝑛 − 𝑖)                        (5)
                                                𝑖=1

or in matrix form as
                                                  ⃗ 𝑇𝑋
                                          ˆ𝑠(𝑛) = 𝐻  ⃗ (𝑛)                                  (6)
where the coefficient and input vectors are reported in (7) and (8).

                                ⃗ 𝑇 = ℎ1 (1) ℎ1 (2) . . . ℎ1 (𝑁 )                           (7)
                                     [︀                          ]︀
                                𝐻

                          ⃗ 𝑇 = 𝑠(𝑛 − 1) 𝑠(𝑛 − 2) . . . 𝑠(𝑛 − 𝑁 )                           (8)
                               [︀                                 ]︀
                          𝑋

2.2. Non-Linear Predictor based on Volterra Series
As well as Linear Prediction, Non Linear Prediction is the optimum approximation of sample
𝑥(𝑛) with a non linear combination of the 𝑁 most recent samples. Popular nonlinear predictors
are based on Volterra series [16]. A Volterra predictor based on a Volterra series truncated to
the second term is reported in (9).
                          𝑁1
                          ∑︁                        𝑁2 ∑︁
                                                    ∑︁ 𝑁2
                𝑥
                ˆ (𝑛) =         ℎ1 (𝑖)𝑥(𝑛 − 𝑖) +              ℎ2 (𝑖, 𝑗)𝑥(𝑛 − 𝑖)𝑥(𝑛 − 𝑗)     (9)
                          𝑖=1                       𝑖=1 𝑗=𝑖

where the symmetry of the Volterra kernel (the ℎ coefficients) is considered. In matrix terms,
the Volterra predictor is represented in (10).

                                                  ⃗ 𝑇𝑋
                                          ˆ𝑠(𝑛) = 𝐻  ⃗ (𝑛)                                 (10)
where the coefficient and input vectors are reported in (12) and (12).
                                                               [︂                                             ]︂
      𝑇                                                              ℎ1 (1)       ℎ1 (2) . . . ℎ1 (𝑁 )
  ⃗
  𝐻                            =                                                                                   (11)
                                                                    ℎ2 (1, 1) ℎ2 (1, 2) . . . ℎ2 (𝑁2 , 𝑁2 )

                                                [︂                                                            ]︂
  ⃗𝑇                                                 𝑠(𝑛 − 1)       𝑠(𝑛 − 2) . . . 𝑠(𝑛 − 𝑁1 )
  𝑋                       =                                                                                        (12)
                                                     𝑠2 (𝑛 − 1) 𝑠(𝑛 − 1)𝑠(𝑛 − 2) . . . 𝑠2 (𝑛 − 𝑁2 )

2.3. Non-Linear Predictor based on Functional Link Artificial Neural
     Networks (FLANN)
FLANN is a single layer neural network without hidden layer. The nonlinear relationships
between input and output are captured through function expansion of the input signal exploiting
suitable orthogonal polynomials. Many authors used for examples trigonometric, Legendre
and Chebyshev polynomials. However, the most frequently used basis function used in FLANN
for function expansion are trigonometric polynomials [17]. The FLANN predictor can be
represented by eq.(13).

             𝑁
            ∑︁                        𝑁 ∑︁
                                     ∑︁  𝑁
  ˆ𝑠(𝑛) =         ℎ1 (𝑖)𝑠(𝑛 − 𝑖) +             ℎ2 (𝑖, 𝑗) cos[𝑖𝜋𝑠(𝑛 − 𝑗)]+
            𝑖=1                      𝑖=1 𝑗=1
                                                                         𝑁 ∑︁
                                                                        ∑︁  𝑁
                                                                                  ℎ2 (𝑖, 𝑗) sin[𝑖𝜋𝑠(𝑛 − 𝑗)] (13)
                                                                        𝑖=1 𝑗=1

   Also in this case the Flann predictor can be represented using the matrix form reported in
(10).
                                               ⃗ 𝑇𝑋
                                       ˆ𝑠(𝑛) = 𝐻   ⃗ (𝑛)                                 (14)
where the coefficient and input vectors of FLANN predictors are reported in (22) and (23).
                                                              ⎡                                        ⎤
                                                                ℎ1 (1)       ℎ1 (2) . . . ℎ1 (𝑁 )
  ⃗𝑇
  𝐻                            =                              ⎣ℎ2 (1, 1) ℎ2 (1, 2) . . . ℎ2 (𝑁2 , 𝑁2 )⎦ (15)
                                                               ℎ3 (1, 1) ℎ3 (1, 2) . . . ℎ3 (𝑁2 , 𝑁2 )

              ⎡                                                                  ⎤
                  𝑠(𝑛 − 1)       𝑠(𝑛 − 2)    . . . 𝑠(𝑛 − 𝑁 )
      𝑇
  ⃗
  𝑋       =   ⎣cos[𝜋𝑠(𝑛 − 1)] cos[𝜋𝑠(𝑛 − 2)] . . .    ...    cos[𝑁2 𝜋𝑠(𝑛 − 𝑁2 )]⎦ (16)
               sin[𝜋𝑠(𝑛 − 1)] sin[𝜋𝑠(𝑛 − 2)] . . .    ...    sin[𝑁2 𝜋𝑥(𝑠 − 𝑁2 )]

2.4. Non-Linear Predictors based on Hammerstein Models
Previous research [18] shown that many real nonlinear systems, spanning from electromechan-
ical systems to audio systems, can be modeled using a static non-linearity. These terms capture
the system nonlinearities, in series with a linear function, which capture the system dynamics
as shown in Figure 1.
                             p
                              2
                                            2
                                          x (n)
                             p3
                                            3
                                          x (n)
                                             .                      +          FIR                y(n)
         x(n)
                                             .
                             pm              .
                                           m
                                          x (n)

Figure 1: Representation of the Hammerstein Models


    Indeed, the front-end of the so called Hammerstein Model is formed by a nonlinear function
whose input is the system input. Of course the type of non-linearity depends on the actual
physical system to be modeled. The output of the nonlinear function is hidden and is fed
as input of the linear function. In the following, we assume that the non-linearity is a finite
polynomial expansion, and the linear dynamic is realized with a Finite Impulse Response (FIR)
filter. Furthermore, in contrast with [18], we assume a mean error analysis and we postpone
the analysis in the robust framework in future work. In other word,
                                                                                           𝑀
                                                                                           ∑︁
  𝑧(𝑛)     =       𝑝(2)𝑥2 (𝑛) + 𝑝(3)𝑥3 (𝑛) + . . . 𝑝(𝑚)𝑥𝑚 (𝑛)                       =            𝑝(𝑖)𝑥𝑖 (𝑛) (17)
                                                                                           𝑖=2

On the other hand, the output of the FIR filter is:

  𝑦(𝑛) = ℎ0 (1)𝑧(𝑛 − 1) + ℎ0 (2)𝑧(𝑛 − 2) + . . . + ℎ0 (𝑁 )𝑧(𝑛 − 𝑁 ) =
                                                                                      𝑁
                                                                                     ∑︁
                                                                                =          ℎ0 (𝑗)𝑧(𝑛 − 𝑗) (18)
                                                                                     𝑗=1

Substituting (17) in (20) we have:
            𝑁
           ∑︁                        𝑁
                                    ∑︁             𝑀
                                                   ∑︁
  𝑦(𝑛) =         ℎ0 (𝑖)𝑧(𝑛 − 𝑖) =         ℎ0 (𝑗)         𝑝(𝑖)𝑥𝑖 (𝑛 − 𝑗) =
           𝑖=1                      𝑗=1            𝑖=2
                                                                         𝑀 ∑︁
                                                                         ∑︁ 𝑁
                                                                                    ℎ0 (𝑗)𝑝(𝑖)𝑥𝑖 (𝑛 − 𝑗) (19)
                                                                          𝑖=2 𝑗=1

Setting 𝑐(𝑖, 𝑗) = ℎ0 (𝑗)𝑝(𝑖) we write
                                                 𝑀 ∑︁
                                                 ∑︁ 𝑁
                                    𝑦(𝑛) =                 𝑐(𝑖, 𝑗)𝑥𝑖 (𝑛 − 𝑗)                                (20)
                                                 𝑖=2 𝑗=1

This equation can be written in matrix form as

                                                        ⃗ 𝑇𝑋
                                                ˆ𝑠(𝑛) = 𝐻  ⃗ (𝑛)                                            (21)
where
                                                  ⎡                                       ⎤
                                                       𝑐(2, 1)    𝑐(2, 2) . . . 𝑐(2, 𝑁2 )
  ⃗𝑇
  𝐻                        =                      ⎣ 𝑐(3, 1)       𝑐(3, 2) . . . 𝑐(3, 𝑁2 ) ⎦ (22)
                                                   . . . 𝑐(𝑀, 1) 𝐶(𝑀, 2) . . . 𝐶(𝑀, 𝑁 )
                 ⎡ 2
                   𝑠 (𝑛 − 2) 𝑠2 (𝑛 − 3) . . . 𝑠2 (𝑛 − 𝑁 )
                                                                                 ⎤
  ⃗𝑇
  𝑋       =      ⎣ 𝑠3 (𝑛 − 2) 𝑠3 (𝑛 − 3) . . . 𝑠3 (𝑛 − 𝑁 )                       ⎦ (23)
                    𝑀          𝑀                 𝑀         𝑀           𝑀
                  𝑠 (𝑛 − 2) 𝑠 (𝑛 − 3) . . . 𝑠 (𝑛 − 1) 𝑠 (𝑛 − 3) . . . 𝑠 (𝑛 − 𝑁 )

3. Predictor Parameters Estimation
So far we saw that all the predictors can be expressed, at time instant 𝑛, as
                                                  𝑇
                                               ⃗ 𝑋
                                       ˆ𝑠(𝑛) = 𝐻 ⃗ (𝑛)                                      (24)

with different definitions of the input, 𝑋                             ⃗ 𝑇 . There are two well
                                         ⃗ (𝑛), end parameters vectors 𝐻
known possibilities for estimating the optimal parameter vector.

3.1. Block-based Approach
The Minimum Mean Square estimation is based on the minimization of the mathematical
expectation of the squared prediction error 𝑒(𝑛) = 𝑠(𝑛) − ˆ𝑠(𝑛)
                                                             ⃗ 𝑇𝑋
                    𝐸[𝑒2 ] = 𝐸[(𝑠(𝑛) − ˆ𝑠(𝑛))2 ] = 𝐸[(𝑠(𝑛) − 𝐻  ⃗ (𝑛))2 ]                   (25)
The minimization of (25) is obtain by setting to zero the Laplacian of the mathematical expecta-
tion of the squared prediction error:
                         ∇𝐻 𝐸[𝑒2 ] = 𝐸[∇𝐻 𝑒2 ] = 𝐸[2𝑒(𝑛)∇𝐻 𝑒] = 0                           (26)
which leads to the well known unique solution
                                                ⃗ −1 𝑅
                                        ⃗ 𝑜𝑝𝑡 = 𝑅
                                        𝐻            ⃗                                      (27)
                                                  𝑥𝑥 𝑠𝑥

where
                                    ⃗ 𝑥𝑥 (𝑛) = 𝐸[𝑋
                                    𝑅                   ⃗ 𝑇 (𝑛)]
                                                   ⃗ (𝑛)𝑋                                   (28)
is the statistical auto-correlation matrix of the input vector 𝑋
                                                               ⃗ (𝑛) and
                                    ⃗ 𝑠𝑥 (𝑛) = 𝐸[𝑠(𝑛)𝑋
                                    𝑅                ⃗ (𝑛)]                                 (29)
is the statistical cross-correlation vector between the signal 𝑠(𝑛) and the input vector 𝑋
                                                                                         ⃗ (𝑛).
The mathematical expectations of the auto and cross correlation are estimated using
                                              ∑︀𝑛 ⃗       ⃗𝑇
                                  ⃗ 𝑥𝑥 (𝑛) = 𝑘=1 𝑋 (𝑛)𝑋 (𝑛)
                                  𝑅                                                        (30)
                                                      𝑛
is the statistical auto-correlation matrix of the input vector 𝑋
                                                               ⃗ (𝑛) and
                                             ∑︀𝑛            ⃗
                                  ⃗ 𝑠𝑥 (𝑛) = 𝑘=1 𝑠(𝑘)(𝑛)𝑋 (𝑛)
                                  𝑅                                                        (31)
                                                      𝑛
3.2. Adaptive Approach
Let us consider a general second order terms of a Volterra predictor
                                        𝑁
                                        ∑︁−1 𝑁
                                             ∑︁−1
                           𝑦(𝑛) =                     ℎ2 (𝑘, 𝑟)𝑥(𝑛 − 𝑘)𝑥(𝑛 − 𝑟)                 (32)
                                         𝑘=0 𝑟=0

It can be generalized for higher order term as
                           𝑁
                          ∑︁            𝑁
                                       ∑︁
                                                                                                (33)
                                                               {︀                     }︀
                                 ···           𝑐𝑘1 · · · 𝑐𝑘𝑝 𝐻𝑝 𝑥𝑘1 (𝑛), · · · 𝑥𝑘𝑝 (𝑛)
                         𝑘1 =1         𝑘𝑝 =1

where
                                                     𝑁
                                                    ∑︁
                                                          𝑐𝑘 𝑥𝑘 (𝑛).                            (34)
                                                    𝑘=1

For the sake of simplicity and without loss of generality, we consider a Volterra predictor based
on a Volterra series truncated to the second term
                          𝑁1
                          ∑︁                              𝑁2 ∑︁
                                                          ∑︁ 𝑁2
                ˆ𝑟(𝑛) =          ℎ1 (𝑖)𝑟(𝑛 − 𝑖) +                   ℎ2 (𝑖, 𝑗)𝑟(𝑛 − 𝑖)𝑟(𝑛 − 𝑗)   (35)
                          𝑖=1                             𝑖=1 𝑗=𝑖

  By defining

                  𝐻 𝑇 (𝑛) = |ℎ1 (1), · · · , ℎ1 (𝑁1 ) , ℎ2 (1, 1), · · · , ℎ2 (𝑁2 , 𝑁2 )|       (36)

and                               ⃒
                        𝑋 𝑇 (𝑛) = ⃒𝑟(𝑛 − 1), · · · , 𝑟 (𝑛 − 𝑁1 ) , 𝑟2 (𝑛 − 1)
                                                                                                (37)
                              𝑟(𝑛 − 1)𝑟(𝑛 − 2), · · · , 𝑟2 (𝑛 − 𝑁2 ) |
Eq (35) can be rewritten as follows

                                               ˆ𝑟(𝑛) = 𝐻 𝑇 (𝑛)𝑋(𝑛).                             (38)

  In order to estimate the best parameters 𝐻, we consider the following loss function
                                             𝑛
                                            ∑︁                            ]︀2
                                                  𝜆𝑛−𝑘 ˆ𝑟(𝑘) − 𝐻 𝑇 (𝑛)𝑋(𝑘)                      (39)
                                                      [︀
                           𝐽𝑛 (𝐻) =
                                            𝑘=0

where 𝜆𝑛−𝑘 weights the relative importance of each squared error. In order to find the 𝐻 that
minimizes the convex function (39) it is enough to impose its gradient to zero, i.e.,

                                                  ∇𝐻 𝐽𝑛 (𝐻) = 0                                 (40)

That is equivalent to
                                         𝑅𝑋𝑋 (𝑛)𝐻(𝑛) = 𝑅𝑟𝑋 (𝑛)                                  (41)
where
                             𝑅𝑋𝑋 (𝑛) = ∑︀𝑛𝑘=0 𝜆𝑛−𝑘 𝑋(𝑘)𝑋 𝑇 (𝑘)
                                        ∑︀
                                                                                              (42)
                               𝑅𝑟𝑋 (𝑛) = 𝑛𝑘=0 𝜆𝑛−𝑘 𝑟(𝑘)𝑋(𝑘)
It follows that the best 𝐻 can be computed by
                                            −1
                                    𝐻(𝑛) = 𝑅𝑋𝑋 (𝑛)𝑅𝑟𝑋 (𝑛)                                     (43)
Since
                          𝑅𝑋𝑋 (𝑛) = 𝜆𝑅𝑋𝑋 (𝑛 − 1) + 𝑋(𝑛)𝑋 𝑇 (𝑛)                                (44)
it follows that
                    −1         1 [︀ −1                       −1
                                   𝑅𝑋𝑋 (𝑛 − 1) − 𝑘(𝑛)𝑋 𝑇 (𝑛)𝑅𝑋𝑋                               (45)
                                                                       ]︀
                   𝑅𝑋𝑋 (𝑛) =                                    (𝑛 − 1)
                               𝜆
where 𝑘(𝑛) is equal to
                                            −1
                                          𝑅𝑋𝑋   (𝑛 − 1)𝑋(𝑛)
                            𝑘(𝑛) =                −1                                          (46)
                                    𝜆 + 𝑋 𝑇 (𝑛)𝑅𝑋𝑋   (𝑛 − 1)𝑋(𝑛)
Instead, matrix 𝑅𝑟𝑋 (𝑛) in (43) can be written as
                             𝑅𝑟𝑋 (𝑛) = 𝜆𝑅𝑟𝑋 (𝑛 − 1) + 𝑟(𝑛)𝑋(𝑛)                                (47)
Thus, inserting Eq (47) and Eq (45)in Eq (43) and rearranging the terms, we obtain
                                              −1
                           𝐻(𝑛) = 𝐻(𝑛 − 1) + 𝑅𝑋𝑋 (𝑛)𝑋(𝑛)𝜖(𝑛)                                  (48)
where
                                  𝜖 = ˆ𝑟(𝑛) − 𝐻 𝑇 (𝑛 − 1)𝑋(𝑛)                                 (49)
By recalling Eq. (46), we can write Eq. (48) as
                                𝐻(𝑛) = 𝐻(𝑛 − 1) + 𝜖(𝑛)𝑘(𝑛)                                    (50)
By introducing, the new notation,
                                    𝐹 (𝑛) = 𝑆 𝑇 (𝑛 − 1)𝑋(𝑛)                                   (51)
The previous equations can be resumed by the following system
                      ⎧
                      ⎪
                      ⎪
                      ⎪ 𝐿(𝑛) = 𝑆(𝑛 − 1)𝐹 (𝑛)
                        𝛽(𝑛) = 𝜆 + 𝐹 𝑇 (𝑛)𝐹 (𝑛)
                      ⎪
                      ⎪
                      ⎪
                      ⎪
                      ⎪
                      ⎪
                      ⎨𝛼(𝑛) =
                      ⎪                 1
                                        √
                                 𝛽(𝑛)+ 𝜆𝛽(𝑛)
                                                                                              (52)
                        𝑆(𝑛) = √1𝜆 𝑆(𝑛 − 1) − 𝛼(𝑛)𝐿(𝑛)𝐹 𝑇 (𝑛)
                                     [︀                       ]︀
                      ⎪
                      ⎪
                      ⎪
                      ⎪
                        𝜖(𝑛) = ˆ𝑟(𝑛 − 1) − 𝛼(𝑛)𝐿(𝑛)𝐹 𝑇 (𝑛)
                      ⎪
                      ⎪
                      ⎪
                      ⎪
                      ⎩𝜖(𝑛) = 𝐻(𝑛 − 1) + 𝐿(𝑛) 𝜖(𝑛)
                      ⎪
                      ⎪
                                                  𝛽(𝑛)

It should be noted that by using Eq (52) the estimation adapts in each step in order to decrease
the error. Thus, the system structure is somehow similar to the Kalman filter.
   Finally, we define the estimation error as
                                  𝑒(𝑛) = 𝑟(𝑛) − 𝐻 𝑇 (𝑛)𝑋(𝑛)                                   (53)
It is worth noting that the computation of the predicted value from Eq. (38) requires 6𝑁tot +2𝑁tot
                                                                                               2

operations, where 𝑁tot = 𝑁1 + 𝑁2 (𝑁2 + 1) /2.
4. Experiments
In order to prove the effectiveness of the proposed approach, in this section we present our
experimental results for a real dataset. Specifically, we consider the requests made to the 1998
World Cup Web site between April 30, 1998 and July 26, 1998 1 . During this period of time the
site received 1,352,804,107 requests. The fields of the request structure contain the following
information: (i) timestamp the time of the request, stored as the number of seconds since the
Epoch. The timestamp has been converted to GMT to allow for portability. During the World
Cup the local time was 2 hours ahead of GMT (+0200). In order to determine the local time, each
timestamp must be adjusted by this amount; (ii) clientID a unique integer identifier for the client
that issued the request; due to privacy concerns these mappings cannot be released; note that
each clientID maps to exactly one IP address, and the mappings are preserved across the entire
data set - that is if IP address 0.0.0.0 mapped to clientID 𝑋 on day 𝑌 then any request in any of the
data sets containing clientID 𝑋 also came from IP address 0.0.0.0; (iii) objectID a unique integer
identifier for the requested URL; these mappings are also 1-to-1 and are preserved across the
entire data set; (iv) size the number of bytes in the response; (v) method the method contained
in the client’s request (e.g., GET); (vi) status this field contains two pieces of information; the 2
highest order bits contain the HTTP version indicated in the client’s request (e.g., HTTP/1.0);
the remaining 6 bits indicate the response status code (e.g., 200 OK); (vii) type the type of file
requested, generally based on the file extension (.html), or the presence of a parameter list; (viii)
server indicates which server handled the request. The upper 3 bits indicate which region the
server was at; the remaining bits indicate which server at the site handled the request.
   In the dataset, 87 days are reported. We use the first one in order to initialise the estimator in
(35) and we use the others as test set by using a rolling horizon method (as in [18]). Particularly,
for each day 𝑡 we compute the estimation by using all the IP observations in the previous days
[0, 𝑡 − 1]. The results are reported in Figure 2. The increase of errors in June is due to the sudden

                              1


                             0.9


                             0.8


                             0.7


                             0.6
                      e(n)


                             0.5


                             0.4


                             0.3


                             0.2


                             0.1


                              0
                             Apr 30   May 14   May 28     Jun 11    Jun 25   Jul 09   Jul 23
                                                             days                       1998


Figure 2: Estimation error for each day of activity of the website.


increment of different IP accessing the website due to the start of the competition (see [19]). It
should be noted that the estimation error decrease exponentially despite dealing with several
millions of IPs.
    1
        ftp://ita.ee.lbl.gov/html/contrib/WorldCup.html
   Since the computation of the optimal coefficient 𝐻(𝑛) may require some time, we measure
the percentage of available data that our approach needs in order to provide good results.
Particularly, in this experiment we consider the average estimation error done by the model
at time 𝑡 by considering a subset of the IPs observed in interval [0, 𝑡 − 1]. The experimental
results on real data cubes are depicted in Figure 3.

                             1
                                                                                              Hammerstein models
                            0.9                                                               Volterra Kerners
                                                                                              sequence clusterings

                            0.8


                            0.7


                            0.6
                     e(n)


                            0.5


                            0.4


                            0.3


                            0.2


                            0.1


                             0
                                  0   0.1   0.2   0.3   0.4     0.5     0.6     0.7     0.8       0.9                1
                                                               ip [%]


Figure 3: Estimation error vs. percentage size of the training set


   It should be noted that the Hammerstain model outperform the results by the Volterra’s
kernel as well as the clustering techniques. In more detail, the clustering techniques are the
one less performing. This is due to the nature of the clustering techniques that exploit the
geometric information of the data more than their time dependency. We highlight that despite
the calculation of 𝐻(𝑛) is computational intensive, this does not effect the real time applicability
of the method. In fact, the access decision is taken by considering the estimator ˆ𝑟(𝑛) that is
computed once per day. Thus the computation of 𝐻(𝑛) does not need to be fast.


5. Conclusions and Future Work
In this paper, we presented a new way to deal with cyber attack by using Hammerstein models.
Experimental results clearly confirm the effectiveness of the proposed techniques for a real
data set, outperforming other well-known techniques. Future work will have two objectives.
First, we want to consider the problem in a stochastic optimization settings, as for example in
[20]. Second, we want to test the approach on other case studies, by also exploiting knowledge
management methodologies (e.g., [21, 22]).


Acknowledgements
This work is partially supported by NSERC (Canada) and University of Manitoba.

References
[1] M. Goldstein, C. Lampert, M. Reif, A. Stahl, T. Breuel, Bayes optimal ddos mitigation by adaptive history-based ip filtering,
    in: Seventh International Conference on Networking (icn 2008), 2008, pp. 174–179.
 [2] H.-X. Tan, W. Seah, Framework for statistical filtering against ddos attacks in manets, 2006, pp. 8 pp.–. doi:10.1109/ICESS.
     2005.57.
 [3] G. Pack, J. Yoon, E. Collins, C. Estan, On filtering of ddos attacks based on source address prefixes, 2006, pp. 1–12. doi:10.
     1109/SECCOMW.2006.359537.
 [4] Y. Yang, C.-H. Lung, The role of traffic forecasting in qos routing - a case study of time-dependent routing, 2005, pp. 224 –
     228 Vol. 1. doi:10.1109/ICC.2005.1494351.
 [5] A. Agrawal, H. Casanova, Clustering hosts in p2p and global computing platforms, 2003, pp. 367– 373. doi:10.1109/
     CCGRID.2003.1199389.
 [6] A. Cuzzocrea, R. Moussa, G. Xu, Olap*: Effectively and efficiently supporting parallel OLAP over big data, in: Model and
     Data Engineering - Third International Conference, MEDI 2013, Amantea, Italy, September 25-27, 2013. Proceedings, 2013,
     pp. 38–49.
 [7] G. Chatzimilioudis, A. Cuzzocrea, D. Gunopulos, N. Mamoulis, A novel distributed framework for optimizing query routing
     trees in wireless sensor networks via optimal operator placement, J. Comput. Syst. Sci. 79 (2013) 349–368.
 [8] A. Cuzzocrea, E. Bertino, Privacy preserving OLAP over distributed XML data: A theoretically-sound secure-multiparty-
     computation approach, J. Comput. Syst. Sci. 77 (2011) 965–987.
 [9] A. Cuzzocrea, V. Russo, Privacy preserving OLAP and OLAP security, in: Encyclopedia of Data Warehousing and Mining,
     Second Edition (4 Volumes), 2009, pp. 1575–1581.
[10] A. Bonifati, A. Cuzzocrea, Storing and retrieving xpath fragments in structured P2P networks, Data Knowl. Eng. 59 (2006)
     247–269.
[11] R. C. Camara, A. Cuzzocrea, G. M. Grasso, C. K. Leung, S. B. Powell, J. Souza, B. Tang, Fuzzy logic-based data analytics
     on predicting the effect of hurricanes on the stock market, in: 2018 IEEE International Conference on Fuzzy Systems,
     FUZZ-IEEE 2018, Rio de Janeiro, Brazil, July 8-13, 2018, IEEE, 2018, pp. 1–8.
[12] P. Braun, A. Cuzzocrea, C. K. Leung, A. G. M. Pazdor, S. K. Tanbeer, G. M. Grasso, An innovative framework for supporting
     frequent pattern mining problems in iot environments, in: O. Gervasi, B. Murgante, S. Misra, E. N. Stankova, C. M. Torre,
     A. M. A. C. Rocha, D. Taniar, B. O. Apduhan, E. Tarantino, Y. Ryu (Eds.), Computational Science and Its Applications - ICCSA
     2018 - 18th International Conference, Melbourne, VIC, Australia, July 2-5, 2018, Proceedings, Part V, volume 10964 of Lecture
     Notes in Computer Science, Springer, 2018, pp. 642–657.
[13] A. Cuzzocrea, C. Mastroianni, G. M. Grasso, Private databases on the cloud: Models, issues and research perspectives, in:
     J. Joshi, G. Karypis, L. Liu, X. Hu, R. Ak, Y. Xia, W. Xu, A. Sato, S. Rachuri, L. H. Ungar, P. S. Yu, R. Govindaraju, T. Suzumura
     (Eds.), 2016 IEEE International Conference on Big Data, BigData 2016, Washington DC, USA, December 5-8, 2016, IEEE
     Computer Society, 2016, pp. 3656–3661.
[14] A. Cuzzocrea, Accuracy control in compressed multidimensional data cubes for quality of answer-based OLAP tools, in: 18th
     International Conference on Scientific and Statistical Database Management, SSDBM 2006, 3-5 July 2006, Vienna, Austria,
     Proceedings, IEEE Computer Society, 2006, pp. 301–310.
[15] J. Makhoul, Linear prediction: A tutorial review, Proceedings of the IEEE 63 (1975) 561–580.
[16] Z. Peng, C. Changming, Volterra series theory: A state-of-the-art review, Chinese Science Bulletin (Chinese Version) 60
     (2015) 1874. doi:10.1360/N972014-01056.
[17] H. Zhao, J. Zhang, Adaptively combined fir and functional link artificial neural network equalizer for nonlinear communi-
     cation channel, IEEE Transactions on Neural Networks 20 (2009) 665–674.
[18] V. Cerone, E. Fadda, D. Regruto, A robust optimization approach to kernel-based nonparametric error-in-variables identi-
     fication in the presence of bounded noise, in: 2017 American Control Conference (ACC), IEEE, 2017. doi:10.23919/acc.
     2017.7963056.
[19] M. Arlitt, T. Jin, H. packard Laboratories, A workload characterization study of the 7998 world cup web site, ????
[20] E. Fadda, G. Perboli, R. Tadei, Customized multi-period stochastic assignment problem for social engagement and oppor-
     tunistic iot, Computers & Operations Research 93 (2018) 41–50.
[21] A. Cuzzocrea, Combining multidimensional user models and knowledge representation and management techniques for
     making web services knowledge-aware, Web Intelligence and Agent Systems 4 (2006) 289–312.
[22] M. Cannataro, A. Cuzzocrea, C. Mastroianni, R. Ortale, A. Pugliese, Modeling adaptive hypermedia with an object-oriented
     approach and XML, in: M. Levene, A. Poulovassilis (Eds.), Proceedings of the Second International Workshop on Web
     Dynamics, WebDyn@WWW 2002, Honululu, HW, USA, May 7, 2002, volume 702 of CEUR Workshop Proceedings, CEUR-
     WS.org, 2002, pp. 35–44.