=Paper=
{{Paper
|id=Vol-2940/paper14
|storemode=property
|title=A Time-series Classification Approach to Shallow Web Traffic De-anonymization
|pdfUrl=https://ceur-ws.org/Vol-2940/paper14.pdf
|volume=Vol-2940
|authors=Axel De Nardin,Marino Miculan,Claudio Piciarelli,Gian Luca Foresti,Alessandro Bocci,Stefano Forti,Gian-Luigi Ferrari,Antonio Brogi,Tobia Fiorese,Pietro Montino,Roberto De Prisco,Alfredo De Santis,Rocco Zaccagnino,Daniele Granata,Massimiliano Rak,Giovanni Salzillo,Umberto Barbato,Giuseppe Mario Malandrone,Giovanni Virdis,Giorgio Giacinto,Davide Maiorca,Dario Stabili,Francesco Pollicino,Alessio Rota,Shaharyar Khan,Alberto Volpatto,Geet Kalra,Jonathan Esteban,Tommaso Pescanoce,Sabino Caporusso,Michael Siegel,Alessia Boi,Carmelo Ardito,Tommaso Di Noia,Eugenio Di Sciascio,Domenico Lofù,Andrea Pazienza,Felice Vitulano,Giulio Berra,Gaspare Ferraro,Matteo Fornero,Nicolò Maunero,Paolo Prinetto,Gianluca Roascio,Luigi Coppolino,Salvatore D'Antonio,Giovanni Mazzeo,Luigi Romano,Paolo Campegiani,Vincenzo Dentamaro,Vito Nicola Convertini,Stefano Galantucci,Paolo Giglio,Tonino Palmisano,Giuseppe Pirlo,Massimiliano Masi,Tanja Pavleska,Simone Pezzoli,Massimiliano Calani,Giovanni Denaro,Alberto Leporati,Manuel Cheminod,Luca Durante,Lucia Seno,Adriano Valenzano,Mario Ciampi,Fabrizio Marangio,Giovanni Schmid,Mario Sicuranza,Marco Zuppelli,Giuseppe Manco,Luca Caviglione,Massimo Guarascio,Marzio Di Feo,Simone Raponi,Maurantonio Caprolu,Roberto Di Pietro,Paolo Spagnoletti,Federica Ceci,Andrea Salvi,Vincenzo Carletti,Antonio Greco,Alessia Saggese,Mario Vento,Gabriele Costa,Enrico Russo,Andrea Valenza,Giuseppe Amato,Simone Ciccarone,Pasquale Digregorio,Giuseppe Natalucci,Giovanni Lagorio,Marina Ribaudo,Alessandro Armando,Francesco Benvenuto,Francesco Palmarini,Riccardo Focardi,Flaminia Luccio,Edoardo Di Paolo,Enrico Bassetti,Angelo Spognardi,Anna Pagnacco,Vita Santa Barletta,Paolo Buono,Danilo Caivano,Giovanni Dimauro,Antonio Pontrelli,Chinmay Siwach,Gabriele Costa,Rocco De Nicola,Carmelo Ardito,Yashar Deldjoo,Eugenio Di Sciascio,Fatemeh Nazary,Vishnu Ramesh,Sara Abraham,Vinod P,Isham Mohamed,Corrado A. Visaggio,Sonia Laudanna
|dblpUrl=https://dblp.org/rec/conf/itasec/NardinMPF21
}}
==A Time-series Classification Approach to Shallow Web Traffic De-anonymization==
<pdf width="1500px">https://ceur-ws.org/Vol-2940/paper14.pdf</pdf>
<pre>
A time-series classification approach
to shallow web traffic de-anonymization
Axel De Nardin1 , Marino Miculan1 , Claudio Piciarelli1 and Gian Luca Foresti1
1
    Department of Mathematics, Computer Science and Physics, University of Udine


                                         Abstract
                                         Web traffic analysis and classification has been extensively studied, both with classical and deep learning
                                         techniques. Many of these systems analyse the entire packet to perform the classification task. Due to
                                         the increase of encrypted traffic in recent years, this approach has become problematic. Moreover, few
                                         works focus on the classification of the users themselves, also called web traffic de-anonymization. In
                                         the present paper we address this problem by proposing an approach focused on a shallow, temporal
                                         analysis of web traffic data packets. We show that it is possible to identify the users of a network just
                                         by analyzing their navigation patterns and without accessing the content of the TCP packets. Finally,
                                         we propose a comparison between the performance of our approach and a more classical feed forward
                                         neural network architecture to showcase the informational power of temporal data in this context.

                                         Keywords
                                         Temporal Analysis, User De-Anonymization, Shallow packet inspection, Network traffic analysis


1. Introduction
The importance of being able to identify the users accessing the Internet, both for commercial
and forensic purposes, cannot be overstated. Being able to categorize the users at different levels
and by different points of view (e.g. application used, OS, browser, demographic) is important
both for commercial uses, where it can be used to perform a profiling of the identified users
and offer them more personalized services, and for forensic purposes where it can be applied to
identify individuals performing criminal actions.
   A common approach to this problem, called web traffic de-anonymization, is to look for useful
information (e.g. usernames, email addresses) inside the payloads of IP packets. This well known
technique, called deep packet inspection (DPI), can be very effective [1, 2], but it can be applied
only if the traffic is not encrypted. Nowadays almost all web traffic (especially that carrying
identification data) is encrypted at the transport level by means of SSL and TLS protocols,
and therefore DPI is rarely applicable. Moreover, DPI raises important privacy issues, because
it allows the inspector to access the whole traffic content, not only the data needed for user
identification [3].
   Therefore, the ability to identify the users generating web traffic on a network without looking
at the actual payload but only performing a shallow packet inspection, is gaining importance.

ITASEC’21: Italian Conference on Cybersecurity, April 07–09, 2021, Online
" denardin.axel@spes.uniud.it (A. D. Nardin); marino.miculan@uniud.it (M. Miculan);
claudio.piciarelli@uniud.it (C. Piciarelli); gianluca.foresti@uniud.it (G. L. Foresti)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
Axel De Nardin et al. CEUR Workshop Proceedings                                               1–10


However, while many works have already been proposed regarding the task of classifying
(encrypted) web traffic, there is little research on web traffic de-anonymization through shallow
packet analysis; see e.g. [4]. In particular, to our knowledge there is no previous work focusing
on temporal analysis applied to this specific area.
   In this paper we aim specifically to this type of analysis. To this end, we introduce the use of
recurrent neural network models, specifically LSTMs, in order to gain a better insight regarding
the impact of the temporal component in this scenario. To achieve a better understanding
about the actual performance gain when using time series, instead of single web traffic logs,
as instances of the dataset we also compare the temporal models with a classic Feed Forward
neural network of similar size.

Synopsys. The rest of the paper is organized as follows. An overview of related work about
network traffic classification is in section 2. Then, in section 3 we introduce the problem we
focus on. In section 4 we outline the proposed approach, by describing the models used and the
way data collection has been performed. The experimental results are reported in section 5 and,
finally, in section 6 we summarize our work and propose some ideas for future work.


2. Related work
When tackling a classification problem it is important to determine which aspect we aim to
classify. Many different approaches have been investigated for web traffic classification focusing
on different categorization goals: identifying the protocol used [5], application classification
[6, 7, 5], traffic type (e.g. browsing vs. video chat) [8, 9], OS classification [7] and intrusion
detection [10].
   Various network architectures and techniques have also been explored. One widely adopted
idea in recent works is to try to map the traffic data to a bi-dimensional image and then use
a CNN to extract the spatial information from it, as proposed in [11]. However it was noted
than when trying to extract features from time series, temporal information loss occurs in the
convolutional and pooling layers. The use of architectures such as C-LSTM ([12]) combining
both convolutional and recurrent layers, specifically LSTMs, has also been investigated and was
able to achieve state-of-the-art performance for web traffic anomaly detection on the Webscope
S5 dataset by Yahoo. One thing to notice, however, is that many of the presented works do not
rely uniquely on a shallow inspection of the transmitted packets but use also consider their
payload to perform the classification task, according to the well known deep packet inspection.
Moreover, only few of these techniques take into account the temporal aspect of the data alone
and rely for the most part on datasets which have single logs as their instances. Finally, while
previous works have explored the use of classification techniques for different purposes related
to web traffic analysis, the area of user identification is still under investigated; see e.g. [4].


3. Problem description
In order to recognize users by means of shallow packet inspection, we adopt the architecture
depicted in Figure 1. The whole network traffic is logged by a sniffer and subsequently filtered


                                                  2
Axel De Nardin et al. CEUR Workshop Proceedings                                                  1–10


                               Border router                    ISP router
           U?
                          Web session
                           classifier                            Packet
                                                                analyzer

                                                Web session log


                                                   Clustering

                                                Features stream


                              User profiles        Classifier              User classification


Figure 1: Architecture of the web session classifier (from [4]).


and pre-processed to collect only the data relevant for the system. In order to preserve user
privacy, pre-processing also replaces source IP addresses with unique identifiers (𝑢1 , 𝑢2 , . . . ).
Hence, despite the system internally stores the address/identifier associations (which are needed
to guarantee a coherent labeling through time), the final data are pseudonymised.
   Following the formalization given in [4] the problem can be defined as:

      given a training set 𝑇 𝑆 for users 𝑢1 , . . . 𝑢𝑛 and a web session log 𝐿 generated by
      one of these users, is it possible to determine which user has generated 𝐿?

where the training set is defined as:

                                  𝑇 𝑆 = {⟨𝐿1 , 𝑢𝑖1 ⟩, . . . , ⟨𝐿𝑘 , 𝑢𝑖𝑘 ⟩}.                       (1)

  Here, 𝑖𝑘 is the index of the user generating log 𝐿𝑘 , and each log 𝐿 consists in a subset of data
extracted from the TCP/IP packed header. In this work, since our main goal is investigating the
role of the temporal aspect in the user classification task, we redefine the training set as:

                                  𝑇 𝑆 = {⟨𝑆1 , 𝑢𝑖1 ⟩, . . . , ⟨𝑆𝑘 , 𝑢𝑖𝑘 ⟩}                        (2)

where 𝑆𝑖 now identifies a sequence of web traffic logs instead of a single one. Thus we obtain
the following revisitation of the previous definition:

      given a training set 𝑇 𝑆 for users 𝑢1 , . . . 𝑢𝑛 and a temporal sequence of web session
      logs 𝑆 generated by one of these users, is it possible to determine which user has
      generated 𝑆?


                                                     3
Axel De Nardin et al. CEUR Workshop Proceedings                                                1–10


4. Proposed approach
In this section an approach for the aforementioned problem will be introduced. We start by
describing the way data is collected and pre-processed, then we give an outline of the models
and techniques used for the identification task.

4.1. Data collection
All the data used for the training and testing of the adopted models have been retrieved by
shallow-sniffing a large WiFi network during a time window of a couple hours. This allowed to
retrieve a dataset of around 6.1 million instances each represented by a tuple consisting of 6
elements:

    • timestamp
    • source MAC address
    • destination IP address
    • packet length
    • TCP source port
    • TCP destination port

The privacy of the users has been preserved by running a pseudonymization process on the
dataset, which mapped each MAC address to a progressive id which has been kept consistent
across the different instances. No data regarding the content of the packets has been sniffed.

4.2. Data preprocessing
Three major pre-processing operation have been performed on the collected data. The first one is
represented, as already mentioned, by the mapping of the MAC address of each detected user to
a progressive and consistent numerical ID which is then used as the label of the corresponding
instance. The same process has been applied to destination IP addresses. The second one,
on the other hand, is performed on the timestamps of the connection. Each timestamp has
been replaced with a value ∆𝑡 which is defined as the difference between the time at which
was performed the current log and the the time of the previous log of the same user. The
idea justifying this transformation is that, when analyzing the behavior of a user, we are more
interested in the interval occurring between his actions rather than in the absolute time at which
they have been performed. Finally the last pre-processing step consisted in the standardization
of the feature values in order to speed up the convergence of the models during training.

4.3. Data selection
Through a preliminary analysis of the data distribution we noticed a significant imbalance in
the number of instances belonging to each user, with a single class completely dominating the
dataset, as shown in Fig. 2. For this reason we decided to use two different settings, regarding the
dataset used, for all the experiments. In the first setting we used the whole dataset represented
by 6.1M instances distributed over 151 classes, as already presented, while in the second one we


                                                  4
Axel De Nardin et al. CEUR Workshop Proceedings                                                1–10


Figure 2: Instances distribution over the classes of the full dataset


decided to filter out the classes that contributed the most to the imbalance in the distribution, in
particular we removed the classes consisting in more than 150k elements and those represented
by less then 5k elements (Fig. 3). The resulting dataset consisted in 2M instances distributed
over 90 classes. This allowed to analyze the performance of the proposed methods on problems
of different scales and also to understand how class imbalance affected them. Another issue
concerns, instead, the features describing the destination IP address. While this piece of data
seems to be very valuable in terms of information provided to the classifier, due to its categorical
nature it is more challenging to find a meaningful way to represent it in a consistent way and
to deal with situations, which are likely to occur in a real world scenario, in which the training
and test sets don’t necessarily contain instances with the same subset of IP addresses. For this
reason in the present work we decided to analyze the performance of the selected models in
two different settings for both the previously presented datasets, one in which the IP address is
used as an input feature (by simply mapping it to a consistent numerical ID) and one in which it
is discarded. This choice allowed us to understand how much impact this particular feature has
on the classification capabilities of the models and if it is possible to obtain good performance
even without including it.

4.4. Models
All the collected data, organized in the settings described in the previous section, has been
used to train a set of Deep Learning classifiers. Two different architectures have been used, the
first one is represent by a 3-layer feed forward network which was used as a baseline for the
performance that can be obtain without taking in consideration the temporal aspect of the data


                                                    5
Axel De Nardin et al. CEUR Workshop Proceedings                                           1–10


Figure 3: Instances distribution over the classes of the filtered dataset


Figure 4: Diagram of the Recurrent Neural network used in the experiments


collected. The second architecture, which is the main focus of the present work, is a recurrent
neural network composed by 2 stacked LSTM [13] layers, with a 50% dropout layer in between
them, and ending with a fully connected layer used to perform the actual classification task. A
diagram of this network is shown in Fig. 4.


                                                    6
Axel De Nardin et al. CEUR Workshop Proceedings                                                1–10


Table 1
Dataset settings used in the different training scenarios
                                        # instances     # classes   destination ip
                   full_nodest              6.1M           151            no
                   full                     6.1M           151           yes
                   reduced_nodest            2M             90            no
                   reduced                   2M             90           yes

Table 2
models hyperparameters
                          Learning rate      Batch size     # epochs   Sequence length
             Feed_FW          1e^-4            1024            50            1
             LSTM_5           1e^-4            1024            50            5
             LSTM_10          1e^-4            1024            50            10
             LSTM_20          1e^-4            1024            50            20


5. Experimental results
All the models have been trained on the 4 different dataset settings which have been described
in the previous section and are summarized in table 1.
   A summary of the hyper-parameters used during the training process is reported in table 2.
While the main focus of the present work wasn’t to find the optimal values for these parameters,
we investigated different choices to make sure that they led to reasonable results while at the
same taking into consideration the time needed for the training of the models.
   We tried to preserve the consistency of the hyper-parameters selected for the different models
analyzed in order to be able to make a more meaningful comparison between them. The only
exception being the sequence length, which represents the temporal aspect we wanted to
investigate with this work and therefore is changed across the different instances of the model
to gain a better understanding on how it affects their performances. In all the scenarios a
60/40 split of the datasets was performed, using the 60% of the data to train the model and the
remaining 40% for the testing process. The metrics used to evaluate the different models are 4:

    • Accuracy
    • Recall
    • Precision
    • F-Score

The results shown for these metrics (table 3), with the exception of the accuracy, have been
calculated by macro-averaging the values obtained for the single classes.
   As it can be seen, the classic feed forward network, trained on the single instances of
the datasets, achieve a much worse performance than every recurrent neural network, with
differences up to 50% between the values of some of the metrics when using the smaller, more
balanced, data set. This seems to imply that the temporal aspect of the data has actually an
impact on the ability of the models to recognize the different users. This idea is also supported by


                                                   7
Axel De Nardin et al. CEUR Workshop Proceedings                                                  1–10


Table 3
Classification results.
       Classifier         Dataset     Accuracy (%)    Precision (%)   Recall (%)   F-Score (%)
                    Reduced_nodest       49.15           30.64          27.66        26.00
                       Reduced           63.54           49.20          52.25        46.60
       Feed_FW
                      Full_nodest        77.18           14.82          16.95        12.80
                          Full           84.26           29.92          31.93        32.90
                    Reduced_nodest       78.82           71.91          73.74        71.80
                       Reduced           90.77           87.04          88.01        87.20
       LSTM_5
                      Full_nodest        86.63           38.55          38.55        38.00
                          Full           95.35           67.00          75.55        68.20
                    Reduced_nodest       86.51           82.00          82.96        82.10
                       Reduced           95.21           93.29          93.65        93.40
       LSTM_10
                      Full_nodest        90.39           49.52          49.53        49.80
                          Full           97.98           82.04          88.51        83.90
                    Reduced_nodest       89.00           84.88          85.68        85.00
                       Reduced           97.00           95.84          95.98        95.90
       LSTM_20
                      Full_nodest        94.98           66.75          72.40        67.00
                          Full           98.85           87.95          92.34        89.10


the fact that the models performance is positively correlated with the length of the sequences in
which the data is organized, reaching their peak when using the largest value for this parameter
(Fig. 5). Another interesting thing to notice is that while the accuracy of all the models increases
when using the full dataset, compared to when the reduced one is used, the values of the other
three metrics seem to greatly deteriorate instead. This suggests that the models are heavily
influenced by the imbalances between the number of instances belonging to the different
classes of the full dataset and have the tendency to specialize towards the more populated ones,
neglecting the others. This phenomenon seems to be partially mitigated by the introduction of
the destination IP address as one of the features during training, which leads to an improvement
of the performance over all the metrics adopted and also reduces the gap between the accuracy
and the other metrics values, especially when the full dataset is used.


6. Conclusions
In this paper we investigated the impact of introducing a temporal component in the context of
web traffic classification. More specifically we tried to determine if analyzing sequences instead
of isolated web logs allows to make more accurate predictions when performing a shallow
packet inspection with the goal of de-anonymizing users. The use of a shallow inspection is
motivated by an ever increasing adoption of encrypted connections on a technological side and
by privacy concerns on a human and legal one.
   While the obtained results clearly point out that the temporal aspect has a great impact on


                                                  8
Axel De Nardin et al. CEUR Workshop Proceedings                                                   1–10


Figure 5: Results obtained for the different metrics when using different values of sequence length and
training the model on the full dataset


the performance of web traffic classification models by also mitigating, in a certain measure,
the negative effects of a highly unbalanced distribution of the instances of the dataset, further
investigation is possible in the future since only a narrow category of network architectures
has been considered, leaving out solutions that have been proven to be effective for other
applications related to web traffic analysis, such as the combination of LSTMs and CNNs.
Another aspect we plan to address in future work regards an improvement in the heterogeneity
of the data. While the dataset we used for the experiments was reasonably large, we suspect
it may not be too representative of a real world scenario as all the logs were collected during
a single session spanning over just a few hours, therefore there is a high possibility that the
retrieved data is highly redundant. Finally we plan to improve the way categorical features,
such as the IP addresses, are represented since we believe that providing a more semantically
meaningful embedding could improve further the performance of the models.


References
 [1] S. Kumar, J. Turner, J. Williams, Advanced algorithms for fast and scalable deep packet
     inspection, in: Architecture for Networking and Communications systems, 2006. ANCS
     2006. ACM/IEEE Symposium on, IEEE, 2006, pp. 81–92.
 [2] C. Parsons, Deep Packet Inspection in Perspective: Tracing its lineage and surveillance
     potentials, Surveillance Studies Centre, Queen’s University, 2008.
 [3] A. Daly, The legality of deep packet inspection, International Journal of Communications
     Law & Policy (2011). doi:10.2139/ssrn.1628024.


                                                  9
Axel De Nardin et al. CEUR Workshop Proceedings                                          1–10


 [4] M. Miculan, G. L. Foresti, C. Piciarelli, Towards user recognition by shallow web traffic
     inspection, in: P. Degano, R. Zunino (Eds.), Proceedings of the Third Italian Conference
     on Cyber Security, Pisa, Italy, February 13-15, 2019, volume 2315 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2019.
 [5] Z. Chen, K. He, J. Li, Y. Geng, Seq2img: A sequence-to-image based approach towards
     ip traffic classification using convolutional neural networks, in: 2017 IEEE International
     Conference on Big Data (Big Data), 2017, pp. 1271–1276. doi:10.1109/BigData.2017.
     8258054.
 [6] G. Aceto, D. Ciuonzo, A. Montieri, A. Pescapé, Mobile encrypted traffic classification
     using deep learning: Experimental evaluation, lessons learned, and challenges, IEEE
     Transactions on Network and Service Management 16 (2019) 445–458. doi:10.1109/
     TNSM.2019.2899085.
 [7] S. Rezaei, X. Liu, How to achieve high classification accuracy with just a few labels: A
     semi-supervised approach using sampled packets, 2020. arXiv:1812.09761.
 [8] W. Wang, M. Zhu, J. Wang, X. Zeng, Z. Yang, End-to-end encrypted traffic classification
     with one-dimensional convolution neural networks, 2017, pp. 43–48. doi:10.1109/ISI.
     2017.8004872.
 [9] M. Lotfollahi, R. Shirali hossein zade, M. Jafari Siavoshani, M. Saberian, Deep packet: A
     novel approach for encrypted traffic classification using deep learning, Soft Computing 24
     (2020). doi:10.1007/s00500-019-04030-2.
[10] W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang, M. Zhu, Hast-ids: Learning
     hierarchical spatial-temporal features using deep neural networks to improve intrusion
     detection, IEEE Access 6 (2018) 1792–1806. doi:10.1109/ACCESS.2017.2780250.
[11] W. Wang, M. Zhu, J. Wang, X. Zeng, Z. Yang, End-to-end encrypted traffic classification
     with one-dimensional convolution neural networks, in: 2017 IEEE International Conference
     on Intelligence and Security Informatics (ISI), 2017, pp. 43–48. doi:10.1109/ISI.2017.
     8004872.
[12] T.-Y. Kim, S. Cho, Web traffic anomaly detection using c-lstm neural networks, Expert
     Systems with Applications 106 (2018). doi:10.1016/j.eswa.2018.04.004.
[13] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997)
     1735–1780. doi:10.1162/neco.1997.9.8.1735.


                                                  10

</pre>