Introduction

Towards User Recognition by Shallow Web Tra c Inspection?

Marino Miculan

marino.miculan@uniud.it 0

Gian Luca Foresti

gianluca.foresti@uniud.it 0

Claudio Piciarelli

claudio.piciarelli@uniud.it 0 0 Department of Mathematics , Computer Science and Physics , University of Udine

We consider the problem of web user recognition, or web trafc de-anonymization: given a set of users, is it possible to identify which user has generated a given web tra c with only a shallow packet inspection (that is, without looking inside the packet payloads)? We propose to address this problem by means of machine learning (ML) techniques, in particular clustering and supervised classi cation. The basic idea is that each user can be identi ed by their browsing habits, and these habits can be described by a suitable set of features: click frequency, permanence time on web pages, amount of downloaded data, etc. In this paper we introduce these features, and show how these can be derived from the data obtained only from packet headers and their arrival time. Finally, we show the e ectiveness of this approach with some preliminary tests and experiments.

Introduction

Nowadays, it is important to be able to identify the users accessing the Internet both for commercial and forensics purposes. On one side, Internet Service Providers and Content Providers can take advantage of this analysis, in order to personalize and improve their services; on the other, it is important to identify users in order to ascertain the responsibility for harmful or even criminal actions.

In fact, ISPs are already required to keep access log les, whose acquisition from the authorities is regulated by Directives such as 2006-24-CE, both for cybercrimes against persons (such as online fraud and identity theft) and for attacks against a server or a network. However, these access logs record only limited information, such as connection times, transfer amounts, and visited web site addresses, which in general are not su cient to trace the speci c author of an o ense, but only the holder of the network connection contract. This is the case of networks which access the Internet via one or few public IP addresses by means of the well-known network address translation technique, such as in the case of home networks, small enterprises, commercial premises o ering WiFi connection to their customers, etc.. Another scenario is that of workstations accessible by many users in public or semi-public environments, like college laboratories, hotel ? Partially supported by UniUd PRID 2017 ENCASE and by Italy-Singapore bilateral technology cooperation project PRESNET. lobbies, internet cafes, etc.. In these cases, we cannot associate the tra c from a given IP address to a speci c user, because a single public IP address is shared among several users, possibly at the same time: family members and their visiting friends, employees and their collaborators, students, hotel guests, customers, etc..

One could argue that more useful information can be obtained by looking for relevant data (e.g. usernames, email addresses) inside the payloads of IP packets. This technique, called deep packet inspection (DPI), is well-known and can be very e ective [ 5, 8 ], but it can be applied only if the tra c is not encrypted. Nowadays most web tra c (especially that carrying identi cation data) is encrypted at the transport level by means of SSL and TLS protocols; actually an increasing number of websites is adopting encryption protocols, and therefore DPI will be less and less applicable. Moreover, DPI raises important privacy issues, because it allows the inspector to access the whole tra c content, not only the data needed for user identi cation [ 2 ].

Therefore, the problem is: given a set of users, is it possible to identify which user has generated a given web tra c, by means of shallow packet inspection? By \shallow" we mean that the only web tra c data we are allowed to consider are those an ISP can normally access for providing its service: the network (IP) header, the transport (TCP/UDP) header, the sizes and frequency of packets, etc., but not the content of the TCP payload. Shallow packet inspection is a novel technique that only very recently has been investigated by the research community, not for de-anonymization but for tra c classi cation [ 12, 6, 11 ].

In this paper, we propose to address this problem by means of machine learning (ML) techniques, in particular clustering and supervised classi cation. The basic idea is that each user can be identi ed by their browsing habits, and that these habits can be described by a suitable set of features, such as tra c size, click frequency, permanence time on web pages, hour of the day of the activity, etc. In this paper we introduce some of these features, and show how these can be derived from the data obtained by shallow packet inspection. To this end, we rst introduce the notion of click as the basic event to consider in the analysis. A \click" represents the voluntary action performed by the user when clicking on a hyperlink/button on a web page. Therefore, a click subsumes the whole tra c (that is, all the HTTP(S) requests and replies) caused by this action. Click data are obtained by partitioning the packet ow to/from the observed IP address, using clustering techniques. Then, on the ow of clicks we identify suitable features for the classi cation. Some features are intuitive (e.g. counterpart IP address, tra c size), but others are less obvious, such as the time of the day when the click is performed, or the \dwell time" on a web page. These data are related to the browsing habits of each user, and hence can be used as the basis for the supervised classi cation algorithms. We consider di erent algorithms; each of them is trained and tested with the same sets of click ows. The results are encouraging: some algorithms yield high classi cation precision.

The rest of the paper is organized as follows. In Section 2 we present in detail the problem under consideration. The solution we propose is described in Section 3, and experimental results are reported in Section 4. Finally in Section 5 we draw some conclusions and outline future work.

Private network

ISP network U1 U2 U3 U4

Border router

ISP router Packet analyzer

Observer

Problem description and analysis

Let us consider the scenario shown in Figure 1. In this scenario, there is a private network where several users u1; : : : ; un can browse the Web using the same client computer. During a session, the client computer is used by only one user. The browsing activity of the user generates a sequence of HTTP(S) requests towards various WWW servers on the Internet, and corresponding replies. In turn, these requests yield a ow of IP packets from the client to the web servers and back. These ows go through the border router and the ISP router(s), where can be examined by an observer. Very often, the private network uses a non-routable address space (such as 10.0.0.0/8 or 192.168.0.0/16). In this case the border router performs a NAT/PAT translation [ 10 ]: in each outgoing packet the source addresses and port are replaced with the public IP address assigned to the border router by the ISP|and dually for the incoming packets. The net e ect is that, from the public network viewpoint, the whole private network appears as a single host with the public IP address.

Using a packet analyzer, the observer can gather the following data for each packet: { Arrival time at the router; { Total length of the packet; { Source IP address and port number: for packets coming from the border router, the source IP address is the public IP assigned to the local router by the ISP, and the port is a dynamic port on such router. For packets going to the border router, the source IP address and port number are those of the contacted web server; { Destination IP address and port number: for packets coming from the local router, the destination IP address and port number are those of the contacted web server; for packets going to the local router, the destination IP address is the public IP assigned to the local router by the ISP, and the port is a dynamic port on such router.

Other data in the headers can be ignored because either not relevant (such as TOS, or fragmentation details), or non informative (the version of IP is almost always 4, the trasport level protocol is always TCP, etc.). It is important to notice that these data cannot be encrypted, otherwise the routers would not be able to route the packet. Following the shallow packet inspection policy, we do not analyze the content of TCP segments; very likely these segments carry SSL/TLS encrypted payloads, which cannot be analyzed further.

Therefore, for each web session we can obtain a le, called web session log, of tuples of the following form: harrival time; packet length; source IP; source port; destination IP; destination porti:

Since we intend to apply supervised classi cation algorithms, we assume to be able to obtain a suitable training set in order to \learn" the browsing habits of each user. A training set is a set T S = fhL1; ui1 i; : : : ; hLk; uik ig of web session logs associated to the corresponding user. Here, ik is the index of the user generating log Lk. This set can be build by observing the tra c when we know the actual identity of the user browsing at that moment.

Then, the classi cation problem can be formulated as follows: given a training set T S for users u1; : : : un and a web session log L generated by one of these users, is it possible to determine which user has generated L? The criteria for evaluating a classi er are the usual ones: Accuracy: the percentage of web session logs correctly classi ed; Recall: the percentage of web session logs correctly assigned to a user, with respect to all logs generated by that user; Precision: the percentage of web session logs correctly assigned to a user, with respect to all logs assigned to that user; F-measure: the harmonic average of recall and precision.

In the next section we propose a machine learning-based solution to this problem. 3

Solution proposal

In order to recognize users by means of shallow packet inspection, we propose the architecture depicted in Figure 2. When building the training set, we assume that users can be identi ed by the IP address of their workstation, thus no user can access to more than one workstation, and no workstation can be used by more than one user. Moreover, we assume that no NAT policy is implemented. These requirements are only needed to support the training phase of a supervised classi er, where each data sample must be labeled with the correct classi cation; user identi cation is not needed in classi cation phase, except for performance measurement. The whole network tra c is logged by a sni er and subsequently ltered and pre-processed to collect only the data relevant for the Web session classifier Packet analyzer Web session log

Clustering

Features stream User profiles

Classifier

User classification system. Pre-processing also includes a clustering step, in which data associated to the same user action are grouped to identify meaningful high-level data that are fed to the classi er for training or evaluation. In order to preserve user privacy, pre-processing also replaces source IP addresses with unique identi ers (User1, User2, etc.). Hence, despite the system internally stores the address/identi er associations (which are needed to guarantee a coherent labeling through time), the nal data are pseudonymized and can be safely shared. 3.1

Feature extraction and ltering

The sni er acquires and logs all the network tra c coming to/from the network; it is thus placed either in the local network itself (where it can access the data by enabling the promiscuous mode of the network card) or in a bottleneck device such a network router. In case of small networks, it can be implemented using publicly available software, such as WireShark. However, in most cases a full log of network tra c quickly leads to extremely large log les, which are both impractical to store and pose privacy issues, since they may contain data outside the scope of the proposed system. We thus adopt ltering rules on the sni er, in order to log only relevant tra c. In particular, we assume that most user actions generate TCP/IP tra c, thus any other packet type is silently ignored. This include both user-generated data, such as UDP/IP packets, as well as network management data, such as ARP, SNMP, RIP packets etc.. Moreover, in this work we choose to focus only on a speci c type of data, namely web tra c, as the Web is a popular service that may contain several user-distinctive features to ease the classi cation task. The sni er is thus con gured to acquire only TCP tra c to or from port 80 (HTTP) or 443 (HTTPS). The encryption of HTTPS connections is not an issue, since the proposed system performs a shallow inspection, without analyzing the packet payload. Finally, to further reduce the amount of data to be stored, we extract only relevant features form each packet. As mentioned in Section 2, the nal collected raw data are: { packet arrival time; { client IP address and TCP port; { server IP address and TCP port; { packet length.

The client IP address is then pseudonymized as described at the beginning of this section. 3.2

Feature pre-processing

Since the goal of the proposed system is to identify users by their web navigation behaviors, it is important that data can be associated to voluntary user actions. This task is not trivial since network tra c generated by modern web navigation can be loosely connected to user actions. Let us consider for example the basic action of clicking on a hyperlink. The corresponding network tra c delivers the required web page, but also new, parallel connections deliver other page contents (e.g. images, cookies, etc.). Moreover, new connections can be established to other web servers, such as advertisement-delivery services, pro ling services etc.

Thus, in order to work on a higher abstraction level we introduce the notion of click, de ned as the set of all network tra c generated by a user action, such as clicking on a hyperlink. Clicks can be extracted by temporally clustering the data packets: groups of data packets with close arrival times are considered part of the same click, even if originating from di erent servers.

Since the number of expected clicks is unknown a priori, no algorithms requiring an initial knowledge on the number of clusters is suitable, thus excluding popular clustering techniques such as k-means or Gaussian Mixture Models. Moreover, we require hard clustering (cluster membership is a binary choice) and explicit outlier modeling, since not all the network tra c could belong to a click. These considerations motivate the choice of DBSCAN [ 3 ] as clustering algorithm. DBSCAN uses a density-based approach, where clusters are de ned as groups of high-density samples. Formally, given a set of samples P , we give the following de nitions: { p 2 P is a core point if at least m points q1 : : : qm 2 P exist such that kP qik 8i 2 [1 : : : m], where m; are the algorithm parameters; { q 2 P is directly-reachable from p 2 P if kp qk and p is a core point; { q 2 P is density-reachable from p 2 P if there exists a path p1 : : : pn such that p1 = p; pn = q and pi+1 is directly-reachable from pi 8i 2 [1 : : : n 1]. Given a core point p, its cluster is then de ned as the set of all the points that are density-reachable from p. Points that are not density-reachable by any other point are marked as outliers. Figure 3 shows the e ect of DBSCAN applied to the arrival time of a set of packets in order to identify clicks and outliers.

For each click, we compute the following features: 0.0 2.5 5.0 7.5

10.0 Time (s) 12.5 15.0 17.5 20.0

Furthermore, by analyzing all the clicks originated from the same user, we de ne a session log as a set of statistics about the acquired clicks. A session log is thus a data sample containing: { user ID; { main site; { session start time, de ned as the timestamp of the rst click; { session end time, de ned as the timestamp of the last click; { session duration, de ned as the the di erence between session end time and start time; { total number of clicks in the session; { average inter-click time, de ned as session duration / total number of clicks; { average click data length, de ned as total amount of data / total number of clicks; { average number of secondary sites.

As a nal pre-processing step, all the numerical data are standardized, since this is required by many machine learning algorithms. Standardization is achieved by mean removal and variance scaling: given a set of feature values ff1 : : : fng, each feature value fi is replaced by its standardized version f^i de ned as: where f = 1=n Pn

j=1 fj . Feature means and standard deviations are computed on the training set only. Test sets are standardized using the same values. The acquired session data are used to train a machine learning classi er. As features, we consider all the numerical data of a session. The main site, despite being a strong hint for user identi cation, is currently discarded since its categorical, rather than numerical, nature poses extra processing di culties that will be addressed in a future work. The User IDs of each session are used as sample labels.

In order to classify the data, we considered the following algorithms: Naive Bayes Naive Bayes classi ers [ 13 ] de ne P (yjx1 : : : xn) as the probability of class y given the features x1 : : : xn. Under the naive (hence the name) assumption of conditional independence between every pair of features given the value of the class variable, it can be proven that: and the class estimate y^ is thus de ned as: n P (yjx1 : : : xn) / P (y) Y P (xijy)

i=1 n y^ = arg max P (y) Y P (xijy)

y y=1 where P (y) and P (xijy) can be estimated from data using maximum a posteriori (MAP) estimation. In this work we adopted a Gaussian Naive Bayes model, where P (xijy) is assumed to be a Gaussian function.

K-Star The K-Star classi er [ 7 ], or K , is an instance-based classi er, meaning that the class of an instance is based upon the class of those training instances similar to it, as determined by some similarity function. Speci cally, K-Star adopts a entropy-based distance function, calculated by mean of the complexity of transforming an instance into another.

Support Vector Machines Linear Support Vector Machines [ 1 ] are based on the idea that an optimal linear classi er maximizes the margin, this is the width of the strip parallel to the classi cation hyperplane that separates the two classes. (1) (2) (3) The solution is found by solving a constrained optimization model leading to the following classi cation function: f (x) = sgn n X yi i(x xi) + b i=1 ! (4) where xi is a sample from the training set and yi is the corresponding class label, while and b are found by solving the optimization problem. The solution is actually sparse since i = 0 for most of the data, except the few ones lying on the margin (support vectors). SVMs became popular because they can be easily extended to the non-linear case by means of kernel methods.

C4.5 The C4.5 algorithm [ 9 ] is a decision tree building algorithm based on its precursor ID3. The decision tree is built using the concept of information entropy: at each node, C4.5 chooses the feature that most e ectively splits the data associated to the branch. The e ectiveness of the split is measured in terms of information gain (di erence in entropy) of the feature. Once built, the decision tree can be easily used to classify new data. 4

Tests and evaluation

In oder to evaluate the system performances, we tested the system on a Local Area Network where 10 users (both male and female, in the age range of 20{45 years) were asked to visit web pages as in their normal daily routine. We did not consider the case of a malicious user deliberately trying to hide their network tra c pattern. Users were informed about privacy issues, to clarify that only web tra c metadata was logged and no deep inspection was performed. This way, users felt free to use the web as in their normal activity. Data were acquired along 10 sessions, each one 10 minutes long. On average, we collected 300 clicks per user. The relatively small amount of data motivates the choice of the algorithms presented in Section 3, since more sophisticated techniques such as deep neural networks would require lager datasets.

After pre-processing, the dataset was split in a training and a test set with di erent ratios to evaluate the performances on low amounts of training data. Tests were performed using respectively 20%, 50% and 80% of the original data as training set. Results obtained with the four classi ers are shown in Table 1.

As it can be seen, all the classi ers achieved good performances except Support Vector Machines. This could be explained by a poor choice of training parameters, since SVM results heavily depends on the choice of kernel and kernel parameters. As a future work, we plan to further investigate this aspect. Among the remaninig classi ers, C4.5 performed best, reaching high accuracy levels even with a small training set (20% of total data). The preliminary results are thus encouraging, and in the next future we aim to test the system with a larger user base. The ability to identify web users by analyzing their network tra c can have multiple applications, from user pro ling to digital forensics. In this paper we investigated the possibility of identifying users only by means of shallow inspection of HTTP(S) network tra c. Shallow inspection, in which the content of the packet payload is not analyzed, is motivated both by privacy issues and by technological factors: nowadays, the increasing adoption of encrypted connections is making deep inspection mostly useless. Despite the few amount of data gathered by shallow inspection, we proposed a data pre-processing method to extract high-level features that could be relevant for user identi cation, such as inter-click time intervals, time spent on a single web page, etc.. We tested four di erent classi ers on a small dataset obtaining encouraging preliminary results.

As a future work, we plan to acquire a larger dataset in order to test more complex classi ers such as deep neural networks. Moreover, we intend to investigate the reasons for the relatively low performance of the SVM classi er, in particular concerning the choice of the kernel and kernel parameters. We also plan to evaluate if automatic deep feature extraction techniques can actually outperform our manually-de ned high-level feature set [ 4 ]. Finally, we will also focus on proper representation and processing of categorical data, in order to handle non-numerical features such as server IP addresses.

Acknowledgments We thank Clelia Bincoletto for preliminary work and experiments on the subject of this paper.

1. Cristianini , N. , Shawe-Taylor , J., et al.: An introduction to support vector machines and other kernel-based learning methods . Cambridge university press ( 2000 )

2. Daly , A. : The legality of deep packet inspection . International Journal of Communications Law & Policy ( 2011 ). https://doi.org/10.2139/ssrn.1628024

3. Ester , M. , Kriegel , H.P. , Sander , J. , Xu , X. , et al.: A density-based algorithm for discovering clusters in large spatial databases with noise . In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) . vol. 96 , pp. 226 { 231 ( 1996 )

4. Goodfellow , I. , Bengio , Y. , Courville , A. , Bach , F. : Deep learning . MIT press ( 2016 )

5. Kumar , S. , Turner , J. , Williams , J. : Advanced algorithms for fast and scalable deep packet inspection . In: Architecture for Networking and Communications systems , 2006 . ANCS 2006 . ACM/IEEE Symposium on. pp. 81 { 92 . IEEE ( 2006 )

6. Lotfollahi , M. , Shirali , R. , Siavoshani , M.J. , Saberian , M. : Deep packet: A novel approach for encrypted tra c classi cation using deep learning . arXiv preprint arXiv:1709.02656 ( 2017 )

7. Painuli , S. , Elangovan , M. , Sugumaran , V. : Tool condition monitoring using k-star algorithm . Expert Systems with Applications 41 ( 6 ), 2638 { 2643 ( 2014 )

8. Parsons , C. : Deep Packet Inspection in Perspective: Tracing its lineage and surveillance potentials . Surveillance Studies Centre , Queen's University ( 2008 )

9. Quinlan , J.R. : C4 . 5: programs for machine learning . Elsevier ( 2014 )

10. Srisuresh , P. , Holdrege , M.: IP Network Address Translator (NAT) Terminology and Considerations . The Internet Society ( 1999 ), rFC 2663

11. Velea , R. , Ciobanu , C. , Gurzau , F. , Patriciu , V.V. : Feature extraction and visualization for network pcapng traces . In: Control Systems and Computer Science (CSCS) , 2017 21st International Conference on. pp. 311 { 316 . IEEE ( 2017 )

12. Velea , R. , Ciobanu , C. , Margarit , L. , Bica , I. : Network tra c anomaly detection using shallow packet inspection and parallel k-means data clustering . Studies in Informatics and Control 26 ( 4 ), 387 { 396 ( 2017 )

13. Zhang , H.: The optimality of naive Bayes . Proc. Seventeenth International Florida Arti cial Intelligence Research Society Conference, FLAIRS 2004 ( 2004 )