A Tool for Classification of Sequential Data

A Tool for Classification of Sequential Data GiacomoKahn giacomo.kahn@isima.fr Université Clermont Auvergne Université Blaise Pascal

BP 10448 F-63000 CLERMONT-FERRAND FRANCE

YannickLoiseau Université Clermont Auvergne Université Blaise Pascal

BP 10448 F-63000 CLERMONT-FERRAND FRANCE

UMR 6158 CNRS LIMOS

F-63178 AUBIERE FRANCE

OlivierRaynaud Université Clermont Auvergne Université Blaise Pascal

BP 10448 F-63000 CLERMONT-FERRAND FRANCE

UMR 6158 CNRS LIMOS

F-63178 AUBIERE FRANCE

A Tool for Classification of Sequential Data 0F538E27D7971B6CDFFA07518DDC0BB0 GROBID - A machine learning software for extracting information from scholarly documents Machine Learning Classification Web Usage Mining

Classification of sequential data (data obtained from series of actions in chronological order) has many applications in security, marketing or ergonomy. In this paper, we present a tool for classification of sequential data. We introduce a new clean dataset of web-browsing logs, and study the case of implicit authentification from web-browsing. We then detail more of the functioning of the tool and some of its parameters.

Introduction and related work

Event-related data can have the form of a succession of actions or events in chronological order. Data mining of such data has many applications in fields such as security (intrusion detection [1]), marketing (e.g. navigation in e-commerce hierarchy) or ergonomy (study of succession of actions in work-related applications). Those applications require the search of some meaningful patterns in the data. A pattern is a structure that appears with regularity in the data. It can be an itemset, a sequence, a sub-word, an association rule... In this context, meaningful means maximizing some metric, such as the support or the lift. Different algorithms exist to mine either of those. An interesting property for patterns is the closure. A pattern p is closed if there is no pattern p , superset of p and support(p) = support(p ). Formal Concept Analysis (FCA) is a mathematical framework that deals with closed sets. Many algorithms from FCA allow to enumerate closed sets (in the form of concepts) and there exist a number of interesting metrics based on concept lattices such as stability or robustness of a concept.

The enumeration of these patterns alone is not sufficient in many cases and is only one step of a decision-making process. For example, in a context of security, one might want to find meaningful patterns as the first step of classification or prediction. In marketing, one might use patterns to construct groups of consumers or to find interesting association rules.

In [2,3], the authors introduce a tool for classification in the binary case, based on positive and negative examples in concept lattices. However, by using this binary classifier to the 1 − n case (n being the number of classes), all anonymous behaviours will be classified as contradictory. Other works of mining in FCA include the mining of sequences in [4] and of graphs in [5]. In [6], the authors defined emerging patterns as patterns appearing frequently in a class, but being hard to find in other classes. Confer [7,8] for surveys on emerging patterns. An emerging closed-pattern classifier can be described as an extension to the 1 − n case of the binary concept lattice classifier, and can be used to predict the class on previously unseen objects. In [9], the authors present another generalisation to n classes of the closed-set based classifier. In particular, the authors introduced the use of the tf × idf for the selection of the closed patterns.

In this paper, we present a tool for classification of sequential data, based on closed-patterns. This tool implements the classifier presented in [9]. We show some results of our tool on a dataset of web navigation logs from more than 3000 users over a six-month period.

This paper is organised as follow: in section 2 we explain the functioning of the classifier and give more details about the tool and its parameters, in section 3 we show a case study and propose a clean dataset for experimentation, finally we conclude and give some perspectives of our work.

Implementation

General parameters

In this section, we describe the classifier implemented by our tool. The tool includes a whole experimental process, from the building of transactions from raw data to detailed results of classification. We mention some of the parameters accepted by each steps.

Building transactions Our tool allows us to group the data into transactions. The transactions can be of fixed size, or created with respect to a time stamp present in the original data. In our case study, the size is fixed and is equal to 10. The data file from where the transactions are built can be of arbitrary size.

Extraction of own patterns

We call own patterns the patterns we believe to be respresentative of each class. For each class, we compute the patterns that verify some property or threshold for a given metric (e.g. support or tf × idf ). With some metrics, the space of those patterns is prunable. The number of patterns we want to keep as well as their maximum size is a parameter. The nature of the pattern is also a parameter: as of today, one can choose between closed itemset or sequence. For a given class c, we denote the set of own patterns by P c .

Profile of a class

There exist different ways to compute the profile of a class. In our tool, we chose to define a common vector profile V = c∈C P c that is the union of all own patterns for all classes. We then compute its numerical components for each classes from either the support, the lift or the tf × idf . This vector allows us to embed all classes in a common space. This numerical value can be seen as the distance from the origin of the space, in each dimensions of the vector. For exemple, let α and β two classes. P α = {A, B, C} and P β = {C, D, E} then the vector V = P α ∪ P β will have 5 component (A, B, C, D, E). For each class c, we compute a numerical value k c i for each component, giving

V α = (k α A , k α B , k α C , 0, 0) and V β = (0, 0, k β C , k β D , k β E ).

Profile of an anonymous transaction This step accepts the same parameters as the construction of the profile of a class. We can also choose the number of anonymous transaction that will be submitted to the classifier in the next step. For example, in Fig. 3, the number of anonymous transactions recieved by the classifier goes from 1 to 30.

Identification step

The goal is to guess the class corresponding to an anonymous set of transactions. After the computation of a profile for this anonymous set, we compute the nearest neighbor in the common space defined previously. The tool implements different similarity functions: euclidean distance, cosine similarity, Kulczynski measure, and Dice similarity. The heuristics gain in efficiency when they are provided with a higher number of anonymous transactions that allows them to construct a finer profile for the anonymous user.

Global parameters Other parameters for experimentations include the number of runs, the verbosity level, the format of the data, the possibility to only compute stats on the data, the use of a fuzzy approach and some parameters for binary classification.

Bayesian Classifier Our tool implements two smoothed Bayesian classifiers: a traditional Bayes classifier and a pattern-based Bayes classifier. Those classifiers allow to compare the results during the experimentations.

Fuzzy approach

The inclusion of a pattern in a transaction is a binary measure. When working with own patterns of significant size, this strict inclusion will often be false. We consider a fuzzy approach for the support during the computation step of an anonymous profile. We will use a inclusion level instead of a binary measure. The fuzzy support may then be computed as the average of the inclusion levels on the set of transitions. The fuzzy inclusion level inc(P, T ) can be computed as the proportion of the own pattern P included in the transaction T :

inc(P, T ) = ||P ∩ T || ||P ||(1)

To adjust to different cases and be able to represent a wide range of inclusion, from intersection to strict inclusion, we use a transfer function to transform the simple level of inclusion of eq. 1. In the tool, those functions are defined by specifying points on a 2-dimensional space. Two points are fixed, (0, 0) and (1, 1). Some transfer functions are illustrated in Fig. 1. The coordinates of the two points that define the transfer function are configuration parameters. In Fig. 1, the parameters for the inclusion are [(1, 0); (1,0)]. This is equivalent to the binary measure of inclusion. For the intersection, the parameters are [(0, 1); (0, 1)]. Those parameters mean the measure is equal to one as soon as the intersection is not empty. For the simple ratio or a more sigmoidal function, the parameters are resp. [(0, 0); (1, 1)] and [(0.25, 0); (0.75, 1)].

Configuration file

The parameters are given to the tool by a configuration file in .yml format. For the results of our case study, presented in Table 2, the file is presented in Fig. 2.

With this file as argument, the tool will recieve from 1 to 30 anonymous transactions, and run 10 executions. The random seed can be fixed to reproduce experimentations. The data comes from the directory Data/150users, and is in csv format. The transactions are built of fixed size 10. The identification method is H 1 (closed itemsets and tf × idf metric), with at most 40 own closed-patterns of maximum size 5. The similarity measure used is Kulczynski. The profiler is the metric used to compute the numerical coordinate of the common vector. When not specified, the method used for inclusion of the pattern is the strict inclusion.

Case study

Our case study is about implicit identification in web-browsing. Implicit identification is studied in [10] and in a web-browsing context in [11][12][13][14]. The challenge is to recognise a user amongst n. The classifier has to guess the corresponding user from an anonymous behaviour. If it fails to recognise the declared user, then the identity is not confirmed. In a security context, this situation can lead to restrictions in the system, or to the request of some explicit means of identification. The parameters used in this study are detailed in the configuration file of Fig. 2.

Data description

Our data comes from Blaise Pascal university proxy servers. It consists of 17×10 6 lines of connection logs from more than 3, 000 users and contains the user ID, the time stamp and a domain name for each line. We applied two types of filters on the domain names: blacklist filters and HTTP-request based filters. We used some lists3 of domain names to remove all domains regarded as advertising. We also filtered the data by the status code obtained after a simple HTTP request on the domain name. After those steps, we still have 4 × 10 6 lines. We divide the file between the 3K users to obtain the class files. This dataset is available at http://fc.isima.fr/∼kahngi/cez13.zip. The studies were conducted on the 150 users with the higher number of requests.

Some information about the data is available in Table 1. The table shows some statistics from before preprocessing and after the filters were applied. #U sers represent the number of users, #Sites represents the cardinal of the whole set of websites for all users and Avg#lines/user is the average number of line per user. We can see that the number of users decreases because some users did not have a single line after the filters. Roughly 40% of the websites were deleted by the filters, and the average number of lines by user was divided by 5. [9] show that tf × idf produces better results than the lift or the support. Figure 3 shows the kind of results that can be obtained with our tool. The abscissa is the number of anonymous transactions given to the classifier and the ordinate the accuracy of the different heuristics. The dataset is divided as follows: 2 3 of learning base for the learning step and 1 3 for the identification step. The division is random. Each test session consists of multiple runs of both those steps. That allows us to smooth the results by using the average accuracy. The table generated by our tool contains 15 columns. Those include the number of classes N C , the number of anonymous transactions recieved A, the method used, the average accuracy, the average number of transactions successfully and not successfully classified, the ratio of classified and not classified tests, and the run time in second. T C represents the ratio of classified tests. All this information allows the analysis of various aspects of the result. Some are presented in Table 2.

Conclusion and perspectives

We presented a tool for classification of sequential data. It includes a lot of features in the construction of the transactions, and different parameters and heuristics for classification. The tool is flexible and adaptable to many contexts of classification and types of data. The perspectives of our work are to add others means of classification based on other types of patterns (such as closed-sequences, pattern structures, or class association rules), and other types of metrics (for example structural metrics such as stability). We are also considering the use of aggregation functions other than the average for fuzzy support, such as ordered weighted averaging (OWA) operators [15,16], or some power-means. Moreover, we are considering the integration of different paradigms of user profiles. Another way to construct the profile of a class is using association rules. Class association rules are association rules of the form A → C where C is a class and A a subset of items. They are studied in [17]. By attributing scores to the rules and searching for the premisses of the rule in an anonymous transaction, we could classify the anonymous transaction in a given class.

Fig. 1 .1Fig. 1. Different transfer functions

Fig. 2 .2Fig. 2. Configuration file for the case study

Fig. 3 .3Fig. 3. Average accuracy of the different heuristics as a function of the number of anonymous transactions recieved by the classifier.

Table 1 .1Descriptive statistics of the datasetOur tool implements several heuristics. H 0 considers frequent 1-patterns with the best support, H 0 Lif t considers frequent 1-patterns with the best lift, H 1 considers closed k-patterns with the best tf × idf , and B is a smoothed Bayes classifier. The tf × idf is a metric that comes from information retrieval and text mining. It is the product of term frequency and inverse document frequency. It reflects how discriminating a pattern is for a given class. The experiments in#U sers #Sites Avg#lines/userRaw Data3388 961845082After preprocessing 3370 5765411453.2 Experimental parameters

Table 2 .2Output of the toolNC A Method Avg accuracy Min accuracy Max accuracy TC150 1H1 0.312660.302130.3211875.796%150 2H1 0.343780.328270.3566693.335%150 5H1 0.469090.45050.4899699.676%150 10 H1 0.673520.637140.69905100%150 20 H1 0.877780.851110.92100%150 30 H1 0.948330.923330.96333100%

http://winhelp2002.mvps.org/hosts.htm and https://pgl.yoyo.org/as.

Acknowledgement

This research was partially supported by the European Union's "Fonds Européen de Développement Régional (FEDER)" program.

Log content extraction engine based on ontology for the purpose of a posteriori access control HAzkia NCuppens-Boulahia FCuppens GCoatrieux IJKL 9 1/ 2014 Complexity of learning in concept lattices from positive and negative examples SOKuznetsov Discrete Applied Mathematics 142 1-3 2004 Machine learning and formal concept analysis SOKuznetsov Concept Lattices, Second International Conference on Formal Concept Analysis, ICFCA 2004

Sydney, Australia

February 23-26, 2004. 2004 Proceedings. On mining complex sequential data by means of FCA and pattern structures ABuzmakov EEgho NJay SOKuznetsov ANapoli CRaïssi Int. J. General Systems 45 2 2016 Learning of simple conceptual graphs from positive and negative examples SOKuznetsov Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD '99

Prague, Czech Republic

September 15-18, 1999. 1999 Proceedings. Patterns based classifiers KRamamohanarao HFan World Wide Web 1 2007 Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining PKNovak NLavrac GIWebb Journal of Machine Learning Research 10 2009 A survey of emerging patterns for supervised classification MGarcía-Borroto JF MTrinidad JACarrasco-Ochoa Artif. Intell. Rev 42 4 2014 Using closed itemsets for implicit user authentication in web browsing OCoupelon DDia FLabernia YLoiseau ORaynaud Proceedings of the Eleventh International Conference on Concept Lattices and Their Applications the Eleventh International Conference on Concept Lattices and Their Applications

Košice, Slovakia

October 7-10, 2014. 2014 Implicit authentication through learning user behavior EShi YNiu MJakobsson RChow Information Security -13th International Conference, ISC 2010

Boca Raton, FL, USA

October 25-28, 2010. 2010 Revised Selected Papers A characterization of online browsing behavior RKumar ATomkins Proceedings of the 19th International Conference on World Wide Web, WWW 2010 the 19th International Conference on World Wide Web, WWW 2010

Raleigh, North Carolina, USA

April 26-30, 2010. 2010 Web user behavioral profiling for user identification YCYang Decision Support Systems 49 3 2010 Who does what on the web: A large-scale study of browsing behavior SGoel JMHofman MISirer Proceedings of the Sixth International Conference on Weblogs and Social Media the Sixth International Conference on Weblogs and Social Media

Dublin, Ireland

June 4-7, 2012. 2012 User authentication from web browsing behavior MAbramson DWAha Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2013 the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2013

St. Pete Beach, Florida

May 22-24, 2013. 2013 Families of OWA operators RRYager 1993 59 Constraint satisfaction using soft quantifiers RRYager International Journal of Intelligent Systems in Accounting and Finance Management 12 3 2004 Integrating classification and association rule mining BLiu WHsu YMa Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98) the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98)

New York City, New York, USA

August 27-31, 1998. 1998