Introduction

Lazy Learning of Classification Rules for Complex Structure Data

0 National Research University Higher School of Economics , Moscow , Russia

In this paper, we address machine learning classification problem and classify each test instance with a set of interpretable and accurate rules. We resort to the idea of lazy classification and mathematical apparatus of formal concept analysis to develop an abstract framework for this task. In a set of benchmarking experiments, we compare the proposed strategy with decision tree learning. We discuss the generalization of the proposed framework for the case of complex structure data such as molecular graphs in tasks such as prediction of biological activity of chemical compounds.

formal concept analysis lazy classification complex structure

Introduction

The classification task in machine learning aims to use some historical data (training set) to predict unknown discrete variables in unknown data (test set). While there are dozens of popular methods for solving the classification problem, usually there is an accuracy-interpretability trade-off when choosing a method for a particular task. Neural networks, random forests and ensemble techniques (boosting, bagging, stacking etc.) are known to outperform simple methods in difficult tasks. Kaggle competitions also bear testimony for that – usually, the winners resort to ensemble techniques, mainly to gradient boosting [ 1 ]. The mentioned algorithms are widely spread in those application scenarios where classification performance is the main objective. In Optical Character Recognition, voice recognition, information retrieval and many other tasks typically we are satisfied with a trained model if it has low generalization error.

However, in lots of applications we need a model to be interpretable as well as accurate. Some classification rules, built from data and examined by experts, may be justified or proved. In medical diagnostics, when making highly responsible decisions (e.g., predicting whether a patient has cancer), experts prefer to extract readable rules from a machine learning model in order to “understand” it and justify the decision. In credit scoring, for instance, applying ensemble techniques can be very effective, but the model is often obliged to have “sound business logic”, that is, to be interpretable [ 2 ].

Another point of interest in this paper is dealing with complex structure data in classification tasks. While there are various popular techniques for handling time series, sequences, graph data, we discuss how pattern structures as formal data representation and lazy associative classification as a learning paradigm may help to learn succinct classification rules for tasks with complex structure data. 2

Definitions

Here we introduce some notions from Formal Concept Analysis [ 3 ] which help us to organize the search space for classification hypotheses.

Definition 1. A formal context in FCA is a triple K = (G; M; I) where G is a set of objects, M is a set of attributes, and the binary relation I G M shows which object possesses which attribute. gIm denotes that object g has attribute m. For subsets of objects and attributes A G and B M Galois operators are defined as follows:

A0 = fm 2 M j gIm 8g 2 Ag;

B0 = fg 2 G j gIm 8m 2 Bg:

A pair (A; B) such that A G; B M; A0 = B and B0 = A, is called a formal concept of a context K. The sets A and B are closed and called the extent and the intent of a formal concept (A; B) respectively.

Example 1. Let us consider a “classical” toy example of a classification task from [ 4 ]. The training set is represented in Table 1. All categorical attributes are binarized into “dummy” attributes. The table shows a formal context K = (G; M; I) with G = f1; : : : ; 10g, M = for; oo; os; tc; tm; th; hn; wg (let us omit a class attribute “play”) and I – a binary relation defined on G M where an element of a relation is represented with a cross ( ) in a corresponding cell of a table.

A concept lattice for this formal context is depicted in Fig. 1. It should be read as follows: for a given element (formal concept) of the lattice its intent (closed set of attributes) is given by all attributes which labels can be reached in ascending lattice traversal. Similarly, the extent (a closed set of objects) of a certain lattice element (formal concept) can be traced in a downward lattice traversal from a given point. For instance, a big blue-and-black circle depicts a formal concept (f1; 2; 5g; for; tc; hng).

Such concept lattice is a concise way of representing all closed itemsets (formal concepts’ intents) of a formal context. Closed itemsets, further, can serve as a condensed representation of classification rules [ 5 ]. In what follows, we develop the idea of a hypotheses search space represented with a concept lattice. 2.1

Pattern Structures

Pattern structures are natural extension of Formal Concept Analysis to objects with arbitrary partially-ordered descriptions [ 6 ]. The order on a set of descriptions D allows one to define a semilattice (D; u), i.e. for any di; dj ; dk 2 D: di u di = di; di u dj = dj u di; di u (dj u dk) = (di u dj ) u dk. Please refer to [ 7 ] for details.

Definition 2. Let G be a set (of objects), let (D; u) be a meet-semi-lattice (of all possible object descriptions) and let : G ! D be a mapping between objects and descriptions. Set (G) := f (g)jg 2 Gg generates a complete subsemilattice (D ; u) of (D; u), if every subset X of (G) has infimum uX in (D; u). Pattern structure is a triple (G; D; ), where D = (D; u), provided that the set (G) := f (g) j g 2 Gg generates a complete subsemilattice (D ; u) [ 6,8 ]. Definition 3. Patterns are elements of D. Patterns are naturally ordered by subsumption relation v: given c; d 2 D one has c v d , c u d = c. Operation u is also called a similarity operation. A pattern structure (G; D; ) gives rise to the following derivation operators ( ) :

A = l (g) g2A

for A 2 G; d = fg 2 G j d v (g)g for d 2 (D; u):

Pairs (A; d) satisfying A pattern concepts of (G; D; ).

G; d 2 D; A = d, and A = d are called Example 2. Closed sets of graphs can be presented with a pattern structure. Let f1; 2; 3g be a set of objects, fG1; G2; G3g – be a set of their molecular graphs: G1 :

CH3

NH2 C

C NH2

NH2

G2 :

C NH2

NH2

G3 : C C

A set of objects f1; 2; 3g, their molecular graphs D = fG1; G2; G3g ( (i) = Gi; i = 1; : : : ; 3), and a similarity operator u defined in [ 9 ] comprise a pattern structure (f1; 2; 3g; (D; u); ).

Here is the set of all pattern concepts for this pattern structure: { f1; 2; 3g ;

NH2 C!

C ; f1; 2g ;

CH3 C

! C

NH2 ; f1; 3g ;

NH2 C

NH2 ! ; f2; 3g ;

NH2 C OH!

; (1; fG1g) ; (2; fG2g) ; (3; fG3g) ; (;; fG1; G2; G3g) }:

Cl Please refer to [ 9 ] for clarification of this example.

Further, we show how pattern concept lattices help to organize the search space for classification hypotheses. 3

Related work

Eager (non-lazy) algorithms construct classifiers that contain an explicit hypothesis mapping unlabelled test instances to their predicted labels. A decision tree classifier, for example, uses a stored model to classify instances by tracing the instance through the tests at the interior nodes until a leaf containing the label is reached. In eager algorithms, the main work is done at the phase of building a classifier.

In lazy classification paradigm [ 10 ], however, no explicit model is constructed, and the inductive process is done by a classifier which maps each test instance to a label using a training set.

The authors of [ 11 ] point the following problem with decision tree learning: while entropy measures used in C4.5 and ID3 are guaranteed to decrease on average, the entropy of a specific child may not change or may increase. In other words, a single decision tree may find a locally optimal hypothesis in terms of entropy measure such as Gini impurity or pairwise mutual information. But using a single tree may lead to many irrelevant splits for a given test instance. A decision tree built for each test instance individually can avoid splits on attributes that are irrelevant for the specific instance. Thus, such “customized” decision trees (actually classification paths) built for a specific test instance may be much shorter and hence may provide a short explanation for the classification.

Associative classifiers build a classifier using association rules mined from training data. Such rules have the class attribute as a conclusion. This approach was shown to yield improved accuracy over decision trees as they perform a global search for rules satisfying some quality constraints [ 12 ]. Decision trees, on the contrary, perform greedy search for rules by selecting the most promising attributes.

Unfortunately, associative classifiers tend to output too many rules while many of them even might not be used for classification of a test instance. Lazy associative classification algorithm overcomes these problems of associative classifiers by generating only the rules with premises being subsets of test instance attributes [ 12 ]. Thus, in lazy associative classification paradigm only those rules are generated that might be used in classification of a test instance. This leads to a reduced set of classification rules for each test instance.

In [ 7 ] and [ 8 ] the authors generalize the lazy associative classification framework to operate with complex data descriptions such as intervals, sequences, processes and graphs.

In [ 13 ] the authors use concept lattices to represent each concept intent (a closed set of attributes) as a decision tree node and a concept lattice itself – as a set of overlapping decision trees. The construction of a decision tree is thus reduced to selecting one of the downward paths in a concept lattice via some information criterion. 4

The search for classification hypotheses in a concept lattice 4.1

Binary-attribute case

For training and test data represented as binary tables, we propose Algorithm 1.

For each test instance we leave only its attributes in the training set (steps 1-2 in Algorithm 1). We clarify what it means in case of real-valued attributes in subsection 4.2.

Then we utilize a modification of the In-Close algorithm [ 14,15 ] to find all formal concepts of a formal context with attributes of a test instance (step 3 in Algorithm 1). We build formal concepts in a top-down manner (increasing the number of attributes) and backtrack when the cardinality of a formal concept intent exceeds k. The parameter k refines the length of any possible hypothesis mined to classify the test instance and is therefore analogous to the depth of a decision tree. We speed up computing closed attribute sets (formal concept intents) by storing them in a separate data structure (set S in the pseudocode).

While generating formal concepts, we retain the values of the class attributes for all training instances having all corresponding attributes (i.e. for all objects in formal concept extent). We calculate the value of some information criterion (such as Gini impurity, Gini ratio or pairwise mutual information) for each formal concept intent (step 4 in Algorithm 1). Retaining the top n concepts with maximal values of the chosen information criterion, we have a set of rules to classify the current test instance. For each concept we define a classification rule with concept intent as a premise and the most common value of class attribute among the instances of concept extent as a conclusion.

Finally, we predict the value of the class attribute for the current test instance simply via majority rule among n “best” classification rules (step 5 in Algorithm 1). Then the calculated formal concept intents are stored (step 6), and the cycle is repeated for the next test instance. 4.2

Numeric-attribute case

In our approach, we deal with numeric attributes similarly to what is done in C4.5 algorithm [ 16 ]. We compute percentiles x1; : : : ; x for each numeric attribute x and introduce 2 new binary attributes in a form “ x x1”, “x < x1”, : : :, “x x ”, “x < x ”. Let us demonstrate steps 1 and 2 of Algorithm 1 in case of binary and numeric attributes with a sample from Kaggle “Titanic: Machine Learning from Disaster” competition dataset.1 Example 3. Fig. 2 shows a sample from the Titanic dataset. Let us from a formal context to classify a passenger with attributes \P class = 3; SibSp = 0; Age = 34:5”. We use 25 and 50% percentiles of the Age attribute to binarize it. The corresponding binary table is shown in Table 2. 5

Example

Let us illustrate the proposed algorithm with a toy example from Table 1. To classify the object no. 10, we perform the following steps according to Algorithm 1: 1. Let us fix Gini impurity as an information criterion of interest and the parameters k = 2 and n = 5. Thus, we are going to classify a test instance

1 https://www.kaggle.com/c/titanic

Algorithm 1 The proposed algorithm - binary attribute case Input: Ktrain = (Gtrain; M [ ctrain; Itrain) is a formal context (a training set), Ktest = (Gtest; M; Itest) is a formal context (a test set); CbO(K; k) is the algorithm used to find all formal concepts of a formal context K with intent cardinality not exceeding k; inf : M [ ctrain ! R is an information criterion used to rate classification rules (such as Gini impurity, Gini gain or pairwise mutual information); k is the maximal cardinality of each classification rule’s premise (a parameter); n is the number of rules to be used for prediction of each test instance’s class attribute (a parameter); Output: ctest, predicted values of the class attribute for test instances in Ktest. S = ;; ctest = []. Initialize a set of formal concept intents (a.k.a. closed itemsets). This set will be used to form classification rules for each test instance from Gtest. Initialize a list of predicted labels for test instances.

for each test instance gt 2 Gtest do 1. Let Mt be a set of attributes of a test instance gt together with the negations of the attributes not in g0 ;

t 2. Build a formal context Kt = fGtrain; Mt; Itg where It = I \ (G Mt). Informally, leave only a part of a context Ktrain with attributes from Mt; 3. With the CbO algorithm and a set S of already computed formal concept intents, find all formal concepts of a formal context Kt with intent cardinality not exceeding the value of the parameter k; 4. Meanwhile, calculate the value of the criterion inf for each concept intent and keep n intents with highest values of the criterion. For each “top-ranked” concept intent Bi determine ci, the most common class among objects from Bi0. Thus, form fBi ! cig; i = 1 : : : n, a set of classification rules for gt; 5. Predict the value of the class attribute for gt via a majority rule among fBi ! cig; i = 1 : : : n. Add it to ctest; 6. Add calculated intents to S.

end for

P class! = 1 P class == 3 SibSp == 0 SibSp! = 1 Age

with 5 rules with at most 2 attributes in premise having highest gain in Gini impurity. 2. The case “Outlook=sunny, Temperature=cool, Humidity=high, Windy=false” corresponds to a set of attributes fos; tcg describing the test instance. Or, if we consider the negations of the attributes, such case is described with a set of attributes: for; oo; os; tc; tm; th; hn; wg. 3. We build a formal context with objects being the training set instances and attributes of a test instance – for; oo; os; tc; tm; th; hn; wg. The corresponding binary table is shown in Table 3. 4. The diagram of the for the formal context given by Table 3 is shown in Fig. 3. The horizontal line separates the concepts with intents having at most 2 attributes. 5. 13 formal concepts with intents having at most 2 attributes give rise to 13 classification rules. Top 5 rules having the highest gain in Gini impurity are given in Table 4. 6. The “best” rules mined in the previous step unanimously classify the test instance “Outlook=sunny, Temperature=cool, Humidity=high, Windy=false” as appropriate for playing tennis. 6

Experiments

We compare the proposed classification algorithm (“PCL” for Pattern Concept Lattice based classification) with the results from [ 12 ] on several datasets from the UCI machine learning repository.2

We used pairwise mutual information as a criterion for rule selection. Parameters k 2 f3; : : : 7g and n 2 f1; : : : 5g were chosen via 5-fold cross validation. The described algorithms were implemented in Python 2.7.3 on a dual-core CPU (Core i3-370M, 2.4 GHz) with 3.87 GB RAM.

The algorithm was also tested on a 2001 Predictive Toxicology Challenge (PTC) dataset.3 Please refer to [ 9 ] and [ 17 ] for the description of the problem and some notions on pattern structures with descriptions given by labeled graphs. Here we compare the results of the proposed algorithm (Pattern Concept Latticebased classification) and the previously developed graphlet-based lazy associative classification on the PTC dataset. The results are shown in Table 6.

To clarify, in both algorithms k-graphlet (parameter “K nodes” in Table 6) graph intersections were build. In “GLAC”, each test instance is classified via voting among all classification hypotheses. In “PCL”, only n best (according to some

3 http://www.predictive-toxicology.org/ptc/

information criterion) closed hypotheses are chosen (here we used n=5). As we can see, “PCL” works slightly better with this dataset suggesting that choosing “best” hypotheses for classification may lead to more accurate classification. 7

Conclusion and further work

In this paper, we have shown how searching for classification hypotheses in a formal concept lattice for each test instance individually may yield accurate results while providing succinct classification rules. The proposed strategy is computationally demanding but may be used for “small data” problems where prediction delay is not as important as classification accuracy and interpretability.

Further we plan to interpret random forests as a search for an optimal hypothesis in a concept lattice and try to compete with this popular classification technique.

Grigorios

Tsoumakas , Apostolos Papadopoulos, Weining Qian, Stavros Vologiannidis, Alexander D'yakonov, Antti Puurula, Jesse Read, Jan Svec, and Stanislav Semenov, “ Wise 2014 challenge: Multi-label classification of print media articles to topics,” in 15th International Conference on Web Information Systems Engineering (WISE 2014 ). Proceedings Part II. October 12-14 2014 , vol. 8787 of Lecture Notes in Computer Science, pp. 541 - 548 , Springer.

Li and

Zhong , “ An overview of personal credit scoring: Techniques and future work ,” International Journal of Intelligence Science , vol. 2 , no. 4A , pp. 181 - 189 , 2012 .

Bernhard

Ganter and

Rudolf

Wille , Formal Concept Analysis: Mathematical Foundations , Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1st edition, 1997 .

4. Thomas

Mitchell , Machine Learning, McGraw-Hill , Inc., New York, NY, USA, 1 edition , 1997 .

Itamar

Hata , Adriano Veloso, and Nivio Ziviani, “ Learning accurate and interpretable classifiers using optimal multi-criteria rules ., ” JIDM , vol. 4 , no. 3 , pp. 204 - 219 , 2013 .

Bernhard

Ganter and Sergei Kuznetsov, “ Pattern Structures and Their Projections,” in Conceptual Structures: Broadening the Base, Harry Delugach and Gerd Stumme, Eds., vol. 2120 of Lecture Notes in Computer Science, pp. 129 - 142 . Springer, Berlin/Heidelberg, 2001 .

7. Sergei O. Kuznetsov , “ Fitting pattern structures to knowledge discovery in big data,” in Formal Concept Analysis: 11th International Conference , ICFCA 2013, Dresden, Germany, May 21-24, 2013 . Proceedings, Peggy Cellier, Felix Distel, and Bernhard Ganter, Eds., Berlin, Heidelberg, 2013 , pp. 254 - 266 , Springer Berlin Heidelberg.

8. Sergei O. Kuznetsov , “ Scalable Knowledge Discovery in Complex Data with Pattern Structures,” in PReMI , Pradipta Maji, Ashish Ghosh,

M. Narasimha

Murty , Kuntal Ghosh, and Sankar K. Pal, Eds. 2013 , vol. 8251 of Lecture Notes in Computer Science, pp. 30 - 39 , Springer.

Yury

Kashnitsky and Sergei O. Kuznetsov , “ Lazy associative graph classification ,” in Proceedings of the 4th International Workshop "What can FCA do for Artificial Intelligence?" , FCA4AI 2015 , co-located with the International Joint Conference on Artificial Intelligence (IJCAI 2015 ), Buenos Aires, Argentina, July 25 , 2015 ., 2015 , pp. 63 - 74 .

10. David W. Aha, Ed., Lazy Learning , Kluwer Academic Publishers, Norwell, MA, USA, 1997 .

11. Jerome H. Friedman , “ Lazy decision trees , ” in Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 1 . 1996 , AAAI' 96 , pp. 717 - 724 , AAAI Press.

12. Adriano

Veloso

, Wagner Meira Jr., and Mohammed J. Zaki , “Lazy Associative Classification,” in Proceedings of the Sixth International Conference on Data Mining , Washington, DC, USA, 2006 , ICDM '06, pp. 645 - 654 , IEEE Computer Society.

13. Radim

Belohlavek

, Bernard De Baets, Jan Outrata, and Vilem Vychodil, “ Inducing decision trees via concept lattices ,” International Journal of General Systems , vol. 38 , no. 4 , pp. 455 - 467 , 2009 .

14. Sergei . O. Kuznetsov, “ A fast algorithm for computing all intersections of objects from an arbitrary semilattice,” Nauchno-Tekhnicheskaya Informatsiya Seriya 2- Informatsionnye Protsessy I Sistemy, , no. 1 , pp. 17 - 20 , 1993 .

15. S. Andrews, “ In-close, a fast algorithm for computing formal concepts , ” in CEUR Workshop Proceedings , 2009 , vol. 483 .

16. J. Ross

Quinlan

, C4 . 5: Programs for Machine Learning , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993 .

17. Sergei

Kuznetsov and Mikhail V.

Samokhin , “ Learning Closed Sets of Labeled Graphs for Chemical Applications,” in ILP, Stefan Kramer and Bernhard Pfahringer, Eds. 2005 , vol. 3625 of Lecture Notes in Computer Science, pp. 190 - 208 , Springer.