Introduction

Learning Preferences for Collaboration

Eva Armengol

eva@iiia.csic.es 0 0 Arti cial Intelligence Research Institute (IIIA - CSIC) , Campus de la UAB, 08193 Bellaterra, Catalonia , Spain

In this paper we propose the acquisition of a set of preferences of collaboration between classi ers based on decision trees. A classi er uses a well-known algorithm (k-NN with leaf-one-out) on its own knowledge base to generate a set of tuples with information about the object to be classi ed, the number of similar precedents, the maximum similarity, and about if it is a situation of collaboration or not. We considered that a classi er does not collaborate when it is able to reach by itself the correct classi cation for an object, otherwise it has to collaborate. The mentioned set of tuples is given as input to generate a decision tree from which a set of collaboration preferences is obtained.

Machine learning Classi cation Learning preferences Collaboration Decision trees

Introduction

In Machine Learning the idea of cooperation between entities appears with the formation of ensembles. An ensemble is composed of several classi ers (using inductive learning methods), each one of them being capable of completely solving a problem. Since classi ers can provide di erent solutions for the same problem, the key issue of ensembles is how to aggregate the solutions proposed by the different classi ers. Perrone and Cooper [ 9 ] proved that aggregating the solutions obtained by independent classi ers improves the accuracy of each classi er on its own. In that approach the cooperation among entities is done by both sharing the results for the same problem and reaching an aggregated solution.

Plaza and Ontan~on [ 10 ] take the idea of ensemble learning and apply it to multi-agent systems. These authors de ne a commitee as an ensemble of agents where each agent has its own experience and it is capable of completely solving new problems. Each agent in a commitee can solve problems but it can also collaborate with other agents in order to improve its accuracy. The di erence between this approach and the most common approaches to multi-agent learning systems (MALS) is that in a commitee each agent is able to completely solve a problem whereas in MALS approaches each agent solves a part of a problem.

Related to the idea of ensemble there is also the idea of meta-learning whose aim is to construct a classi er from distributed knowledge bases. The idea is to combine the predictions of an ensemble of classi ers in order to obtain a global classi er. This global classi er establishes what could be seen as a set of preferences (since it is not a simple aggregation procedure) to give the nal classi cation. Prodromidis et al. [ 11 ] analyzed meta-learning and give a simpli ed meta-learning scenario composed of the following phases:

1. the base classi ers are trained from the data,

2. each classi er generates independently a prediction for the data on a separate test set, 3. a meta-level training set is constructed from the test set and the predictions generated by each classi er on the test set, 4. the meta-classi er is trained from the meta-level training set.

What we propose in this paper is similar to both, ensembles and metalearning. As in ensembles, our goal is to solve a new problem and we take the approach proposed by Plaza and Ontan~on [ 10 ], that is to say, each classi er does not solve a part of a problem (like in the most common ensemble approaches) but it can solve completely the problem. The metaphor of our approach is the following: let us suppose that a physician has to diagnose a patient but he has not enough experience for this. The most usual behavior could be that this physician asks other colleagues for advice in diagnosing that patient. As long as the physician interacts with others for solving problems that initially were outside of his experience, he acquires in turn, experience on this kind of problems and, consequently, the interaction with other experts will be reduced.

In a previous work [ 1 ] we implemented this scenario and we proposed that the agents can take bene t from the collaboration with other agents by learning domain knowledge. Our point was that if agents are able to justify the solution they provide, then agents receiving these justi cations could use them as new domain knowledge (like domain rules). The idea of taking bene t from cooperation between learners was pointed out by Provost and Hennessy [ 12 ]. These authors shown how individual learners can share domain rules in order to build a common domain theory from distributed data and how this improves the performance of the whole system.

In the current paper we are interested to show how one individual agent can improve its own domain knowledge from the collaboration with other agents. To do that we analyze the answer of a question that has to be taken into account before to start the machinery described in [ 1 ]: when an agent prefers to ask for collaboration instead to give the classi cation it has reached using its own experience? In the previous work we assumed that this collaboration is done when the classi cation has not enough support. However now we take a closer look on the agent's own competence and learn situations where the agent prefers the classi cation it has obtained and when it prefers to ask other agents.

The paper is organized as follows. In Section 2 we present the scenario and introduce the elements that will be used as input for learning. In Section 3 we describe the procedure to construct the preference rules for collaboration.

Scenario

Let us suppose a classi er capable to solve problems of a given domain. Domain objects are described by sets of attribute-value pairs and each object has associated a class label belonging to a set C = fC1 : : : Cng. We assume that all the domain objects are described using the same set of attributes. The experience of the classi er is formed by a knowledge base containing domain objects with its class label, i.e., hOi; Cj i. Given a problem p to classify, the classi er uses the k-Nearest Neighbor (k-NN) algorithm [ 5, 4 ] on the knowledge base to obtain the class label for p. The k-NN algorithm uses a similarity measure to assess the similarity between the object p and each one of the domain objects in the knowledge base. The outcome of k-NN is the set of the k objects most similar to p.

What the classi er knows about its own knowledge are the problems in its knowledge base. Therefore the knowledge base is the only source from which it can learn about its own competence. The procedure we propose now is similar to the one described in [ 13 ], but here the classi er tries to solve its own problems. In other words, for each hOi; Cii in its knowledge base, the classi er takes Oi and uses k-NN for classifying it. Let us suppose that k-NN proposes as classi cation for Oi the class Cj . In such situation there are three possible scenarios: 1. Cj = Ci, i.e, Oi has been classi ed correctly. 2. Cj 6= Ci, i.e, Oi has been classi ed incorrectly. 3. k-NN proposes more than one class for the object.

This procedure can be done either for all the objects of the knowledge base (using leave-one-out) or for a selected subset of objects. Each object Oi is a domain object described by a set of attributes A = fa1; : : : ; ang each one with a value that may be either numeric or symbolic. For each object Oi in the knowledge base the classi er generates a tuple as follows:

hOi:a1; : : : ; Oi:an; Cj ; ; sim; actioni where the notation Oi:al stands for the value that the object Oi takes in the attribute al; Cj is the classi cation of Oi using the majority rule; is the number of examples used by k-NN to reach the solution; sim is the maximum similarity among Oi and the examples; and action is either collaboration or nocollaboration. When Oi is solved correctly, the action is no-collaboration (meaning that the classi er is able to solve correctly the problem Oi with its own knowledge), otherwise the action is collaboration, (meaning that the classi er is not able to solve correctly the problem Oi with its own knowledge).

From the set of all these tuples, the classi er constructs a decision tree that allows to learn preferences about two situations: a) when to collaborate with other classi ers, and b) when the classi er prefers its own solution. This approach is similar to the one proposed in [ 8 ] where the authors use a decision tree to compute the con dence degree on the classi cation given for an object. In [ 8 ] the examples used to construct the decision tree are tuples of three elements, therefore the tree has as maximum three levels meaning that the leaves have elements belonging to both classes. In our approach, the tuples we use to construct the decision three have n + 3 components (the n attributes plus , sim and action). Also, the preferences we obtain are (or may be) in terms of some of the attributes describing the objects. 2.1

Similarity between objects

The k-NN algorithm uses a similarity measure to retrieve k objects that are the more similar ones to a given object Oi. The most common similarity measures used in k-NN when objects are represented as a set of attribute-value pairs and the values are numerical is the Euclidean distance (although other measures are also used, see for instance [ 7 ]). When the values of the attributes are symbolic, the most common similarity measure is the following one: sim(Oi:al; Oj :al) = 1; 0; if Oi:al = Oj :al otherwise Therefore the similarity between objects Oi and Oj is computed as follows: Pn

l=1 sim(Oi:al; Oj :al) sim(Oi; Oj ) = n in other words, the similarity of both objects is the number of attributes taking the same value in both objects and normalized by the number of attributes describing the objects. 2.2

Retrieving a subset of similar objects

Commonly, the k-NN algorithm retrieves k objects similar to a given one. However what we propose is to retrieve all objects with similarity equal or bigger than a given threshold of similarity h. When a problem pi is given to the classier, h is the threshold under which the objects are not considered similar enough to pi and they are rejected. This means that the number of retrieved objects may be di erent for each input problem. The closer to 1 h is, the more con dent is the classi cation. For instance, let us suppose a knowledge base containing the objects O1; O2; O3 and O4, and the problems p1 and p2 to be classi ed. Table 1 shows the similarity between p1 and p2 and the objects of the knowledge base. If we take h = 0:80, when solving p1 the classi er retrieves = 2 similar objects (only O2 and O4 have similarity equal or greater than h); however, when solving p2 the classi er retrieves = 3 similar objects (only O4 has similarity lower than h).

Let P be the set of objects in the knowledge base whose similarity to pi is greater or equal than h. There are four possible situations concerning the elements of P: 1. For all object oi 2 P the solution class is the same. If the class is the correct one, then there is a situation of no collaboration, otherwise the algorithm has retrieved similar objects but the classi cation proposed is not the correct one. These cases are specially useful since they belong to a region where the classes are similar. Therefore, when the problem to be solved belongs to these regions the classi er has to prefer to collaborate with other classi ers since its own classi cation may be incorrect. 2. The majority of the objects in P belongs to the same class. In this situation the classi er has to collaborate when the majority class is not the correct one. Both sim and give an idea of how strong is this classi cation. 3. There is a tie between two solution classes, i.e., there are two classes with the same number of elements in P. This is a situation of collaboration because the classi er has not enough information for classifying pi. 4. The algorithm does not retrieve any object, meaning that there are not objects in the knowledge base similar enough to the new problem to be solved. In this situation, the classi er also prefers to collaborate with others than give its own classi cation based on objects that are not similar enough. 3

Learning Preferences for Collaboration

From the procedure described in the previous section, the classi er acquires a set of tuples describing situations for collaboration/no collaboration. We assume that only when a problem has been solved correctly, it gives a situation of no collaboration, otherwise the classi er should collaborate. The tuples can be used to construct a decision tree with the goal to induce a general model of collaboration.

As we already have explained in Section 2, the tuples have the form hOi:a1; : : : ; Oi:an; Cj ; ; sim; actioni: Notice that, because we assumed that all the objects are described using the same set of attributes A, all the tuples have the same length. For convenience, we suppose that there are not attributes with unknown values. However, when an object has unknown value in an attribute (say Oi:aj ) we can take the option that the corresponding position of the tuple (i.e., the position j) will hold the value unknown. In other words, unknown is plays the same role than any other value. Let us analyze these elements in more detail. The rst n elements of the tuple, Oi:a1; : : : ; Oi:an, are the values that the object Oi takes in each one of its attributes.

The element Cj of the tuple is the class to which belong the majority of elements in P. Therefore the tuple contains the class to which Oi is classi ed using the k-NN algorithm with the majority rule. When there is a tie between two classes then we consider that Cj = ;.

The element of the tuple is the cardinality of P, i.e., the number of elements in the knowledge base that have a similarity greater or equal than the given threshold h. The number is related with the threshold h and gives information about the knowledge base. For instance, if h has to be low in order to obtain 6= 0, this means that the object Oi is not much similar to any of the elements in the knowledge base.

The element sim is the maximum similarity of Oi and the objects of the knowledge base. Although we give a threshold, it is possible that the object Oi has the highest similarity with some of the objects in the knowledge base. We want to take into account this fact, especially when the classi cation has been incorrect. Notice that these cases (high similarity and incorrect classi cation) mean that the knowledge base has not enough objects to clearly distinguish the classes involved. In the example shown in Table 1 taking h = 0:85, the maximum similarity of the objects retrieved when solving o1 is 0.90 and when solving o2 is 0.91.

The element action plays the role of class label. It can take two values: collaborate or no collaborate. As we have already mentioned, the classi er will prefer to collaborate when the classi cation reached for an object has been either incorrect or a tie. Also it prefers to collaborate when there are no objects in the knowledge base similar enough to the problem Oi.

From the set T of tuples obtained from the procedure described above, we propose to construct a decision tree to induce rules describing preferences of collaboration/no collaboration. In the next section we describe in some detail the process of construction of a decision tree. 3.1

Construction of Decision Trees

A Decision Tree (DT) is a kind of directed acyclic graph in the form of a tree. The root of the tree has not incoming edges and the remaining ones have exactly one incoming edge. Nodes without outgoing edges are called leaf nodes and the remaining ones are internal nodes. A DT is commonly used to create a domain model predictive enough to classify future unseen domain objects.

The construction of a decision tree is performed by splitting the source set of examples. This process is repeated on each derived subset in a recursive manner called recursive partitioning. Figure 1 shows the algorithm (for more details see [ 14, 15 ]) commonly used to construct decision trees. It is assumed that domain objects are represented by means of a set of pairs attribute-value. For instance, a mushroom can be described using the attributes texture, form and, size and each one of these attributes can take a value. Therefore, the following is a description of a particular mushroom using a set of attribute-values: ((texture = spots)(form = planar)(size = big)) ID3 (examples, attributes) create a node if all examples belong to the same class return class as the label for the node otherwise

A ← best attribute for each possible value vi of A add a new tree branch below node examplesvi ← subset of examples such that A = vi

ID3(examplesvi, attributes - {A}) return node

The values of the attributes may be continuous-valued or categorical. The description of the mushroom above is categorical (i.e., the values of the attributes are labels). Examples of continuous-valued attributes are the heigh and the weight of a person. Each tree node represents an attribute ai selected by some criteria and each arch is followed according to the value of ai. For instance, Fig. 2 shows an example classifying mushrooms as eatable or poisonous. Attributes describing a mushroom are texture, form and, size. The most relevant attribute for classifying a mushroom is texture since if it is smooth the mushroom can be classi ed as eatable. Otherwise the node has to be expanded. Next relevant attribute is form with two possible values: planar corresponding only to poisonous mushrooms; and round that is a characteristic shared by both classes of mushrooms. Finally, the attribute size allows a perfect classi cation of all the known mushrooms.

Each node of a tree has associated a set of examples that are those satisfying the path from the root to that node. For instance, the node size of the tree shown in Fig. 2 has associated all the examples having texture = spots and form = round.

From a decision tree we can extract rules giving descriptions of classes. For instance, some eatable mushrooms are described by means of the rule: if texture=spots and form=round and size=small then eatable.

A key issue of the construction of decision trees is the selection of the most relevant attribute to split a node. This selection is made by means of a distance measure. Each measure uses a di erent criteria, therefore the selected attribute could be di erent depending on it and, thus the whole tree could also be di erent. The most common measures are based on the degree of impurity of a node. That is to say, they compute the proportion of examples of each class contained in a node. The goal is to obtain nodes (the leaves of the tree) having examples of only one class, that is to say, with impurity zero. Intermediate nodes are more pure as closer they are to the leaves, meaning that they are able to di erentiate the classes.

Impurity measures compare the impurity of a node, say t, with the impurity of the children nodes t1 : : : tk generated by an attibute ai. This comparison is done for each one of the attributes used to represent the domain objects. The texture

smooth spots form

eatable round planar

size poisonous small big eatable poisonous texture = smooth : eatable texture = spots form = planar : poisonous form = round size = small : eatable size = big : poisonous general expression to calculate the gain following: associated to an attribute ai is the (ai) = I(t)

Xk N (tj )

N j=1

I(tj ) where I( ) is an impurity measure, N is the total number of examples associated to the parent node t, k is the number of di erent values taken by ai and N (tj ) is the number of examples associated with the child node tj . 3.2

Example

We have performed some preliminary experiments using the procedure described in previous sections on the data set Soybean from the UCI Machine Learning Repository [ 2 ]. The Soybean dataset contains around 300 domain objects distributed on 18 solution classes and described by means of 35 categorical attributes without unknown values.

The rst step of our approach is to generate a model of the classi er capabilities using the leaf-one-out method. For each object hOi; Cii in the knowledge base the leave-one-out process is an evaluation technique, commonly used in machine learning, with the following procedure:

1. Take only the description Oi.

2. Use the classi er to achieve a classi cation for Oi using the remaining objects of the knowledge base. 3. Let Cj be the classi cation proposed for Oi. If Cj = Ci then the classi cation for the object Oi is correct; otherwise the classi cation is incorrect.

In our approach, when an object Oi is classi ed correctly, it is labelled as belonging to the class no collaboration otherwise it is classi ed as collaboration. For instance, a tuple generated in this process is the following: hApril,LT-normal,GT-normal,no-hail,. . . ,normal, 11, 0.88, collaborationi (1) Max-‐sim <= 0.828

Max-‐sim > 0.828 Stem=normal

Stem=abnormal Collabora:on Canker_lesion=tan

No collabora:on

Temp= LT-‐normal

No collabora:on Canker_lesion=DNA-‐lesion

Temp=GT-‐normal

No collabora:on Collabora:on

Temp= normal

Collabora:on where the rst part is composed of 38 values corresponding to the 38 attributes of the description of the domain objects, and the three last values indicate that: 1) the classi er has based its classi cation on 11 objects of the knowledge base, 2) the maximum similarity between the new object and the most similar retrieved object is 0.88, and 3) the classi cation has been incorrect, therefore the classi er has labelled the object as collaboration.

Since the knowledge base has 300 objects, we have obtained 300 tuples as the one shown in (1) at the end of the leave-one-out process. These 300 tuples have been given as input to a decision tree. We used the J48 algorithm [ 3 ] implemented in Weka [ 6 ] to generate a decision tree. The J48 algorithm is, in fact, the ID3 algorithm proposed by Quinlan [ 15 ] evolved to be able to deal with both categorical and continuous attributes without previous discretization.

We have experimented with di erent similarity thresholds. The decision tree shown in Fig. 3 is the one corresponding to the threshold of similarity h = 0:80. This tree shows that the classi er prefers to collaborate when: 1) the most similar object is under 0.828 (in fact between 0.80 and 0.82 since h is the lower threshold given as input) or, 2) when the similarity is higher than 0.828 and the object to be classi ed has Stem= normal, canker lesion=DNA-lesion and either Temp=LT normal or Temp=normal. That is, in addition to the similarity, the description of the object is also taken into account to decide when to collaborate. 4

Future Work

The work introduced in the current paper opens several interesting lines of future research. First of all, we plan to integrate a classi er as the one described in this paper into a system formed by n other classi ers. Each classi er forming the system is capable to completely solve a problem on a given domain and use objects described by means of a common representation (i.e., with the same set of attributes). Moreover, each classi er has a model of its own capability in solving a problem. The general idea is that when one of the classi ers, say Clk has to solve a problem p, the rst step is to use the tree of preferences in order to detect if Clk is capable to solve p using its own knowledge. If the model labels p as collaboration Clk will ask all other classi ers for collaboration. The easy case is to assume that all the classi ers propose a class and that the classi cation for p is the class proposed by the majority of the classi ers. This approach should be similar to the one used in ensemble learning [ 11 ]. More complicated cases could occur when only a few (or any) of the rest of classi ers are able to propose a classi cation for p. These cases should be analyzed in detail. We also plan to deeply evaluate a system as the one described above in order to check its accuracy. In particular, we expect that solving problems in collaboration between classi ers produces higher accuracy than having only one.

Another interesting issue is that each classi er has a model of the capabilities of each one of the classi ers of the system. This model could be constructed in the same way described in the paper and its utility could be twofold: 1) a classi er should knowns in advance which classi er will most probably give a correct classi cation and, consequently, 2) it will not be necessary ask all the classi ers in the system. The second issue is especially interesting when the system is formed by a high number of classi ers since a ltering like this will reduce the communication load between the classi ers.

A third line of research is that each classi er gives, in addition to the classi cation for p, the con dence in that classi cation. Such con dence could be obtained from the parameters and sim of the tuple. High values of both and sim mean that p has been classi ed taking into account many known examples having all them high similarity with p, therefore the classi cation is highly con dent. In such situation, the nal classi cation for p could be obtained by means of a weighted aggregation of the classi cations proposed by the system classi ers. This same con dence could also be used by a classi er to assess its own capability in classifying p.

Acknowledgements

The author thanks Angel Garc a-Cerdan~a, Pilar Dellunde and the anonymous reviewers their helpful comments and suggestions. The author also acknowledges support by the Spanish MICINN projects EdeTRI (TIN2012-39348-C02-01) and COGNITIO (TIN2012-38450-C03-03) and the grant 2014SGR-118 from the Generalitat de Catalunya.

Armengol and

Puertas . Learning from cooperation using justi cations . In M. Polit,

Talbert ,

Lopez , and J. Melendez, editors, Arti cial Intelligence Research and Development , pages 47 { 54 . IOS Press, 2006 .

Asuncion and D. Newman. UCI machine learning repository , 2007 .

Bhargava , G. Sharma,

Bhargava , and

Mathuria . Decision tree analysis on J48 algorithm for data mining . International Journal of Advanced Research in Computer Science and Software Engineering , 3 ( 6 ): 1114 { 1119 , June 2013 .

B. V.

Dasarathy . Handbook of Data Mining and Knowledge Discovery, chapter Data Mining Tasks and Methods: Classi cation: Nearest-neighbor Approaches , pages 288 { 298 . Oxford University Press, Inc., New York, NY, USA, 2002 .

Dasarathy . Nearest Neighbor Norms: NN Pattern Classi cation Techniques . IEEE Press, 1991 .

Hall , E. Frank,

Holmes ,

Pfahringer ,

Reutemann ,

and I. H.

Witten . The Weka data mining software: An update . SIGKDD Explorations Newsletter , 11 ( 1 ): 10 { 18 , 2009 .

T. W.

Liao ,

Zhang , and

Mount . Similarity measures for retrieval in case-based reasoning systems . Applied Arti cial Intelligence , 12 ( 4 ): 267 { 288 , 1998 .

8. S. Ontan~on and

Plaza . Learning when to collaborate among learning agents . In L. D. Raedt and P. A. Flach, editors, ECML , volume 2167 of Lecture Notes in Computer Science, pages 394 { 405 . Springer, 2001 .

M. P.

Perrone and

L. N.

Cooper . When networks disagree: Ensemble methods for hybrid neural networks . In R. J. Mammone, editor, Neural Networks for Speech and Image Processing , pages 126 { 142 . Chapman-Hall , 1993 .

10. E. Plaza and S. Ontan~on. Ensemble case-based reasoning: Colaboration policies for multiagent cooperative CBR. In I. Watson and Q . Yang, editors, CBR Research and Development: ICCBR-2001 , volume 2080 , pages 437 { 451 , 2001 .

11.

Prodromidis ,

Chan , and

Stolfo . Meta-learning in distributed data mining systems: Issues and approaches . In H. Kargupta and P. Chan, editors, Book on Advances of Distributed Data Mining . AAAI press, 2000 .

12.

F. J.

Provost and

D. N.

Hennessy . Scaling up: Distributed machine learning with cooperation . In Proceedings of the 13th AAAI/IAAI , Volume 1 , pages 74 { 79 , 1996 .

13.

Puertas and

Armengol . Inducing domain theory from problem solving in a multi-agent system . In J. Vitria,

Radeva , and I. Aguilo, editors, Recent Advances in Arti cial Intelligence Research and Development , pages 325 { 332 . IOS Press, 2004 .

14. J. Quinlan . Discovering rules by induction from large collection of examples . In Expert Systems in the Microelectronic Age. D. Michie (Ed.) , pages 168 { 201 . Edimburg University Press, 1979 .

15.

J. R.

Quinlan . Induction of decision trees . Machine Learning , 1 ( 1 ): 81 { 106 , 1986 .