Decision Aiding Software Using FCA Florent Domenach and Ali Tayari Computer Science Department, University of Nicosia, 46 Makedonitissas Av., P.O.Box 24005, 1700 Nicosia, Cyprus, domenach.f@unic.ac.cy Abstract. The consensus problem arises from social choice theory and systematic biology where we are looking for the common information shared by a series of trees. In this paper we present a decision aiding software to help systematic biologist to choose the consensus function the most appropriate for their need. This software is based on a previous study between consensus functions and axiomatic properties, and their underlined concept lattice. 1 Introduction The consensus problem, which [11] deemed a ”problem for the future”, consists of summarizing a series of structures, usually trees, into one representative struc- ture. Axiomatic studies of consensus functions is often [26] described as an ”ideal situation [in which] the researcher formulates a list of desirable axioms that a consensus function should satisfy, and search for the best method that satisfies these axioms” [33]. We present here a software following this approach almost to the letter. Unfortunatly, it is still missing critical GUI features and is not available yet. The motivation for the software is originating from the separation existing between theorizers and practitioners of consensus theory, what [7] denotes as abstract consensus theory and concrete consensus theory. On one hand, math- ematicians are developing sophisticated mathematical tools. The modern devel- opment of the consensus problem originates from Arrow’s work [3] (followed by [25]) who considered the problem of aggregating votes and showed that any voting system is either inconsistent, arbitrary or unstable. Since then, a lot of functions, together with a set of equivalent axioms, were developed (see [15, 22] for a comprehensive survey). On the other hand, practitioners like systematic biologists are rarely using more than a handful of consensus functions. If you consider the most popular software available like PAUP∗1 [37] (majority), PHYLIP2 [20] (majority, strict), or COMPONENT 2.03 [30] (strict, majority-rule, loose, Nelson and Adams con- sensus trees), only a handful are available for use. It was pointed out [38] that this gap between the two communities was detrimental to both. 1 http://paup.csit.fsu.edu/ 2 http://evolution.genetics.washington.edu/phylip.html 3 http://taxonomy.zoology.gla.ac.uk/rod/cpw.html 70 F. Domenach et al. The goal of this paper is to present an approach – based on FCA – to the con- sensus problem that would fill the gap between both communities. We created a software that asks the user to think of desirable properties that a consensus method should possess, and then we advise on which consensus function satis- fying these properties he/she should use. Each step is described in detail in the paper. This paper is organized in four sections, the first one being this introduction. In Section 2, we give a precise definition of the consensus problem, as well as the definitions of the consensus functions (Section 2.1) and of the axiomatic proper- ties (Section 2.2) that we implemented. We present in Section 3 the structure of our program, and explain for every step why and how we are doing it. Finally a brief conclusion is given in Section 4. 2 The Consensus Problem Consider a finite set S, |S| = n. In phylogeny, the elements of S are called operational taxonomic units, or taxa. A hierarchy H on S, also called n-tree, is a family of subsets of S (called the classes or clusters of H) such that S ∈ H, ∅ 6∈ H, {s} ∈ H for all s ∈ S, and A ∩ B ∈ {∅, A, B} for all A, B ∈ H. We will indifferently use the terms trees or hierarchies in the paper. We denote the set of all hierarchies on S by H. Fig. 1 shows the graphical representation of different trees; usually the internal nodes are simply denoted by the leaves underneath. Consensus trees are summarizations of the information shared by two or more classification trees of the same set of taxa. Given a profile H ∗ of trees on S, i.e. a series of trees, we want to know what they have in common - we want to aggregate H ∗ in a unique tree H. We consider in this paper the case where all the trees of the profile are defined on the same set of taxa, as the generalization to super-trees [34] (where the trees can have different sets of taxa) can create computational problems. 2.1 Consensus Functions Let H ∗ = (H1 , H2 , ..., Hk ) be a profile of hierarchies on S, and K will denote the set of indices of the hierarchies of H ∗ , K = {1, ..., k}. Formally, a consensus function on H is a map c : Hk → H with k ≥ 2 and Hk the k cartesian product, which associate to any profile H ∗ a unique hierarchy consensus, c(H ∗ ). We do not aim to have an exhaustive list of consensus functions, a classification based on refinement is available in [13]. Consensus functions can be divided in three main categories: Quota-based consensus functions. Consider a grouping and the associated index defined as: NH ∗ (A) = {i ∈ K : A ∈ Hi } and nH ∗ (A) = |NH ∗ (A)| Decision Aiding Software Using FCA 71 a c b d e a c b d e a c b d e (H1 ) (H2 ) (H3 ) a c b d e a c b d e a c b d e (HStrict ) (HMaj ) (HLoose ) Fig. 1. Different trees defined on the set of taxa S = {a, b, c, d, e}. For the profile H ∗ = (H1 , H2 , H3 ), the strict consensus tree is given by (HStrict ), the majority by (HM aj ) and the loose by (HLoose ) We associate the consensus function c(p) : Hk → H to the index nH ∗ for any p ∈ K. A subset A is called p-frequent if nH ∗ (A) ≥ p, and the p-frequent consensus of H ∗ , denoted as c(p) (H ∗ ), is the family of all p-frequent subsets. Quota-based consensus functions are particular cases of federation consensus functions [23]. Recall that a federation (simple game) is a family F of subsets of K such that A ∈ F, B ⊇ A imply B ∈ F. A federation consensus function cF is then defined as cF (H ∗ ) = ∨S∈F (∩i∈S Hi ). If we take the simple case where, for some j ∈ K, F = {S ⊆ K : j ∈ S}, we have cF (H ∗ ) = Hj , a single hierarchy dictating the result of the consensus, the so called projection consensus function. Projection: ∃j ∈ K : P rj(H ∗ ) = Hj When we extend this to a subset J of K, we haveTthe oligarchic consensus function using F = {S ⊆ K : J ⊆ S}, and cF (H ∗ ) = j∈J Hj . \ Oligarchy: ∃J ⊆ K : Ol(H ∗ ) = Hj j∈J In the family of quota-based consensus functions, one can notice c(k) (H ∗ ) = T i∈K Hi the set of classes present in all trees of the profile, i.e. the strict con- sensus function [36]. In Fig. 1, (HStrict ) is the strict consensus of the profile (H1 , H2 , H3 ). \ Strict: Str(H ∗ ) = Hi i∈K If we take p = d k+1 k 2 e, the smallest natural number greater than 2 , we have the majority consensus function [24] which considers clusters appearing in at least 72 F. Domenach et al. half of the trees. An example of the majority consensus function is given in Fig. 1, where (HM aj ) = M aj(H1 , H2 , H3 ). k Majority: M aj(H ∗ ) = {X ⊆ S : nH ∗ (X) > } 2 Unfortunately, if p is less than d k+1 2 e, it cannot be guaranteed that the resulting family will be a tree. In order to keep the structure of a tree, different strategies can be used. Frequency-based consensus functions A first approach considers the idea of com- patibility, i.e. two sets A and B are compatible if A ∩ B ∈ {∅, A, B}, denoted as AkB, and a set A is compatible with a hierarchy H if it is compatible with every cluster of H (or, equivalently, if A ∪ H ∈ H). We then can define a consensus function called loose consensus [6] (originally called combinable component [12], also called semi-strict) which considers subsets as long as they are compatible with every tree of the profile. Fig. 1 shows (HLoose ), the loose consensus tree obtained from (H1 , H2 , H3 ). [ Loose: L(H ∗ ) = {X ⊆ S : ∃j ∈ K, X ∈ Hj and ∀i ∈ K, X ∪ Hi ∈ H} The loose consensus function was extended by [18] to two different consensus functions. The first one is combining the classes obtained by the majority con- sensus function with those of the loose consensus function: Loose and Majority Function Property: LM (H ∗ ) = M aj(H ∗ ) ∪ L(H ∗ ) The second extension is to add classes that are more often compatible than not. Define NH ∗ (X) = {i ∈ K : X ∪ Hi 6∈ H} as the set of trees not compatible with a subset X, then the majority (+) consensus function will take subsets that are more often compatible than incompatible. It obviously contains all the classes obtained by the majority function and by the loose function. Majority-rule (+) : M aj + (H ∗ ) = {X ⊆ S : |NH ∗ (X)| > |NH ∗ (X)|} Consider the weight function w(X) = nH ∗ (X) − 1 on classes. The Nelson-Page consensus tree is the tree constructed from the clique G containing the com- ponents most frequently replicated in the profile. If two or more cliques have the same, maximal number of replications of components, then the consensus tree is constructed from those components common to all those cliques. In the literature, the Nelson-Page tree [27, 29] has often been confused with the strict consensus tree. The frequency difference consensus function consider the subsets of S that are more frequent than any other subsets non-compatible. Freq. Diff.: F D(H ∗ ) = {X : nH ∗ (X) > maxY not compatible with X {nH ∗ (Y )}} Previous consensus functions may miss some structural features of the trees, particularly if the data is noisy. For example, a desirable feature would be that Decision Aiding Software Using FCA 73 if two taxa are closer than a third one, we want these two taxa to be separated from the third one in the consensus hierarchy - which is what Adams’ function [1] achieves. Historically the first one, an Adams consensus tree contains the nestings common to all trees in a profile. X nests in Y in H, denoted as X L : we remove the top and the bot- tom concepts of the lattice to compute the topological distance because such concepts don’t bear any information (even if 1L can have attributes associ- ated, it still doesn’t have any meaning). It is particularly important when we consider the co-atoms of the lattice (such as Pareto Optimal or co-Pareto Optimal, see Fig. 3), as the shortest path could go through 1L and short- circuit the ”real” distance. – Meet Distance: It is the topological distance between C and Ci passing through their meet, i.e. the distance from C to C ∧ Ci plus the distance from Ci to C ∧ Ci . – Join Distance: Dual to the meet distance, it is the topological distance be- tween C and Ci passing through their join. Since each previous distance has its own advantages and disadvantages, we also implemented a weighted average distance for which the user can freely assign the weights. It is a weighted average of all the above distances based on user’s preference. Fig. 4 shows an example of user’s choice and the advised consensus function. Decision Aiding Software Using FCA 79 Fig. 4. Screen shot of the second layer of the software, with an example of user’s choices. 3.4 Decision Aiding In the third layer, the D.A. Software recognizes input trees which are given in Newick format. The Newick tree format is a well-known representation of graph-theoretical trees which denotes trees using parentheses and commas. The simplicity and standard nature of Newick makes it a suitable method for scien- tists to provide the software with their input. There are several ways through which trees can be represented, however the representation that contains only the information about the leaves are recognized as the valid ones. For example, in Fig. 1, (H1 ) has the Newick format (((a, c), b), d, e), while (H2 ) is ((a, c), (b, d, e)). Upon selection of consensus functions by the user, the D.A. software generates the unique (or set of all possible consensus trees) for the selected functions, so that the user can compare them with each other. Using this feature, the user is able to find out which model would be more suitable for the nature of their work, for which he/she will be provided with respective consensus tree(s). This allows the user to have a narrowed list of candidates for the representative consensus trees as well as having a hands-on experience to find out the most suitable functional property and consensus tree. 4 Conclusion and Future Work In this paper, we presented a decision aiding software which explore via Formal Concept Analysis the space of consensus functions and their axioms. It provides the user with means to generate consensus tree(s) representative(s) depending on their choices. It initially imports the raw context obtained via pre-processing, constructs the associated lattice and, depending on the user’s preferences, advise 80 F. Domenach et al. based on distances in the lattice on which function to use. Upon selection of functions, the program generates the consensus trees of the collection of user’s input tree using selected functional properties. In continuation of this project, we are planning to expand the capabilities of this software. Firstly, besides the (rooted) trees that are currently supported as input and output, the program will be able to support super-trees as well as unrooted trees as its input and output. Another possibility would be the addition of other types of structures of sets such as pyramids [9], weak-trees [4], and, more generally, lattices. In addition, the concept of independence and neutrality as axiomatic properties are planned to be incorporated. Moreover, other commonly used consensus functions are going to be added to the result, therefore with a further refined and exhaustive approach, the program’s precision and usefulness would be improved. References [1] Adams III, E.N.: Consensus Techniques and the Comparison of Taxonomic Trees. Systematic Zoology 21 (1972) 390–397 [2] Adams III, E.N.: N-trees as nestings: Complexity, similarity, and consensus. J. Classif 3 (1986) 299-317 [3] Arrow, K.J.: Social Choice and Individual Values. Wiley, New York (1951) [4] Bandelt, H.-J., Dress, A.: Weak hierarchies associated with similarity measures: an additive clustering technique Bull. Math. Biology 51 (1989) 133-166 [5] Barbut, M., Monjardet, B.: Ordres et classification: Algèbre et combinatoire (tome II). Hachette, Paris (1970) [6] Barthélemy, J.-P., McMorris, F.R., Powers, R.C: Dictatorial consensus functions on n-trees. Math. Soc. Sci. 25 (1992) 59-64 [7] Barthélemy, J.-P., Brucker, F.: Average Consensus in Numerical Taxonomy. In Data analysis, Eds. W. Gaul, O. Opitz, and M. Schader, Springer (2000) 95-104 [8] Bertrand, P.: Set Systems and Dissimilarities. Europ. J. Combinatorics 21 (2000) 727-743 [9] Bertrand, P., Diday, E.: A visual representation of compatibility between an order and a dissimilarity index: the pyramids. Comput. Stat. Quart. 2 (1985) 31-44 [10] Birkhoff, G.: Lattice Theory, 3rd ed. Amer. Math. Soc., Providence (1967) [11] Bock, H.H.: Classification and clustering: Problems for the future. In Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., Burtschy, B.: New approaches in clas- sification and data analysis. Springer-Verlag, Berlin (1994) 3–24 [12] Bremer, K.: Combinable component consensus. Cladistics 6 (1990) 369-372 [13] Bryant, D.: A Classification of Consensus Methods for Phylogenetics. In: Janowitz, M., Lapointe, F.J., McMorris, F., Mirkin, B., Roberts, F. (eds.) Bioconsensus, DIMACS (2003) 163-184 [14] Colonius, H., Schulze, H.-H.: Tree structure for proximity Data. Brit. J. of Math. Stat. Psych. 34 (1981) 167-180 [15] Day, W.H.E., McMorris, F.R.: Axiomatic Consensus Theory in Group Choice and Biomathematics. Siam, Philadelphia (2003) [16] Davey, B.A., Priestley, H. A.: Introduction to Lattices and Order, 2nd ed. Cam- bridge University Press (2002) Decision Aiding Software Using FCA 81 [17] Domenach, F., Tayari, A.: Implications of Axiomatic Consensus Properties. To appear (2012) [18] Dong, J., Fernández-Baca, D., McMorris, F.R., Powers. R.C.: An Axiomatic Study of Majority-rule (+) and associated Consensus Functions on Hierarchies. Disc. App. Math. 159 (2011) 2038-2044 [19] Felsenstein, J.: The Number of Evolutionary Trees. Syst. Zool. 27 (1978) 27-33 [20] Felsenstein, J.: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5 (1989) 164-166 [21] Ganter, B., Wille, R.: Formal Concept Analysis : Mathematical Foundations. Springer (1996) [22] Hudry, O., Monjardet, B.: Consensus Theories. An Oriented Survey. Math. Sci. hum 190 (2010) 139-167 [23] Leclerc, B. Monjardet, B.: Latticial theory of consensus. In: Barnett, V., Moulin, H., Salles M., Schofield N. (eds.) Social choice, Welfare and Ethics. Cambridge University Press, Cambridge (1995) 145-159 [24] Margush, T., McMorris, F.R.: Consensus n-trees. Bull. Math. Biol. 43 (1981) 239-244 [25] May, K.O.: A Set of Independent Necessary and Sufficient Conditions for Simple Majority Decision. Econometrica 20 (1952) 680-684 [26] McMorris, F.R.: : Axioms for consensus functions defined on undirected phyloge- netic trees. Math. Biosciences 74 (1985) 77-80 [27] Nelson, G.: Cladistic analysis and synthesis: Principles and definitions, with a historical note on Adanson’s Famille des Plantes (1763-1764). Syst. Zool. 28 (1979) 1-21 [28] Neumann, D.A.: Faithful consensus methods for n-trees. Math. Biosci 63 (1983) 271-287 [29] Page, R.D.M.: Tracks and Trees in the Antipodes: A Reply to Humphries and Seberg. Syst. Zool. 39 (1990) 288-299 [30] Page, R.D.M.: User’s manual for COMPONENT, Version 2.0. Natural History Museum, London (1993) [31] Pareto, V.: Cours d’économie politique. F. Rouge, Lausanne (1896) [32] Phillips, C., Warnow, T.J.: The aymmetric median tree - A new model for building consensus trees. Disc. App. Math. 71 (1996) 311-335 [33] Powers, R.C., White, J.M.: Wilson’s theorem for consensus functions on hierar- chies. Disc. Appl. Math. 156 (2008) 1321–1329 [34] Semple, M., Steel, C.: A supertree method for rooted trees. Disc. App. Math. 105 (2000) 147-158 [35] Sholander M.: Medians, Lattices, and Trees. Proceedings of the American Math- ematical Society 5 (1954) 808-812 [36] Sokal, R.R., Rohlf, F.J., Taxonomic congruence in the Leptopodomorpha re- examined. Syst. Zool. 30 (1981) 309-325 [37] Swofford, D.L.: PAUP: Phylogenetic Analysis Using Parsimony, version 3.0. Illinois Natural History Survery, Champaign (1990) [38] Wilkinson, M., Thorley, J.L., Pisani, D.E., Lapointe, F.-J., McInerney, James O.: Some Desiderata for Liberal Supertrees. In: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, Kluwer Academic Publishers (2004) 564-582 [39] Yevtushenko, S.A.: System of data analysis ”Concept Explorer”. Proceedings of the 7th national conference on Artificial Intelligence KII-2000, Russia (2000) 127- 134