Introduction

An Algorithm for the Multi-Relational Boolean Factor Analysis based on Essential Elements

Martin Trnecka

martin.trnecka@gmail.com 0

Marketa Trneckova

marketa.trneckova@gmail.com 0 0 Data Analysis and Modeling Lab (DAMOL) Department of Computer Science, Palacky University , Olomouc

The Multi-Relational Boolean factor analysis is a method from the family of matrix decomposition methods which enables us analyze binary multi-relational data, i.e. binary data which are composed from many binary data tables interconnected via relation. In this paper we present a new Boolean matrix factorization algorithm for this kind of data, which use the new knowledge from the theory of the Boolean factor analysis, so-called essential elements. We show on real dataset that utilizing essential elements in the algorithm leads to better results in terms of quality and the number of obtained multi-relational factors.

Introduction

The Boolean matrix factorization (or decomposition), also known as the Boolean factor analysis, has gained interest in the data mining community. Methods for decomposition of multi-relational data, i.e. complex data composed from many data tables interconnected via relations between objects or attributes of this data tables, were intensively studied, especially in the past few years. Multi-relational data is a more truthful and therefore often also more powerful representation of reality. An example of this kind of data can be an arbitrary relational database. In this paper we are focused on the subset of multi-relational data, more precisely on the multi-relational Boolean data. In this case data tables and relations between them contain only 0s and 1s.

It is important to say that many real-word data sets are more complex than one simple data table. Relations between this tables are crucial, because they carry additional information about the relationship between data and this information is important for understanding data as a whole. For this reason methods which can analyze multi-relational data usually takes into account relations between data tables unlike classical Boolean matrix factorization methods which can handle only one data table.

The Multi-Relational Boolean matrix factorization (MBMF) is used for many data mining purposes. The basic task is to nd new variables hidden in data, called multi-relational factors, which explain or describe the original input data. There exist several ways how to represent multi-relational factors. In this work we adopt settings from [ 7 ], where is the multi-relational factor represented as an ordered set of classic factors from data tables, always one factor from each data table. The fact, that classic factors are connected into multi-relational factor is matter of semantic of relation between data tables.

The main problem is how to connect classic factors into one multi-relational. The main aim of this work is to propose a new algorithm which utilize so-called essential elements from the theory of Boolean matrices. The essential elements provide information about factors which cover a particular part of data tables. This information can be used for a better connection of classic factors into one multi-relational factor.

Another thing is the number of obtained factors. In classical settings we want the number of obtained factors as small as a possible. In the literature can be found two main views on this requirement. In the rst case we want to obtain the particular number of factors. In the second case we want to obtain factors that explain prescribed portion of data. In both cases we want to obtain the most important factors. For more details see [ 1 ]. We emphasize this fact and we re ect it in designing of our algorithm. Both views can be transferred to multirelational case. The rst one is straightforward, the second one is a little bit problematic because multi-relational factors may not be able explain the whole data. This is correct, because multi-relational factors carry di erent information than classical factors. We discuss this issue later in the paper. 2

Preliminaries and basic notions

We assume familiarity with the basic notions of the Formal concept analysis [ 4 ], which provides a basic framework for dealing with factors and the Boolean matrix factorization (BMF) [ 2 ]. The main goal of classical BMF is to nd a decomposition C = A B, where C is input data table, A represent object-factor data table (or matrix) and B represent factor-attribute data table (or matrix). The product is the Boolean matrix product, de ned by (A

B)ij = Wk l=1 Ail Blj ; (1) where W denotes maximum (truth function of logical disjunction) and is the usual product (truth function of logical conjunction). Decomposition C into A B corresponds to discovery factors which explain the data. Factors in classical BMF can be seen as formal concepts [ 2 ], i.e. entity with the extent part and the intent part. This leads to clear interpretation of factors. Another bene t of using FCA as a basic framework is that matrices A and B can be constructed from the subset of all formal concepts. Let

F = fhA1; B1i ; : : : ; hAk; Bkig

B(X; Y; C); where B(X; Y; C) represents a set of all formal concepts of data table, which can be seen as a formal context hX; Y; Ci, where X is a set of objects, Y is a set of attributes and C is a binary relation between X and Y . Matrices A and B are constructed in the following way: (A)il = 1 if i 2 Al (B)lj = 0 if i 2= Al for l = 1; : : : ; k. In other words, A is composed from characteristic vectors Al. Similarly for B.

In a multi-relation environment we have a set of input data tables C1, C2; : : : Cn and a set of relations Rij , where i; j 2 f1; : : : ; ng, between Ci and Cj . The multi-relation factor on data tables C1, C2; : : : Cn is an ordered n-tuple F1i1 ; F2i2 ; : : : Fnin , where Fjij 2 Fj , j 2 f1; : : : ; ng (Fj denotes a set of classic factors of data table Cj ) and satisfying relations RClCl+1 or RCl+1Cl for l 2 f1; : : : ; n 1g.

Example 1. Let us have two data tables C1 (Table 1) and C2 (Table 2). Moreover, we consider relation RC1C2 (Table 3) between objects of the rst data table and attributes of the second one.

Classic factors of data table C1 are for example: F1C1 = hf1, 4g; fb; c; dgi, F2C1 = hf2, 4g; fa; cgi, F3C1 = hf1, 3, 4g; fb; dgi and factors of the second table C2 are: F1C2 = hf6; 7g; ff; ggi, F2C2 = hf5g; fe; hgi, F3C2 = hf5; 7g; fegi, F4C2 = hf8g; fg; hgi. These factors can be connected with using a relation RC1C2 into multi-relational factors in several ways. In [ 7 ] were introduced three approaches how to manage this connections. We use the narrow approach from [ 7 ], which seems to be the most natural, and we obtain two multi-relational factors hF1C1 ; F1C2 i and hF3C1 ; F1C2 i. The idea of the narrow approach is very simple. We connect two factors FiC1 and FjC2 if the non-empty set of attributes (if such exist), which are common (in the relation RC1C2 ) to all objects from the rst factor FiC1 , is the subset of attributes of the second factor FjC2 .

The previous example also demonstrate the most problematic part of MBMF. Usually is problematic to connect all factors from each data table. The result of this is a small number of connections between them. This leads to problematic selection of quality multi-relational factors. The reason for a small number of connections between factors is that classic factors are selected without taking relation into account.

Another very important notion for our work are so-called essential elements presented in [ 1 ]. Essential elements in the Boolean data table are entries in this data table which are su cient for covering the whole data table by factors (concepts), i.e. if we take factors which cover all these entries, we automatically cover all entries of the input data table. Formally, essential elements in the data table hX; Y; Ci are de ned via minimal intervals in the concept lattice. The entry Cij is essential i interval bounded by formal concepts hi"#; i"i and hj#; j#"i is non-empty and minimal w.r.t. (if it is not contained in any other interval). We denote this interval by Iij . If the table entry Cij is essential, then interval Iij represents the set of all formal concepts (factors) which cover this entry. Very interesting property of essential elements, which is used in our algorithm, is that is su cient take only one arbitrary concept from each interval to create exact Boolean decomposition of hX; Y; Ci. For more details about essential elements we refer to [ 1 ]. 3

Related work

There are several papers about classical BMF [ 1, 2, 5, 8, 10, 12 ], but this methods can handle only one data table. In the literature, we can found a wide range of theoretical and application papers about the multi-relation data analysis (see overview [ 3 ]), but many times were shown that these approaches are suitable only for ordinal data. The multi-relational Boolean factor analysis is more speci c. The most relevant paper for our work is [ 7 ], where was introduced the basic idea that multi-relational factors are composed from classical factors which are interconnected via relation between data tables. There were also introduced three approaches how to create multi-relational factors, but an e ective algorithm is missing.

The Boolean multi-relational patterns and its extraction are subject of a paper [ 11 ]. Di erently from our approach data are represented via k-partite graphs. There are considered only relations between attributes and data tables contain only one single attribute. Patterns in [ 11 ] are di erent from our multirelational factors (are represented as k-clique in data) and also carry di erent information. In [ 11 ] there is also considered other kind of measure of quality of obtained patterns which is based on entropy.

Another relevant work is [ 6 ] where were introduced the Relational Formal Concept Analysis as a tool for analyzing multi-relational data. Unlike from [ 6 ] our approach extracts a di erent kind of patterns. For more details see [ 7 ]. MBMF is mentioned indirectly in a very speci c and limited form in [ 9 ] as the Joint Subspace Matrix Factorization.

Generally the idea of connection patterns from various data tables is not new. It can be found in the social network analysis or in the eld of recommendation systems. The main advantage of our approach is that patterns are Boolean factors that carry signi cant information and the second important advantage is that we deliver the most important factors (factors which describe the biggest portion of input data) before others, i.e. the rst obtained factor is the most important.

Algorithm for MBMF

Before we present the algorithm for the MBMF we show on a simple example basic ideas that are behind the algorithm. For this purpose we take the example from the previous part. As we mentioned above if we take tables C1; C2 and relation RC1C2 , we obtain with the narrow approach two connections between factors, i.e. two multi-relational factors. These factors explain only 60 percent of data. There usually exist more factorizations of Boolean data table. Factors in our example were obtained with using GreConD algorithm from [ 2 ]. GreConD algorithm select in each iteration a factor which covers the biggest part of still uncovered data. Now we are in the situation, where we want to obtain a di erent set of factors, with more connections between them. For this purpose we can use essential elements. Firstly we compute essential parts of C1 (denoted Ess(C1)) and C2 (denoted Ess(C1)). With the essential part of data table we mean all essential elements (tables 1 and 2).

Each essential element in Ess(C1) is de ned via interval in concept lattice of C1 (Fig. 1a) and similarly for essential elements in Ess(C2) (Fig 1b). In Fig. 1a is highlighted interval I1c corresponding to essential element (C1)1c. In Fig. 1b is highlighted interval corresponding to essential element (C2)8g. Let us note that concept lattices here are only for illustration purpose. For computing Ess(C1) and Ess(C2) is not necessary to construct concept lattices at all. Now, if we use the fact that we can take an arbitrary concept (factor) from each interval to obtain a complete factorization of data table, we have several options which concepts can be connect into one. More precisely we can take two intervals and try to connect each concept from the rst interval with concepts from the second one. Again, we obtain full factorization of input data tables, but now we can select factors with regard to a relation between them.

For example, if we take highlighted intervals, we obtain possibly four connections. First highlighted interval contains two concepts c1 = hf1; 2; 4g; fcgi and c2 = hf1; 4g; fb; c; dgi. Second consist of concepts d1 = hf6; 7; 8g; fggi and d2 = hf8g; fg; hgi. Only two connections (c1 with d1 and c1 with d2) satisfy relation RC1C2 , i.e. can be connected.

For two intervals it is not necessary to try all combination of factors. If we are not able to connect concept hA; Bi from the rst interval with concept hC; Di from the second interval, we are not able connect hA; Bi with any concept hE; F i from the second interval, where hC; Di hE; F i. Also if we are not 2 c a 4 (a) 3 b; d 1

e 5 8 (b) g able to connect concept hA; Bi from the rst interval with concept hE; F i from the second interval, we are not able connect any concept hC; Di from the rst interval, where hC; Di hA; Bi, with concept hE; F i. Let us note that is classical subconcept-superconcept ordering.

Even if we take this search space reduction into account, search in this intervals is still time consuming. We propose an heuristic approach which takes attribute concepts in intervals of the second data table, i.e. the bottom elements in each interval. In intervals of the rst data table we take greatest concepts which can be connected via relation, i.e. set of common attributes in relation is non-empty. The idea behind this heuristic is that a bigger set of objects possibly have a smaller set of common attributes in a relation and this leads to bigger probability to connect this factor with some factor from the second data table, moreover, if we take factor which contains the biggest set of attributes in intervals of the second data table.

Because we do not want to construct the whole concept lattice and search in it, we compute candidates for greatest element directly from relation RC1C2 . We take all objects belonging to the top element of interval Iij from the rst data table and compute how many of them belong to each attribute in the relation. We take into account only attributes belonging to object i. We take as candidate the greatest set of objects belonging to some attribute in a relation, which satis es that if we compute a closure of this set in the rst data table, resulting set of objects do not have empty set of common attributes in a relation.

Applying this heuristic on data from the example, we obtain three factors hf1; 2; 4gr;sftcgdiataandtafobuler,faFcCto1rs=FhCf22;=4gh;ff5ag;;cfgei;,hFgiC, 1F =C2 h=f1h;f36;;47gg;;ffcf;;dgggii,,FFCC12 == in the 1 2 3 1 2 3 hf7g; fe; f; ggi, F4C2 = hf8g; fg; hgi from the second one. Between this factors, there are six connections satisfying the relation. These connections are shown in table 6.

We form multi-relational factors in a greedy manner. In each step we connect factors, which cover the biggest part of still uncovered part of data tables C1 and C2. Firstly, we obtain multi-relational factor hF2C1 ; F2C2 i which covers 50 percent of the data. Then we obtain factor hF3C1 ; F4C2 i which covers together with rst factor 75 percent of the data and last we obtain factor hF1C1 ; F3C2 i. All these factors cover 90 percent of the data. By adding other factors we do not obtain better coverage of input data. These three factors cover the same part of input data as six connections from table 6.

Remark 1. As we mentioned above and what we can see in the example, multirelational factors are not always able to explain the whole data. This is due to nature of data. Simply there is no information how to connect some classic factors, e.g. in the example no set of objects from C1 has in RC1C2 a set of common attributes equal to fe; hg (or only feg or only fhg). From this reason we are not able to connect any factor from C1 with factor F1C2 .

Remark 2. In previous part we explain the idea of the algorithm on a objectattribute relation between data tables. It is also possible consider di erent kind of relation, e.g. object-object, attribute-object or attribute-attribute relation. Without loss of generality we present the algorithm only for the object-attribute relation. Modi cation to a di erent kind of relation is very simple.

Now we are going to describe the pseudo-code (Algorithm 1) of our algorithm for MBMF. Input to this algorithm are two Boolean data tables C1 and C2, binary relation RC1C2 between them and a number p 2 [0; 1] which represent how large part of C1 and C2 we want to cover by multi-relational factors, e.g. value 0.9 mean that we want to cover 90 percent of entries in input data tables. Output of this algorithm is a set M of multi-relational factors that covers the prescribed portion of input data (if it is possible to obtain prescribed coverage). The rst computed factor covers the biggest part of data.

First, in lines 1-2 we compute essential part of C1 and C2. In lines 2-4 we initialize variables UC1 and UC2 . These variables are used for storing information about still uncovered part of input data. We repeat the main loop (lines 5-18) until we obtain a required coverage or until it is possible to add new multirelational factors which cover still uncovered part (lines 12-14).

In the main loop for each essential element we select the best candidate from interval Iij from the rst data table in the greedy manner described in the algorithm idea, i.e. we take the greatest concept which can be connected via relation. Than we try to connect this candidate with factors from the second data table. We compute cover function and we add to M the multi-relational factor maximizing this coverage.

In lines 16-17 we remove from UC1 and UC2 entries which are covered by actually added multi-relational factor.

Algorithm 1: Algorithm for the multi-relational Boolean factors analysis Input: Boolean matrices C1; C2 and relation RC1C2 between them and p 2 [0; 1] Output: set M of multi-relational factors select one from set of candidates which maximize cover of C1 (C2)i#"C2 and which 10 11

Our implementation of the algorithm follows the pseudo-code conceptually, but not in details. For example we speed up the algorithm by precomputing candidates or instead computing candidates for each essential elements, we compute candidates for essential areas, i.e. essential elements which are covered by one formal concept.

Remark 3. The input of our algorithm are two Boolean data tables and one relation between them. In general we can have more data tables and relations. Generalization of our algorithm for such input is possible. Due to lack of space we mentioned only an idea of this generalization. For the input data tables C1; C2; : : : ; Cn and relations RCiCi+1 ; i 2 f1; 2; : : : ; n 1g we rstly compute multi-relational factors for Cn 1 and Cn. Then iteratively compute multirelational factors for Cn 2 and Cn 1. From this pairs we construct n-tuple multirelational factor.

We do not make a detail analysis of the time complexity of the algorithm. Even our slow implementation in MATLAB is fast enough for factorization usually large datasets in a few minutes. 5

Experimental evaluation

For experimental evaluation of our algorithm we use in a data minig community well known real dataset MovieLens1. This dataset is composed of two data tables that represent a set of users and their attributes, e.g. gender, age, sex, occupation and a set of movies again with their attributes, e.g. the year of production or genre. Last part of this dataset is a relation between this data sets. This relation contains 1000209 anonymous ratings of approximately 3900 movies (3952) made by 6040 MovieLens users who joined to MovieLens in 2000. Each user has at least 20 ratings. Ratings are made on a 5-star scale (values 1-5, 1 means, that user does not like a movie and 5 means that he likes a movie).

Originally data tables Users and Movies are categorical. Age is grouped into 7 categories such as \Under 18", \18-24", \25-34", \35-44", \45-49", \50-55" and \56+". Sex is from set fMale, Femaleg. Occupation is chosen from the following choices: \other" or not speci ed, \academic/educator", \artist", \clerical/admin", \college/grad student", \customer service", \doctor/health care", \executive/managerial", \farmer", \homemaker", \K-12 student", \lawyer", \programmer", \retired", \sales/marketing", \scientist", \self-employed", \technician/engineer", \tradesman/craftsman", \unemployed" and \writer". Film genres are following: \Action", \Adventure", \Animation", \Children's", \Comedy", \Crime", \Documentary", \Drama", \Fantasy", \Film-Noir", \Horror", \Musical", \Mystery", \Romance", \Sci-Fi", \Thriller", \War" and \Western". Year of production is from 1919 to 2000. We grouped years into 8 categories \1919-1930", \1931-1940", \1941-1950", \1951-1960", \1961-1970", \1971-1980", \1981-1990" and \1991-2000".

We convert the ordinal relation in to binary one. We use three di erent scaling. The rst is that user rates a movie. The second is that a user does not like a movie (he rates movie with 1-2 stars). The last one is that user likes a movie (rates 4-5). This does not mean, that users do like (respective do not like) some genre, it means, that movies from this genre are or are not worth to see. We took the middle size version of the MovieLens dataset and we made a restriction to 3000 users and movies that were rated by that users. We take users, who rate movies the most, and we obtain dimension of the rst data table 3000 30 and dimension of the second data table is 3671 26. Let us just note that for obtaining object-attribute relation we need to transpose Movies data table.

Relation \user rates a movie" make sense, because user rates a movie if he has seen it. We can understand this relation as user has seen movie. We get 29 multi-relational factors, that cover almost 100% of data (99.97%). Values of coverage, i.e. how large part of input data is covered can be seen in gure 2. 1 http://grouplens.org/datasets/movielens/ Graphs in gure 3 show coverage of Users data table and Movies data table separately.

We can also see that for explaining more than 90 percent of data are su cient 17 factors. This is signi cant reduction of input data. { Males rate new movies (movies from 1991 to 2000). { Young adult users (ages 25-34) rate drama movies. { Females rate comedy movies.

{ Youth users (18-24) rate action movies.

Another interesting factors are: { Old users (from category 56+) rate movies from their childhood (movies from 1941 to 1950). { Users in age range 50-55 rate children's movies. Users in this age usually have grand children. { K-12 students rate animation movies.

Due to lack of space, we skip details about factors in relation \user does not like a movie" and relation \user does like a movie". In the rst relation we get 30 factors, that covers 99.99% of data. In the second one, we get 29 factors, covering 99.96% of data. Compute all multi-relational factors on this datasets take approximately 5 minutes.

Remark 4. In case of MovieLens we are able to reconstruct input data tables almost wholly for each three relations. Interesting question is what about relation, i.e. can we reconstruct relation between data tables? Answer is yes, we can. Multi-relational factor carry also information about the relation between data tables. So we can reconstruct it, but with some error. This error is a result of choosing the narrow approach.

Reconstruction error of relation is interesting information and can be minimize if we take this error into account in phase of computing coverage. In other words we want maximal coverage with minimal relation reconstruction error. This leads to more complicated algorithm because we need weights to compute a value of utility function. We implement also this variant of algorithm. Requirement of minimal reconstruction error and maximal coverage seems to be contradictory, but this claim need more detailed study. Also it is necessary to determine correct weight settings. We left this issue for the extended version of this paper. 6

Conclusion and Future Research

In this paper, we present new algorithm for multi-relational Boolean matrix factorization, that uses essential elements from binary matrices for constructing better multi-relational factors, with regard to relations between each data table. We test the algorithm on, in data mining well known, dataset MovieLens. We obtain from these experiments interesting and easy interpretable results, moreover, the number of obtained multi-relational factors needed for explaining almost whole data is reasonable small.

A future research shall include the following topics: generalization of the algorithm for MBMF for ordinal data, especially data over residuated lattices. Construction of algorithm which takes into account reconstruction error of the relation between data tables. Test the potential of this method in recommendation systems. And last but not least create not crisp operator for connecting classic factors into multi-relational factors.

Acknowledgment We acknowledge support by IGA of Palacky University, No. PrF 2014 034.

1. Belohlavek

, Trnecka

: From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm . http://arxiv.org/abs/1306.4905, 2013 .

2. Belohlavek

, Vychodil

: Discovery of optimal factors in binary data via a novel method of matrix decomposition . J. Comput. Syst. Sci . 76 ( 1 ), 3 { 20 , 2010 .

3. Dzeroski S. : Multi-Relational Data Mining: An Introduction . ACM SIGKDD Explorations Newsletter , 1 ( 5 ), 1 { 16 , 2003 .

4. Ganter

, Wille

: Formal Concept Analysis: Mathematical Foundations . Springer, Berlin, 1999 .

5. Geerts

, Goethals

, Mielikainen T.: Tiling databases , Proceedings of Discovery Science 2004 , pp. 278 { 289 , 2004 .

6. Hacene

M. R.

, Huchard

, Napoli

, Valtechev

: Relational concept analysis: mining concept lattices from multi-relational data . Ann. Math. Artif. Intell . 67 ( 1 ), 81 { 108 , 2013 .

7. Krmelova

, Trnecka

: Boolean Factor Analysis of Multi-Relational Data . In: M. Ojeda-Aciego , J. Outrata (Eds.): CLA 2013: Proceedings of the 10th International Conference on Concept Lattices and Their Applications , pp. 187 { 198 , 2013 .

8. Lucchese

, Orlando

, Perego

: Mining top-K patterns from binary datasets in presence of noise . SIAM DM 2010 , pp. 165 { 176 , 2010 .

9. Miettinen

: On Finding Joint Subspace Boolean Matrix Factorizations . Proc. SIAM International Conference on Data Mining (SDM2012) , pp. 954 - 965 , 2012 .

10. Miettinen

, Mielikainen T., Gionis

, Das

, Mannila

: The discrete basis problem , IEEE Trans. Knowledge and Data Eng . 20 ( 10 ), 1348 { 1362 , 2008 .

11. Spyropoulou

, De Bie T.: Interesting Multi-relational Patterns . In Proceedings of the 2011 IEEE 11th International Conference on Data Mining, ICDM '11 , pp. 675 { 684 , 2011 .

12. Xiang

, Jin

, Fuhry

, Dragan

F. F.

: Summarizing transactional databases with overlapped hyperrectangles , Data Mining and Knowledge Discovery 23 ( 2 ), 215 { 251 , 2011 .