Introduction

VVeeccttoorr mmooddeell iimmpprroovveemmeenntt bbyy FFCCAA aanndd TTooppiicc EEvvoolluuttiioonn

Jan Martinoviˇc

Petr Gajdoˇs Jan Martinoviˇc

Petr Gajdoˇs

0 0 Department of Computer Science, VSB - Technical University of Ost rava Departm1e7n.tliosftoCpoamdupu1t5e , r7S0c8ie3n3ceO, sVtrS 1 pPaedtur.

1G5a, j7d0o8s,33JOanst.rMaavrat-Pinooruvibca}, @Cvzsebc.hczRepublic

46 57

Presented research is based on standard methods of information retrieval using the vector model for representation of documents (objects). The vector model is often expanded to get better precision and recall. In this article we have mentioned two approaches of vector model expansion. The first approach is based on hierarchical clustering. Its goal is to find a list of all documents they have most similar topic to the requested document. The second one is the document classification based on formal concept analysis. We have tried to evaluate all concepts and computed the importances of documents. At last have compared the results of our approach based on formal concept analysis and the results of classical vector model.

Vector FCA Moebius Topic Evolution Clustering

Introduction

2.1

Background Vector model

The vector model [12] of documents is dated back to 70th of the 20th century. In vector model there are documents and users queries represented by vectors.

We use m different terms t1 . . . tm for indexing n documents. Then each document di is represented by a vector: where wij is the weight of the term tj in the document di.

An index file of the vector model is represented by matrix: di = (wi1, wi2, . . . , wim) , where i-th row matches i-th document, and j-th column matches j-th term. In a vector model a query is represented by m dimensional vector: q = (q1, q2, . . . , qm) , where qj ∈ h0, 1im. On the basis of the query q we can compute coefficient of similarity for each document di. This coefficient can be understood as ”distance” between the documents’ vector and the vector of the query. We used cosine measure for computing this similarity:

The similarity of two documents is given by following formula: sim(q, di) =

Pkm=1 (qkwik) qPkm=1 (qk)2 Pkm=1 (wik)2 sim(di, dj ) =

Pkm=1 (wikwjk) qPkm=1 (wik)2 Pkm=1 (wjk)2 For more information see [12]. 2.2

Cluster analysis

The main goal of the cluster analysis is to find the fact, if there are any groups of similar objects. These groups are called clusters. We focuse on object clustering that can be divided in two steps. Firstly, we create the clusters and then we look for relevant clusters [7]. The reason of the cluster analysis is contained in the clusters hypothesis [12].

The searching process of an ideal fragmentation of objects is also called clustering. We use an agglomerative hierarchical clustering based on the similarity matrix:

At the beginning each object is considered as one cluster. Clusters are joined together in sequence. The algorithm is over, when all objects form only one cluster. Similarity matrix SimC for collections C may be described with: sim11 sim12 . . .sim1n  SimC = sim21 sim22 . . .sim2n  ,  ... ... . . . ... 

simn1simn2. . .simnn where i-th row matches i-th document and j-th column j-th document. 2.3

Formal Concept Analysis

FCA has been defined by R. Wille and it can be used for hierarchical order of objects based on object’s features. The basic terms are formal context and formal concept. In this section there are all important definitions the one needs to know to understand the problematics.

Definition 1. A formal context C = (G,M,I) consists of two sets G and M and a relation I between G and M. Elements of G are called objects and elements of M are called attributes of the context. In order to express that an object g is in a relation I with an attribute m, we write gIm or and read it as ” the object g has the attribute m”. The relation I is also called the incide nce relation of the context.

Definition 2. For a set A ⊂ G of objects we define

A↑ = { m ∈ M | gIm f or all g ∈ A} -the set of attributes common to the objects in A. Correspondi ngly, for a set B ⊂ M of attributes we define

B↓ = { g ∈ G | gIm f or all m ∈ B} -the set of objects which have all attributes in B.

Definition 3. A formal concept of the context (G,M,I) is a pair (A, B) with A ⊆ G, B ⊆ M , A↑ = B and B↓ = A. We call A the extent and B the intent of the concept (A, B).

Definition 4. Let M be the totality of all features deemed relevant in the specific context, and let I ⊆ G × M be the incidence relation that describes the features possessed by objects, i.e. (g, m) ∈ I whenever object g ∈ G possesses a feature m ∈ M . For each relevant feature m ∈ M , let λ(m) ≥ 0 quantify the importance or weight of feature m. The diversity value of a set S is defined as

Our approach is also based on Conjugate Moebius Function and the on some properties go out from the Theory of diversity and Formal concept analysis. ( 1 ) ( 2 ) ( 3 ) Theorem 1. For any function v : 2M → R with v(∅) = 0 there exists unique function λ : 2M → R, the Conjugate Moebius Inverse function, such that λ(∅) = 0 and for all S,

The diversity of an object (document) g is the sum of all weights of all features which are related to the object according to the incidence matrix. It conveys information about partial importance of an object but doesn’t clearly display other dependences.

do(g) =

X m:m∈M and (gIm)∈I

λ(m) sdo(C) = X do(g)

g:g∈C

Next characteristic is called the sum of diversities of all objects. Actually, the objects of one concept can “cover” all features.

The importance of the object (document) g is the main point of our method. The value represents the importance from these aspects: • Uniqueness - Is there any other similar object? • Range of description - What type of dimension does the object describe? • Weight of description - What is the weight of object in each dimension? ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) impo(g) =

C:C∋g X sdo(C) v(S) λ(A) do(g) For more information see [3] 3 3.1

Vector Model Improvement Using FCA to obtain the importance of documents

This method is based a) on the partial ordering of concepts in the concept lattice and b) on the inverse calculation of weights of objects using Moebius function and defined characteristics. Particular steps are illustrated by fig.[1] and briefly described in this chapter.

First we obtain the input data (documents and words) like a table or matrix. The second step - scaling method is used to create an input incidence matrix. Every dimension can be scaled to a finite number of parts to get the binary values or we can only change non-zero values for number one, otherwise number zero. The output of transformation is an incidence matrix that we need as input for the concept calculation. Next the power set of concepts is computed using FCA algorithms. We can create the “concept lattice” and draw the Hasse diagram, but it’s not important in our method. But it can be useful to show dependences between concepts, if we need it. We use only the list of concepts. After that, we can compute the basic characteristics for each concept according to the formulas ( 4 ), ( 5 ), ( 6 ), ( 7 ). Finally, we compute the importance of objects according to the formula ( 8 ). Obtained values provide us the criteria to sort the set of objects. 3.2

Evolution of topic

Our research concerns with the topics undergo an evolution. Lets assume document from collection of documents, that describes the same topic. It is clear, there are some other documents in the collection that describes the same topic, but they use different words to characterize the topic. The difference can be caused by many reasons. The first document focused on the topic use some set of words and next documents may use synonyms or for example exploration of new circumstances, new fact, new political situation etc. [4].

The result of searching an evolution of topic is to engaged query finding the lists of documents related by thematic with engaged query. We mean the query as query sets by terms or as document which is set as relevant.

We define this algorithm based on formal concept analysis and another algorithm for clustering. Our research gives us the answer for the question “What is the better way to improve the results of vector model?” This is our algorithm using FCA and Moebius function: Algorithm TOPIC-FCA : 1. We make the query transformation. It means that we create weighted vector of terms. 2. We compute the importances of documents (objects) by the formula 8. and we make the list of the documents and their importances. 3. We find the relevant document reld in the ordered list. 4. In finite steps, we look for “nearest” documents. The “nearest” document is the document, that has the smallest difference between its weight and the weight of reld. Founded document is excluded before next step.

Then we use this algorithm for clustering:

Algorithm TOPIC-CA :

1. We choose the total number of documents we want (’level)’. 2. We find leaf cluster which contain selected relevant document. 3. We get up in hierarchy. 4. We explore neighbouring clusters. First we select the cluster created on the highest sub-level. Each document, which we find, we add to the result list. When the count of all documents in the result list equal to ’level’ we break finding. 5. We repeat the step 3. 3.3

Sort Response in Vector Model

The collection of documents responses to the query in the vector model, which is ordered by the coefficient of similarity of the query and the document. In this part, we present the method that can change this response by asking to the evolution of topic from clusters or concepts. Our approach is based on removing all non-relevant documents from the query and next on adding another relevant documents to the query. We have developed next algorithm for this change: Algorithm SORT-EACH , this algorithm moves all documents in a result of the vector model query so that the documents belong to the same evolution of topic are closer to each other: 1. Collection of the documents from the vector query is marked as CV . 2. The new sorted collection is marked as CS and the count of its documents is a new value of the variable count. 3. We choose the total number of documents in evolution of topic and we mark it as level. 4. We do next sorting: foreach document DV in CV do if CS is empty then add DV to collection CS goto Continue end To document DV found by algorithm TOPIC-FCA (or TOPIC-CA) collection of evolution of topic CT . Count of documents in topic is level + 1 (document DV ). foreach document DT in CT within document DV do if document DT is in CS then add the document DV behind DT do CS goto Continue end end if not added DV then

add DV to end of collection CS label: Continue end 5. We return collection CS to user.

4 Illustrative sample of vector model improvement

Following tables show experimental results on generated data. Documents’ importances were computed according to formula 8. The document selected by user is highlighted. This is the input document in TOPIC algorithms above. Each query is transformed to the vector of weights of terms. We use simplified matrix of documents and terms. The number 1“” means that the document on the row consists the term in given column.

In the tables, we can see, that a vector queries give us worse results in some occurrence because they return zero-values of documents that don’t have common terms. But, these documents can be about the same theme described by different terms (words). So we use the SEARCH-EACH algorithm for improve vector query by TOPIC-FCA or TOPIC-CA. We use the new TOPIC-FCA algorithm in these samples. See [4] to get another experiments. them contains three requested terms. Computed TOPIC-FCA (Importances of objects) brings zero improvement. Next, we describe table 2. The values of documents’ importances show us the relative importances according to inserted query. There are only small differences between the importances of objects and vector query. The distance between document number 1 and selected document number 2 is larger then the distance between document number 2 and 3 (see the difference between documents’ importances). The distance of the vector query and each document plus the distance between documents are the main reason of this appearance. It is better to describe this effect in the following table 3. The vector query is 0“00111000000”. We selected the document number 2 again. Although the first document contains the same terms as the second document, the distance between them is very large because of great number of terms the second document does not contain. Then, the evolution of topic of the second document is doc3, doc4 and at last doc1. So we get different ordering than the ordering after vector query.

The table 4 shows the main deficiency of the vector query. When we insert query “000111000000” we can not obtain the fourth document. But our method include this document because of a similarity to selected document number 2. So we can find new dependences between documents they can be about the same theme.

The last table 5 shows better all hidden dependences between documents. The documents number 1 and 4 are not included in vector query, but we can say there can be some references between them because of common term number 9. The evolution of topic of selected document is doc3, doc1 and doc4.

We tried to show the importance of our method in simple examples. If we use the TOPIC-FCA or TOPIC-CA for vector query improvement we can find another dependences between documents and we can get better ordering of requested documents. 4.1

Sample graphs

Following graphs show documents’ distances from selected document number two. The graphs on the left show distances of documents after using TOPICFCA algorithm and the graphs on the right correspond to the results of the vector query. All distances were computed from selected document number 2. Graph description: – Node represent a document or a cluster of document if the documents’ distance is zero. Nodes’ numbers correspond to number of documents. – Edge connect comparable documents (nodes). The value means the distance of appropriate documents.

Documents’ distances computed from table 1. Documents’ distances computed from table 2. Documents’ distances computed from table 3. 56 Jan Martinoviˇc, Petr Gajdoˇs Documents’ distances computed from table 4. Documents’ distances computed from table 5.

Conclusion and future work

We have described new method for vector query improvement based on formal concept analysis and Moebius inverse function. The known deficiencies of vector model have been suppressed using TOPICs and SEARCH-EACH algorithms. In the future work we would like to test our methods on real data. Our presented methods can be applied on small data sets or on large collections of documents.

1. Berry , M. W (Ed.): Survey of Text Mining: Clustering Classification, and

Retrieval. Springer Verlag

2003 .

2. Baeza-Yates

, Ribeiro-Neto

: Modern Information Retrieval . Ad dison Wesley , New York, 1999 .

3. Dˇuar´kova´, D. , Gajdoˇs, P.: Indicators Valuation using FCA and Moeb ius Inversion Function . DATAKON, Brno, 2004 , IBSN 80-2103-516-1

4. Dvorsky ´ J., Martinoviˇc J., Snaˇ´sel V. : Query Expansion and Ev olution of Topic in Information Retrieval Systems , DATESO 2004 , ISBN: 80 - 248 -0457-3.

5. Dvorsky ´ J., Martinoviˇc J., Pokorny´ J., Snaˇ´sel V. : A Search t opics in Collection of Documents.(in Czech) , Znalosti 2004 , ISBN: 80 - 248 -0456-5.

6. Ganter

, Wille

: Formal Concept Analysis . SpringerV-erlag, Ber lin , Heidelberg, 1999 .

Christis

Faloutsos , Douglas Oard: A Survey of Information Retrieval and Filtering Methods , Univ. of Maryland Institute for Advanced Computer Studies Report , College Park, 1995 .

8. Keith Van Rijsbergen: The Geometry of Information Retrieva , Cambridge University Press, 2004 .

9. Kummamuru

, Lotlikar

, Roy

, Singal

, Krishnapuram

A Hierarchical

Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results , WWW2004, New York, USA.

10. Nehring , K. and Puppe , C. : Modelling phylogenetic diversity . Resource and Energy Economics ( 2002 ).

11. Nehring , K. : A Theory of Diversity. Ecometrica 70 ( 2002 ) 1155 1198 .

12. Pokorny ´ J., Snaˇ´sel V., Hu´sek D.: Dokumentograficke´ informaˇc nı´ sysetm´y. Karolinum, Skriptum MFF UK Praha , 1998 .

13. C.J. van Rijsbergen : Information Retrieval (second ed.). London, Butterworths, 1979 .

14. Tsunenori

Ishioka

: Evaluation of Criteria for Information Retrieval , International Conference on Web Intelligence, IEEE Computer Society, 2003 , ISBN 0-7695- 1932- 6.

15.

Vempala: The Random Projection Method , Dimacs Series in Discrete Mathematics and Theoretical Computer Science , 2004 .