<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VVeeccttoorr mmooddeell iimmpprroovveemmeenntt bbyy FFCCAA aanndd TTooppiicc EEvvoolluuttiioonn</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Martinoviˇc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Gajdoˇs Jan Martinoviˇc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Gajdoˇs</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, VSB - Technical University of Ost rava Departm1e7n.tliosftoCpoamdupu1t5e</institution>
          ,
          <addr-line>r7S0c8ie3n3ceO, sVtrS</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>pPaedtur.</institution>
          <addr-line>1G5a, j7d0o8s,33JOanst.rMaavrat-Pinooruvibca}, @Cvzsebc.hczRepublic</addr-line>
        </aff>
      </contrib-group>
      <fpage>46</fpage>
      <lpage>57</lpage>
      <abstract>
        <p>Presented research is based on standard methods of information retrieval using the vector model for representation of documents (objects). The vector model is often expanded to get better precision and recall. In this article we have mentioned two approaches of vector model expansion. The first approach is based on hierarchical clustering. Its goal is to find a list of all documents they have most similar topic to the requested document. The second one is the document classification based on formal concept analysis. We have tried to evaluate all concepts and computed the importances of documents. At last have compared the results of our approach based on formal concept analysis and the results of classical vector model.</p>
      </abstract>
      <kwd-group>
        <kwd>Vector</kwd>
        <kwd>FCA</kwd>
        <kwd>Moebius</kwd>
        <kwd>Topic Evolution</kwd>
        <kwd>Clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <sec id="sec-2-1">
        <title>Vector model</title>
        <p>The vector model [12] of documents is dated back to 70th of the 20th century.
In vector model there are documents and users queries represented by vectors.</p>
        <p>We use m different terms t1 . . . tm for indexing n documents. Then each
document di is represented by a vector:
where wij is the weight of the term tj in the document di.</p>
        <p>An index file of the vector model is represented by matrix:
di = (wi1, wi2, . . . , wim) ,
where i-th row matches i-th document, and j-th column matches j-th term.
In a vector model a query is represented by m dimensional vector:
q = (q1, q2, . . . , qm) ,
where qj ∈ h0, 1im. On the basis of the query q we can compute coefficient of
similarity for each document di. This coefficient can be understood as ”distance”
between the documents’ vector and the vector of the query. We used cosine
measure for computing this similarity:</p>
        <p>The similarity of two documents is given by following formula:
sim(q, di) =</p>
        <p>Pkm=1 (qkwik)
qPkm=1 (qk)2 Pkm=1 (wik)2
sim(di, dj ) =</p>
        <p>Pkm=1 (wikwjk)
qPkm=1 (wik)2 Pkm=1 (wjk)2
For more information see [12].
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Cluster analysis</title>
        <p>The main goal of the cluster analysis is to find the fact, if there are any groups of
similar objects. These groups are called clusters. We focuse on object clustering
that can be divided in two steps. Firstly, we create the clusters and then we look
for relevant clusters [7]. The reason of the cluster analysis is contained in the
clusters hypothesis [12].</p>
        <p>The searching process of an ideal fragmentation of objects is also called
clustering. We use an agglomerative hierarchical clustering based on the similarity
matrix:</p>
        <p>At the beginning each object is considered as one cluster. Clusters are joined
together in sequence. The algorithm is over, when all objects form only one
cluster.
Similarity matrix SimC for collections C may be described with:
sim11 sim12 . . .sim1n 
SimC = sim21 sim22 . . .sim2n  ,
 ... ... . . . ... </p>
        <p>simn1simn2. . .simnn
where i-th row matches i-th document and j-th column j-th document.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Formal Concept Analysis</title>
        <p>FCA has been defined by R. Wille and it can be used for hierarchical order
of objects based on object’s features. The basic terms are formal context and
formal concept. In this section there are all important definitions the one needs
to know to understand the problematics.</p>
        <p>Definition 1. A formal context C = (G,M,I) consists of two sets G and M and
a relation I between G and M. Elements of G are called objects and elements of
M are called attributes of the context. In order to express that an object g is in
a relation I with an attribute m, we write gIm or and read it as ” the object g
has the attribute m”. The relation I is also called the incide nce relation of the
context.</p>
        <p>Definition 2. For a set A ⊂ G of objects we define</p>
        <p>A↑ = { m ∈ M | gIm f or all g ∈ A}
-the set of attributes common to the objects in A. Correspondi ngly, for a set
B ⊂ M of attributes we define</p>
        <p>B↓ = { g ∈ G | gIm f or all m ∈ B}
-the set of objects which have all attributes in B.</p>
        <p>Definition 3. A formal concept of the context (G,M,I) is a pair (A, B) with
A ⊆ G, B ⊆ M , A↑ = B and B↓ = A. We call A the extent and B the intent of
the concept (A, B).</p>
        <p>Definition 4. Let M be the totality of all features deemed relevant in the specific
context, and let I ⊆ G × M be the incidence relation that describes the features
possessed by objects, i.e. (g, m) ∈ I whenever object g ∈ G possesses a feature
m ∈ M . For each relevant feature m ∈ M , let λ(m) ≥ 0 quantify the importance
or weight of feature m. The diversity value of a set S is defined as</p>
        <p>
          Our approach is also based on Conjugate Moebius Function and the on some
properties go out from the Theory of diversity and Formal concept analysis.
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
Theorem 1. For any function v : 2M → R with v(∅) = 0 there exists unique
function λ : 2M → R, the Conjugate Moebius Inverse function, such that λ(∅) =
0 and for all S,
        </p>
        <p>The diversity of an object (document) g is the sum of all weights of all
features which are related to the object according to the incidence matrix. It
conveys information about partial importance of an object but doesn’t clearly
display other dependences.</p>
        <p>do(g) =</p>
        <p>X
m:m∈M and (gIm)∈I</p>
        <p>λ(m)
sdo(C) = X do(g)</p>
        <p>g:g∈C</p>
        <p>Next characteristic is called the sum of diversities of all objects. Actually,
the objects of one concept can “cover” all features.</p>
        <p>
          The importance of the object (document) g is the main point of our method.
The value represents the importance from these aspects:
• Uniqueness - Is there any other similar object?
• Range of description - What type of dimension does the object describe?
• Weight of description - What is the weight of object in each dimension?
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
impo(g) =
        </p>
        <p>C:C∋g
X sdo(C)
v(S)
λ(A) do(g)
For more information see [3]
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Vector Model Improvement</title>
      <sec id="sec-3-1">
        <title>Using FCA to obtain the importance of documents</title>
        <p>This method is based a) on the partial ordering of concepts in the concept lattice
and b) on the inverse calculation of weights of objects using Moebius function
and defined characteristics. Particular steps are illustrated by fig.[1] and briefly
described in this chapter.</p>
        <p>
          First we obtain the input data (documents and words) like a table or matrix.
The second step - scaling method is used to create an input incidence matrix.
Every dimension can be scaled to a finite number of parts to get the binary values
or we can only change non-zero values for number one, otherwise number zero.
The output of transformation is an incidence matrix that we need as input for
the concept calculation. Next the power set of concepts is computed using FCA
algorithms. We can create the “concept lattice” and draw the Hasse diagram,
but it’s not important in our method. But it can be useful to show dependences
between concepts, if we need it. We use only the list of concepts. After that, we
can compute the basic characteristics for each concept according to the formulas
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ), (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ), (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ), (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ). Finally, we compute the importance of objects according to the
formula (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ). Obtained values provide us the criteria to sort the set of objects.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Evolution of topic</title>
        <p>Our research concerns with the topics undergo an evolution. Lets assume
document from collection of documents, that describes the same topic. It is clear,
there are some other documents in the collection that describes the same topic,
but they use different words to characterize the topic. The difference can be
caused by many reasons. The first document focused on the topic use some set
of words and next documents may use synonyms or for example exploration of
new circumstances, new fact, new political situation etc. [4].</p>
        <p>The result of searching an evolution of topic is to engaged query finding the
lists of documents related by thematic with engaged query. We mean the query
as query sets by terms or as document which is set as relevant.</p>
        <p>We define this algorithm based on formal concept analysis and another
algorithm for clustering. Our research gives us the answer for the question “What is
the better way to improve the results of vector model?”
This is our algorithm using FCA and Moebius function:
Algorithm TOPIC-FCA :
1. We make the query transformation. It means that we create weighted vector
of terms.
2. We compute the importances of documents (objects) by the formula 8. and
we make the list of the documents and their importances.
3. We find the relevant document reld in the ordered list.
4. In finite steps, we look for “nearest” documents. The “nearest” document is
the document, that has the smallest difference between its weight and the
weight of reld. Founded document is excluded before next step.</p>
        <p>Then we use this algorithm for clustering:</p>
      </sec>
      <sec id="sec-3-3">
        <title>Algorithm TOPIC-CA :</title>
        <p>1. We choose the total number of documents we want (’level)’.
2. We find leaf cluster which contain selected relevant document.
3. We get up in hierarchy.
4. We explore neighbouring clusters. First we select the cluster created on the
highest sub-level. Each document, which we find, we add to the result list.
When the count of all documents in the result list equal to ’level’ we break
finding.
5. We repeat the step 3.
3.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Sort Response in Vector Model</title>
        <p>The collection of documents responses to the query in the vector model, which
is ordered by the coefficient of similarity of the query and the document. In this
part, we present the method that can change this response by asking to the
evolution of topic from clusters or concepts. Our approach is based on removing
all non-relevant documents from the query and next on adding another relevant
documents to the query. We have developed next algorithm for this change:
Algorithm SORT-EACH , this algorithm moves all documents in a result of
the vector model query so that the documents belong to the same evolution of
topic are closer to each other:
1. Collection of the documents from the vector query is marked as CV .
2. The new sorted collection is marked as CS and the count of its documents
is a new value of the variable count.
3. We choose the total number of documents in evolution of topic and we mark
it as level.
4. We do next sorting:
foreach document DV in CV do
if CS is empty then
add DV to collection CS
goto Continue
end
To document DV found by algorithm TOPIC-FCA (or TOPIC-CA)
collection of evolution of topic CT . Count of documents in topic is
level + 1 (document DV ).
foreach document DT in CT within document DV do
if document DT is in CS then
add the document DV behind DT do CS
goto Continue
end
end
if not added DV then</p>
        <p>add DV to end of collection CS
label: Continue
end
5. We return collection CS to user.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Illustrative sample of vector model improvement</title>
      <p>Following tables show experimental results on generated data. Documents’
importances were computed according to formula 8. The document selected by
user is highlighted. This is the input document in TOPIC algorithms above.
Each query is transformed to the vector of weights of terms. We use simplified
matrix of documents and terms. The number 1“” means that the document on
the row consists the term in given column.</p>
      <p>In the tables, we can see, that a vector queries give us worse results in some
occurrence because they return zero-values of documents that don’t have
common terms. But, these documents can be about the same theme described by
different terms (words). So we use the SEARCH-EACH algorithm for improve
vector query by TOPIC-FCA or TOPIC-CA. We use the new TOPIC-FCA
algorithm in these samples. See [4] to get another experiments.
them contains three requested terms. Computed TOPIC-FCA (Importances of
objects) brings zero improvement.
Next, we describe table 2. The values of documents’ importances show us
the relative importances according to inserted query. There are only small
differences between the importances of objects and vector query. The distance
between document number 1 and selected document number 2 is larger then the
distance between document number 2 and 3 (see the difference between
documents’ importances). The distance of the vector query and each document plus
the distance between documents are the main reason of this appearance. It is
better to describe this effect in the following table 3.
The vector query is 0“00111000000”. We selected the document number 2
again. Although the first document contains the same terms as the second
document, the distance between them is very large because of great number of terms
the second document does not contain. Then, the evolution of topic of the second
document is doc3, doc4 and at last doc1. So we get different ordering than the
ordering after vector query.</p>
      <p>The table 4 shows the main deficiency of the vector query. When we insert
query “000111000000” we can not obtain the fourth document. But our method
include this document because of a similarity to selected document number 2. So
we can find new dependences between documents they can be about the same
theme.</p>
      <p>The last table 5 shows better all hidden dependences between documents.
The documents number 1 and 4 are not included in vector query, but we can say
there can be some references between them because of common term number 9.
The evolution of topic of selected document is doc3, doc1 and doc4.</p>
      <p>We tried to show the importance of our method in simple examples. If we
use the TOPIC-FCA or TOPIC-CA for vector query improvement we can find
another dependences between documents and we can get better ordering of
requested documents.
4.1</p>
      <sec id="sec-4-1">
        <title>Sample graphs</title>
        <p>Following graphs show documents’ distances from selected document number
two. The graphs on the left show distances of documents after using
TOPICFCA algorithm and the graphs on the right correspond to the results of the
vector query. All distances were computed from selected document number 2.
Graph description:
– Node represent a document or a cluster of document if the documents’
distance is zero. Nodes’ numbers correspond to number of documents.
– Edge connect comparable documents (nodes). The value means the distance
of appropriate documents.</p>
        <sec id="sec-4-1-1">
          <title>Documents’ distances computed from table 1.</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Documents’ distances computed from table 2.</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Documents’ distances computed from table 3. 56 Jan Martinoviˇc, Petr Gajdoˇs</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Documents’ distances computed from table 4.</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Documents’ distances computed from table 5.</title>
          <p>5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>We have described new method for vector query improvement based on formal
concept analysis and Moebius inverse function. The known deficiencies of vector
model have been suppressed using TOPICs and SEARCH-EACH algorithms. In
the future work we would like to test our methods on real data. Our presented
methods can be applied on small data sets or on large collections of documents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Berry</surname>
          </string-name>
          , M. W (Ed.):
          <article-title>Survey of Text Mining: Clustering Classification, and</article-title>
          <string-name>
            <surname>Retrieval. Springer Verlag</surname>
          </string-name>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baeza-Yates</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro-Neto</surname>
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Modern Information Retrieval</article-title>
          .
          <source>Ad dison Wesley</source>
          , New York,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Dˇuar´kova´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Gajdoˇs, P.:
          <article-title>Indicators Valuation using FCA and Moeb ius Inversion Function</article-title>
          . DATAKON, Brno,
          <year>2004</year>
          , IBSN 80-2103-516-1
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dvorsky</surname>
          </string-name>
          ´ J.,
          <source>Martinoviˇc</source>
          J., Snaˇ´sel V.
          <article-title>: Query Expansion and Ev olution of Topic in Information Retrieval Systems</article-title>
          ,
          <string-name>
            <surname>DATESO</surname>
          </string-name>
          <year>2004</year>
          , ISBN:
          <fpage>80</fpage>
          -
          <lpage>248</lpage>
          -0457-3.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dvorsky</surname>
          </string-name>
          ´ J.,
          <source>Martinoviˇc</source>
          J., Pokorny´ J., Snaˇ´sel V.
          <article-title>: A Search t opics in Collection of Documents.(in Czech)</article-title>
          ,
          <source>Znalosti</source>
          <year>2004</year>
          , ISBN:
          <fpage>80</fpage>
          -
          <lpage>248</lpage>
          -0456-5.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ganter</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Formal Concept Analysis</article-title>
          .
          <article-title>SpringerV-erlag, Ber lin</article-title>
          , Heidelberg,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Christis</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          , Douglas Oard:
          <article-title>A Survey of Information Retrieval and Filtering Methods</article-title>
          ,
          <source>Univ. of Maryland Institute for Advanced Computer Studies Report</source>
          , College Park,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Keith Van Rijsbergen:
          <source>The Geometry of Information Retrieva</source>
          , Cambridge University Press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kummamuru</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotlikar</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singal</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnapuram</surname>
            <given-names>R.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>A Hierarchical</given-names>
            <surname>Monothetic</surname>
          </string-name>
          <article-title>Document Clustering Algorithm for Summarization and Browsing Search Results</article-title>
          , WWW2004, New York, USA.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Nehring</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Puppe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Modelling phylogenetic diversity</article-title>
          .
          <source>Resource and Energy Economics</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nehring</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <source>A Theory of Diversity. Ecometrica</source>
          <volume>70</volume>
          (
          <year>2002</year>
          ) 1155
          <fpage>1198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pokorny</surname>
          </string-name>
          ´ J., Snaˇ´sel V., Hu´sek D.: Dokumentograficke´ informaˇc nı´ sysetm´y. Karolinum,
          <source>Skriptum MFF UK Praha</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>C.J. van Rijsbergen</surname>
          </string-name>
          : Information Retrieval (second ed.). London, Butterworths,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Tsunenori</surname>
            <given-names>Ishioka</given-names>
          </string-name>
          :
          <article-title>Evaluation of Criteria for Information Retrieval</article-title>
          , International Conference on Web Intelligence, IEEE Computer Society,
          <year>2003</year>
          , ISBN 0-7695- 1932- 6.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>S.</given-names>
            <surname>Vempala: The Random Projection Method</surname>
          </string-name>
          ,
          <source>Dimacs Series in Discrete Mathematics and Theoretical Computer Science</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>