<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identifying Latent Semantics in High-Dimensional Web Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ajit Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanjeev Maskara</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jau-Min Wong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I-Jen Chiang</string-name>
          <email>ijchiang@tmu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate Institute of Biomedical Informatics, Taipei Medical University</institution>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Biomedical Engineering, National Taiwan University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ovens and King Community Health Services</institution>
          ,
          <addr-line>Wangaratta, Victoria</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Search engines have become an indispensable tool for obtaining relevant information on the Web. The search engine often generates a large number of results, including several irrelevant items that obscure the comprehension of the generated results. Therefore, the search engines need to be enhanced to discover the latent semantics in high-dimensional web data. This paper purports to explain a novel framework, including its implementation and evaluation. To discover the latent semantics in high-dimensional web data, we proposed a framework named Latent Semantic Manifold (LSM). LSM is a mixture model based on the concepts of topology and probability. The framework can find the latent semantics in web data and represent them in homogeneous groups. The framework will be evaluated by experiments. The LSM framework outperformed compared to other frameworks. In addition, we deployed the framework to develop a tool. The tool was deployed for two years at two places - library and one biomedical engineering laboratory of Taiwan. The tool assisted the researchers to do semantic searches of the PubMed database. LSM framework evaluation and deployment suggest that the framework could be used to enhance the functionalities of currently available search engines by discovering latent semantics in high-dimensional web data.</p>
      </abstract>
      <kwd-group>
        <kwd>latent semantic manifold</kwd>
        <kwd>semantic cluster</kwd>
        <kwd>conditional random field</kwd>
        <kwd>hidden Markov models</kwd>
        <kwd>graph-based tree-width decomposition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Gigantic repositories, including data, texts, and media have grown rapidly [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1-5</xref>
        ].
These are made available on the World Wide Web for the public use. The search
engine tools assist users in searching contents relevant to them quickly [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However,
the search engines often return inconsistent, uninteresting, and disorganized results
due to various reasons [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. First, the web pages are heterogeneous and consist of
varying quality [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Second, the relationships among the words (polysemy,
synonymy, and homophony), sentences (paraphrase, entailment, and contradiction), and
ambiguities (lexical and structural) put a limitation on search technologies that
diminish the power of the search engines [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Users have to devote substantial time to
differentiate amongst meaningful items from the generated results [
        <xref ref-type="bibr" rid="ref10 ref11 ref5">5, 10, 11</xref>
        ]. Thus,
the users felt a need that search engines should be enhanced to filter and organize
meaningful items from the irrelevant results generated from the search queries [
        <xref ref-type="bibr" rid="ref12 ref13">12,
13</xref>
        ]. An effective search approach advocate to fit search results to the users’ intent by
discovering latent semantic in the generated documents, and then, classify documents
into ‘homogeneous semantic clusters’ [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. In this approach, each semantic cluster
is seen as a ‘topic’ that indicates a summary of the generated documents. Later, the
users can explore the topics that are relevant to their intent. For example, a query term
APC (Adenomatous Polyposis Coli) can be used to retrieve articles’ abstract from the
PubMed. However, the generated results would consist of not only articles about
Adenomatous Polyposis Coli, but also others such as Antigen Presenting Cells (APC),
Anaphase Promoting Complex (APC), and Activated Protein C (APC). The users
need to find articles relevant to their intent (here Adenomatous Polyposis Coli) after
going through the abstracts generated from the search. Similarly, a query term
‘network’ might generate different topics if it occurs near to a term such as computer,
traffic, artificial neural, and biological neural in the context of searched documents.
The generated results are desired to be relevant, not just outbound links pertaining to
the query terms. In order to facilitate and enhance relevant information access to the
web users, it is essential for search engines to deal with ambiguity, elusiveness, and
impreciseness of the users’ request [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Several researchers had made efforts towards semantic search of giant repositories.
For example, a deterministic search provided metadata-enhanced search facility,
wherein a user preselects different facets to generate more relevant search results [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
However, scaling the metadata-enhanced search facility to the web is difficult and
requires many experts to define controlled-vocabulary to create unique labels for
concepts having the same terminology [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. A revolutionary change in information
retrieval was realized by the introduction of the tf  idf scheme [
        <xref ref-type="bibr" rid="ref20 ref21 ref22">20-22</xref>
        ]. In this
scheme, the document collection is presented as a document-by-term matrix, which is
usually enormously high dimensional and sparse. Often, for a single document, there
are more than thousands of terms in a matrix, and most of the entries are zero. The
tf  idf scheme can reduce some terms; however, it provides the relatively small
amount of reduction, which is not enough to reveal the statistical measures within or
between document(s). In the last decades, some other dimension reduction techniques
such as Latent Semantic Indexing, Probabilistic Latent Semantic Indexing, and Latent
Dirichlet Allocation models were proposed to overcome some of these shortcomings.
However, all these were bag-of-words models. These bag-of-words models follow
Aldous and de Finetti theorem of exchangeability, wherein ‘order of terms in a
document’ or ‘order of documents in a corpus’ can be neglected [
        <xref ref-type="bibr" rid="ref23 ref24 ref25">23-25</xref>
        ]. As the spatial
information conveyed by the ‘terms in the document’ or ‘documents in a corpus’ was
highly neglected, a statistical issue was found to be attached with these bags-of-words
models [
        <xref ref-type="bibr" rid="ref24 ref25 ref26 ref27">24-27</xref>
        ]. In probability theory, the random variables (here referred as terms)
t1, t2, · · ·, tN , are said to be exchangeable if the joint distribution  F t1, t2, · · · , tN  is
invariant under permutation of its arguments, so that
F z1, z2 , · · · , zN   F t1 , t2, · · · , tN  where z1, z2, · · ·, zN   is a permutation of
t1, t2, · · ·, tN . Thus, a semantic generates from somewhat co-occurring ‘in
relationships terms’ and ‘in the limited number of terms’. The criterion that ‘order of terms in
a document can be neglected’ should be modified to ‘the order of terms in a
relationship of a document can be neglected’. Similarly, ‘the order of documents in a corpus
can be neglected’ should be modified to ‘the ordering documents in relationships of a
corpus can be neglected’. For example, a query term ‘network’ would yield different
‘topics’ if it occurs nearby to a term such as ‘computer’, ‘traffic’, ‘artificial neural’ or
‘biological neural,’ and the ‘order of terms in a relationship’ might be neglected.
      </p>
      <p>As we can see from the literature and arguments mentioned above, there was a
need to enhance search engines to reveal latent semantics in high dimensional web
data, while preserving the relationship and order of term(s) or document(s).
Therefore, we proposed a Latent Semantic Manifold (LSM) framework that identifies
homogeneous groups in web data, while preserving the spatial information of terms in a
document, or documents in the corpus. This paper aims to explain the Latent
Semantic Manifold framework (hereinafter, LSM framework), including its implementation
and evaluation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Materials and Methods</title>
      <p>This study consists of three key components – proposal of a novel theoretical
framework, implementation, and evaluation. They are explained in the following
subsections.
2.1</p>
      <sec id="sec-2-1">
        <title>Theoretical framework</title>
        <p>The proposed Latent Semantic Manifold (LSM) framework is a mixture model based
on the concepts of probability and topology, which identifies the latent semantic in
data. The concepts deployed in LSM framework are explained in the following four
steps. Figure 1 shows the high-level view of the framework.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Step 1: A ‘query’ entry for searching the high-dimensional web data</title>
        <p>The user can enter the ‘query’ using a search engine that generates a set of
documents. The generated documents need to be processed to get semantics, which can be
a sentence, paragraph, section, or even a whole document. The generated documents
are referred as ‘fragments’ in the following Step 2 and 3. For example, the sentence
‘Jaguar is an animal living in Jungle’ can be considered as a ‘fragment’ in Step 2 and
3. At times, ‘fragment’ has another meaning - context, which we have mentioned
explicitly at the appropriate place.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Step 2: Named-entity recognition and heterogeneous manifold construction</title>
        <p>
          The significant noun terms are identified from the ‘fragments’. For example, if the
sentence ‘Jaguar is an animal living in Jungle’ is considered to be fragmented;
‘Jaguar,’ ‘animal,’ and ‘Jungle,’ are significant ‘noun terms’. Some natural language
processing methods, called as named-entity recognition, are used to select named-entity
(noun terms) and its ‘type’ [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. The named-entity recognition and classification
algorithms extract the named-entities (noun terms) from fragments, and then, classify
those entities by ‘type’ such as person, organization, and location. For example, the
‘jaguar’ is considered as a named-entity, and it is assigned to the animal or vehicle
‘type’ depending on the fragment (context). The named-entities are indicated with
their marginal probabilities, and the correlations among the named-entities are
indicated with their conditional probabilities. As shown in Figure 2, Jaguar is a
namedentity with three possible types – animal, vehicle, and instrument. It has marginal
probabilities such as Panimal(Jaguar), Pvehicle(Jaguar), and Pinstrument(Jaguar).
Similarly, it has conditional probabilities such as P (Jaguar, Car | Vehicle), P (Jaguar,
Motorcycle | Vehicle).
        </p>
        <p>
          Although, we can enumerate all possible types of terms including their marginal
and conditional probabilities using a large number of training documents; however, it
is highly computational. Therefore, only nouns (words or phrases) are kept in reserve
instead of identifying all types of terms and their probabilities [
          <xref ref-type="bibr" rid="ref29 ref30 ref31 ref32">29-32</xref>
          ]. The Hidden
Markov Models (HMMs) were often used to draw ‘terms’ and their ‘relationships’
[
          <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
          ]. In the last decade, a discriminative linear chain Conditional Random Field
(CRF) was also used to extract ‘terms’ in the corpus [
          <xref ref-type="bibr" rid="ref35 ref36 ref37 ref38 ref39">35-39</xref>
          ]. In this study, we used a
trained Markov Natural Language Processing (NLP) Part-of-Speech (POS) tagging
models to extract all named-entities (noun terms and its types) by the inferences of
‘fragments’ [
          <xref ref-type="bibr" rid="ref31 ref32">31, 32</xref>
          ]. The relationships among those named-entities construct a
complex structure manifold. As the complex structure manifold is heterogeneous;
therefore, we call it ‘heterogeneous manifold’ hereinafter.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Step 3: Decomposing a heterogeneous manifold into homogeneous manifolds</title>
        <p>
          As mentioned in Step 2, the heterogeneous manifold consists of the complex structure
of named-entities including estimates of marginal and conditional probabilities. A
collection of fragment vectors lie on heterogeneous manifold, which contains some
local spaces resembling Euclidean spaces of a fixed number of dimensions. Every
point of the n-dimensional heterogeneous manifold has a neighborhood
homeomorphic to the n-dimensional Euclidean space R n . In addition, all points in the ‘local
spaces’ are strongly connected. As the heterogeneous manifold is overly complex, and
semantic is latent in ‘local spaces’; therefore, instead of retaining just one
heterogeneous manifold, we can break it into a collection of ‘homogeneous manifolds’. The
topological and geometrical concepts can be used to represent the latent semantics of
a heterogeneous manifold as a collection of homogeneous manifolds. A graph-based
treewidth decomposition algorithm is involved to decompose the a heterogeneous
manifold into the collection of homogeneous manifolds [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. As shown in Figure 3,
assuming ‘Jaguar’ as heterogeneous manifold, we can decompose it into three
‘homogeneous manifolds’ bounded by dotted lines of three different colors.
Later, the local manifold is decomposed into two local manifolds that are not
adjacent. This decomposition is recursive until no further decomposition is possible.
        </p>
        <p>We can express the above concept formally - let a heterogeneous manifold Mi for
fragment
i
be
the
set
of
homogeneous
manifolds
such
that
Mi   Mij | No Mij is a subset of Mik , j  k . The semantics generated from local
homogeneous manifolds, which are equipped with fragments, are independent. In addition,
a semantic topic set C  {z1, z2,···, zm} of the returned documents is associated with a
semantic
mapping
f Mij   C
with
a
probability
P Mij , zk   0, 1 , and quantity f Mij   zk . The probabilities indicate how many
documents pertaining to a homogeneous manifold are relevant and match the user’s
intent. To induce homogeneous manifolds, it is crucial to extract significant ‘terms’
from fragments. In addition, we should demonstrate the relevance of each fragment to
the homogeneous manifold. The users can refer only those homogeneous manifolds’
associated fragments, which they want.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Step 4: Exploring homogeneous manifold</title>
        <p>
          The search-generated documents (referred as fragments in step 2 and 3) are clustered
to their related homogeneous manifolds. For example, a query by the user for the term
A P C , t h e documents returned from the queried term aggregated into a collection
of homogeneous manifolds as shown in Figure 5. Each document is assigned to a
particular homogeneous manifold. The occurrence of a particular d o cument in the
whole set of documents denotes i t s significance in homogeneous manifold.
The LSM framework was implemented using the Eclipse Software Development Kit.
A team of three researchers, who were expert in the Java programming language,
developed the entire system. The development took almost 11 months. We provided a
straightforward search interface facility as shown in Figure 6 in the Result section.
The output of a user queried term (for example, APC) is shown in Figure 7 in the
Result section. The system was deployed for two years at two places - library and one
biomedical engineering laboratory of Taiwan. This system assisted the researchers to
perform semantic searches of the PubMed database. For example, a researcher can
search APC with Adenomatous Polyposis Coli as his or her intended meaning.
However, APC can also have meaning such as Antigen-Presenting Cells, Anaphase
Promoting Complex, or Activated Protein C among others. For instance, in a
homogeneous manifold, if APC, Colorectal Cancer, and Gene related documents are assembled,
homogeneous manifold would point out the APC as Adenomatous Polyposis Gene.
Similarly, an APC, Major Histocompatibility Complex and T-cells related documents
are assembled; it would indicate APC as Antigen Presenting Cells. In the Result
section, the Figure 8 shows that documents returned from the queried term APC can
automatically associate to homogeneous manifolds (semantic topics). In addition, the
researchers can obtain a different ‘vantage point’ based on the underlying data. For
example, a PubMed search retrieved 300 randomly selected published or in-press
articles’ abstracts for a medical term NOD2. Figure 9 shows latent semantic topics as
a clustering result. According to the result, inflammatory bowel disease and its type
(Crohn’s disease and ulcerative colitis) are associated with gene NOD2. The term
NOD2 was found to be evenly spread over these three topics - inflammatory bowel
disease and its type. Some evolving topics such as ‘bacterial component’ were also
discovered. However, when we searched NOD2 on Genia corpus†, the result was
different as shown in Figure 10 [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Experiment</title>
        <p>Data Sets. Two data sets, Reuters-21578-Distribution-1 and OHSUMED, were used
to evaluate performance of the LSM framework and its implementation. The
Reuters21578-Distribution-1 collection consists of Newswire articles. The articles were
classified into 135 topics, which were used to affirm the clustering results. In the test, the
documents with multiple topics (category labels) and single topic were separated. The
topics, which had less than five documents, were removed. Table 1 shows the
summary of the Reuters-21578-Distribution-1 collection.</p>
        <p>
          OHSUMED is a clinically oriented Medline collection, consisting of 348,566
references. It covers all the references from 270 medical journals of 23 disease
categories over a five-year period (1987-1991) [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ].
        </p>
        <p>Evaluation criteria. The experimental evaluation of the LSM framework
measured both effectiveness and efficiency. Effectiveness is defined as an ability to
identify the right cluster (collection of documents). In order to measure the effectiveness,
the clusters generated were verified by human experts as shown in Table 2.
† Genia corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms
‘human,’ ‘blood cells,’ and ‘transcription factors’.
‡ TP – True Positive; FP – False Positive; FN – False Negative; TN – True Negative</p>
        <sec id="sec-2-6-1">
          <title>Precisioni </title>
        </sec>
        <sec id="sec-2-6-2">
          <title>Recalli </title>
          <p>TPi
TPi  FPi</p>
          <p>TPi</p>
          <p>TPi  FNi
F 
 2  1 Precisioni  Recalli</p>
          <p> 2  Precisioni  Recalli
The F measure, which combines Precision and Recall, is defined as follows.</p>
          <p>F1 measure is used in this paper, which is obtained assigning  to be 1, which
means that precision and recall have equal weight for evaluating the performance. In
case, many categories are generated and compared, the overall precision and recall are
calculated as the average of all precisions and recalls belonging to various categories.
F1 is calculated as the mean of all results, which is a macro-average of the categories.</p>
          <p>
            In addition, two other evaluation metrics, normalized mutual information and
overall F-measure were also used [
            <xref ref-type="bibr" rid="ref44 ref45 ref46">44-46</xref>
            ]. Given the two sets of topics C and Cl, let
C denote the topic set defined by experts and Cl denote the topic set generated
by a clustering method, and both derived from the same corpora X. Let N(X)
denotes the number of total documents, N(z, X) denotes the number of documents
in topic z, and N(z, z’, X) denotes the number of documents both in topic z and
topic z’, for any topics in C. The normalized mutual information (NMI) metric MI (C,
C’) is defined as follows.
          </p>
          <p>MI (C,C ') </p>
          <p>
zC,z'C '</p>
          <p>P(z, z ') log2( PP(z()zP,z(z')'))</p>
          <p>The three measures of the effectiveness of clustering methods (Precision, Recall,
and F ) were calculated using the contingency Table 1. The Precision and Recall are
defined respectively as follows.
(1)
(2)
(3)
(4)</p>
          <p>Where P z  N  z, X  / N  X , Pz '  N z’, X  / N(X ), and Pz, z '  N z, z ', X  / N(X ). The
normalized mutual information metric MI C,C '  will return a value between zero and
maxH C, H C ' , where H(C) and H(C') define the entropies of C and C'
respectively. The higher MI C,C’ value means that two topics are almost identical, otherwise
more independent. The normalized mutual information metric MI C,C ' is, therefore,
transferred to be</p>
          <p>MI(C,C ') </p>
          <p>MI (C,C ')
max(H (C), H (C '))</p>
          <p>Let Fi be an F-measure for each cluster zi defined above. The overall F-measure
can be defined as:</p>
          <p>F*   P(z ')  maxF(z, z ')
z 'C ' zC
(5)</p>
        </sec>
        <sec id="sec-2-6-3">
          <title>Where F (z, z’) calculates the F-measure between z and z’.</title>
          <p>
            The experiments were conducted on Reuters-21578-Distribution-1 and OHSUMED
dataset. The clusters, from two to ten, were selected randomly to evaluate LSM and
other clustering methods. Fifty test runs were conducted for the randomly chosen
clusters from the corpus, and the final performance scores were obtained by averaging
the scores from the 50 test runs [
            <xref ref-type="bibr" rid="ref44">44</xref>
            ]. The t-test assessed whether homogeneous
clusters generated by the two methods (LSM vs. Other methods) were statistically
different from each other as shown in as Table 3 and Figure 11 in the Result section. We
also calculated the overall F-measure in combination of arbitrary ‘k’ clusters that
were uniquely assigned to topics from the Reuters-21578-Distribution-1 dataset,
where k was 3, 15, 30, and 60 [
            <xref ref-type="bibr" rid="ref47">47</xref>
            ]. Fifty test runs were also performed using these
LSM results to compare ‘Frequent Item set-based Hierarchical Clustering (FIHC)’
and ‘bisecting k-means’ as shown Table 4 and Figure 12 in the Result section [
            <xref ref-type="bibr" rid="ref47 ref48">47,
48</xref>
            ]. The average precision, recall, overall F-measure, and normalized mutual
information of LSM, LST, PLSI, PLSI + Ada Boost, LDA, and CCF was evaluated on the
Reuters-21578-Distribution-1 dataset; and LSM, LST, and CCF were evaluated on an
OHSUMED dataset as shown in Table 5 in the Result section [
            <xref ref-type="bibr" rid="ref26">26, 49-52</xref>
            ]. Besides the
effectiveness, an efficiency testing was performed on LSM, LST, and CCF as shown
in Figure 13 in the Result section.
3
3.1
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>LSM implementation results</title>
        <p>
          Normalized Mutual Information comparison of the LSM framework with the other
sixteen methods using Reuters-21578-Distribution-1 dataset is shown in Table 3 and
Figure 11 [
          <xref ref-type="bibr" rid="ref44">44, 52-54</xref>
          ].
§ LSM – Latent semantic manifold; CCF – k-clique community finding algorithm; GMM – Gaussian
mixture model; NB – Naive Bayes clustering; GMM + DFM – Gaussian mixture model followed by the
iterative cluster refinement method; KM – Traditional k-means; KM-NC – Traditional k-means and Spectral
clustering algorithm based on normalized cut criterion; SKM – Spherical k-means; SKM-NCW –
Normalized-cut weighted form; BP-NCW – Spectral clustering based bipartite normalized cut; AA – Average
association criterion; NC – Normalized cut criterion; RC – Spectral clustering based on ratio cut criterion;
NMF – Non-negative matrix factorization; NMF-NCW – Nonnegative Matrix
Factorizationbased clustering; CF – Concept factorization; CF-NCW – Clustering by concept factorization
        </p>
        <p>The four metrics (precision, recall, overall F-measure, normalized mutual
information) of LSM on Reuters-21578-Distribution-1 dataset for different k are listed in
Table 4. In addition, overall F-measure is compared with FIHC and bisecting k-means
as shown in Figure 12.
** LSM –Latent semantic manifold; GMM – Gaussian mixture model; NB – Naive Bayes clustering; GMM
+ DFM – Gaussian mixture model followed by the iterative cluster refinement method; KM –Traditional
kmeans; KM-NC – Traditional k-means and spectral clustering algorithm based on normalized cut criterion;
SKM – Spherical k-means; SKM-NCW – Normalized-cut weighted form; BP-NCW – Spectral clustering
based bipartite normalized cut; AA – Average association criterion; NC – Normalized cut criterion; RC –
Spectral clustering based on ratio cut criterion; NMF – Non-negative matrix factorization; NMF-NCW –
Nonnegative Matrix Factorization-based clustering; CF – Concept factorization; CF-NCW – Clustering by
concept factorization ; CCF – k-clique community finding algorithm</p>
        <p>The average precision, recall, overall F-measure, and normalized mutual
information of LSM, LST, PLSI, LDA, and CCF on Reuters-21578-Distribution-1 dataset;
and LSM, LST and CCF on OHSUMED are shown in Table 5. The efficiency testing
results of the three methods LSM, LST, and CCF are shown in Figure 13.
†† LSM – Latent semantic manifold; FIHC – Frequent itemset-based hierarchical clustering
‡‡ LSM – Latent semantic manifold; LST – Latent semantic topology; PLSI – Probabilistic latent semantic
indexing; PLSI + AdaBoost – Probabilistic latent semantic indexing + additive boosting methods; LDA –
Latent Dirichlet allocation; CCF – k-clique community finding algorithm
§§ LSM – Latent semantic manifold; LST – Latent semantic topology; CCF – k-clique community
finding algorithm
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <sec id="sec-4-1">
        <title>Primary findings</title>
        <p>Our findings suggest that the LSM framework might play an instrumental role to
enhance the search engine functionalities by discovering the latent semantics in
highdimensional web data (Figure 6 -10).
LSM has much better performance than the other sixteen clustering methods,
especially when the number of clusters got larger on Reuters-21578-Distribution-1 and
OHSUMED dataset (Table 3-4 and Figure 11-12).In general, LSM can produce more
accurate results than others. The paired t-test assessed the clustering results of the
same topics by any two methods - LSM, LST, and CCF. With a p-value less than
0.05, the results of LSM were significantly better than the results of LST, wherein we
used 63 clusters in the experiments. Similarly, with a p-value less than 0.05, the
results of LSM were significantly better than the results of the CCF in 48 randomly
selected clusters out of 72, in the experiments (Table 5). The efficiency of three
methods LSM, LST, and CCF with a number of features also demonstrated that LSM
is better than the others. The time needed to build latent semantics does not increase
significantly when the data became larger (Figure 13).
This study had a few limitations that open up the scope of future studies. First, to
identify and discriminate the correct topics in a collection of documents, the
combinations of features and their co-occurring relationships serve as clues, and the
probabilities display their significance. All features in documents compose a ‘topological
probabilistic manifold’ associated with ‘probabilistic measures’ to denote the
underlying structure. This complex structure is decomposed into inseparable components at
various levels (in various levels of skeletons) so that each component corresponds to
topics in a collection of documents. However, it is a time-consuming process and
strongly dependent on features and their identifications (named-entities). Second,
some terms with similar meanings such as ‘anticipate,’ ‘believe,’ ‘estimate,’ ‘expect,’
‘intend,’ and ‘project’ could be separated into several independent topics; however,
those topics could have a same meaning. Some data of a ‘single topic’ might be
specified in several clusters. These issues would be considered in the further research by
utilizing thesauri and some other adaptive methods [55]. Third, in this study, the
evaluation was carried out mainly by comparing with other latent semantic indexing (LSI)
algorithms. However, many alternative approaches for searching, clustering, and
categorization exist. Further study is needed to compare this approach with alternatives.
Fourth, some tools, such as GOPUBMED, ARGO, Vivisimo, also perform latent
semantics search of high dimensional web data. Some further study is needed to
compare LSM-based tool proposed in this study with already existing tools to find some
space for synergy. Fifth, there are some already existing knowledge bases or
resources in biomedical domain, such as (Medical Subject Headings). We already
perfomed some studies using Genia corpus, which contains 1,999 Medline abstracts,
selected using a PubMed query for the three MeSH terms ‘human,’ ‘blood cells,’ and
‘transcription factors (Figure 10).’ Some more studies need to be carried to verify if
this approach might be easily adapted to knowledge bases or resources.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We found that LSM framework could discover the latent semantics in
highdimensional web data and organize those into several semantic topics. This
framework could be used to enhance the functionalities of currently available search
engines.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>The National Science Foundation (NSC 98-2221-E-038-012) supported this work.
49. Cai, L., Hofmann, T.: Text categorization by boosting automatically extracted concepts.</p>
      <p>In: Proceedings of the 26th annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 182-189. ACM, (Year)
50. Chiang, I.-J.: Discover the semantic topology in high-dimensional data. Expert Systems
with Applications 33, 256-262 (2007)
51. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual
international ACM SIGIR conference on Research and development in information
retrieval, pp. 50-57. ACM, (Year)
52. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structure of complex networks in nature and society. Nature 435, 814-818 (2005)
53. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using
clustering. Machine learning 42, 143-175 (2001)
54. Shi, J., Malik, J.: Normalized cuts and image segmentation. Pattern Analysis and Machine</p>
      <p>Intelligence, IEEE Transactions on 22, 888-905 (2000)
55. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets
for data integration. In: Proceedings of the eighth ACM SIGKDD international conference
on Knowledge discovery and data mining, pp. 475-480. ACM, (Year)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ranganathan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>The data explosion</article-title>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Howe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Costanzo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gojobori</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hannick</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hide</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kania</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaeffer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>St</given-names>
            <surname>Pierre</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Big data: The future of biocuration</article-title>
          .
          <source>Nature</source>
          <volume>455</volume>
          ,
          <fpage>47</fpage>
          -
          <lpage>50</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gracia</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montiel-Ponsoda</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Pérez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Challenges for the multilingual Web of Data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>11</volume>
          ,
          <fpage>63</fpage>
          -
          <lpage>71</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strohman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Search engines: Information retrieval in practice</article-title>
          . Addison-Wesley
          <string-name>
            <surname>Reading</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Starlinger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vowinkel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arzt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
          </string-name>
          , U.:
          <article-title>GeneView: a comprehensive semantic search engine for PubMed</article-title>
          .
          <source>Nucleic acids research</source>
          40,
          <fpage>W585</fpage>
          -
          <lpage>W591</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lingwal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A Comparative Study Of Different Approaches For Improving Search Engine Performance</article-title>
          .
          <source>International Journal</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curry</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Riain</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Querying heterogeneous datasets on the linked data web: Challenges, approaches, and trends</article-title>
          .
          <source>Internet Computing, IEEE</source>
          <volume>16</volume>
          ,
          <fpage>24</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dalal</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Automatic Classification of Unstructured Blog Text</article-title>
          .
          <source>Journal of Intelligent Learning Systems and Applications</source>
          <volume>5</volume>
          ,
          <fpage>108</fpage>
          -
          <lpage>114</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Vercruysse</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuiper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Jointly creating digital abstracts: dealing with synonymy and polysemy</article-title>
          .
          <source>BMC research notes 5</source>
          ,
          <issue>601</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norbisrath</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewandowski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Ordinary search engine users carrying out complex search tasks</article-title>
          .
          <source>Journal of Information Science</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Brossard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheufele</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Science</surname>
          </string-name>
          , New Media, and the Public.
          <source>Science</source>
          <volume>339</volume>
          ,
          <fpage>40</fpage>
          -
          <lpage>41</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Stumme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berendt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Semantic Web Mining: State of the art and future directions</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>4</volume>
          ,
          <fpage>124</fpage>
          -
          <lpage>143</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Blanco</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halpin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herzig</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pound</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>H.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Repeatable and Reliable Semantic Search Evaluation</article-title>
          . Web Semantics: Science,
          <source>Services and Agents on the World Wide Web</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinsella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>9</volume>
          ,
          <fpage>365</fpage>
          -
          <lpage>401</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Fazzinga</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gianforme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gottlob</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukasiewicz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Semantic Web search based on ontological conjunctive queries</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>9</volume>
          ,
          <fpage>453</fpage>
          -
          <lpage>473</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Notion of “Meaning System” and its use for “Semantic Search”</article-title>
          .
          <source>Journal of Computations &amp; Modelling</source>
          <volume>1</volume>
          ,
          <fpage>97</fpage>
          -
          <lpage>126</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Beall</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The weaknesses of full-text searching</article-title>
          .
          <source>The Journal of Academic Librarianship</source>
          <volume>34</volume>
          ,
          <fpage>438</fpage>
          -
          <lpage>444</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Şah</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wade</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Automatic metadata mining from multilingual enterprise content</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>11</volume>
          ,
          <fpage>41</fpage>
          -
          <lpage>62</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domnori</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guerra</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trillo</surname>
            <given-names>Lado</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Velegrakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Keyword search over relational databases: a metadata approach</article-title>
          .
          <source>In: Proceedings of the 2011 international conference on Management of data</source>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>576</lpage>
          . ACM, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Luhn</surname>
            ,
            <given-names>H.P.:</given-names>
          </string-name>
          <article-title>The automatic creation of literature abstracts</article-title>
          .
          <source>IBM Journal of research and development 2</source>
          ,
          <fpage>159</fpage>
          -
          <lpage>165</lpage>
          (
          <year>1958</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGill</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to modern information retrieval</article-title>
          . (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          :
          <article-title>{Human Behaviour and the Principle of Least-Effort}</article-title>
          . (
          <year>1949</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Aldous</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Exchangeability and related topics</article-title>
          . École
          <string-name>
            <surname>d'Été de Probabilités de Saint-Flour</surname>
            <given-names>XIII</given-names>
          </string-name>
          -
          <volume>1983</volume>
          <fpage>1</fpage>
          -
          <lpage>198</lpage>
          (
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>De Finetti</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Theory of Probability: A critical introductory treatment</article-title>
          . Vol.
          <volume>1</volume>
          .
          <string-name>
            <surname>Wiley</surname>
          </string-name>
          (
          <year>1974</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>De Finetti</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Theory of probability: a critical introductory treatment (</article-title>
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>the Journal of machine Learning research 3</source>
          ,
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Flores</surname>
            ,
            <given-names>J.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferret</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Chandelar</surname>
          </string-name>
          , G.:
          <article-title>Bag of senses versus bag of words: comparing semantic and lexical approaches on sentence extraction</article-title>
          .
          <source>In: TAC 2008 Workshop-Notebook papers and results</source>
          , pp.
          <fpage>158</fpage>
          -
          <lpage>167</lpage>
          . (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Grishman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sundheim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Design of the MUC-6 evaluation</article-title>
          . In
          <source>: Proceedings of a workshop on held at Vienna</source>
          , Virginia: May 6-
          <issue>8</issue>
          ,
          <year>1996</year>
          , pp.
          <fpage>413</fpage>
          -
          <lpage>422</lpage>
          . Association for Computational Linguistics, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Juang</surname>
            ,
            <given-names>B.-H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabiner</surname>
            ,
            <given-names>L.R.</given-names>
          </string-name>
          :
          <article-title>Hidden Markov models for speech recognition</article-title>
          .
          <source>Technometrics</source>
          <volume>33</volume>
          ,
          <fpage>251</fpage>
          -
          <lpage>272</lpage>
          (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Mooij</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappen</surname>
            ,
            <given-names>H.J.:</given-names>
          </string-name>
          <article-title>Sufficient conditions for convergence of the sum-product algorithm</article-title>
          .
          <source>Information Theory</source>
          , IEEE Transactions on
          <volume>53</volume>
          ,
          <fpage>4422</fpage>
          -
          <lpage>4437</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Yedidia</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freeman</surname>
          </string-name>
          , W.T.,
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , Y.:
          <article-title>Understanding belief propagation and its generalizations</article-title>
          .
          <source>Exploring artificial intelligence in the new millennium 8</source>
          ,
          <fpage>236</fpage>
          -
          <lpage>239</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Yedidia</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freeman</surname>
          </string-name>
          , W.T.,
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , Y.:
          <article-title>Constructing free-energy approximations and generalized belief propagation algorithms</article-title>
          .
          <source>Information Theory</source>
          , IEEE Transactions on
          <volume>51</volume>
          ,
          <fpage>2282</fpage>
          -
          <lpage>2312</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Borkar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deshmukh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Automatic segmentation of text into structured records</article-title>
          .
          <source>In: ACM SIGMOD Record</source>
          , pp.
          <fpage>175</fpage>
          -
          <lpage>186</lpage>
          . ACM, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Bunescu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mooney</surname>
          </string-name>
          , R.J.:
          <article-title>Relational markov networks for collective information extraction</article-title>
          .
          <source>In: ICML-2004 Workshop on Statistical Relational Learning. (Year)</source>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Grimmett</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welsh</surname>
            ,
            <given-names>D.J.:</given-names>
          </string-name>
          <article-title>Disorder in physical systems: a volume in honour of John M. Hammersley on the occasion of his 70th birthday</article-title>
          . Oxford University Press, USA (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.</given-names>
          </string-name>
          :
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          . (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Chinese segmentation and new word detection using conditional random fields</article-title>
          .
          <source>In: Proceedings of the 20th international conference on Computational Linguistics</source>
          , pp.
          <fpage>562</fpage>
          . Association for Computational Linguistics, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text</article-title>
          .
          <source>Bioinformatics</source>
          <volume>21</volume>
          ,
          <fpage>3191</fpage>
          -
          <lpage>3192</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Discriminative probabilistic models for relational data</article-title>
          .
          <source>In: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence</source>
          , pp.
          <fpage>485</fpage>
          -
          <lpage>492</lpage>
          . Morgan Kaufmann Publishers Inc., (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>Srebro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaakkola</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Weighted low-rank approximations</article-title>
          .
          <source>In: Machine Learning International Workshop and Conference</source>
          , pp.
          <fpage>720</fpage>
          . (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <surname>Diestel</surname>
          </string-name>
          , R.:
          <source>Graph theory. 2005</source>
          . Springer-Verlag (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          42.
          <string-name>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ohta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tateisi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.i.</given-names>
          </string-name>
          :
          <article-title>GENIA corpus-a semantically annotated corpus for bio-textmining</article-title>
          .
          <source>Bioinformatics</source>
          <volume>19</volume>
          ,
          <fpage>i180</fpage>
          -
          <lpage>i182</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          43.
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leone</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hickam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>OHSUMED: an interactive retrieval evaluation and new large test collection for research</article-title>
          .
          <source>In: SIGIR'94</source>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>201</lpage>
          . Springer, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          44.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Document clustering by concept factorization</article-title>
          .
          <source>In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>209</lpage>
          . ACM, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          45.
          <string-name>
            <surname>Dalli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Adaptation of the F-measure to cluster based lexicon quality evaluation</article-title>
          .
          <source>In: Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?</source>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>56</lpage>
          . Association for Computational Linguistics, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          46.
          <string-name>
            <surname>Kummamuru</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotlikar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnapuram</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A hierarchical monothetic document clustering algorithm for summarization and browsing search results</article-title>
          .
          <source>In: Proceedings of the 13th international conference on World Wide Web</source>
          , pp.
          <fpage>658</fpage>
          -
          <lpage>665</lpage>
          . ACM, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          47.
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ester</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Hierarchical document clustering using frequent itemsets</article-title>
          .
          <source>In: Proceedings of the SIAM international conference on data mining</source>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>70</lpage>
          . (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          48.
          <string-name>
            <surname>Steinbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , V.:
          <article-title>A comparison of document clustering techniques</article-title>
          .
          <source>In: KDD workshop on text mining</source>
          , pp.
          <fpage>525</fpage>
          -
          <lpage>526</lpage>
          . Boston, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>