Identifying Latent Semantics in High-Dimensional Web Data Ajit Kumar1, Sanjeev Maskara2, Jau-Min Wong 3, I-Jen Chiang 1, 3, * 1 Graduate Institute of Biomedical Informatics, Taipei Medical University, Taiwan 2 Ovens and King Community Health Services, Wangaratta, Victoria, Australia 3 Institute of Biomedical Engineering, National Taiwan University, Taipei, Taiwan {d110099005,ijchiang}@tmu.edu.tw Abstract. Search engines have become an indispensable tool for obtaining rele- vant information on the Web. The search engine often generates a large number of results, including several irrelevant items that obscure the comprehension of the generated results. Therefore, the search engines need to be enhanced to dis- cover the latent semantics in high-dimensional web data. This paper purports to explain a novel framework, including its implementation and evaluation. To discover the latent semantics in high-dimensional web data, we proposed a framework named Latent Semantic Manifold (LSM). LSM is a mixture model based on the concepts of topology and probability. The framework can find the latent semantics in web data and represent them in homogeneous groups. The framework will be evaluated by experiments. The LSM framework outper- formed compared to other frameworks. In addition, we deployed the framework to develop a tool. The tool was deployed for two years at two places - library and one biomedical engineering laboratory of Taiwan. The tool assisted the re- searchers to do semantic searches of the PubMed database. LSM framework evaluation and deployment suggest that the framework could be used to en- hance the functionalities of currently available search engines by discovering latent semantics in high-dimensional web data. Keywords: latent semantic manifold; semantic cluster; conditional random field; hidden Markov models; graph-based tree-width decomposition 1 Introduction Gigantic repositories, including data, texts, and media have grown rapidly [1-5]. These are made available on the World Wide Web for the public use. The search en- gine tools assist users in searching contents relevant to them quickly [4]. However, the search engines often return inconsistent, uninteresting, and disorganized results due to various reasons [5, 6]. First, the web pages are heterogeneous and consist of varying quality [6, 7]. Second, the relationships among the words (polysemy, synon- * Address: 250, Wuxing Street, Taipei -11031, Taiwan Phone: +886 -2-27361661 Ext. 3343 Fax: +886-2-27392914 2 ymy, and homophony), sentences (paraphrase, entailment, and contradiction), and ambiguities (lexical and structural) put a limitation on search technologies that dimin- ish the power of the search engines [8, 9]. Users have to devote substantial time to differentiate amongst meaningful items from the generated results [5, 10, 11]. Thus, the users felt a need that search engines should be enhanced to filter and organize meaningful items from the irrelevant results generated from the search queries [12, 13]. An effective search approach advocate to fit search results to the users’ intent by discovering latent semantic in the generated documents, and then, classify documents into ‘homogeneous semantic clusters’ [14, 15]. In this approach, each semantic cluster is seen as a ‘topic’ that indicates a summary of the generated documents. Later, the users can explore the topics that are relevant to their intent. For example, a query term APC (Adenomatous Polyposis Coli) can be used to retrieve articles’ abstract from the PubMed. However, the generated results would consist of not only articles about Ad- enomatous Polyposis Coli, but also others such as Antigen Presenting Cells (APC), Anaphase Promoting Complex (APC), and Activated Protein C (APC). The users need to find articles relevant to their intent (here Adenomatous Polyposis Coli) after going through the abstracts generated from the search. Similarly, a query term ‘net- work’ might generate different topics if it occurs near to a term such as computer, traffic, artificial neural, and biological neural in the context of searched documents. The generated results are desired to be relevant, not just outbound links pertaining to the query terms. In order to facilitate and enhance relevant information access to the web users, it is essential for search engines to deal with ambiguity, elusiveness, and impreciseness of the users’ request [16]. Several researchers had made efforts towards semantic search of giant repositories. For example, a deterministic search provided metadata-enhanced search facility, wherein a user preselects different facets to generate more relevant search results [17]. However, scaling the metadata-enhanced search facility to the web is difficult and requires many experts to define controlled-vocabulary to create unique labels for con- cepts having the same terminology [18, 19]. A revolutionary change in information retrieval was realized by the introduction of the tf  idf scheme [20-22]. In this scheme, the document collection is presented as a document-by-term matrix, which is usually enormously high dimensional and sparse. Often, for a single document, there are more than thousands of terms in a matrix, and most of the entries are zero. The tf  idf scheme can reduce some terms; however, it provides the relatively small amount of reduction, which is not enough to reveal the statistical measures within or between document(s). In the last decades, some other dimension reduction techniques such as Latent Semantic Indexing, Probabilistic Latent Semantic Indexing, and Latent Dirichlet Allocation models were proposed to overcome some of these shortcomings. However, all these were bag-of-words models. These bag-of-words models follow Aldous and de Finetti theorem of exchangeability, wherein ‘order of terms in a docu- ment’ or ‘order of documents in a corpus’ can be neglected [23-25]. As the spatial information conveyed by the ‘terms in the document’ or ‘documents in a corpus’ was highly neglected, a statistical issue was found to be attached with these bags-of-words models [24-27]. In probability theory, the random variables (here referred as terms) t1, t 2 , · · ·, t N , are said to be exchangeable if the joint distribution F  t1, t 2 , · · · , t N  is invariant under permutation of its arguments, so that F  z1, z2 , · · · , z N   F  t1 , t 2 , · · · , t N  where z1, z2 , · · ·, z N is a permutation of t1, t 2 , · · ·, t N . Thus, a semantic generates from somewhat co-occurring ‘in relation- ships terms’ and ‘in the limited number of terms’. The criterion that ‘order of terms in a document can be neglected’ should be modified to ‘the order of terms in a relation- ship of a document can be neglected’. Similarly, ‘the order of documents in a corpus can be neglected’ should be modified to ‘the ordering documents in relationships of a corpus can be neglected’. For example, a query term ‘network’ would yield different ‘topics’ if it occurs nearby to a term such as ‘computer’, ‘traffic’, ‘artificial neural’ or ‘biological neural,’ and the ‘order of terms in a relationship’ might be neglected. As we can see from the literature and arguments mentioned above, there was a need to enhance search engines to reveal latent semantics in high dimensional web data, while preserving the relationship and order of term(s) or document(s). There- fore, we proposed a Latent Semantic Manifold (LSM) framework that identifies ho- mogeneous groups in web data, while preserving the spatial information of terms in a document, or documents in the corpus. This paper aims to explain the Latent Seman- tic Manifold framework (hereinafter, LSM framework), including its implementation and evaluation. 2 Materials and Methods This study consists of three key components – proposal of a novel theoretical frame- work, implementation, and evaluation. They are explained in the following subsec- tions. 2.1 Theoretical framework The proposed Latent Semantic Manifold (LSM) framework is a mixture model based on the concepts of probability and topology, which identifies the latent semantic in data. The concepts deployed in LSM framework are explained in the following four steps. Figure 1 shows the high-level view of the framework. 4 Fig. 1. Theoretical framework Step 1: A ‘query’ entry for searching the high-dimensional web data The user can enter the ‘query’ using a search engine that generates a set of docu- ments. The generated documents need to be processed to get semantics, which can be a sentence, paragraph, section, or even a whole document. The generated documents are referred as ‘fragments’ in the following Step 2 and 3. For example, the sentence ‘Jaguar is an animal living in Jungle’ can be considered as a ‘fragment’ in Step 2 and 3. At times, ‘fragment’ has another meaning - context, which we have mentioned explicitly at the appropriate place. Step 2: Named-entity recognition and heterogeneous manifold construction The significant noun terms are identified from the ‘fragments’. For example, if the sentence ‘Jaguar is an animal living in Jungle’ is considered to be fragmented; ‘Jagu- ar,’ ‘animal,’ and ‘Jungle,’ are significant ‘noun terms’. Some natural language pro- cessing methods, called as named-entity recognition, are used to select named-entity (noun terms) and its ‘type’ [28]. The named-entity recognition and classification algo- rithms extract the named-entities (noun terms) from fragments, and then, classify those entities by ‘type’ such as person, organization, and location. For example, the ‘jaguar’ is considered as a named-entity, and it is assigned to the animal or vehicle ‘type’ depending on the fragment (context). The named-entities are indicated with their marginal probabilities, and the correlations among the named-entities are indi- cated with their conditional probabilities. As shown in Figure 2, Jaguar is a named- entity with three possible types – animal, vehicle, and instrument. It has marginal probabilities such as Panimal(Jaguar), Pvehicle(Jaguar), and Pinstrument(Jaguar). Similarly, it has conditional probabilities such as P (Jaguar, Car | Vehicle), P (Jaguar, Motorcycle | Vehicle). Fig. 2. An example to demonstrate named-entities, its types, and associated marginal and conditional prob- abilities Although, we can enumerate all possible types of terms including their marginal and conditional probabilities using a large number of training documents; however, it is highly computational. Therefore, only nouns (words or phrases) are kept in reserve instead of identifying all types of terms and their probabilities [29-32]. The Hidden Markov Models (HMMs) were often used to draw ‘terms’ and their ‘relationships’ [33, 34]. In the last decade, a discriminative linear chain Conditional Random Field 6 (CRF) was also used to extract ‘terms’ in the corpus [35-39]. In this study, we used a trained Markov Natural Language Processing (NLP) Part-of-Speech (POS) tagging models to extract all named-entities (noun terms and its types) by the inferences of ‘fragments’ [31, 32]. The relationships among those named-entities construct a com- plex structure manifold. As the complex structure manifold is heterogeneous; there- fore, we call it ‘heterogeneous manifold’ hereinafter. Step 3: Decomposing a heterogeneous manifold into homogeneous manifolds As mentioned in Step 2, the heterogeneous manifold consists of the complex structure of named-entities including estimates of marginal and conditional probabilities. A collection of fragment vectors lie on heterogeneous manifold, which contains some local spaces resembling Euclidean spaces of a fixed number of dimensions. Every point of the n-dimensional heterogeneous manifold has a neighborhood homeo- n morphic to the n-dimensional Euclidean space R . In addition, all points in the ‘local spaces’ are strongly connected. As the heterogeneous manifold is overly complex, and semantic is latent in ‘local spaces’; therefore, instead of retaining just one heterogene- ous manifold, we can break it into a collection of ‘homogeneous manifolds’. The topological and geometrical concepts can be used to represent the latent semantics of a heterogeneous manifold as a collection of homogeneous manifolds. A graph-based treewidth decomposition algorithm is involved to decompose the a heterogeneous manifold into the collection of homogeneous manifolds [40]. As shown in Figure 3, assuming ‘Jaguar’ as heterogeneous manifold, we can decompose it into three ‘homo- geneous manifolds’ bounded by dotted lines of three different colors. Fig. 3. An example to demonstrate Graph-based treewidth decomposition In the graph-based treewidth decomposition algorithm, we can start by selecting a random ‘fixed dimension local manifold’ to be a separator as shown in Figure 4 [41]. Later, the local manifold is decomposed into two local manifolds that are not adja- cent. This decomposition is recursive until no further decomposition is possible. Fig. 4. An example to demonstrate the concept of separator under Graph-based treewidth decomposition We can express the above concept formally - let a heterogeneous manifold Mi for fragment i be the set of homogeneous manifolds such that Mi  Mij | No Mij is a subset of Mik , j  k . The semantics generated from local homo- geneous manifolds, which are equipped with fragments, are independent. In addition, a semantic topic set C  {z1, z2 ,···, zm} of the returned documents is associated with a semantic mapping  f Mij  C with a probability  P Mij , zk   0, 1 , and quantity f Mij   zk . The probabilities indicate how many documents pertaining to a homogeneous manifold are relevant and match the user’s intent. To induce homogeneous manifolds, it is crucial to extract significant ‘terms’ from fragments. In addition, we should demonstrate the relevance of each fragment to the homogeneous manifold. The users can refer only those homogeneous manifolds’ associated fragments, which they want. Step 4: Exploring homogeneous manifold The search-generated documents (referred as fragments in step 2 and 3) are clustered to their related homogeneous manifolds. For example, a query by the user for the term AP C, the documents returned from the queried term aggregated into a collection of homogeneous manifolds as shown in Figure 5. Each document is assigned to a particular homogeneous manifold. The occurrence of a particular document in the whole set of documents denotes i t s significance in homogeneous manifold. 8 Fig. 5. An example to demonstrate heterogeneous manifold, homogeneous manifolds, and documents associated with homogeneous manifolds 2.2 Implementation The LSM framework was implemented using the Eclipse Software Development Kit. A team of three researchers, who were expert in the Java programming language, developed the entire system. The development took almost 11 months. We provided a straightforward search interface facility as shown in Figure 6 in the Result section. The output of a user queried term (for example, APC) is shown in Figure 7 in the Result section. The system was deployed for two years at two places - library and one biomedical engineering laboratory of Taiwan. This system assisted the researchers to perform semantic searches of the PubMed database. For example, a researcher can search APC with Adenomatous Polyposis Coli as his or her intended meaning. How- ever, APC can also have meaning such as Antigen-Presenting Cells, Anaphase Pro- moting Complex, or Activated Protein C among others. For instance, in a homogene- ous manifold, if APC, Colorectal Cancer, and Gene related documents are assembled, homogeneous manifold would point out the APC as Adenomatous Polyposis Gene. Similarly, an APC, Major Histocompatibility Complex and T-cells related documents are assembled; it would indicate APC as Antigen Presenting Cells. In the Result sec- tion, the Figure 8 shows that documents returned from the queried term APC can automatically associate to homogeneous manifolds (semantic topics). In addition, the researchers can obtain a different ‘vantage point’ based on the underlying data. For example, a PubMed search retrieved 300 randomly selected published or in-press articles’ abstracts for a medical term NOD2. Figure 9 shows latent semantic topics as a clustering result. According to the result, inflammatory bowel disease and its type (Crohn’s disease and ulcerative colitis) are associated with gene NOD2. The term NOD2 was found to be evenly spread over these three topics - inflammatory bowel disease and its type. Some evolving topics such as ‘bacterial component’ were also discovered. However, when we searched NOD2 on Genia corpus † , the result was different as shown in Figure 10 [42]. 2.3 Experiment Data Sets. Two data sets, Reuters-21578-Distribution-1 and OHSUMED, were used to evaluate performance of the LSM framework and its implementation. The Reuters- 21578-Distribution-1 collection consists of Newswire articles. The articles were clas- sified into 135 topics, which were used to affirm the clustering results. In the test, the documents with multiple topics (category labels) and single topic were separated. The topics, which had less than five documents, were removed. Table 1 shows the sum- mary of the Reuters-21578-Distribution-1 collection. Table 1. Statistics of Reuters-21578 corpora Statistics Number of topics Number of docu- Documents on a ments topic Origin 135 21578 0-3945 Single Topic 65 8649 1-3945 Single Topic (>=5 documents) 51 9494 5-3945 OHSUMED is a clinically oriented Medline collection, consisting of 348,566 ref- erences. It covers all the references from 270 medical journals of 23 disease catego- ries over a five-year period (1987-1991) [43]. Evaluation criteria. The experimental evaluation of the LSM framework meas- ured both effectiveness and efficiency. Effectiveness is defined as an ability to identi- fy the right cluster (collection of documents). In order to measure the effectiveness, the clusters generated were verified by human experts as shown in Table 2. Table 2. Contingency table for category (ci, where i = natural number)‡ Category Clustering results Yes No Expert Yes TPi FNi Judgment No FPi TNi † Genia corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms ‘human,’ ‘blood cells,’ and ‘transcription factors’. ‡ TP – True Positive; FP – False Positive; FN – False Negative; TN – True Negative 10 The three measures of the effectiveness of clustering methods (Precision, Recall, and F ) were calculated using the contingency Table 1. The Precision and Recall are defined respectively as follows. TPi Precision i  TPi  FPi TPi Recalli  TPi  FNi (1) The F measure, which combines Precision and Recall, is defined as follows. F    2  1  Precisioni  Recalli (2)  2  Precisioni  Recalli F1 measure is used in this paper, which is obtained assigning  to be 1, which means that precision and recall have equal weight for evaluating the performance. In case, many categories are generated and compared, the overall precision and recall are calculated as the average of all precisions and recalls belonging to various categories. F1 is calculated as the mean of all results, which is a macro-average of the categories. In addition, two other evaluation metrics, normalized mutual information and over- all F-measure were also used [44-46]. Given the two sets of topics C and Cl, let C denote the topic set defined by experts and Cl denote the topic set generated by a clustering method, and both derived from the same corpora X. Let N(X) de- notes the number of total documents, N(z, X) denotes the number of documents in topic z, and N(z, z’, X) denotes the number of documents both in topic z and topic z’, for any topics in C. The normalized mutual information (NMI) metric MI (C, C’) is defined as follows. MI (C, C ')   P(z, z ') log ( PP(z()zP, z(z')') ) zC , z 'C ' 2 (3) Where P  z   N  z, X  / N  X  , P  z '  N  z’, X  / N ( X ), and P  z, z '  N  z, z ', X  / N ( X ). The normalized mutual information metric MI C, C ' will return a value between zero and max  H C  , H C '  , where H(C) and H(C') define the entropies of C and C' respective- ly. The higher MI C, C’ value means that two topics are almost identical, otherwise more independent. The normalized mutual information metric MI C, C ' is, therefore, transferred to be MI (C , C ') MI(C , C ')  (4) max( H (C ), H (C ')) Let Fi be an F-measure for each cluster zi defined above. The overall F-measure can be defined as: F*   P( z ')  maxF ( z, z ') z 'C ' zC (5) Where F (z, z’) calculates the F-measure between z and z’. 2.4 Evaluation The experiments were conducted on Reuters-21578-Distribution-1 and OHSUMED dataset. The clusters, from two to ten, were selected randomly to evaluate LSM and other clustering methods. Fifty test runs were conducted for the randomly chosen clusters from the corpus, and the final performance scores were obtained by averaging the scores from the 50 test runs [44]. The t-test assessed whether homogeneous clus- ters generated by the two methods (LSM vs. Other methods) were statistically differ- ent from each other as shown in as Table 3 and Figure 11 in the Result section. We also calculated the overall F-measure in combination of arbitrary ‘k’ clusters that were uniquely assigned to topics from the Reuters-21578-Distribution-1 dataset, where k was 3, 15, 30, and 60 [47]. Fifty test runs were also performed using these LSM results to compare ‘Frequent Item set-based Hierarchical Clustering (FIHC)’ and ‘bisecting k-means’ as shown Table 4 and Figure 12 in the Result section [47, 48]. The average precision, recall, overall F-measure, and normalized mutual infor- mation of LSM, LST, PLSI, PLSI + Ada Boost, LDA, and CCF was evaluated on the Reuters-21578-Distribution-1 dataset; and LSM, LST, and CCF were evaluated on an OHSUMED dataset as shown in Table 5 in the Result section [26, 49-52]. Besides the effectiveness, an efficiency testing was performed on LSM, LST, and CCF as shown in Figure 13 in the Result section. 3 Results 3.1 LSM implementation results Figures 6 and 7 are input and output interface of the system. Figure 8 shows the re- sults of query term ‘APC’ and explains potential functionalities that can be enhanced in the search engines using the LSM framework. Figure 9 and 10 different views of a query term NOD2 due to different underlying data sets. 12 Fig. 6. Users search ‘APC’ in the above input interface Fig. 7. The clustering result of the query term ‘APC’ retrieved from PubMed as output Fig. 8. Result from ‘APC’ term query Fig. 9. The clustering result of the query term ‘NOD2’ retrieved from PubMed 14 Fig. 10. Clustering result of the query term ‘NOD2’ retrieved from Genia corpus parsed with biological Conditional Random Fields (CRFs) 3.2 Experiment results Normalized Mutual Information comparison of the LSM framework with the other sixteen methods using Reuters-21578-Distribution-1 dataset is shown in Table 3 and Figure 11 [44, 52-54]. Table 3. Normalized Mutual Information comparison of LSM framework with other sixteen methods using Reuters-21578-Distribution-1 dataset§ k 2 3 4 5 6 7 8 9 10 Average LSM 0.461 0.505 0.622 0.686 0.714 0.792 0.893 0.884 0.9 0.717 CCF 0.569 0.563 0.607 0.62 0.605 0.624 0.633 0.647 0.676 0.616 GMM 0.475 0.468 0.462 0.516 0.551 0.522 0.551 0.557 0.548 0.517 NB 0.466 0.348 0.401 0.405 0.409 0.404 0.435 0.411 0.418 0.411 GMM + DFM 0.47 0.466 0.45 0.513 0.531 0.506 0.535 0.535 0.536 0.505 KM 0.404 0.402 0.461 0.525 0.561 0.548 0.583 0.597 0.618 0.522 KM-NC 0.438 0.462 0.525 0.554 0.592 0.577 0.594 0.607 0.618 0.552 SKM 0.458 0.407 0.499 0.561 0.567 0.558 0.591 0.598 0.619 0.54 SKM-NCW 0.434 0.423 0.515 0.556 0.577 0.563 0.593 0.602 0.612 0.542 BP-NCW 0.391 0.377 0.431 0.478 0.493 0.5 0.519 0.529 0.532 0.472 AA 0.443 0.415 0.488 0.531 0.571 0.542 0.587 0.594 0.611 0.531 NC 0.484 0.461 0.555 0.592 0.617 0.594 0.64 0.634 0.643 0.58 RC 0.417 0.381 0.505 0.46 0.485 0.456 0.548 0.484 0.495 0.47 NMF 0.48 0.426 0.498 0.559 0.591 0.552 0.603 0.601 0.623 0.548 NMF-NCW 0.494 0.5 0.586 0.615 0.637 0.613 0.654 0.659 0.658 0.602 CF 0.48 0.429 0.503 0.563 0.592 0.556 0.613 0.609 0.629 0.553 CF-NCW 0.496 0.505 0.595 0.616 0.644 0.615 0.66 0.66 0.665 0.606 § LSM – Latent semantic manifold; CCF – k-clique community finding algorithm; GMM – Gaussian mix- ture model; NB – Naive Bayes clustering; GMM + DFM – Gaussian mixture model followed by the itera- tive cluster refinement method; KM – Traditional k-means; KM-NC – Traditional k-means and Spectral clustering algorithm based on normalized cut criterion; SKM – Spherical k-means; SKM-NCW – Normal- ized-cut weighted form; BP-NCW – Spectral clustering based bipartite normalized cut; AA – Average association criterion; NC – Normalized cut criterion; RC – Spectral clustering based on ratio cut criterion; NMF – Non-negative matrix factorization; NMF-NCW – Nonnegative Matrix Factorization- based clustering; CF – Concept factorization; CF-NCW – Clustering by concept factorization 16 Fig. 11. The Mutual information values of 2 to 10 clusters built by LSM framework and other sixteen methods using Reuters-21578-Distribution-1 datasets** The four metrics (precision, recall, overall F-measure, normalized mutual infor- mation) of LSM on Reuters-21578-Distribution-1 dataset for different k are listed in Table 4. In addition, overall F-measure is compared with FIHC and bisecting k-means as shown in Figure 12. ** LSM –Latent semantic manifold; GMM – Gaussian mixture model; NB – Naive Bayes clustering; GMM + DFM – Gaussian mixture model followed by the iterative cluster refinement method; KM –Traditional k- means; KM-NC – Traditional k-means and spectral clustering algorithm based on normalized cut criterion; SKM – Spherical k-means; SKM-NCW – Normalized-cut weighted form; BP-NCW – Spectral clustering based bipartite normalized cut; AA – Average association criterion; NC – Normalized cut criterion; RC – Spectral clustering based on ratio cut criterion; NMF – Non-negative matrix factorization; NMF-NCW – Nonnegative Matrix Factorization-based clustering; CF – Concept factorization; CF-NCW – Clustering by concept factorization ; CCF – k-clique community finding algorithm Table 4. Precision, Recall, Overall F-measure, and Normalized Mutual Information (NMI) of Latent Semantic Manifold on Reuters-21578-Distribution-1 dataset k 2 3 4 5 6 7 8 9 10 Preci- 0.9845 0.9579 0.9385 0.9352 0.8909 0.9013 0.9148 0.8913 0.8859 sion Recall 0.7085 0.6384 0.6453 0.6056 0.5916 0.6543 0.6822 0.6688 0.6805 Overall 0.7988 0.7297 0.7399 0.6986 0.6822 0.7329 0.7562 0.7343 0.7472 F- meas- ure NMI 0.4617 0.5051 0.6221 0.6866 0.7148 0.7925 0.8936 0.8848 0.9006 Fig. 12. The overall F-measure of three methods (LSM, FIHC, and bisecting k-means) on Reuters-21578- Distribution-1 dataset, where FIHC – Frequent itemset-based hierarchical clustering†† The average precision, recall, overall F-measure, and normalized mutual infor- mation of LSM, LST, PLSI, LDA, and CCF on Reuters-21578-Distribution-1 dataset; and LSM, LST and CCF on OHSUMED are shown in Table 5. The efficiency testing results of the three methods LSM, LST, and CCF are shown in Figure 13. †† LSM – Latent semantic manifold; FIHC – Frequent itemset-based hierarchical clustering 18 Table 5. The average Precision, Recall, Overall F-measure, and Normalized Mutual Information (NMI) of LSM, LST, PLSI, PLSI + AdaBoost, LDA, and CCF on Reuters-21578-Distribution-1 dataset; and LSM, LST and CCF on OHSUMED‡‡ Dataset Method Precision Recall Overall F-measure NMI Reuters LSM 0.81 0.773 0.786 0.717 LST 0.779 0.745 0.754 0.633 PLSI 0.649 0.627 0.636 0.54 PLSI + AdaBoost 0.772 0.812 0.697 N/A LDA 0.66 0.714 0.686 0.61 CCF 0.727 0.73 0.723 0.616 OHSUMED LSM 0.59 0.479 0.522 0.315 LST 0.586 0.388 0.456 0.257 CCF 0.514 0.54 0.513 0.214 Fig. 13. The efficiency of three clustering methods, wherein x-axis identified the number of features and y- axis denoted the run time (milliseconds)§§ ‡‡ LSM – Latent semantic manifold; LST – Latent semantic topology; PLSI – Probabilistic latent semantic indexing; PLSI + AdaBoost – Probabilistic latent semantic indexing + additive boosting methods; LDA – Latent Dirichlet allocation; CCF – k-clique community finding algorithm §§ LSM – Latent semantic manifold; LST – Latent semantic topology; CCF – k-clique community finding algorithm 4 Discussion 4.1 Primary findings Our findings suggest that the LSM framework might play an instrumental role to en- hance the search engine functionalities by discovering the latent semantics in high- dimensional web data (Figure 6 -10). 4.2 Secondary findings LSM has much better performance than the other sixteen clustering methods, espe- cially when the number of clusters got larger on Reuters-21578-Distribution-1 and OHSUMED dataset (Table 3-4 and Figure 11-12).In general, LSM can produce more accurate results than others. The paired t-test assessed the clustering results of the same topics by any two methods - LSM, LST, and CCF. With a p-value less than 0.05, the results of LSM were significantly better than the results of LST, wherein we used 63 clusters in the experiments. Similarly, with a p-value less than 0.05, the re- sults of LSM were significantly better than the results of the CCF in 48 randomly selected clusters out of 72, in the experiments (Table 5). The efficiency of three methods LSM, LST, and CCF with a number of features also demonstrated that LSM is better than the others. The time needed to build latent semantics does not increase significantly when the data became larger (Figure 13). 4.3 Limitation and future studies This study had a few limitations that open up the scope of future studies. First, to identify and discriminate the correct topics in a collection of documents, the combina- tions of features and their co-occurring relationships serve as clues, and the probabili- ties display their significance. All features in documents compose a ‘topological probabilistic manifold’ associated with ‘probabilistic measures’ to denote the underly- ing structure. This complex structure is decomposed into inseparable components at various levels (in various levels of skeletons) so that each component corresponds to topics in a collection of documents. However, it is a time-consuming process and strongly dependent on features and their identifications (named-entities). Second, some terms with similar meanings such as ‘anticipate,’ ‘believe,’ ‘estimate,’ ‘expect,’ ‘intend,’ and ‘project’ could be separated into several independent topics; however, those topics could have a same meaning. Some data of a ‘single topic’ might be speci- fied in several clusters. These issues would be considered in the further research by utilizing thesauri and some other adaptive methods [55]. Third, in this study, the eval- uation was carried out mainly by comparing with other latent semantic indexing (LSI) algorithms. However, many alternative approaches for searching, clustering, and cat- egorization exist. Further study is needed to compare this approach with alternatives. Fourth, some tools, such as GOPUBMED, ARGO, Vivisimo, also perform latent semantics search of high dimensional web data. Some further study is needed to com- pare LSM-based tool proposed in this study with already existing tools to find some 20 space for synergy. Fifth, there are some already existing knowledge bases or re- sources in biomedical domain, such as (Medical Subject Headings). We already per- fomed some studies using Genia corpus, which contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms ‘human,’ ‘blood cells,’ and ‘transcription factors (Figure 10).’ Some more studies need to be carried to verify if this approach might be easily adapted to knowledge bases or resources. 5 Conclusion We found that LSM framework could discover the latent semantics in high- dimensional web data and organize those into several semantic topics. This frame- work could be used to enhance the functionalities of currently available search en- gines. 6 Acknowledgement The National Science Foundation (NSC 98-2221-E-038-012) supported this work. Reference 1. Ranganathan, P.: The data explosion. (2011) 2. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D.P., Kania, R., Schaeffer, M., St Pierre, S.: Big data: The future of biocuration. Nature 455, 47-50 (2008) 3. Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gómez-Pérez, A., Buitelaar, P., McCrae, J.: Challenges for the multilingual Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web 11, 63-71 (2012) 4. Croft, W.B., Metzler, D., Strohman, T.: Search engines: Information retrieval in practice. Addison-Wesley Reading (2010) 5. Thomas, P., Starlinger, J., Vowinkel, A., Arzt, S., Leser, U.: GeneView: a comprehensive semantic search engine for PubMed. Nucleic acids research 40, W585-W591 (2012) 6. Lingwal, S., Gupta, B.: A Comparative Study Of Different Approaches For Improving Search Engine Performance. International Journal (2012) 7. Freitas, A., Curry, E., Oliveira, J.G., O'Riain, S.: Querying heterogeneous datasets on the linked data web: Challenges, approaches, and trends. Internet Computing, IEEE 16, 24-33 (2012) 8. Dalal, M.K., Zaveri, M.A.: Automatic Classification of Unstructured Blog Text. Journal of Intelligent Learning Systems and Applications 5, 108-114 (2013) 9. Vercruysse, S., Kuiper, M.: Jointly creating digital abstracts: dealing with synonymy and polysemy. BMC research notes 5, 601 (2012) 10. Singer, G., Norbisrath, U., Lewandowski, D.: Ordinary search engine users carrying out complex search tasks. Journal of Information Science (2012) 11. Brossard, D., Scheufele, D.A.: Science, New Media, and the Public. Science 339, 40-41 (2013) 12. Stumme, G., Hotho, A., Berendt, B.: Semantic Web Mining: State of the art and future directions. Web Semantics: Science, Services and Agents on the World Wide Web 4, 124- 143 (2006) 13. Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H.S., Tran, T.: Repeatable and Reliable Semantic Search Evaluation. Web Semantics: Science, Services and Agents on the World Wide Web (2013) 14. Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine. Web Semantics: Science, Services and Agents on the World Wide Web 9, 365-401 (2011) 15. Fazzinga, B., Gianforme, G., Gottlob, G., Lukasiewicz, T.: Semantic Web search based on ontological conjunctive queries. Web Semantics: Science, Services and Agents on the World Wide Web 9, 453-473 (2011) 16. Liu, L., Feng, J.: The Notion of “Meaning System” and its use for “Semantic Search”. Journal of Computations & Modelling 1, 97-126 (2011) 17. Beall, J.: The weaknesses of full-text searching. The Journal of Academic Librarianship 34, 438-444 (2008) 18. Şah, M., Wade, V.: Automatic metadata mining from multilingual enterprise content. Web Semantics: Science, Services and Agents on the World Wide Web 11, 41-62 (2012) 19. Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: Proceedings of the 2011 international conference on Management of data, pp. 565-576. ACM, (Year) 20. Luhn, H.P.: The automatic creation of literature abstracts. IBM Journal of research and development 2, 159-165 (1958) 21. Salton, G., McGill, M.J.: Introduction to modern information retrieval. (1986) 22. Zipf, G.K.: {Human Behaviour and the Principle of Least-Effort}. (1949) 23. Aldous, D.: Exchangeability and related topics. École d'Été de Probabilités de Saint-Flour XIII—1983 1-198 (1985) 24. De Finetti, B.: Theory of Probability: A critical introductory treatment. Vol. 1. Wiley (1974) 25. De Finetti, B.: Theory of probability: a critical introductory treatment (1993) 26. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. the Journal of machine Learning research 3, 993-1022 (2003) 27. Flores, J.G., Gillard, L., Ferret, O., de Chandelar, G.: Bag of senses versus bag of words: comparing semantic and lexical approaches on sentence extraction. In: TAC 2008 Workshop-Notebook papers and results, pp. 158-167. (Year) 28. Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996, pp. 413-422. Association for Computational Linguistics, (Year) 29. Juang, B.-H., Rabiner, L.R.: Hidden Markov models for speech recognition. Technometrics 33, 251-272 (1991) 30. Mooij, J.M., Kappen, H.J.: Sufficient conditions for convergence of the sum–product algorithm. Information Theory, IEEE Transactions on 53, 4422-4437 (2007) 22 31. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium 8, 236-239 (2003) 32. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. Information Theory, IEEE Transactions on 51, 2282-2312 (2005) 33. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic segmentation of text into structured records. In: ACM SIGMOD Record, pp. 175-186. ACM, (Year) 34. Bunescu, R., Mooney, R.J.: Relational markov networks for collective information extraction. In: ICML-2004 Workshop on Statistical Relational Learning. (Year) 35. Grimmett, G., Welsh, D.J.: Disorder in physical systems: a volume in honour of John M. Hammersley on the occasion of his 70th birthday. Oxford University Press, USA (1990) 36. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001) 37. Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th international conference on Computational Linguistics, pp. 562. Association for Computational Linguistics, (Year) 38. Settles, B.: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21, 3191-3192 (2005) 39. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 485-492. Morgan Kaufmann Publishers Inc., (Year) 40. Srebro, N., Jaakkola, T.: Weighted low-rank approximations. In: Machine Learning International Workshop and Conference, pp. 720. (Year) 41. Diestel, R.: Graph theory. 2005. Springer-Verlag (2005) 42. Kim, J.-D., Ohta, T., Tateisi, Y., Tsujii, J.i.: GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19, i180-i182 (2003) 43. Hersh, W., Buckley, C., Leone, T., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: SIGIR’94, pp. 192-201. Springer, (Year) 44. Xu, W., Gong, Y.: Document clustering by concept factorization. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 202-209. ACM, (Year) 45. Dalli, A.: Adaptation of the F-measure to cluster based lexicon quality evaluation. In: Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: are evaluation methods, metrics and resources reusable?, pp. 51-56. Association for Computational Linguistics, (Year) 46. Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web, pp. 658-665. ACM, (Year) 47. Fung, B.C., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of the SIAM international conference on data mining, pp. 59-70. (Year) 48. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD workshop on text mining, pp. 525-526. Boston, (Year) 49. Cai, L., Hofmann, T.: Text categorization by boosting automatically extracted concepts. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 182-189. ACM, (Year) 50. Chiang, I.-J.: Discover the semantic topology in high-dimensional data. Expert Systems with Applications 33, 256-262 (2007) 51. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50-57. ACM, (Year) 52. Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814-818 (2005) 53. Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Machine learning 42, 143-175 (2001) 54. Shi, J., Malik, J.: Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on 22, 888-905 (2000) 55. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 475-480. ACM, (Year)