-

Enhancing Topical Word Semantic for Relevance Feature Selection

1School of Electrical Engineering and Computer Science

2Department of Computer Science

1 0 Queensland University of Technology , Brisbane , Australia 1 Umm Al-Qura University , Makkah , Saudi Arabia

Unsupervised topic models, such as Latent Dirichlet Allocation (LDA), are widely used as automated feature engineering tools for textual data. They model words semantics based on some latent topics on the basis that semantically related words occur in similar documents. However, words weights that are assigned by these topic models do not represent the semantic meaning of these words to user information needs. In this paper, we present an innovative and e↵ective extended random sets (ERS) model to enhance the semantic of topical words. The proposed model is used as a word weighting scheme for relevance feature selection (FS). It accurately weights words based on their appearance in the LDA latent topics and the relevant documents. The experimental results, based on 50 collections of the standard RCV1 dataset and TREC topics for information filtering, show that the proposed model significantly outperforms eight, state-of-the-art, baseline models in five standard performance measures.

InI:n: PAr.ocEeeddiitnogrs, oBf .IJCAoeIdWitoorrks(heodpso.)n: SePmroacnetiecdMinagcshionfe LtheaernXinYgZ Worksho(pS,MLLo2c0a1ti7o)n,A,uCgo1u9n-t2r5y,2D01D7,-MMelbMou-YrnYe,YAYus,trpaulibal.ished at http://ceur-ws.org 1

Introduction

LDA [BNJ03] is currently the most common probabilistic topic model compared to similar models, such as probabilistic Latent Semantic Analysis (pLSA) [Hof01], with a wide range of applications [Ble12]. LDA statistically discovers hidden topics from documents as features to be used for di↵erent tasks in information retrieval (IR) [WC06, WMW07], information filtering (IF) [GXL15] and for many other text mining and machine learning applications. LDA represents documents by a set of topics, and each topic is a set of semantically related terms1. Thus, it is capable of clustering related words in a document collection, which can reduce the impact of common problems like polysemy, synonymy and information overload [AZ12].

The core and critical part of any text FS method is the weighting function. It assigns a numerical value (usually a real number) to each feature, which specifies how informative the feature is to the user’s information needs [ALA13]. In the context of probabilistic topic modelling in general and LDA specifically, calculating a term weight is done locally at its document-level based on two components; the term local documenttopics distributions and the global term-topics assignment. Therefore, in a set of similar documents, a specific term might receive a di↵erent weight in each single document even though this term is semantically identical across all these documents. Such approach does not accurately reflect on the semantic meaning and usefulness of this term to the entire user’s information needs. It badly influences the performance of LDA 1In this paper, terms, words, keywords or unigrams are used interchangeably. for FS as it is uncertain and dicult to know which weight is more representative and should be assigned to the intended term. Would it be the average weight? The highest? The lowest? The aggregated? Several experiments in various studies confirm that the localglobal weighting approach of the LDA is ine↵ective for relevant FS [GXL15].

Given a document set that describes user information needs, global statistics, such as document frequency (df), reveal the discriminatory power of terms [LTSL09]. However, in IR, selecting terms based on global weighting schemes did not show better retrieval performance [MO10], because global statistics cannot describe the local importance of terms [MC13]. From the LDA’s perspective, it is challenging and still uncertain on how to use LDA’s local-global term weighting function in a global context due to the complex relationships between terms and many entities that represent the entire collection. A term, for example, might appear in multiple documents and LDA topics, and each topic may also cover many documents or paragraphs that contain the same term. Therefore, the hard question this research tries to answer is: how to generalise the local topic weight (at document level) and combine it with global topical statistics such as the term frequency in both topics and relevant documents for more discriminative and semantically representative global term weighting scheme?

The aim of this research is to develop an e↵ective topic-based FS model for relevance discovery. The model uses a hierarchical framework based on ERS theory to assign a more representative weight to terms based on their appearance in LDA topics and all relevant documents. Therefore, two major contributions have been made in this paper to the fields of text FS and IF: (a) A new theoretical model based on multiple ERS [Mol06] to represent and interpret the complex relationships between long documents, their paragraphs, LDA topics and all terms in the collection, where a function describes each relationship; (b) A new and e↵ective term weighting formula that assigns a more discriminately accurate weight to topical terms that represent their relevance to the user information needs. The formula generalises LDA’s local topic weight to a global one using the proposed ERS theory and then combines it with the frequency ratio of words in both documents and topics to answer the question asked by the authors. To test the e↵ectiveness of our model, we conducted extensive experiments on RCV1 dataset and the assessors’ relevance judgements of the TREC filtering track. The results show that our model significantly outperforms all used baseline FS models for IF despite the type of text features they use (terms, phrases, patterns, topics or even a di↵erent combination of them).

Related Works

In the literature, there is a significant amount of work that extends and improves LDA to suit di↵erent needs including text FS [ZPH08, TG09]. However, our model is intended for IF, and, to the best of our knowledge, it is the first attempt to extend random sets [Mol06] to functionally describe and interpret complex relationships that involve topical terms and other entities in a document collection to enhance the semantic of topical words for relevance FS. Relevance is a fundamental concept in both IR and IF. IR mainly concerns about document’s relevance to a query for a specific subject. However, IF discusses the document’s relevance to user information needs [LAZ10]. In relevance discovery, FS is a method that selects a subset of features that are relevant to user’s needs and thus removing those that are irrelevant, redundant and noisy. Existing methods adopt di↵erent type of text features such as terms [LTSL09], phrases (n-grams) [ALA13], patterns (a pattern is a set of associated terms) [LAA+15], topics [DDF+90, Hof01, BNJ03] or a combination of them for better performance [WMW07, LAZ10, GXL15].

The most ecient FS methods for relevance, are the ones that are developed based on weighting function, which is the core and critical part of the selection algorithm [LAA+15]. Using LDA words weighting function for relevance is still limited and does not show encouraging results [GXL15] including similar topicbased models such as the pLSA [Hof01]. For better performance, Gao et al (2015) [GXL15] integrate pattern mining techniques into topic models to discover discriminative features. Such work is expensive and susceptible to the features-loss problem and also might be impacted by the uncertainty of the probabilistic topic model. ERS is proven to be e↵ective in describing complex relations between di↵erent entities and interprets them as a function (weighting function) [Li03]. Thus, the ERS-based models can be used to weight closed sequential patterns more accurately and thus facilitate the discovery of specific ones as appears in [ALX14]. However, selecting the most useful patterns is challenging due to a large number of patterns generated from relevant documents using various minimum supports (min sup), and also may lead to feature-loss. 3

Background Overview

For a given corpus C, the relevant long documents set D✓ C represents user’s information needs that might have multiple subjects. The proposed model uses D for training where each document dx2 D has a set of paragraphs P S and each paragraph has a set of terms T . ⇥ is the set of all paragraphs in D and P S✓ ⇥. A set of terms ⌦ is the set of all unique words in D. The proposed model uses LDA to reduce the dimensionality of D to a set of manageable topics Z, where V is the number of topics. LDA assumes that each document has multiple latent topics [GXL15], and defines each topic zj 2 Z as a multinomial probability distribution over all words in ⌦ as p(wi|zj ) in which wi2 ⌦ and 1  j V such that P|⌦ | p(wi|zj )=1. LDA i also represents a document d as a probabilistic mixture of topics as p(zj |d). As a result, and based on the number of latent topics, the probability (local weight) of word wi in document d can be calculated as p(wi|d)= PV

j=1 p(wi|zj )⇥ p(zj |d) . Finally, all hidden variables, p(wi|zj ) and p(zj |d), are statistically estimated by the Gibbs sampling algorithm [SG07]. 3.2

Random Set

A random set is a random object that has values, which are subsets that are taken from some space [Mol06]. It works as an e↵ective measure of uncertainty in imprecise data for decision analysis [Ngu08]. For example, let Z and ⌦ be finite sets that represent topics and words respectively. is a set-valued mapping from Z (the evidence space) onto ⌦ that can be written as : Z ! 2⌦ , and P is a probability function defined on Z, thus the pair (P, ) is called a random set [KSH12]. can be extended as ⇠ :: Z ! 2⌦ ⇥ [0,1] (also called an extended setvalued mapping), which satisfies P(w,p)2 ⇠ (z) p=1 for each z2 Z. Let P be a probability function on Z, such that Pz2 Z P (z)=1. We call (⇠, P ) an extended random set. 4

The Proposed Model

The proposed model (Figure 1) deals with the local weight problem of terms that is assigned by the LDA probability function (described in section 3.1) by exploring all possible relationships between di↵erent entities that influence the weighting process. The targeting entities in our model are documents, paragraphs, topics, and terms. The possible relationships between these entities are complex (a set of one-to-many relationships). For example, a document can have many paragraphs and terms; a paragraph can have multiple topics; a topic can have many terms. Inversely, a topic can cover many paragraphs, and a term can appear in many documents and topics.

In this model, we proposed three ERSs to describe such complex relationships, where each ERS can be interpreted as a function by which we can determine the importance of the main entity in the relationship. Then, the proposed ERS theory is used to develop a new weighting scheme to accurately weight topical words by generalising the topic’s local weight, and, then, combine it with the frequency ratio of words in both documents and topics. Let assume we have a set of topics Z={z1, z2, z3, . . . , zV } in ⇥ and let D= {d1, d2, d3, . . . , dN } is a set of N relevant long documents. Each document dx consists of M paragraphs such as dx= {p1, p2, p3, . . . , pM }. A paragraph py consists of a set of L words, for example, py= {w1, w2, w3, . . . , wL}. A word w is a keyword or unigram, where the function words(p) returns a set of words appear in paragraph p. A topic z can be defined as a probability distribution over the set of words ⌦ where words(p)✓ ⌦ for every paragraph p2 ⇥.

For each zi2 Z, let fi(w, zi) be a frequency function on ⌦, such that ( zi)={w|w2 ⌦ , fi(w, zi) 0} while the inverse mapping of is defined as 1 : ⌦ ! 2Z ; 1(w)={z2 Z|w2 ( z)}. Also, for each dj 2 D, let fj (w, dj ) be a frequency function on ⌦, such that ( dj ) = {w|w2 ⌦ , fj (w, dj )>0} while the inverse mapping of is defined as 1 : ⌦ ! 2D; 1(w)={d2 D|w2 ( d)}. These extended setvalued mappings can decide a weighting function on ⌦, which satisfies sr :: ⌦ ! [0, +1 ) such that ing the conditional probability function Pxy(z|dxpy) as 1 : ⇥ ! 2Z⇥ [0,1]; 1(dxpy)={(z1, Pxy(z1|dxpy)), . . .}.

Similarly 2 is also proposed to describe the relationship between topics and terms using the defined frequency function fi(w, zi) as 2 : Z ! 2⌦ ⇥ [0,+1 ); 2(zi)={(w1, Pi(w1|zi)), . . .}.

Lastly, 3 is also proposed to describe the relationship between documents and terms using the defined frequency function fj (w, dj ) as 3 : D ! 2⌦ ⇥ [0,+1 ); 3(dj )={(w1, fj (w1, dj )), . . .}

Based on the inverse mapping described above, we have 1 1, 2 1 and 3 1. 1 1 describes the inverse relationships between topics and paragraphs using the probability function Pz(zi) such that 1 1(z)={dxpy|z2 1(dxpy)} while 2 1, on the other hand, describes the inverse relationships between terms and topics using fi(w, zi) function such that 2 1(w)={z|w2 2(z)}. 3 1 describes the inverse relationships between terms and documents using fj (w, dj ) function such that 3(w)={d|w 2 3(d)} To estimate the generalised topic weight in D, we need to calculate the probability of each topic Pz(zi) in each paragraph of document d and similarly for all documents in D based on 1 1 in which we assume P⇥ (dxpy) = N1 , where N is the total number of paragraphs as follows:

Pz(zi) =

P To verify the proposed model, we designed two hypotheses. First, our ERS model can e↵ectively generalise the topic’s local weight that is estimated from all documents paragraphs. The generalisation has led to a more accurate term weighting scheme especially when it is combined with the term frequency ratio in both documents and topics. Second, our model, overall, is more e↵ective in selecting relevant features than most, state-of-the-art, term-based, patternbased, topic-based or even mix-based FS models. To support these two hypotheses, we conducted experiments and evaluated their performance. 5.1

Dataset

The first 50 collections of the standard Reuters Corpus Volume 1 (RCV1) dataset is used in this research due to being assessed by domain experts at NIST [SR03] for TREC2 in their filtering track. This number of collections is sucient and stable for better and reliable experiments [BV00]. RCV1 is collections of documents where each document is a news story in English published by Reuters. 5.2

Baseline models

We compared the performance of our model to eight di↵erent baseline models. These models are categorised into five groups based on the type of feature they use. The proposed model is trained only on relevant documents and does not consider irrelevant ones. Therefore, for fair comparison and judgement, we can only select a baseline model that either unsupervised or does not require the use of irrelevant documents.

We selected Okapi BM25 [RZ09], which is one of the best term-based ranking algorithm. The phrase-based model n-Grams is selected. It represents user’s information needs as a set of phrases where n = 3 as it is the best value reported by Gao et al. (2015) [GXL15]. The Pattern Deploying based on Support (PDS) [ZLW12] is one of the pattern-based models. It can overcome the limitations of pattern frequency and usage. We selected the Latent Dirichlet Allocation (LDA) [BNJ03] as the most widely used topic modelling algorithm. From the same group we also selected the Probabilistic Latent Semantic Analysis (pLSA) [Hof01]; it is similar to the LDA and can deal with the problem of polysemy. Three models were selected from the mix-based category. First, we selected the PatternBased Topic Model (PBTM-FP) [GXL15] that incorporates topics and frequent patterns FP to obtain semantically rich and discriminative representation for IF. Secondly, the PBTM-FCP [GXL15], which is similar to the PBTM-FP except it uses the frequent closed pattern FCP instead. Lastly, we selected the Topical N-Grams (TNG) [WMW07] that integrates the topic model with phrases (n-grams) to discover topical phrases that are more discriminative and interpretable.

2http://trec.nist.gov/

⌘◆# The e↵ectiveness of our model is measured based on relevance judgements by five metrics that are wellestablished and commonly used in the IR and IF communities. These metrics are the average precision of the top-20 ranked documents (top-20), break-even point (b/p), mean average precision (MAP), F-score (F1) measure, and 11-points interpolated average precision (IAP). For more details about these measures, the reader can refer to Manning et al ( 2008 ) [MRS08]. For even better analysis of the experimental results, the Wilcoxon signed-rank test (Wilcoxon T-test) [Wil45] was used. Wilcoxon T-test is a statistical nonparametric hypothesis test used to compare and assess if the ranked means of two related samples di↵er or not. It is a better alternative to the student’s t-test, especially when no normal distribution is assumed. For each collection, we train our model on all paragraphs of relevant documents D in the training part of the collection. We use LDA to extract ten topics because it is the best number for each collection as it has reported in [GXL13, GXL14, GXL15]. Then, the proposed model scores documents’ terms, ranks them and uses the top-k features as a query to an IF system. The IF system uses unknown documents (from the testing part of the same collection) to decide their relevance to the user’s information needs (relevant or irrelevant). However, specifying the value of k is experimental. The same process is also applied separately to all baseline models. If the results of the IF system returned by the five metrics are better than the baseline results, then we can claim that our model is significant and outperforms a baseline model.

The IF testing system uses the following equation to rank the testing documents set:

t2 Q weight(d) = X x, if (t 2 d, x = weight(t) t 2 / d, x = 0 where weight(d) is the weight of document d. (3) 5.5

Experimental Settings

In our experiment, we use the MALLET toolkit [McC02] to implement all LDA-based models except for the pLSA model where we used the Lemur toolkit 3 instead. All topic-based models require some 3https://www.lemurproject.org/ parameters to be set. For the LDA-based models, we set the number of iterations for the Gibbs sampling to 1000 and for the hyper-parameters to = 0.01 and ↵ = 50/V as they were justified in [SG07]. We configured the number of iterations for the pLSA to be 1000 (default setting). For the experimental parameters of the BM25, we set b = 0.75 and k1 = 1.2 as recommended by Manning et al. ( 2008 ) [MRS08]. 5.6

Experimental Results

Table 1 and figure 2 show the evaluation results of our model and the baselines. These results are the average of the 50 collections of the RCV1. The results in Table 1 have been categorised based on the type of feature used by the baseline model and the improvement% represents the percentage change in our model’s performance compared to the best result of the baseline model (marked in bold if there is more than one baseline model in the category). We consider any improvement that is greater than 5% to be significant.

Table 1 shows that our model outperformed all baseline models for information filtering in all five measures. Regardless of the type of feature used by the baseline model, our model is significantly better on average by a minimum improvement of 8.0% and 39.7% maximum. Moreover, the 11-points result in figure 2 illustrates the superiority of the proposed model and confirms the significant improvements that shown in table 1. Wilcoxon T-test results (Table 2) present the pvalues of the results of our model compared to all base

Conclusion

This paper presents an innovative and e↵ective topicbased feature ranking model to enhance the semantic of topical words to acquire user needs. The model extends random sets to generalise the LDA topic weight at the document level. Then, a term weighting scheme is developed to accurately rank topical terms based on their frequent appearance in the LDA topics distributions and all relevant documents. The new calculated weight e↵ectively reflects the relevance of a term to user’s information needs and maintains the same semantic meaning of terms across all relevant documents. The proposed model is tested for IF on the standard RCV1 dataset, TREC topics, five di↵erent performance measurement metrics and eight stateof-the-art baseline models. The experimental results show that our model achieved significant performance compared to all other baseline models. [ALA13] line models on all performance measures. A model’s result is considered significantly di↵erent from other model’s if the p-value is less than 0.05 [Wil45].Clearly, the p-value for all metrics is largely less than 0.05 confirming that our model’s performance is significantly di↵erent from all baselines. This shows that our model gains substantial improvement compared to the used baseline models. Based on the results presented earlier, we are confident in claiming that our extended random sets model can e↵ectively generalise the local topic weight at the document level in the LDA term scoring function and, thus, provide a more globally representative term weight when it combined the term frequency in document and topics. Also, our model is more e↵ective in selecting relevant features to acquire user’s information needs that represented by a set of long documents. [DDF+90] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391, 1990. [ALX14] [AZ12] [Ble12] [BNJ03] [BV00]

Mubarak Albathan, Yuefeng Li, and Ab

dulmohsen Algarni. Enhanced N-Gram Extraction Using Relevance Feature Discovery, pages 453–465. Springer International Publishing, Cham, 2013.

Mubarak Albathan, Yuefeng Li, and Yue Xu. Using extended random set to find specific patterns. In WI’14, volume 2, pages 30–37. IEEE, 2014. Charu C Aggarwal and ChengXiang Zhai. A survey of text clustering algorithms. In Mining text data, pages 77–128. Springer, 2012. David M Blei. Probabilistic topic models.

Communications of the ACM, 55(4):77–84, 2012.

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the

Journal of machine Learning research, 3:993–1022, 2003.

Chris Buckley and Ellen M Voorhees. Evaluating evaluation measure stability. In SIGIR’00, pages 33–40. ACM, 2000. [GXL13]

[GXL14] [Hof01] [KSH12] [LAZ10] [Li03] [LTSL09] [MC13] [McC02] [MO10]

[GXL15] Yang

Gao

, Yue Xu, and

Yuefeng

Li .

Pattern-based topic models for information filtering . In ICDM'13 , pages 921 - 928 .

IEEE , 2013 .

Yang

Gao

, Yue Xu, and

Yuefeng

Li . Topical pattern based document modelling and relevance ranking . In WISE'14 , pages 186 - 201 . Springer, 2014 .

Pattern-based topics for document modelling in information filtering . IEEE TKDE , 27 ( 6 ): 1629 - 1642 , 2015 .

Machine learning , 42 ( 1-2 ): 177 - 196 , 2001 .

Rudolf

Kruse , Erhard Schwecke, and

Jochen

Heinsohn . Uncertainty and vagueness in knowledge based systems: numerical methods . Springer Science & Business Media , 2012 .

Yuefeng

Li ,

Abdulmohsen

Algarni , and

Ning

Zhong . Mining positive and negative patterns for relevance feature discovery . In KDD'10 , pages 753 - 762 . ACM, 2010 .

Yuefeng

Li . Extended random sets for knowledge discovery in information systems . In RSFDGrC'03 , pages 524 - 532 .

Springer , 2003 .

Man

Lan , Chew Lim Tan,

Jian

Su , and

Yue

Lu . Supervised and traditional term weighting methods for automatic text categorization . IEEE TPAMI , 31 ( 4 ): 721 - 735 , 2009 .

Compact query term selection using topically related text . In SIGIR'13 , pages 583 - 592 . ACM, 2013 .

Craig

Macdonald and

Iadh

Ounis . Global statistics in proximity weighting models . In Web N-gram Workshop , page 30. Citeseer , 2010 .

[LAA+15]

Yuefeng

Li ,

Abdulmohsen

Algarni , Mubarak Albathan,

Yan

Shen , and Moch Arif Bijaksana. Relevance feature discovery for text mining . IEEE TKDE , 27 ( 6 ): 1656 - 1669 , 2015 .

[MRS08] [Ngu08] [RZ09] [SG07] [SR03] [TG09] [WC06] [Wil45] Ilya Molchanov. Theory of random sets.

Springer

Science & Business

Media , 2006 .

Christopher D Manning , Prabhakar

Raghavan , and Hinrich Schu¨tze. Introduction to information retrieval. Cambridge University Press, 2008 .

Hung T Nguyen . Random sets . Scholarpedia , 3 ( 7 ): 3383 , 2008 .

The probabilistic relevance framework: BM25 and beyond . Now Publishers Inc, 2009 .

Mark

Steyvers and

Tom

Griths . Probabilistic topic models . Handbook of latent semantic analysis , 427 ( 7 ): 424 - 440 , 2007 .

Building a filtering test collection for trec 2002 . In SIGIR' 03 , pages 243 - 250 . ACM, 2003 .

Serafettin

Tasci and

Tunga

Gungor . Ldabased keyword selection in text categorization . In ISCIS'09 , pages 230 - 235 . IEEE, 2009 .

Xing

Wei and

W Bruce

Croft . Lda-based document models for ad-hoc retrieval . In SIGIR'06 , pages 178 - 185 . ACM, 2006 .

Frank

Wilcoxon . Individual comparisons by ranking methods . Biometrics bulletin , 1 ( 6 ): 80 - 83 , 1945 .

[WMW07] Xuerui

Wang

, Andrew McCallum , and Xing Wei . Topical n-grams: Phrase and topic discovery, with an application to information retrieval . In ICDM'07 , pages 697 - 702 . IEEE, 2007 .

[ZLW12] [ZPH08]

Ning

Zhong ,

Yuefeng

Li , and Sheng-Tang Wu . E↵ ective pattern discovery for text mining . IEEE TKDE , 24 ( 1 ): 30 - 44 , 2012 .

Zhiwei

Zhang , Xuan-Hieu Phan , and Susumu Horiguchi . An ecient feature selection using hidden topic in text categorization . In AINAW'08 , pages 1223 - 1228 .

IEEE , 2008 .