=Paper= {{Paper |id=Vol-1986/SML17_paper_4 |storemode=property |title=Enhancing Topical Word Semantic for Relevance Feature Selection |pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_4.pdf |volume=Vol-1986 |authors=Abdullah Semran Alharbi,Yuefeng Li,Yue Xu |dblpUrl=https://dblp.org/rec/conf/ijcai/AlharbiL017 }} ==Enhancing Topical Word Semantic for Relevance Feature Selection== https://ceur-ws.org/Vol-1986/SML17_paper_4.pdf
         Enhancing Topical Word Semantic for Relevance
                       Feature Selection

             Abdullah Semran Alharbi1,2                        Yuefeng Li1                     Yue Xu1
               asaharbi@uqu.edu.sa                           y2.li@qut.edu.au             yue.xu@qut.edu.au
                                1
                                    School of Electrical Engineering and Computer Science
                                            Queensland University of Technology
                                                      Brisbane, Australia

                                             2
                                                 Department of Computer Science
                                                    Umm Al-Qura University
                                                     Makkah, Saudi Arabia



                                                                   1    Introduction
                                                                   LDA [BNJ03] is currently the most common prob-
                        Abstract                                   abilistic topic model compared to similar mod-
                                                                   els, such as probabilistic Latent Semantic Analy-
                                                                   sis (pLSA) [Hof01], with a wide range of applica-
    Unsupervised topic models, such as Latent                      tions [Ble12]. LDA statistically discovers hidden top-
    Dirichlet Allocation (LDA), are widely used                    ics from documents as features to be used for di↵erent
    as automated feature engineering tools for tex-                tasks in information retrieval (IR) [WC06, WMW07],
    tual data. They model words semantics based                    information filtering (IF) [GXL15] and for many other
    on some latent topics on the basis that se-                    text mining and machine learning applications. LDA
    mantically related words occur in similar doc-                 represents documents by a set of topics, and each topic
    uments. However, words weights that are as-                    is a set of semantically related terms1 . Thus, it is ca-
    signed by these topic models do not represent                  pable of clustering related words in a document col-
    the semantic meaning of these words to user                    lection, which can reduce the impact of common prob-
    information needs. In this paper, we present                   lems like polysemy, synonymy and information over-
    an innovative and e↵ective extended random                     load [AZ12].
    sets (ERS) model to enhance the semantic of                        The core and critical part of any text FS method
    topical words. The proposed model is used as                   is the weighting function. It assigns a numerical value
    a word weighting scheme for relevance feature                  (usually a real number) to each feature, which specifies
    selection (FS). It accurately weights words                    how informative the feature is to the user’s information
    based on their appearance in the LDA latent                    needs [ALA13]. In the context of probabilistic topic
    topics and the relevant documents. The ex-                     modelling in general and LDA specifically, calculat-
    perimental results, based on 50 collections of                 ing a term weight is done locally at its document-level
    the standard RCV1 dataset and TREC topics                      based on two components; the term local document-
    for information filtering, show that the pro-                  topics distributions and the global term-topics assign-
    posed model significantly outperforms eight,                   ment. Therefore, in a set of similar documents, a spe-
    state-of-the-art, baseline models in five stan-                cific term might receive a di↵erent weight in each single
    dard performance measures.                                     document even though this term is semantically iden-
                                                                   tical across all these documents. Such approach does
                                                                   not accurately reflect on the semantic meaning and
Copyright c by the paper’s authors. Copying permitted for          usefulness of this term to the entire user’s information
private and academic purposes.                                     needs. It badly influences the performance of LDA
 In: Proceedings
In:  A. Editor, of
                 B.IJCAI Workshop
                     Coeditor     on Semantic
                              (eds.):         Machine
                                      Proceedings   of Learning
                                                        the XYZ
Workshop,(SML 2017), Aug
            Location,    19-25 2017,
                      Country,       Melbourne, Australia.
                                DD-MMM-YYYY,        published at      1 In this paper, terms, words, keywords or unigrams are used

http://ceur-ws.org                                                 interchangeably.
for FS as it is uncertain and difficult to know which       2   Related Works
weight is more representative and should be assigned
                                                            In the literature, there is a significant amount of work
to the intended term. Would it be the average weight?
                                                            that extends and improves LDA to suit di↵erent needs
The highest? The lowest? The aggregated? Several
                                                            including text FS [ZPH08, TG09]. However, our model
experiments in various studies confirm that the local-
                                                            is intended for IF, and, to the best of our knowledge, it
global weighting approach of the LDA is ine↵ective for
                                                            is the first attempt to extend random sets [Mol06] to
relevant FS [GXL15].
                                                            functionally describe and interpret complex relation-
    Given a document set that describes user infor-
                                                            ships that involve topical terms and other entities in a
mation needs, global statistics, such as document
                                                            document collection to enhance the semantic of topi-
frequency (df), reveal the discriminatory power of
                                                            cal words for relevance FS. Relevance is a fundamental
terms [LTSL09]. However, in IR, selecting terms based
                                                            concept in both IR and IF. IR mainly concerns about
on global weighting schemes did not show better re-
                                                            document’s relevance to a query for a specific subject.
trieval performance [MO10], because global statistics
                                                            However, IF discusses the document’s relevance to user
cannot describe the local importance of terms [MC13].
                                                            information needs [LAZ10]. In relevance discovery,
From the LDA’s perspective, it is challenging and
                                                            FS is a method that selects a subset of features that
still uncertain on how to use LDA’s local-global term
                                                            are relevant to user’s needs and thus removing those
weighting function in a global context due to the com-
                                                            that are irrelevant, redundant and noisy. Existing
plex relationships between terms and many entities
                                                            methods adopt di↵erent type of text features such as
that represent the entire collection. A term, for ex-
                                                            terms [LTSL09], phrases (n-grams) [ALA13], patterns
ample, might appear in multiple documents and LDA
                                                            (a pattern is a set of associated terms) [LAA+ 15], top-
topics, and each topic may also cover many documents
                                                            ics [DDF+ 90, Hof01, BNJ03] or a combination of them
or paragraphs that contain the same term. Therefore,
                                                            for better performance [WMW07, LAZ10, GXL15].
the hard question this research tries to answer is: how
                                                                The most efficient FS methods for relevance, are the
to generalise the local topic weight (at document level)
                                                            ones that are developed based on weighting function,
and combine it with global topical statistics such as the
                                                            which is the core and critical part of the selection al-
term frequency in both topics and relevant documents
                                                            gorithm [LAA+ 15]. Using LDA words weighting func-
for more discriminative and semantically representa-
                                                            tion for relevance is still limited and does not show
tive global term weighting scheme?
                                                            encouraging results [GXL15] including similar topic-
    The aim of this research is to develop an e↵ective      based models such as the pLSA [Hof01]. For bet-
topic-based FS model for relevance discovery. The           ter performance, Gao et al (2015) [GXL15] integrate
model uses a hierarchical framework based on ERS            pattern mining techniques into topic models to dis-
theory to assign a more representative weight to terms      cover discriminative features. Such work is expensive
based on their appearance in LDA topics and all rel-        and susceptible to the features-loss problem and also
evant documents. Therefore, two major contributions         might be impacted by the uncertainty of the prob-
have been made in this paper to the fields of text FS       abilistic topic model. ERS is proven to be e↵ective
and IF: (a) A new theoretical model based on multiple       in describing complex relations between di↵erent enti-
ERS [Mol06] to represent and interpret the complex re-      ties and interprets them as a function (weighting func-
lationships between long documents, their paragraphs,       tion) [Li03]. Thus, the ERS-based models can be used
LDA topics and all terms in the collection, where a         to weight closed sequential patterns more accurately
function describes each relationship; (b) A new and         and thus facilitate the discovery of specific ones as ap-
e↵ective term weighting formula that assigns a more         pears in [ALX14]. However, selecting the most useful
discriminately accurate weight to topical terms that        patterns is challenging due to a large number of pat-
represent their relevance to the user information needs.    terns generated from relevant documents using various
The formula generalises LDA’s local topic weight to a       minimum supports (min sup), and also may lead to
global one using the proposed ERS theory and then           feature-loss.
combines it with the frequency ratio of words in both
documents and topics to answer the question asked by
the authors. To test the e↵ectiveness of our model,
                                                            3   Background Overview
we conducted extensive experiments on RCV1 dataset          For a given corpus C, the relevant long documents set
and the assessors’ relevance judgements of the TREC         D✓C represents user’s information needs that might
filtering track. The results show that our model sig-       have multiple subjects. The proposed model uses D
nificantly outperforms all used baseline FS models for      for training where each document dx 2D has a set of
IF despite the type of text features they use (terms,       paragraphs P S and each paragraph has a set of terms
phrases, patterns, topics or even a di↵erent combina-       T . ⇥ is the set of all paragraphs in D and P S✓⇥. A
tion of them).                                              set of terms ⌦ is the set of all unique words in D.
3.1   Latent Dirichlet Allocation
The proposed model uses LDA to reduce the dimen-
sionality of D to a set of manageable topics Z, where
V is the number of topics. LDA assumes that each
document has multiple latent topics [GXL15], and de-
fines each topic zj 2Z as a multinomial probability
distribution over all words in ⌦ as p(wi |zj ) in which
                                 P|⌦|
wi 2⌦ and 1jV such that i p(wi |zj )=1. LDA
also represents a document d as a probabilistic mix-
ture of topics as p(zj |d). As a result, and based on
the number of latent topics, the probability (local
weight) of word wi in document d can be calculated as
          PV
p(wi |d)= j=1 p(wi |zj )⇥p(zj |d) . Finally, all hidden
variables, p(wi |zj ) and p(zj |d), are statistically esti-
mated by the Gibbs sampling algorithm [SG07].

3.2   Random Set
                                                                               Figure 1: our proposed model
A random set is a random object that has val-
ues, which are subsets that are taken from some               words by generalising the topic’s local weight, and,
space [Mol06]. It works as an e↵ective measure                then, combine it with the frequency ratio of words in
of uncertainty in imprecise data for decision analy-          both documents and topics.
sis [Ngu08]. For example, let Z and ⌦ be finite sets
that represent topics and words respectively.      is a
                                                              4.1    Extended Random Sets
set-valued mapping from Z (the evidence space) onto
⌦ that can be written as : Z ! 2⌦ , and P is a                Let     assume           we       have     a   set   of    top-
probability function defined on Z, thus the pair (P, )        ics     Z={z1 , z2 , z3 , . . . , zV }    in   ⇥     and     let
is called a random set [KSH12].       can be extended         D= {d1 , d2 , d3 , . . . , dN } is a set of N relevant
as ⇠ :: Z ! 2⌦⇥[0,1] (also calledPan extended set-            long documents.              Each document dx consists of
valued mapping), which satisfies     (w,p)2⇠(z) p=1 for
                                                              M paragraphs such as dx = {p1 , p2 , p3 , . . . , pM }. A
each z2Z.   Let P be a probability function on Z, such        paragraph py consists of a set of L words, for example,
      P
that z2Z P (z)=1. We call (⇠, P ) an extended ran-            py = {w1 , w2 , w3 , . . . , wL }. A word w is a keyword or
dom set.                                                      unigram, where the function words(p) returns a set
                                                              of words appear in paragraph p. A topic z can be
4     The Proposed Model                                      defined as a probability distribution over the set of
                                                              words ⌦ where words(p)✓⌦ for every paragraph p2⇥.
The proposed model (Figure 1) deals with the local                For each zi 2Z, let fi (w, zi ) be a frequency func-
weight problem of terms that is assigned by the LDA           tion on ⌦, such that (zi )={w|w2⌦, fi (w, zi ) 0}
probability function (described in section 3.1) by ex-        while the inverse mapping of                     is defined as
ploring all possible relationships between di↵erent en-           1
                                                                    : ⌦ ! 2Z ;             1
                                                                                             (w)={z2Z|w2 (z)}. Also, for
tities that influence the weighting process. The target-      each dj 2D, let fj (w, dj ) be a frequency function on
ing entities in our model are documents, paragraphs,          ⌦, such that (dj ) = {w|w2⌦, fj (w, dj )>0} while
topics, and terms. The possible relationships between         the inverse mapping of               is defined as    1
                                                                                                                      : ⌦ !
these entities are complex (a set of one-to-many rela-        2D ;     1
                                                                         (w)={d2D|w2 (d)}. These extended set-
tionships). For example, a document can have many             valued mappings can decide a weighting function on
paragraphs and terms; a paragraph can have multiple           ⌦, which satisfies sr :: ⌦ ! [0, +1) such that
topics; a topic can have many terms. Inversely, a topic
can cover many paragraphs, and a term can appear in                        X                             ⇣          X                                ⌘
many documents and topics.                                                                      1
                                                              sr(w) =                                   ·                    Pz (zi ) ⇥ fi (w, zi )
                                                                                            fj (w, dj )
    In this model, we proposed three ERSs to describe                   dj 2    1 (w)                         zi 2   1 (w)

such complex relationships, where each ERS can be                                                                                            (1)
interpreted as a function by which we can determine           where sr(w) is the combined weight of topical word w
the importance of the main entity in the relationship.        at the collection level.
Then, the proposed ERS theory is used to develop                 The extended random set 1 is proposed to describe
a new weighting scheme to accurately weight topical           the relationships between paragraphs and topics us-
ing the conditional probability function Pxy (z|dx py ) as                      overall, is more e↵ective in selecting relevant fea-
              Z⇥[0,1]
  1 : ⇥ ! 2           ; 1 (dx py )={(z1 , Pxy (z1 |dx py )), . . .}.            tures than most, state-of-the-art, term-based, pattern-
    Similarly      2    is also proposed to describe                            based, topic-based or even mix-based FS models. To
the relationship between topics and terms us-                                   support these two hypotheses, we conducted experi-
ing the defined frequency function fi (w, zi ) as                               ments and evaluated their performance.
              ⌦⇥[0,+1)
  2 : Z ! 2              ; 2 (zi )={(w1 , Pi (w1 |zi )), . . .}.
    Lastly,     3 is also proposed to describe the
relationship between documents and terms us-                                    5.1   Dataset
ing the defined frequency function fj (w, dj ) as                               The first 50 collections of the standard Reuters Corpus
              ⌦⇥[0,+1)
  3 : D ! 2               ; 3 (dj )={(w1 , fj (w1 , dj )), . . .}               Volume 1 (RCV1) dataset is used in this research due
    Based on the inverse mapping described above,                               to being assessed by domain experts at NIST [SR03]
we have 1 1 , 2 1 and 3 1 .             1
                                          1
                                            describes the in-                   for TREC2 in their filtering track. This number of col-
verse relationships between topics and paragraphs                               lections is sufficient and stable for better and reliable
using the probability function Pz (zi ) such that                               experiments [BV00]. RCV1 is collections of documents
    1                                          1
  1 (z)={dx py |z2 1 (dx py )} while        2 , on the other                    where each document is a news story in English pub-
hand, describes the inverse relationships between                               lished by Reuters.
terms and topics using fi (w, zi ) function such that
    1                                1
  2 (w)={z|w2 2 (z)}.              3   describes the inverse
relationships between terms and documents using                                 5.2   Baseline models
fj (w, dj ) function such that 3 (w)={d|w 2 3 (d)}
                                                                                We compared the performance of our model to eight
                                                                                di↵erent baseline models. These models are cate-
4.2    Generalised Topic Weight
                                                                                gorised into five groups based on the type of feature
To estimate the generalised topic weight in D, we need                          they use. The proposed model is trained only on rele-
to calculate the probability of each topic Pz (zi ) in                          vant documents and does not consider irrelevant ones.
each paragraph of document d and similarly for all                              Therefore, for fair comparison and judgement, we can
documents in D based on 1 1 in which we assume                                  only select a baseline model that either unsupervised
P⇥ (dx py ) = N1 , where N is the total number of para-                         or does not require the use of irrelevant documents.
graphs as follows:                                                                 We selected Okapi BM25 [RZ09], which is one of the
                                                                                best term-based ranking algorithm. The phrase-based
                       P                                                        model n-Grams is selected. It represents user’s infor-
    Pz (zi ) =                               (P⇥ (dx py ) ⇥ Pxy (zi |dx py ))   mation needs as a set of phrases where n = 3 as it is the
                               1
                 dx py 2           (zi )
                           1
                                                                                best value reported by Gao et al. (2015) [GXL15]. The
                           P                                                    Pattern Deploying based on Support (PDS) [ZLW12] is
            = N1                                   Pxy (zi |dx py )
                   dx p y 2            1
                                           (zi )
                                                                                one of the pattern-based models. It can overcome the
                                   1
                                                        (2)                     limitations of pattern frequency and usage. We se-
where Pxy (zi |dx py ) is estimated by LDA, dx py refers to                     lected the Latent Dirichlet Allocation (LDA) [BNJ03]
paragraph y in document x. 1 1 is a mapping function                            as the most widely used topic modelling algorithm.
defined previously.                                                             From the same group we also selected the Probabilis-
                                                                                tic Latent Semantic Analysis (pLSA) [Hof01]; it is
                                                                                similar to the LDA and can deal with the problem
4.3    Topical Word Weighting Scheme                                            of polysemy. Three models were selected from the
To calculate the topical word weight at collection level,                       mix-based category. First, we selected the Pattern-
we simply substitute Pz (zi ) in Equation 1 by its value                        Based Topic Model (PBTM-FP) [GXL15] that incor-
from Equation 2. Equation 3 shows the substitution.                             porates topics and frequent patterns FP to obtain se-
                                                                                mantically rich and discriminative representation for
                                                                                IF. Secondly, the PBTM-FCP [GXL15], which is simi-
5     Evaluation                                                                lar to the PBTM-FP except it uses the frequent closed
To verify the proposed model, we designed two hy-                               pattern FCP instead. Lastly, we selected the Top-
potheses. First, our ERS model can e↵ectively gen-                              ical N-Grams (TNG) [WMW07] that integrates the
eralise the topic’s local weight that is estimated from                         topic model with phrases (n-grams) to discover top-
all documents paragraphs. The generalisation has led                            ical phrases that are more discriminative and inter-
to a more accurate term weighting scheme especially                             pretable.
when it is combined with the term frequency ratio
in both documents and topics. Second, our model,                                  2 http://trec.nist.gov/
                                                 "                   ✓                                                                                               #
                    1         X                          1                  X                 ⇣                         X                                       ⌘◆
            sr(w) =                                              ⇥                                fi (w, zi ) ⇥                              Pxy (zi |dx py )                    (3)
                    N                  1
                                                     fj (w, dj )                    1                                            1
                          dj 2     3       (w)                           zi 2   2       (w)                       dx p y 2   1       (zi )

5.3   Evaluation Measures                                                                     parameters to be set. For the LDA-based models, we
                                                                                              set the number of iterations for the Gibbs sampling
The e↵ectiveness of our model is measured based on
                                                                                              to 1000 and for the hyper-parameters to      = 0.01
relevance judgements by five metrics that are well-
                                                                                              and ↵ = 50/V as they were justified in [SG07]. We
established and commonly used in the IR and IF com-
                                                                                              configured the number of iterations for the pLSA
munities. These metrics are the average precision
                                                                                              to be 1000 (default setting). For the experimental
of the top-20 ranked documents (top-20), break-even
                                                                                              parameters of the BM25, we set b = 0.75 and k1 = 1.2
point (b/p), mean average precision (MAP), F-score
                                                                                              as recommended by Manning et al. (2008) [MRS08].
(F1 ) measure, and 11-points interpolated average preci-
sion (IAP). For more details about these measures, the
                                                                                              5.6         Experimental Results
reader can refer to Manning et al (2008) [MRS08]. For
even better analysis of the experimental results, the                                         Table 1 and figure 2 show the evaluation results of our
Wilcoxon signed-rank test (Wilcoxon T-test) [Wil45]                                           model and the baselines. These results are the average
was used. Wilcoxon T-test is a statistical non-                                               of the 50 collections of the RCV1. The results in Table
parametric hypothesis test used to compare and as-                                            1 have been categorised based on the type of feature
sess if the ranked means of two related samples di↵er                                         used by the baseline model and the improvement%
or not. It is a better alternative to the student’s t-test,                                   represents the percentage change in our model’s per-
especially when no normal distribution is assumed.                                            formance compared to the best result of the baseline
                                                                                              model (marked in bold if there is more than one base-
5.4   Experimental Design                                                                     line model in the category). We consider any improve-
For each collection, we train our model on all para-                                          ment that is greater than 5% to be significant.
graphs of relevant documents D in the training part                                               Table 1 shows that our model outperformed all
of the collection. We use LDA to extract ten topics                                           baseline models for information filtering in all five mea-
because it is the best number for each collection as it                                       sures. Regardless of the type of feature used by the
has reported in [GXL13, GXL14, GXL15]. Then, the                                              baseline model, our model is significantly better on av-
proposed model scores documents’ terms, ranks them                                            erage by a minimum improvement of 8.0% and 39.7%
and uses the top-k features as a query to an IF sys-                                          maximum. Moreover, the 11-points result in figure 2
tem. The IF system uses unknown documents (from                                               illustrates the superiority of the proposed model and
the testing part of the same collection) to decide their                                      confirms the significant improvements that shown in
relevance to the user’s information needs (relevant or                                        table 1.
irrelevant). However, specifying the value of k is exper-
                                                                                              Table 1: Evaluation results of our model in comparison
imental. The same process is also applied separately
                                                                                              with the baselines (grouped based on the type of fea-
to all baseline models. If the results of the IF sys-
                                                                                              ture used by the model) for all measures averaged over
tem returned by the five metrics are better than the
                                                                                              the first 50 document collections of the RCV1 dataset.
baseline results, then we can claim that our model is
significant and outperforms a baseline model.                                                       Model           Top-20             b/p          MAP         F =1     IAP
   The IF testing system uses the following equation                                                our model       0.560              0.471        0.502       0.475    0.526
to rank the testing documents set:                                                                  LDA             0.492              0.414        0.442       0.437    0.468
                                                                                                    pLSA            0.423              0.386        0.379       0.392    0.404
                              (                                                                     improvement%    +13.9%             +13.8%       +13.7%      +8.5%    +12.3%
                X                 t 2 d, x = weight(t)                                              PDS             0.496              0.430        0.444       0.439    0.464
  weight(d) =         x, if                                                     (3)                 improvement%    +12.9%             +9.5%        +13.2%      +8.0%    +13.4%
                t2Q
                                  t2/ d, x = 0
                                                                                                    n-Gram          0.401              0.342        0.361       0.386    0.384
                                                                                                    improvement%    +39.7%             +37.8%       +39.1%      +22.9%   +37.1%
where weight(d) is the weight of document d.                                                        BM25            0.445              0.407        0.407       0.414    0.428
                                                                                                    improvement%    +25.8%             +15.6%       +23.5%      +14.6%   +22.9%
5.5   Experimental Settings                                                                         PBTM-FCP        0.489              0.420        0.423       0.422    0.447

In our experiment, we use the MALLET                                                                PBTM-FP         0.470              0.402        0.427       0.423    0.449
                                                                                                    TNG             0.447              0.360        0.372       0.386    0.394
toolkit [McC02] to implement all LDA-based models
                                                                                                    improvement%    +14.5%             +12.1%       +17.7%      +12.2%   +17.1%
except for the pLSA model where we used the Lemur
toolkit 3 instead. All topic-based models require some
                                                                                                 Wilcoxon T-test results (Table 2) present the p-
  3 https://www.lemurproject.org/                                                             values of the results of our model compared to all base-
                                                                                   6   Conclusion
                                                                                   This paper presents an innovative and e↵ective topic-
                                                                                   based feature ranking model to enhance the semantic
                                                                                   of topical words to acquire user needs. The model ex-
                                                                                   tends random sets to generalise the LDA topic weight
                                                                                   at the document level. Then, a term weighting scheme
                                                                                   is developed to accurately rank topical terms based
                                                                                   on their frequent appearance in the LDA topics dis-
                                                                                   tributions and all relevant documents. The new cal-
                                                                                   culated weight e↵ectively reflects the relevance of a
                                                                                   term to user’s information needs and maintains the
                                                                                   same semantic meaning of terms across all relevant
                                                                                   documents. The proposed model is tested for IF on
                                                                                   the standard RCV1 dataset, TREC topics, five di↵er-
                                                                                   ent performance measurement metrics and eight state-
                                                                                   of-the-art baseline models. The experimental results
                                                                                   show that our model achieved significant performance
                                                                                   compared to all other baseline models.

                                                                                   References
Figure 2: 11-points result of our model in compari-                                [ALA13]    Mubarak Albathan, Yuefeng Li, and Ab-
son with baselines averaged over the first 50 document                                        dulmohsen Algarni. Enhanced N-Gram Ex-
collections of the RCV1 dataset.                                                              traction Using Relevance Feature Discov-
                                                                                              ery, pages 453–465. Springer International
line models on all performance measures. A model’s                                            Publishing, Cham, 2013.
result is considered significantly di↵erent from other
                                                                                   [ALX14]    Mubarak Albathan, Yuefeng Li, and Yue
model’s if the p-value is less than 0.05 [Wil45].Clearly,
                                                                                              Xu. Using extended random set to find spe-
the p-value for all metrics is largely less than 0.05 con-
                                                                                              cific patterns. In WI’14, volume 2, pages
firming that our model’s performance is significantly
                                                                                              30–37. IEEE, 2014.
di↵erent from all baselines. This shows that our model
gains substantial improvement compared to the used                                 [AZ12]     Charu C Aggarwal and ChengXiang Zhai.
baseline models.                                                                              A survey of text clustering algorithms. In
                                                                                              Mining text data, pages 77–128. Springer,
Table 2: Wilcoxon T-test p-values of the baseline                                             2012.
models in comparison with our model’s.
                                                                                   [Ble12]    David M Blei. Probabilistic topic models.
  Model      Top-20        b/p           MAP           F =1          IAP                      Communications of the ACM, 55(4):77–84,
  LDA        0.004165      0.000179      7.00 ⇥ 10 6   8.96 ⇥ 10 6   6.71 ⇥ 10 6              2012.
  pLSA       1.48 ⇥ 10 4   1.49 ⇥ 10 4   6.65 ⇥ 10 7   5.86 ⇥ 10 7   1.72 ⇥ 10 7
  PDS        0.008575      0.003034      0.000194      0.000140      4.53 ⇥ 10 5
  n-Gram     7.46 ⇥ 10 8   1.05 ⇥ 10 7   1.71 ⇥ 10 9   1.86 ⇥ 10 9   1.23 ⇥ 10 9
                                                                                   [BNJ03]    David M Blei, Andrew Y Ng, and Michael I
  BM25       0.000353      0.008264      0.000279      0.000117      5.68 ⇥ 10 5              Jordan. Latent dirichlet allocation. the
  TNG        0.010360      0.000607      0.000180      0.000137      3.76 ⇥ 10 5              Journal of machine Learning research,
  PBTM-FP    0.003442      7.19 ⇥ 10 4   0.000382      0.000235      5.81 ⇥ 10 5
  PBTM-FCP   0.048010      0.033410      0.000306      0.000289      0.000180
                                                                                              3:993–1022, 2003.

                                                                                   [BV00]     Chris Buckley and Ellen M Voorhees. Eval-
   Based on the results presented earlier, we are confi-                                      uating evaluation measure stability. In SI-
dent in claiming that our extended random sets model                                          GIR’00, pages 33–40. ACM, 2000.
can e↵ectively generalise the local topic weight at
the document level in the LDA term scoring function                                [DDF+ 90] Scott Deerwester, Susan T Dumais,
and, thus, provide a more globally representative term                                       George W Furnas, Thomas K Landauer,
weight when it combined the term frequency in doc-                                           and Richard Harshman. Indexing by latent
ument and topics. Also, our model is more e↵ective                                           semantic analysis. Journal of the American
in selecting relevant features to acquire user’s informa-                                    society for information science, 41(6):391,
tion needs that represented by a set of long documents.                                      1990.
[GXL13]    Yang Gao, Yue Xu, and Yuefeng Li.              [Mol06]   Ilya Molchanov. Theory of random sets.
           Pattern-based topic models for informa-                  Springer Science & Business Media, 2006.
           tion filtering. In ICDM’13, pages 921–928.
           IEEE, 2013.                                    [MRS08]   Christopher D Manning, Prabhakar
                                                                    Raghavan, and Hinrich Schütze. Introduc-
[GXL14]    Yang Gao, Yue Xu, and Yuefeng Li. Topi-                  tion to information retrieval. Cambridge
           cal pattern based document modelling and                 University Press, 2008.
           relevance ranking. In WISE’14, pages 186–
           201. Springer, 2014.                           [Ngu08]   Hung T Nguyen. Random sets. Scholarpe-
                                                                    dia, 3(7):3383, 2008.
[GXL15]    Yang Gao, Yue Xu, and Yuefeng Li.
           Pattern-based topics for document mod-         [RZ09]    Stephen Robertson and Hugo Zaragoza.
           elling in information filtering.  IEEE                   The probabilistic relevance framework:
           TKDE, 27(6):1629–1642, 2015.                             BM25 and beyond. Now Publishers Inc,
                                                                    2009.
[Hof01]    Thomas Hofmann. Unsupervised learning
           by probabilistic latent semantic analysis.     [SG07]    Mark Steyvers and Tom Griffiths. Prob-
           Machine learning, 42(1-2):177–196, 2001.                 abilistic topic models. Handbook of latent
                                                                    semantic analysis, 427(7):424–440, 2007.
[KSH12]    Rudolf Kruse, Erhard Schwecke, and
           Jochen Heinsohn. Uncertainty and vague-        [SR03]    Ian Soboro↵ and Stephen Robertson.
           ness in knowledge based systems: numeri-                 Building a filtering test collection for trec
           cal methods. Springer Science & Business                 2002. In SIGIR’03, pages 243–250. ACM,
           Media, 2012.                                             2003.
[LAA+ 15] Yuefeng Li,      Abdulmohsen Algarni,           [TG09]    Serafettin Tasci and Tunga Gungor. Lda-
          Mubarak Albathan, Yan Shen, and                           based keyword selection in text categoriza-
          Moch Arif Bijaksana. Relevance feature                    tion. In ISCIS’09, pages 230–235. IEEE,
          discovery for text mining. IEEE TKDE,                     2009.
          27(6):1656–1669, 2015.
                                                          [WC06]    Xing Wei and W Bruce Croft. Lda-based
[LAZ10]    Yuefeng Li, Abdulmohsen Algarni, and                     document models for ad-hoc retrieval. In
           Ning Zhong. Mining positive and negative                 SIGIR’06, pages 178–185. ACM, 2006.
           patterns for relevance feature discovery. In
           KDD’10, pages 753–762. ACM, 2010.              [Wil45]   Frank Wilcoxon. Individual comparisons
                                                                    by ranking methods. Biometrics bulletin,
[Li03]     Yuefeng Li. Extended random sets for
                                                                    1(6):80–83, 1945.
           knowledge discovery in information sys-
           tems. In RSFDGrC’03, pages 524–532.            [WMW07] Xuerui Wang, Andrew McCallum, and
           Springer, 2003.                                        Xing Wei. Topical n-grams: Phrase and
                                                                  topic discovery, with an application to in-
[LTSL09]   Man Lan, Chew Lim Tan, Jian Su, and
                                                                  formation retrieval. In ICDM’07, pages
           Yue Lu. Supervised and traditional term
                                                                  697–702. IEEE, 2007.
           weighting methods for automatic text cat-
           egorization. IEEE TPAMI, 31(4):721–735,        [ZLW12]   Ning Zhong, Yuefeng Li, and Sheng-Tang
           2009.                                                    Wu. E↵ective pattern discovery for text
[MC13]     K Tamsin Maxwell and W Bruce Croft.                      mining. IEEE TKDE, 24(1):30–44, 2012.
           Compact query term selection using topi-       [ZPH08]   Zhiwei Zhang, Xuan-Hieu Phan, and
           cally related text. In SIGIR’13, pages 583–              Susumu Horiguchi. An efficient feature se-
           592. ACM, 2013.                                          lection using hidden topic in text catego-
[McC02]    Andrew Kachites McCallum.       Mallet:                  rization. In AINAW’08, pages 1223–1228.
           A machine learning for language toolkit.                 IEEE, 2008.
           2002.
[MO10]     Craig Macdonald and Iadh Ounis. Global
           statistics in proximity weighting models. In
           Web N-gram Workshop, page 30. Citeseer,
           2010.