-

Principal Component Analysis in Topic Modelling of Short Text Document Collections

Hennadii Dobrovolskyi

Nataliya Keberle

nkeberle@gmail.com 0 0 Department of Computer Science, Zaporizhzhya National University , Zhukovskogo st. 66, 69600, Zaporizhzhya , Ukraine

48 54

This paper presents the motivation for and the preliminary theoretical investigations of the PhD project by the rst author. The objective of the research is to propose and to experimentally verify the approach of application of eigendecomposition in principal component analysis for topic modelling of short text document collections. The main hypothesis examined in this project, is that principal component analysis applied to word co-occurrence statistics turns topic modelling into well-de ned problem having unique solution with natural tting parameters. The project is performed at the Dept. of Computer Science of Zaporizhzhya National University.

text mining short text document topic modelling principal component analysis eigendecomposition clusterization KeyTerms MathematicalModel MachineIntelligence DescriptiveModel KnowledgeRepresentation

This paper presents a PhD project aimed at developing the method for probabilistic topic modelling of collections of short text documents. It is assumed that documents are literate texts or have another known structure that allows to discover in a natural way all the links between terms. For instance collection of scienti c paper abstracts and titles contains well-formed sentences that can be parsed with NLP tools. The project concentrates on the analysis scienti c abstracts covering one vague domain of knowledge. That means, documents have one principal topic and some extra ones associated with related domains of knowledge.

It is well known that number of scienti c publication grows faster than average scientist can analyze. Thus it is important to have a tool that can maintain actual state of documents collection ensuring that it completely covers a domain of interest. Therefore developing a well-grounded method to determine topics an unknown document belongs to is useful.

The main hypothesis examined in this project, is that principal component analysis[ 1 ] applied to word co-occurrence statistics turns topic modelling into well- de ned problem having unique solution with natural tting parameters.

The discovered topics can be used to represent short documents as vector of real numbers appropriate for retrieval and clustering. Moreover the terms associated with topics can be used to search for new documents and to extend the collection.

The known common document topic modelling is the ill-posed problem that does not have unique solution. Therefore different additional conditions are added and combined to get comprehensible topic models. Often many restrictions are poorly grounded heuristics that require diverse tricks to combine them[ 2 ]. Applying common document topic modelling to short texts leads to the lower quality of discovered topics if compared to ones derived from long texts.

The objective of the presented project is to develop and evaluate a method to determine a set of topics of short text document collection and assign topic weights to each document in the collection.

As a theoretical background the project uses the natural language processing (NLP) methods like part-of-speech tagging [ 3 ], stemming [ 4 ], sentence splitting and parsing [ 5 ]. Information retrieval approaches [ 4 ] are used to exclude insignificant information during preprocessing. Following the mainstream of probabilistic topic model the Principal Component Analysis [ 1 ] is applied to derive the most signi cant collection topics from word co-occurrence frequencies. The HEP collection [ 6 ] provides a sample data to verify the suggested method.

The rest of the paper is organized as follows. Section 2 contains a short review of the related work. This is followed by short description of the suggested method in Section 3. Experiment goals and work ow is illustrated and explained in Section 4. Finally, the conclusive remarks are presented brie y in section 5 and several possible directions in future are also pointed out. 2

Related Works and Motivation

Probabilistic topic models [ 2 ] are the set of algorithms providing a statistical solution to the problem of handling large collection of documents. The basic idea behind the topic modelling is to construct a low-dimensional document representation using few groups of tightly connected signi cant terms instead of separate words. The most known method of topic modelling is Latent Dirichlet Allocations (LDA) [ 7 ] which overcomes de ciencies of earlier approaches and is successful and simple enough. A general introduction and survey of the topic modelling can be found in [ 2 ] along with a novel approach, called Additive Regularization of Topic Models. However the primary direction of topic model enhancement still is a regularization, i.e. incorporating different restrictions into basic algorithm. Origins of the restrictions are not limited and sometimes (as in LDA) the additional condition is applied because it is manageable and it works.

Another drawback of common topic model is the shorter are documents in a collection the less accurate is the result. It is overpassed with approaches utilizing word co-occurrence statistics [ 8, 9 ] instead of counting document-word pairs.

Similar results can be reached in quite a different way, with a combination of NLP and clustering algorithms [ 10 ]. However the clustering in a high-dimensional discrete space is a time demanding problem. So mapping documents to lowdimensional space Rn can accelerate the clustering and subsequent analysis.

Therefore, the method of topic modelling that replaces the magic restrictions with comprehensible ones will be valuable. The project presented in this paper aims at the development, evaluation and application of the method based on the PCA approach for word co-occurrence probabilities. 3

Method Description

Let D be a collection of documents, W - a dictionary containing all terms used in D. Each document d 2 D is a sequence of nd terms (w1; : : : ; wnd ). The term can occur many times in the document. "Term" may be a word or group of words.

The suggested method shares with the common document topic model [ 2 ] the following assumptions:

Assumption 1: Each term w in the document d is related to a topic t from a set of topics T . Collection of documents is formed as set of triples (d; w; t), independently selected in a random way from discrete probability p (d; w; t) de ned over set D W T . The document d and the term w are observable and the topic t is the hidden parameter.

Assumption 2. Order of terms in a document doesn't matter.

Assumption 3: Order of documents in the collection doesn't matter.

Assumption 4: Conditional probability p (wjd; t) is independent on the document d, i.e. p (wjd; t) = p (wjt).

As well as Word Network Topic Model [ 8 ] and Biterm Topic Model [ 9 ] the suggested method utilizes probability p (wi; wk) that both word wi and word wk occur in the same document or document fragment p (wi; wk) =

T ∑ p (wijt) p (t) p (wkjt) t=1 (1) where p (wi; wk) is a joint probability and t is a topic identi er. In the presented project the probability is estimated as relative number of pairs (wi; wk).

Term pairs (wi; wk) are collected in two steps. First, each document dk in the collection is mapped to set of short term sequences S (dk) = (sk1; sk2; : : :), where skq = (wkq1; : : : ; wkqr). Second, each sentence skq is mapped to pairs (wi; wk), wi 2 skq , wk 2 skq , wi ̸= wk .

Topic model creation is an estimation of probabilities p (wijt) and p (tjdk). It is assumed that a number of signi cant topics is far smaller than the number of words and the number of documents that simpli es the further manipulations like search, comparison, clustering etc.

In our document generation model, the document dk is represented with the set of conditional probabilities p (tjdk). dk is a bag of terms and we apply the Gibbs sampling to create such a bag of terms. First, the document covariance matrix p (wi; wk) is calculated using Eq.(1) where topic probabilities p(t) are replaced with p (tjdk). Second, a random topic t is selected according to the where Z is a normalizing denominator and wi is a term placed at i-th position in the document;

3. get new random term w according to the probability (2) and place it at the position j.

In the presented work, dimensionality of covariance matrix is decreased through stemming and omitting words which are not nouns or adjectives, stop-words, and rare words.

Words which are not nouns or adjectives are readily detected with part-ofspeech tagger [ 3 ]. They are proved to make small contribution to document topic assignment [ 10 ].

Stop-words are the terms that do not affect topic detection. There are two groups of stop-words: collection-speci c and common stop-words. Various lists of common words are available online1 but the collection-speci c ones have to be constructed.

To extract a set of collection-speci c stop-words the covariance p (wi; wj ) is employed. The hypothesis is that the stop-word has a large value of the Shannon information entropy conditional probability p(tjdk) and the initial set of N terms for the document is randomly selected based on the term probabilities p(wijt). Third, the repetitive sampling is used to replace each term in the document. One iteration of the sampling is a three-step process: 1. choose the term position j that will be updated; 2. calculate nave Bayes probability for each term w in the dictionary W (2) (3) (4) H(wi) = jW j ∑ p (wi; wj ) log [p (wi; wj )] j=1 The value of H(wi) exceeding some threshold value Hmax signals that wi can accompany any other word. Therefore it is not effective when detecting topics and should be dropped out. Hmax may be considered as additional parameter to adjust the algorithm.

Rare words are detected with comparison of the single word probability p(wi) and a threshold value Pmax where p(wi) = jW j ∑ p (wi; wj ) j=1 1 For instance list of English stop words is available at Snowball stemmer site http://snowball.tartarus.org/algorithms/english/stop.txt

One of the ways to de ne Pmax is to require that cumulative distribution function equals to some parameter =

∑ p(wi) Pmax p(wi) (5) That means, the kept terms cover prede ned percentage of occurrences.

After all the excessive words are dropped out the joint probability matrix becomes much smaller and should be decomposed into product of three matrixes according to Eq.(1). The main point of the presented method is setting number of topics T to dimensionality of the square covariance matrix Pij . Then Eq.(1) becomes eigendecomposition problem such that its solution produces conditional word probabilities p (wj jt) as eigenvectors and topic probabilities p(t) as eigenvalues.

Next step is to reduce the number of topics. Method of Principal Component Analysis [ 1 ] states that the matrix Pij can be approximated by setting the smallest values of topic probabilities p(t) to zero. Also PCA suggests a way to select the most signi cant topics relying on calculated topic probabilities. After values of p(wj jt) are calculated, the topic detection of the document is performed using the expression (6) (7) (8) where p(tjwi) is found from the Bayes equation p(tjd) = jW j ∑ p (tjwi) p (wijd) i=1 p(tjwi) = p(wijt)p(t) p(wi) p(wi) ( see Eq.(4) ) is the probability of word wi occurs in the collection, and p(wijd) is the relative frequency of word wi in document d.

The comprehensive and automated evaluation measure of topic quality is Topic Coherence [ 11 ] C(z; M ) = ∑T t∑11 log [ D(mt1 ; mt2 ) + ϵ ]

D(mt1 )D(mt2 ) where M = (m1; : : : ; mT ) is the list of the T most probable terms in a topic z, D(m) counts the number of documents containing the term m, D(m1; m2) counts the number of documents containing both m1 and m2, and ϵ is used to avoid log(0). The evaluation metric of the entire topic model is the average coherence score of all topics. Topic coherence is directly related to probability the top topic terms can be found in the same document. So the higher topic coherence indicates better topic quality.

Experiment planning

Experiments aim to check if the method presented above does provide the valid and high-quality topic de nitions. To answer the main question experiments should explore the impact of factors enlisted in the Section 3 on the quality of topic de nitions, namely: 1. How does the topic model quality depend on the threshold value of information entropy? 2. How does the topic model quality depend on the threshold value of rare word frequency? 3. Which of the available PCA decompositions ts the explored method? 4. How does the lower limit of topic probability in uence the quality of results? 5. Which method of word pairs extraction is better: (a) combination of consecutive words (as in [ 10 ]); (b) combination of adjacent terms in a grammar tree; (c) combination of all possible words in separate sentence; (d) all possible word pairs in sliding window of size r [ 8, 9 ].

General experiment work ow contains the following steps: 1. Extract title and abstract from each document of the collection; 2. Extract word pairs from titles and abstracts using one of the word pairs extraction methods; 3. Apply stemming, omit words which are not nouns or adjectives, stop-words, and rare words setting Hmax, ; 4. Apply one of PCA alternatives to extract word probabilities in topics, setting minimal value of topic probability; 5. Calculate average topic coherence to measure quality of topic set.

The experiment will use the HEP data collection [ 6 ] which is oriented to the study of multi-label classi ers text. It consists of scienti c papers in the eld of High Energy Physics (HEP) obtained from the document server of European Nuclear Physics Laboratory (CERN).

The experiments should show the dependencies of average topic coherence on Hmax, , minimal value of topic probability, type of PCA and extraction method. 5

Conclusive remarks and future works

The project has been started in December 2016 and is in stage of detailed planning of experiments and exploration of background technologies and theories. The future plans include (a) implementation of all the necessary software components; (b) evaluating quality of proposed basic algorithms; (c) component integration and running all the work ow; (d) application of developed method to practical tasks.

1. Jolliffe , I. Principal Component Analysis ( 2ed .). Springer, 2002 .

2. Vorontsov , K.V. , Potapenko , A.A.: Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization . In: Ignatov, D.I. et al. ( Eds.) Proc. AIST2014 , CCIS 436, pp. 29 - 46 ( 2014 ).

3. Toutanova , K. , Klein , D. , Manning , C.D. , Singer , Y. : Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network . In: Hearst, M. and Ostendorf , M. (Eds.) Proc. HLT-NAACL2003 , pp. 252 - 259 ( 2003 )

4. Manning , C.D. , Raghavan , P. , Schtze , H.: Introduction to Information Retrieval. Cambridge University Press, 2008 .

5. Chen , D. , Manning , C.D. : A Fast and Accurate Dependency Parser using Neural Networks . In: Moschitti, A. et al. ( Eds.) Proc. EMNLP 2014 , pp. 740 - 750 ( 2014 )

6. Montejo-Rez , A. , Steinberger , R. , Urea-Lpez , L.A. : Adaptive Selection of Base Classi ers in One-Against-All Learning for Large Multi-Labeled Collections . In: Vicedo J. L . et al. ( Eds.) Proc. 4th Intl Conf. Advances in Natural Language Processing (EsTAL2004) , LNAI 3230 , pp. 1 - 12 ( 2004 )

7. Blei , D.M. , Ng , A. , Jordan , M.I. : Latent Dirichlet Allocation . In: J. Machine Learning Research , Vol. 3 , pp. 993 - 1022 ( 2003 )

8. Zuo , Yu. , Zhao , Ji., Xu , K. : Word Network Topic Model: A Simple but General Solution for Short and Imbalanced Texts . The Computer Research Repository (CoRR), http://arxiv.org/abs/1412.5404, December 2014 .

9. Yan , X. , Guo , Ji., Lan , Ya., Cheng, Xu., A Biterm

Topic

Model For Short Texts . In: Schwabe, D. et al.( Eds.) Proc. 22nd Intl Conf. World Wide Web, ACM , pp. 1445 - 1456 ( 2013 )

10. Popova , S. , Khodyrev , I. , Egorov , A. , Logvin , S. , Gulayev , S. , Karpova , M. , Mouromtsev , D. : Sci-Search: Academic Search and Analysis System Based on Keyphrases . In: Klinov, P. , Mouromtsev , D . (Eds.) Proc. 4th Intl Conf. Knowledge Engineering and the Semantic Web, CCIS 394 , pp. 281 - 288 ( 2013 )

11. Aletras , N. , Stevenson , M. : Evaluating Topic Coherence Using Distributional Semantics . In: Proc. 10th Intl Workshop on Computational Semantics (IWCS2013) , pp. 13 - 22 ( 2013 )