=Paper=
{{Paper
|id=Vol-2086/AICS2017_paper_27
|storemode=property
|title=A Comparison on the Classification of Short-text Documents Using Latent Dirichlet Allocation and Formal Concept Analysis
|pdfUrl=https://ceur-ws.org/Vol-2086/AICS2017_paper_27.pdf
|volume=Vol-2086
|authors=Noel Rogers,Luca Longo
|dblpUrl=https://dblp.org/rec/conf/aics/RogersL17
}}
==A Comparison on the Classification of Short-text Documents Using Latent Dirichlet Allocation and Formal Concept Analysis==
A comparison on the classification of short-text documents using Latent Dirichlet Allocation and Formal Concept Analysis Noel Rogers, Luca Longo* School of Computing, Dublin Institute of Technology, Ireland luca.longo@dit.ie Abstract. With the increasing amounts of textual data being collected online, automated text classification techniques are becoming increas- ingly important. However, a lot of this data is in the form of short-text with just a handful of terms per document (e.g. Text messages, tweets or Facebook posts). This data is generally too sparse and noisy to ob- tain satisfactory classification. Two techniques which aim to alleviate this problem are Latent Dirichlet Allocation (LDA) and Formal Concept Analysis (FCA). Both techniques have been shown to improve the perfor- mance of short-text classification by reducing the sparsity of the input data. The relative performance of classifiers that have been enhanced using each technique has not been directly compared so, to address this issue, this work presents an experiment to compare them, using super- vised models. It has shown that FCA leads to a much higher degree of correlation among terms than LDA and initially gives lower classification accuracy. However, once a subset of features is selected for training, the FCA models can outperform those trained on LDA expanded data. 1 Introduction In recent years the amount of short text data available online has exploded. A big part of this is down to the rise of social media with a lot of this data taking the form of tweets, Facebook posts or comments on media sites like YouTube for example. However, the sparse, noisy nature of short text makes automatic classification a difficult task. Typically a classifier could take tf-idf as inputs in the form of a Term-Document-Matrix (TDM) where entry tij relates the fre- quency with which term j appears in document i with the overall occurrances of the term across the document corpus [19] but for short-text the amount of information contained in a TDM is too sparse to facilitate accurate prediction. As a result we need to reduce this level of sparsity by adding weights in the TDM for words which do not already appear in the document. This could be done by incorporating external knowledge bases [17] or by using metadata to add extra features to compensate for the sparsity within the actual text [21]. Both rely on data external to the textual content so as an alternative the co-occurrence of words within the document corpus can be used to perform the necessary ex- pansion. Two such techniques which adopt this approach are Latent Dirichlet Allocation (LDA) and Formal Concept Analysis (FCA). An investigation into the application of these two techniques to text classification will be the primary focus of this work. The rest of this document is organised as follows. Firstly, a brief review of related literature is provided, with particular emphasis on the applications of LDA and FCA to the problem of short-text classification. Section 3 then outlines the design of an experiment with the aim of comparing the improvements in classification accuracy due to each technique. An analysis of the results of this experiment are provided before we finish the paper with conclusions drawn from these results and provide suggestions for future work. 2 Related Work 2.1 Latent Dirichlet Allocation Latent semantic analysis was developed to find the latent topics in a set of doc- uments by looking at eigenvectors and used these as a means of dimensionality reduction [20]. This was extended to instead use conditional probabilities as a means of modelling the underlying topics, first introduced in [12]. The key idea is that a document can be considered as a mixed distribution over a number of topics. So, supposing there are k possible topics, then the probability that a given word w will instantiate some term t, is given by X p (w = t) = p (w = t | z = k) p (z = k) (1) k By convention we denote φk = p (w | z = k) as the word distribution for a topic k, and θd = p (z) as the distribution over topics for a given document d. Combining the distributions for all values of k and d respectively yields two matrices denoted φ and θ. Generalising θ to new documents not in the original corpus is non-trivial, so an additional assumption was taken in the seminal work of Blei, Andrew and Jordan which introduced a Dirichlet prior, leading to LDA [3]. A Dirichlet distribution is simply a family of distributions parameterised by a vector, α, of real values. In the case of LDA, the family of distributions correspond to φk and the values of α can be thought of as a prior count on the number of times a topic k is observed in a document. The same Dirichlet assumption can be extended to the distributions of words within topics, parameterised by a vector β. LDA is a generative model - we can generate a document word by word by first randomly sampling from the topic distribution and then selecting a word, conditioned on the selected topic [23], [4]. To generate an LDA model Markov- chain Monte Carlo moethods such as Gibbs sampling can be employed, for a detailed example see [23]. For the model parameters, the number of topics that should be generated may be known in advance but typically there needs to be a way to find an optimum value. There is no hard and fast rule for this, though there are heuristics based on information theory such as measuring the perplexity on a hold-out test sample and then finding the topic number that minimises this. Perplexity gives a measure of how well the model predicts the distribution on the test documents and is computed as per equation 2 where M is the number of documents in the test set, wd represents document d and Nd is the number of words in document d [25]. PM ! log p (wd ) perplexity = exp − d=1 PM (2) d=1 Nd There have been a large number of examples applying LDA to text classi- fication problems, with Twitter proving a popular data source for focusing on short-text problems [13]. For other applications LDA is simply one step in a more complex workflow to aid in achieving high classification accuracies [16, 6]. 2.2 Formal Concept Analysis We provide here a very brief overview of the subject of FCA. For a more detailed introduction to the topic see [10]. FCA was born out of a mathematical attempt to add formal definitions and structure to the notion of a concept. Intuitively a concept is a unit of thought consisting of a set of objects belonging to it (Called the extent) and the properties or attributes that they share (The intent). To for- mally define these ideas, start with a set of objects, X, and a set of attributes, Y , pertaining to elements of X. A binary relation, I, encodes for which elements in X have particular attributes of Y . The notation hx, yi ∈ I means that the ob- ject x has the attribute y. The collection hX, Y, Ii is called a formal context [2]. A formal concept then, is a pair hA, Bi where A ⊆ X, B ⊆ Y with A = {x ∈ X | ∀y ∈ B, hx, yi ∈ I} and B = {y ∈ Y | ∀x ∈ A, hx, yi ∈ I} The sets A and B are the extent and intent of the concept respectively. The collection of all such concepts for a given context hX, Y, Ii is denoted by B (X, Y, I). By ordering concepts using sub / super-set relations a partial order- ing can be added to the set of concepts. The key theorem, taken from the seminal paper of Wille which initially produced this framework, is that B (X, Y, I) forms a lattice when equipped with this partial ordering [24]. When applied to short- text classification, the typical approach is to treat documents as the objects and the words appearing within them as the attributes. In this way a corpus of docu- ments can be mapped to a concept lattice to determine the relationships between words [18]. The most relevant work for this paper is that of Boutari, Carpetino and Nicolussi [5]. Here, FCA is used as a text expansion technique to improve both supervised and unsupervised classification of short texts. Their main focus is on identifying proximity measures between concepts in the lattice that can be used to expand a TDM with weights from closely related concepts. In order to formalise this the authors developed five different metrics to generate these weights with the resulting matrices used as the input to K-Nearest Neighbour and K-Means classifiers for comparison. 3 Experiment Design The key focus of this study is on comparing LDA and FCA as sparsity reduction techniques. In order to determine their comparative performance, classifiers will need to be trained on inputs derived from each technique and their accuracies compared - for this both neural networks and SVM have been chosen. A baseline model will be trained on the unprocessed input TDM. The key steps are shown in figure 1. Fig. 1: Experiment design 3.1 Data and Preprocessing To reduce the possibility of the specific patterns in the distribution of the dataset from impacting the results of the study, the experiment will be replicated using two distinct datasets. The first is the Google Snippets1 corpus, first employed in [17]. It consists of snippets of search terms, typically between ten and forty words long, which comprise the documents. Each document is also assigned one of eight class labels. The dataset is already split into training and test subsets. The second dataset chosen is the Reuters-215782 collection. This is one of the most widely utilised datasets within the text classification domain, employed for example in [5] and [3]. This corpus consists of 21,578 different news articles along with additional metadata such as the author, date and title. For this study the articles themselves are too long so just the titles will be extracted with each 1 jwebpro.sourceforge.net/data-web-snippets.tar.gz 2 archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection Table 1: Dataset Characteristics (a) Google Snippets Dataset (b) Reuters Dataset Subset Size Min Len Max Len Avg. Len Size Min Len Max Len Avg. Len Training 10,060 1 38 17.87 7,733 2 36 6.48 Test 2,280 2 38 17.96 3,561 2 43 6.47 considered a distinct document as per the approach taken in [5]. A number of recommended subsets and splits are included with the dataset, for the purposes of this experiment a subset will be taken consisting of 78 classes, a training set of 7,733 documents and a test set of 3,561. A summary of this is provided in table 1b. As pre-processing steps we first convert to lowercase and remove any punctuation or non alpha-numeric characters. All stop words then (For example “the” or “and”) will be extracted. These do not contribute anything to the topics or concepts contained in a document so they will represent noise in the data. Note that the core focus of the experiment is on expansion of short text by enriching with topics or concepts derived from the entire corpus. Words which only appear in a single document are of little use in this regard, since they cannot form relationships with words from other documents. Therefore any words that only appear in a single document will be removed. Once this processing has been completed, a sparse TDM T is generated for each dataset. Each element tij represents the inverse-frequency with which word j appears in document i. LDA and FCA will be employed to address the sparsity of T . 3.2 LDA The first point of note is that the LDA step can be tuned by a number of hyperparamaters - the Dirichlet priors α and β and the number of topics. Ideally a range of values would be tried for each so that the optimal value could be found but this will not be feasible for this study so a good approximation for each needs to be taken upfront. The approach taken will follow the same as that employed in [11]. Their recommendation is to take α = 50/NT and β = 200/W where NT is the number of topics and W is the number of words. To derive the number of topics to use, perplexity values will be calculated from the test data as per equation 2. The range of values taken will be from 1 to 246, incrementing by 5 each time. Based on the outcome of this test a single model will be chosen to proceed with. The key outputs from this model are two probability matrices - one giving the distribution of words within each topic and the other giving the distribution of topics over documents. These correspond to φ and θ respectively, as defined in section 2.1. It follows then from equation 1 that the probabilities for each word appearing in each document are given by θ × φ. This new matrix has the same dimensions as T and replaces T as the input for the training step. 3.3 FCA The starting point is to note that T can be considered a formal context - if tij 6= 0 then word j appears in document i. As such we can form a concept lattice from the documents and words and related concepts from this will be used to add non-zero terms to T . Once the concept lattice is formed a proximity measure can be derived, encoding how closely related two concepts are. For this we choose SD P roximity = 1 − (3) max SD where SD is the shortest distance between two points in the graph [5]. Given any pair of words, equation 3 allows the similarity between them to be computed yeilding a symmetric matrix S where each term sij is the proximity between words i and j. Now let d be a vector representing one of the documents i.e. d corresponds to a row of T . The aim is to obtain a new representation, d0 , that takes advantage of the word proximities to reduce the sparsity of the original representation. The value for the ith word should take into account both the proximity between word i and each other word but also the frequency with which those other words appear in d yielding the following equation W X d0i = dk Ski (4) k=1 Extending this over the whole document gives d0 = d × S. It follows then that the expanded term document matrix, T 0 , is simply T × S. For the execution of this we adopt a tool3 implementing the InClose algorithm [1] to obtain the concept set. To construct the lattice from these concepts, a simple algorithm was employed to add edges where a given concept was a lower neighbour of another. The matrix S was generated using a breadth-first search over the lattice. Note that the source and sink nodes (Corresponding to the empty and universal sets in the concept) need to be first removed so that concepts cannot be linked via these nodes. It could happen that two unrelated terms end up with an unnaturally high proximity value on account of both being directly connected to either the source or sink. 3.4 Modelling We briefly highlight the choices made for neural network parameters. As we have a clasification problem the activation function selected is softmax. For the hidden layer, rectified linear units or ReLU will be used [14]. In order to help the model generalise well and avoid overfitting, dropout layers will be added in between each layer of the network [22]. The final consideration is to the architecture of the model - the number and width of hidden layers and the connections between them. Since the purpose of this study is not to investigate 3 sourceforge.net/projects/inclose/ neural network architectures, the simplest setup will be chosen, namely a single hidden layer, of width W and with all units connected. One additional pre- processing step that will be required before training the neural network is to normalise the input data. The normalised features, or z scores, are computed by subracting the mean and dividing by the std. deviation of each feature. For the SVM, a simple linear kernel has been chosen. The only other consideration required with the algorithm then is on how to deal with the multiple class labels. In order to handle this the ’one-versus-rest’ approach will be taken [15]. 4 Results and analysis 4.1 LDA The topic values taken for each LDA model were determined by computing the perplexity associated to each topic number and selecting the minimum. From figures 2a and 2b these topic numbers are 181 and 161. (a) Reuters perplexity values (b) Snippets perplexity values Fig. 2: Perplexity values per topic number 4.2 Correlation Analysis The initial experiments yielded very poor accuracy on the FCA enhanced Snip- pets dataset for neural networks and it was found that there was a high degree of correlation between the input features. As a result of this two further runs of the experiment were performed, the first removing correlated features above a threshold of 0.8. This conservative value gave little improvement and prompted a further run where the top 10% of features were selected based on the outcome of an ANOVA test. To help understand the cause of the high correlations, the distributions of weights in the FCA and LDA enhanced TDMs for the Snippets dataset are shown (Figures 3a and 3b). For the LDA weights, the majority of terms are close to zero so it is still just a small subset of terms that contribute most to the classification. Contrast this with FCA; here the weights form a near normal distribution around a mean of 0.5. The impact is that even totally unrelated terms are still contributing to significant weight increases. Across both datasets, the greatest distance between any pair of concepts was 12 leading to a small range of values that the proximities could take. We will revist this issue in section 5 with suggestion for how future work can combat this problem. (a) FCA TDM weights (b) LDA TDM weights Fig. 3: Weight distributions 4.3 Overall model evaluation Neural Network models The performance of each classifier was determined by comparing precision, recall and F-measure values (Denoted P , R and F1 ). A full breakdown of the results of each run of the experiment are given in tables 2a to 2c. Graphs of the F1 scores can also be seen in figures 4a and 4b for the Reuters and Snippets experiments respectively. As already highlighted, the initial FCA results are quite poor on the Snippets dataset. For Reuters however, FCA is already outperforming both the baseline (BL) and LDA. Removing correlated features does not lead to significant change in the results 2b but in the final run of the experiment, following the selection of just 10% of features using ANOVA, it can be seen that FCA has outperformed across the board. SVM models The high correlations which impeded the FCA trained neural networks did not have the same negative effect on the SVM models. Across both datasets the highest accuracies are achieved on the first run, before features are removed. Comparing FCA and LDA for the SVM models, the highest overall F1 Table 2: Experimental Results Reuters Snippets Reuters Snippets BL FCA LDA BL FCA LDA BL FCA LDA BL FCA LDA NN P 0.715 0.78 0.59 0.645 0.017 0.6 0.736 0.78 0.62 0.656 0.021 0.63 R 0.718 0.73 0.59 0.61 0.13 0.58 0.736 0.76 0.59 0.622 0.14 0.59 F1 0.706 0.74 0.57 0.609 0.03 0.58 0.724 0.76 0.59 0.621 0.04 0.59 SVM P 0.68 0.78 0.52 0.61 0.72 0.58 0.68 0.78 0.55 0.61 0.71 0.6 R 0.68 0.78 0.58 0.57 0.69 0.56 0.68 0.77 0.59 0.57 0.68 0.58 F1 0.67 0.78 0.54 0.57 0.69 0.56 0.68 0.77 0.55 0.56 0.68 0.58 (a) Results - No feature engineering (b) Results - Correlation removal Reuters Snippets BL FCA LDA BL FCA LDA NN P 0.698 0.77 0.61 0.631 0.72 0.62 R 0.693 0.78 0.62 0.551 0.66 0.61 F1 0.681 0.76 0.61 0.555 0.66 0.61 SVM P 0.66 0.74 0.56 0.57 0.65 0.64 R 0.6 0.75 0.61 0.52 0.63 0.62 F1 0.57 0.74 0.56 0.52 0.63 0.62 (c) Results - ANOVA feature selection values are again achieved by FCA (0.69 versus 0.62 on the snippets data and 0.78 versus 0.56 on Reuters). Comparing the best F1 scores for each dataset across all 3 runs shows the FCA achieving a 3-5% increase on the baseline and a 5-15% increase over LDA. The highest scoring combination across both datasets is FCA + SVM with no need for additional feature engineering steps. 4.4 Statistical significance As a final point the statistical significance of the the obtained results has been evaluated. We used McNemars test statistic [8]. The statistic is given by 2 (| n01 − n10 | −1) χ2 = (5) n01 + n10 where, for two models a and b, n01 corresponds to cases misclassified by a and not b and those missed by b and not a give n10 . From table 3, it can be seen that the results comparing the LDA and FCA models are statistically significant with a p-value < 0.01. (a) F1 scores for Reuters data (b) F1 scores for Snippets data Fig. 4: Model evaluation using F1 scores Snippets Reuters NN SVM NN SVM McNemar χ2 13.62 91.72 10.32 9.11 P-value < 0.01 < 0.01 < 0.01 < 0.01 Table 3: Statistical significance of experimental results 4.5 Summary of results Three different iterations were run - without feature engineering, with the re- moval of correlated features and incorporating ANOVA for feature selection. Two classifiers were trained on the resulting feature sets - neural networks and SVM, with FCA showing a 5% increase in the Snippets dataset and a 15% improve- ment on the Reuters data. The LDA models remained consistent throughout but failed to even outperform the baseline models on either dataset. Analysis was performed to understand the initial poor scores and a high degree of feature correlation was discovered. As the focus was on term expansion techniques, no parameter engineering was performed on any of the neural network models and only a simple linear kernel was employed for the SVM. Varying the network ar- chitectures, dropout weights or learning rates or employing more sophisticated kernels could have improved the results for individual models but these steps were not performed. To strengthen the results the experiment was repeated on two datasets. The correlation problem that FCA initially introduced may not have been picked up had only the Reuters dataset been applied. LDA is widely used in text analysis but we have shown that for this particular task FCA is more suitable. We have also shown though that FCA also adds a high degree of correlation between terms. One of the drawbacks in FCA is the computational resources needed to build the concept lattice and term similarity matrix. The density of the lattice was highlighted as a reason for the high correlations so a trade-off in comput- ing the full lattice or term similarities would help mitigate both the resources required and reduce the correlations. 5 Conclusions This experiment has compared the relative benefits of LDA and FCA for to addressing the sparsity of short-text. We list now some potential future avenues of work arising from this experiment. This project only focused on “standard” FCA, however fuzzy FCA, as described in [7], could be examined. In fuzzy FCA, rather than attributes simply being absent / present for a given object, a weight between 0 and 1 is applied to each one - precisely the form that a tf-idf TDM takes. The outcome of the FCA model is the term-term similarity matrix and this is the key component in this step. One measure was utilised in this work, however there are alternative methods of deriving concept similarity from a lattice, not just the geometric distance of shortest paths. Further work could look at alternatives such as set based approaches (Measuring the size of the intent / extent intersections) or combinations of these with geometric distance [5]. One issue identified with FCA was the high degree of correlations that were observed. We looked at evaluating term-similarities between a concept and all other others but a more restrictive approach, looking just at a small neighbourhood around each concept, might fare better. Within this neighbourhood the proximities could be computed as before, with all concepts outside this neighbourhood having being set to 0 [9]. An alternative approach that could yield the same outcome is to instead use iceberg lattices [1]. This is simply a concept lattice which has been pruned by introducing a required minimal support for concept inclusion. The removal of edges from the lattice would lead to a wider spread in proximities between concepts. References 1. Andrews, S., Hirsch, L.: A Tool for Creating and Visualising Formal Concept Trees. In: Proceedings of the Fifth Conceptual Structures Tools & Interoperability Work- shop. pp. 1–9. Annecy, France (2016) 2. Belohlavek, R.: Introduction to formal concept analysis. Palacky University, De- partment of Computer Science, Olomouc p. 47 (2008) 3. Blei, D., Andrew, Y., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1020 (2003) 4. Blei, D., Carin, L., Dunson, D.: Probabilistic topic models. IEEE Signal Processing Magazine 27(6), 55–65 (2010) 5. Boutari, A.M., Carpineto, C., Nicolussi, R.: Evaluating term concept association measures for short text expansion: Two case studies of classification and clustering. CEUR Workshop Proceedings 672, 163–174 (2010) 6. Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi- granularity topics. In: IJCAI. pp. 1776–1781 (2011) 7. De Maio, C., Fenza, G., Loia, V., Senatore, S.: Hierarchical web resources retrieval by exploiting fuzzy formal concept analysis. Information Processing and Manage- ment 48(3), 399–418 (2012) 8. Dietterich, T.G.: Approximate statistical tests for comparing supervised classifica- tion learning algorithms. Neural Computation 10(7), 1895–1923 (1998) 9. Eklund, P., Ducrou, J., Dau, F.: Concept similarity and related categories in in- formation retrieval using formal concept analysis. International Journal of General Systems 41(8), 826–846 (2012) 10. Ganter, B.: Ch1 & Ch2: Contexts, Concepts, and Concept Lattices. Formal Con- cept Analysis: Methods and Applications in Computer Science (2003) 11. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101 Suppl, 5228–35 (2004) 12. Hofmann, T.: Probabilistic latent semantic indexing. Proceedings of the 22nd an- nual international ACM SIGIR conference on Research and development in infor- mation retrieval pp. 50–57 (1999) 13. Hong, L., Davison, B.: Empirical study of topic modeling in twitter. Proceedings of the first workshop on social media analytics pp. 80–88 (2010) 14. Kelleher, J.D., MacNamee, B., D’Arcy, A.: Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. The MIT Press (2015) 15. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network. Neurocomputing: algorithms, architectures and applications 68(41-50), 71 (1990) 16. Lu, Z., Li, H.: A Deep Architecture for Matching Short Texts. Advances in Neural Information Processing Systems pp. 1–9 (2013) 17. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. Proceeding of the 17th international conference on World Wide Web - WWW ’08 pp. 91–100 (2008) 18. Poelmans, J., Ignatov, D.I., Kuznetsov, S.O., Dedene, G.: Formal concept analysis in knowledge processing: A survey on applications. Expert Systems with Applica- tions 40(16), 6538–6560 (2013) 19. Sebastiani, F.: Machine learning in automated text categorization. ACM comput- ing surveys (CSUR) 34(1), 1–47 (2002) 20. Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Computers and Mathematics with Applications 57(11-12), 1901–1907 (2009) 21. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short Text Classification in Twitter to Improve Information Filtering. Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval SE - SIGIR ’10 pp. 841–842 (2010) 22. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: prevent NN from overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014) 23. Steyvers, M., Griffiths, T.: Latent semantic analysis: a road to meaning, chapter probabilistic topic models. Laurence Erlbaum (2007) 24. Wille, R.: Restructuring lattice theory: an approach based on hierarchies of con- cepts. In: Ordered sets, pp. 445–470. Springer (1982) 25. Zhao, W., Chen, J.J., Perkins, R., Liu, Z., Ge, W., Ding, Y., Zou, W.: A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics 16(Suppl 13), S8 (2015)