Finding Niche Topics using Semi-Supervised Topic Modeling via Word Embeddings Gerald Conheady, Derek Greene School of Computer Science, University College Dublin, Ireland gerry.conheady@ucdconnect.ie B, derek.greene@ucd.ie Abstract. Topic modeling techniques generally focus on the discovery of the predominant thematic structures in text corpora. In contrast, a niche topic is made up of a small number of documents related to a common theme. Such a topic may have so few documents relative to the overall corpus size that it fails to be identified when using stan- dard techniques. This paper proposes a new process, called Niche+, for finding these kinds of niche topics. It assumes interactions with a user who can provide a strictly limited level of supervision, which is subse- quently employed in semi-supervised matrix factorization. Furthermore, word embeddings are used to provide additional weakly-labeled data. Ex- perimental results show that documents in niche topics can be success- fully identified using Niche+. These results are further supported via a use case that explores a real-world company email database. 1 Introduction In certain text corpus exploration tasks, users will be primarily interested in the predominant topics that naturally appear in the data. At other times, users will aim to discover documents related to a selection of topics of particular interest. We define a niche topic as a small set of documents from a corpus that the user considers to be linked together by a highly-coherent theme. It is expected that a user can provide example documents and typical words for a niche topic, i.e. a limited level of supervision. An example of this might be a ‘data breach’ topic, where a user is interested in the discovery of unacceptable leaks of patient data through a health organization’s email database. There might be millions of emails in the database and the number of data breaches would be expected to be low, therefore we could naturally regard this as a niche topic within the overall dataset. Ideally a user investigating this data could potentially provide a small sample of emails and terms related to data breaches, so as to help to discover other related content. A second related example might be a ‘Product Functionality Query’ topic, where a user is interested in the discovery of content from customers who have been querying the functionality of a product. Unsupervised algorithms, such as Non-negative Matrix Factorization (NMF) [7], have been used to uncover the underlying topical structure in unlabeled text corpora [1]. NMF might potentially identify a topic associated with a small number of documents, when a very large number of topics is specified. However, this has computational implications and also leads to challenges in interpreting the resulting topic model. Specifically, it is often impractical to ask a user to scan through hundreds or thousands of topics in order to find one or two niche topics. Semi-supervised NMF (SS-NMF) algorithms have been proposed which use background information, in the form of word and document constraints, to guide the factorization process in order to produce more useful topic models [8]. The Niche+ process described later in this paper uses SS-NMF techniques. Word embeddings have been applied in a range of natural language processing tasks, where words are represented by vectors in a vector space [9]. Words with related meanings will tend to be close together in this space. In Section 3 we apply the Weak+ approach supervision for topic modeling, [2], which uses word embeddings to generate additional “weakly-labeled” data. This supervision takes the form of a list of candidate words that are semantically related to a small number of “strong” words supplied by an expert to describe a topic. Our experiments on annotated corpora in Section 4 show that, when this weak supervision is fed to Niche+, other example documents from the niche topic can be found. In Section 5 we describe a use case involving a real-world email corpus provided by an enterprise software manufacturer. We show that, by using highly-limited supervision, the Niche+ process can successfully identify specific topics of interest from among a larger set of more general topics. 2 Related Work 2.1 Topic Modeling Topic modeling allows for the discovery of themes in a collection of documents in an unsupervised manner. It differs from keyword searches that try to match documents directly to a particular subset of words or phrases, whereas in topic modeling themes are based on grouping documents that have a similar usage of words. While probabilistic approaches have often been used for topic modeling, approaches based on NMF [7] have also been successful. Semi-supervised learning often involves using limited labeled data to improve the performance of algorithms which are normally unsupervised. For instance, methods have been proposed for incorporating constraints into matrix factoriza- tion [8]. For text data, this typically involves providing supervision in the form of constraints imposed on documents and terms, suggested by a human expert who is often referred to as the “oracle”. The Utopian system, which implements this approach, has demonstrated improved topic modeling results [6]. The best way to provide labeled data for semi-supervised learning is by continu- ous human interaction with the algorithm [5]. In topic modeling, a key practical challenge is to provide a user with an easy way to explore a large collection of text. The user needs to be free to select from the output of the topic model to highlight areas for improvement or further analysis. The ability to manipulate both documents and terms in a topic is needed. This approach is used in our ex- periments where the oracle is asked to provide topic documents and topic words for supervision, in order to guide the topic modeling process. The oracle is also asked to provide feedback on the documents found in the first run to determine whether they belong to the topic or not. This information is used to provide negative supervision during the second run. In other words, the feedback is used to exclude the documents and words from the niche topic. 2.2 Word Embeddings NMF does not directly take into account semantic associations between words. Related meanings of words, such as between ‘car’ and ‘bus’, do not explicitly in- fluence the factorization process. Techniques based on word embeddings attempt to take into account the semantic relatedness between pairs of words, as derived from a large corpus of text. Many applications of word embeddings are based on the use of a neural network as in the original word2vec model [9]. The input and output layers have one entry for each word in the vocabulary n. The hidden layer is considered the dimension layer and has d entries. It allows the output from the hidden layer to be represented by a n x d matrix. This representation measures the semantic associations between words in a corpus. 3 Methods 3.1 Characterizing Niche Topics The characteristics of a niche topic might typically include both its distincti- veness compared to the overall corpus and the heterogeneity of the niche. The distinctiveness influences how easy it is to find documents in the niche topic and can be measured using the cosine ratio. Given a corpus of documents assigned to k topics, where each document is assigned to one topic, we quantify the cosine ratio as follows. Firstly, we compute a topic-topic similarity matrix S, where an off-diagonal entry Sij indicates the mean cosine similarity between all pairs of documents in topic i and topic j, and a diagonal entry Sii indicates the mean cosine similarity between all pairs of documents in the same topic i. We refer to Sii as the within-topic similarity for topic i. The between-topic similarity for topic i is the average of the values Sij where j 6= i. The cosine ratio for topic i is its within-topic similarity divided by its between-topic similarity. A higher value for this ratio indicates a niche topic which is more coherent and well-separated relative to the rest of the topics present in the corpus. The heterogeneity of a niche topic can be established by a manual review of the sub-themes of documents within the niche. For instance, sub-themes of a topic such as “sport” could relate to soccer, rugby and tennis. Although clearly part of the “sports” topic, the vocabulary of the documents would be specific to each sub-theme. In a small group of niche documents, it is expected that the higher the number of sub-themes, the more difficult the documents are to find, as it is more heterogeneous. 3.2 Weak+ The Weak+ approach has been proposed to provide a form of limited supervi- sion for topic modeling, where word embeddings are used to generate additional “weakly-labeled” data . The Wikipedia word2vec [10] model provides an excel- lent source of generic semantic relationships of words. However, it cannot fully reflect the idiosyncratic semantic relationships between words within individual subject domains. In order to overcome this limitation, supervision words are first chosen from a word2vec model generated from the corpus. These words are only selected if they also appear in the top 500 similar words coming from the Wikipedia word2vec model. 3.3 Niche+ We now discuss the Niche+ approach for identifying niche topics in a corpus. It uses a semi-supervised strategy based on a simplified version of the Uto- pian algorithm [6]. The oracle-provided documents and words and the Weak+ “weakly-labeled” words are input to Niche+ to provide supervision. The rele- vant notation used for this discussion is summarized in the table below. Notation Description m Number of documents in the corpus n Number of words in the corpus k User-specified number of topics A ∈ Rmxn Document-term matrix W ∈ Rmxk Document by topic matrix H ∈ Rkxn Topic by word matrix Wr ∈ Rmxk Supervision matrix with topic weights for documents in W Hr ∈ Rkxn Supervision matrix with topic weights for words in H MW ∈ Rkxk Masking matrix for W with cells set to 1 for topics supervised MH ∈ Rnxn Masking matrix for H with cells set to 1 for topics supervised DH ∈ Rnxn Diagonal matrix used for automatic scaling The Utopian matrix factorization algorithm minimizes the objective in (1): ||A − W H||2F + ||(W − Wr )MW ||2F + ||(H − Hr DH )MH ||2F (1) This requires the H matrix to be recalculated column by column for every ite- ration until the stopping criteria is reached. It demands large resources for a corpus with a large vocabulary. As a result a simpler form of the objective was adopted as in (2) which only requires the H matrix to be updated once per iteration. The diagonal matrix DH is no longer required and is eliminated. ||A − W H||2F + ||(W − Wr )MW ||2F + ||(H − Hr )MH ||2F (2) The non-negativity constrained least squares with active-set method and column grouping, [4], nnlsm_activeset, is used. The W matrix continues to be updated as in (3) per the Utopian algorithm.  2 HT AT    T W ← argmin W − (3) W ≥0 MW MW WrT F The update process for the H matrix is changed as in (4):     2 W A H ← argmin H− (4) H≥0 MW Hr .MH F 2 The nnlsm_activeset algorithm solves X for min AX − B X >=0 element- F wise and is used to solve W and H. Positive supervision is implemented by setting the supervision weights for documents and words to positive values and negative supervision by setting the weights to 0. Niche+, building on the ori- ginal SS-NMF process, is carried out using the following steps: 1. Normalize the document-term matrix using TF-IDF. 2. Reconstitute the documents from the document-term matrix. 3. Generate an extended list of supervision words using the Weak+ process. 4. Apply the SS-NMF algorithm on the document-term matrix A (a) Initialise the W and H matrices with random values. Initialize the Wr , Hr , MW and MH matrices with zeros. (b) Set the Wr and W matrices weights for each document for the topic to be supervised. (c) Set the Hr and H matrices weights for each word to be supervised. (d) Set the MW and MH weights for the topic to be supervised. (e) Select another topic and set its weights to be the reverse of the supervised topic. (f) Repeat until the objective converges or the maximum iteration is rea- ched: i. Using nnlsm_activeset solve for H and then for W. ii. Recalculate the objective. The Niche+ process guides the discovery of niche topic documents using the words and documents provided by the user, along with the extended list of semantically linked words provided by Weak+. 4 Evaluation 4.1 Experimental Setup The aim of our experiments is to investigate whether the Niche+ process can be used successfully to find niche topics. The corpora listed in Table 1 are used for the evaluation. They were chosen as they come with a ground truth and provide niche topics that meet our definition of being a ‘small set of documents from the corpus that the users considers to be linked together’. Table 1. Summary of datasets used in the evaluation. For each topic, we report the within-topic similarity, between-topic similarity, and the ratio for the two. Dataset m k Topic Entries Within Between Cosine Ratio EU-PR 9,677 12 Antitrust 30 0.14 0.03 4.48 20-NG 18,828 20 Electronics 30 0.05 0.01 4.84 Politics 30 0.07 0.01 5.75 Religion 30 0.07 0.01 5.94 Med 30 0.07 0.01 6.75 Complaints 66,804 90 Bankruptcy 50 0.12 0.03 4.05 Data Privacy 54 0.09 0.04 2.18 Adding Money 65 0.12 0.04 3.38 The EU-PR dataset consists of press releases describing activities relating to the European Parliament across 12 different policy areas [3], where some policy areas are naturally covered more frequently than others. Our second corpus is the widely-used 20 Newsgroups (20-NG) collection of approximately 18K posts from 20 Usenet newsgroups. The Complaints corpus is a collection of over 66K records from the US Consumer Financial Complaints dataset provided by Kag- gle, categorized into 90 different types, such as ‘Data Privacy’, ‘Bankruptcy’ and ‘Foreign Currency Exchange’. For each corpus, we use the topic(s) with the lowest cosine ratio as these are the most difficult to find. Niche topics are created for the EU-PR and 20-NG topics by deleting all the documents for the topics, except the first 30. The 90 Complaints dataset is a much larger dataset and its topics have sizes ranging from a single document to over 6000 documents. The ‘Adding Money’, ‘Data Privacy’ and ‘Bankruptcy’ topics are selected as naturally occurring niche topics. It is important to minimize the user burden of providing labeled documents. To simulate this, only the first five unique documents in each topic are used as oracle documents. We calculate centroid vectors for each annotated ground truth topic, and then rank the corresponding words based on their centroid weights. In this way we can select the top five words as the oracle-given words for the niche. These words are used to construct word2vec embeddings on each corpus using a skip-gram model, with vectors of 100 dimensions and the document frequency threshold set to a minimum of 5. We next run the Niche+ process with step 4 repeated 50 times. We fix the number of topics k to be the number of ground truth topics in each corpus. Weights are set to 10 for the oracle given documents and words. Weights are set to 1 for Weak+ generated words, as these are not to be as influential as the oracle given words. The mask weights are set to 1 for the topics supervised. All other weights are set to 0. In order to simulate further feedback from the oracle, the initial runs are followed by an ‘exclusion’ run. Documents found in the original run that are not part of the niche topic are subject to negative supervision by setting their weights to 0. The weights of 5 prominent words from these documents are set to 0 to provide further negative supervision. The percentage of documents found using Niche+ is compared to a word frequency search based on the oracle words. 4.2 Results Firstly, we use Normalized Mutual Information (NMI) [11] to measure the accu- racy of document assignments arising from the topic models, relative to the ground truth document assignments. This is done by counting the number of correct documents found for each topic using the ground truth labels. The re- sults show little difference in NMI scores between the runs with and without niche topic supervision. This is explained by the fact that the Niche+ process concentrates on improving accuracy for a single topic only. The NMI scores for the EU-PR and 20-NG corpora range from 0.65 to 0.82 and for the Complaints corpus from 0.35 to 0.36. Next we use the percentage of documents found for each niche topic as a measure of success. Weightings by topic for each document are output from Niche+. They are ranked into the top 100 documents for each topic. Figure 1 shows the percentage of documents that are labeled as being part of the niche topic. An oracle-based word frequency search is run using the words generated by centroid calculation for the oracle documents. This is all that can be done when there is no ground truth, as in our later use case. The results show 70% of documents are found in the EU PR ‘Antitrust’ topic, from 3% to 23% documents for the 20 Newsgroups topics and from 6% to 10% for the Complaints topics. A ‘best case’ niche topic word frequency search is run using words generated by a separate centroid calculation for all the documents in the niche topic. This can be done as we have a ground truth. Improved results show that 77% of documents are found in the EU PR ‘Antitrust’ topic, from 27% to 40% of documents for the 20 Newsgroups topics and from 6% to 17% for the Complaints topics. Niche+ oracle supervision results are as high as 87% for the ‘Antitrust’ topic, 40% for the 20 Newsgroups topics and 22% for the Complaints topics. Niche+ oracle supervision with exclusions showed further improvement reaching 87% for the ‘Antitrust’ topic, 48% for the 20 Newsgroups topics and 30% for the Complaints topics. Fig. 1. Percentage of documents found for niche topics, using different models. 100% Niche Docs Found 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Oracle Word Frequency Search Niche Topic Word Frequency Search Oracle Supervision Oracle Supervision with Exclusions The precision interval analysis in Figure 2 presents the number of niche docu- ments for different levels of precision, considering the top 10 to 100 documents found. We see that typically over 80% of documents are found within the first 30 documents. Fig. 2. Precision interval analysis for different niche topics. 1.00 0.80 Precision 0.60 0.40 0.20 0.00 10 20 30 40 50 60 70 80 90 100 Interval Document Count EU PR - Antitrust Electronics Med Politics Religion Adding Money Bankruptcy Data Privacy It is expected that where the oracle documents are similar to the niche topic and the niche topic is distinct from the corpus, that the results will be most successful. i.e. a low oracle to niche cosine ratio and a high niche to corpus cosine ratio will find more niche documents. These ratios are shown in Figure 3. Fig. 3. Comparison of cosine ratios vs precision and recall for niche topics. 18 1.00 Oracle/Niche Cosine Ratios 16 0.90 Precision/Recall 14 0.80 12 0.70 10 0.60 0.50 8 0.40 6 0.30 4 0.20 2 0.10 0 0.00 Precision Recall Niche vs Corpus Cosine Ratio Oracle vs Niche Cosine Ratio The “antitrust” topic in the EU-PR dataset has the lowest oracle to niche docu- ments cosine ratio of 3.3. This is reflected in its high precision and recall scores. The “electronics” topic has the lowest precision and recall scores for the 20ng dataset. Its oracle to niche documents cosine ratio is high at 15.8 showing that the oracle documents do not represent the niche well. The highest 20-NG topic results are for the “med” topic with an oracle to niche documents cosine ratio of 7.1. This is slightly higher than the scores of 6.9 and 6.8 for the “religion” and “politics” topics. However, the cosine ratio for the niche documents to the corpus documents is 6.8 compared to 5.9 and 5.8 for the “religion” and “politics” topics, implying that the niche topic for “med” is more distinct than the others. A similar pattern is seen in the Complaints dataset where the “adding money” topic has the lowest oracle to niche documents ratio and the “bankruptcy” topic the highest. The combination of how well the oracle documents reflect the niche and how distinct the niche is in the corpus determines the success level. A further manual analysis of the “med” topic reveals a high level of heterogeneity as defined in Section 3.1. It can be divided into distinct sub-themes such as ‘back pain’, ‘lactose intolerance’, ‘smoking’ and others. All the documents are clearly linked to the “med” topic. The oracle documents include one related to the ‘lactose intolerance’ sub-theme and the results include similar documents. However, the oracle documents do not include any relating to the ‘smoking’ sub-theme and none of the ‘smoking’ documents in the niche topic are found. Although the relationship between the sub-themes is easily detected manually, the Niche+ process does not make the connection. The “electronics” topic shows a few clear sub-themes such as ‘searches for circuits’, ‘data transmission’ and ‘car radar’. The oracle documents do not contain any documents relating to these sub-themes and this may explain the poorer performance. 5 Case Study 5.1 Experimental Setup Enterprise email archives can contain hundreds of millions of emails. The ability to discover niche topics in archives can assist enterprises to audit and manage business processes. A software manufacturer has extracted 279K emails from their email archive for our real-world use case. The emails are unlabeled and cover twelve years of activity from their customer support department. The niche topics provided for analysis relate to a ‘visa application’, an ‘accounting package upgrade’ and the ‘moving of email archive volumes’. An initial clean-up of the emails is required. Only the subject and the main body of the email text are used. Details removed include forwarded messages, original messages, confidentiality notices, signatures, URLs ,and email addresses. Only emails with at least 50 characters are selected for the creation of a TF- IDF normalized document-term matrix. Words are filtered based on a minimum document frequency of 30. The Weak+ process generates 95 extra words for supervision from the original user provided words. Table 2 shows the first 10 words generated by user word for the ‘accounting package’ topic. The words generated are semantically close to the user words in the context of an accounting package upgrade. The words selected for ‘quote’ relate to seeking and providing of information relating to the cost of the accounting package upgrade. This reflects the user’s domain whereas a more generic approach may have interpreted ‘quote’ as relating to citations from literature. Niche+ is then run to find 20 topics. This choice is based on an inspection of the data. The supervision weights of the five documents and words supplied by the user are set to 10. The supervision weights of the Weak+ generated words are set to 1. Table 2. Sample of words generated by Weak+ User Generated Words Word accounting structuring, budgeting, balances, evaluations, profitability, forecasting, advisory, accountability, competencies, methodology system configured, determine, improve, testing, reduce, process, utilizing, application, environment, development quote clarifies, mention, explanation, anyway, linked, basically, viewpoint, clearly, clarifying, incorrect requirement indicator, workable, disclaimers, consistent, scrutiny, definitions, unacceptable, furthermore, mandates, maintain upgrade updating, interface, cost, latest, test, invoicing, platform, automate, storage, service 5.2 Discussion of Results Based on the topic model produced by our approach, the first 100 documents for each topic are ranked based on their weightings as in Section 4.2. The judgment of a user expert, who is familiar with the data, is that 94% of the documents for the ‘Visa’ topic, 49% for the ‘Accounting Package’ and 29% for the ‘Archive’ relate to the topic, as seen in Figure 4. A word frequency search of the email corpus, as used in Section 4.2, with the user given words, results in finding 12% of the documents for the ‘Visa’ topic and 14% for the ‘Accounting Package’ and 40% for the ‘Archive’ topic. In the case of the ‘Accounting Package’ topic, many of the off-topic documents relate to other package upgrades, such as Microsoft Windows upgrades. The documents the user excludes from the ‘Archive’ topic include many relating to a similar product, from a competitor, that is not of user interest. All documents and 5 words identified as not belonging to the topic are used for exclusion runs for each topic, as described in Section 4.1. The number of documents found increases in all supervision/exclusion runs to 95% for the ‘Visa’ topic, 59% for the ‘Accounting Package’ and 55% for the ‘Archive’ topic. Overall, this use case shows that the Niche+ process can successfully find niche topics in real-world datasets, such as a large email corpus. Fig. 4. Number of documents found for three niche topics in the email corpus, relative to human judgments. 100% 90% 80% No. of Documents Found 70% 60% 50% 40% 30% 20% 10% 0% Nationalisation Accounting package Archive moves Word frequency search Supervised Supervised/Exclusions 6 Conclusions and Future Work This paper has shown that input from an oracle (e.g. a “human-in-the-loop”) during topic modeling can improve results. In particular, when trying to identify small niche topics in a large unstructured text corpus, a user’s domain expertise can be essential. An initial set of inputs from a user helps the discovery of such niche topic documents. A second round of input, either in the form of inclusions or exclusions, can further improve the results. It has also been shown that cosine ratio is a good predictor of the number of niche documents that are found. This opens up the opportunity to guide users in their selection of suitable documents for niche topic supervision by looking at the cosine ratios for the oracle documents. However, the Niche+ process is not always successful in finding documents relating to sub-themes in the niche that do not have oracle examples, such as in the case of the ‘smoking’ sub-theme, as seen in Section 4.2. The process cannot currently reach out to semantically linked sub-themes. This will be an area of further investigation. References 1. Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models - Going beyond SVD. Proceedings - Annual IEEE Symposium on Foundations of Com- puter Science, FOCS, pages 1–10, 2012. 2. Gerald Conheady and Derek Greene. Weak Supervision for Semi-supervised Topic Modeling via Word Embeddings. Language, Data, and Knowledge. LDK 2017., pages 150–155, 2017. 3. J.P. Cross and D. Greene. Capturing and explaining the policy agenda of the european commission between 1986-2016: A quantitative text analysis approach. Under review, 2017. 4. Jingu Kim. nonnegfac-python @ github.com. 5. Patrik Ehrencrona Kjellin. A Survey On Interactivity in Topic Models. 7(4):456– 461, 2016. 6. D Kuang, J Choo, and H Park. Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering. Partitional Clustering Algorithms, pa- ges 1–28, 2015. 7. D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–91, 1999. 8. Tao Li, Chris Ding, and Michael I Jordan. Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization. In Seventh IEEE International Conference on Data Mining (ICDM 2007), volume 2, pages 577–582, 2007. 9. Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), pages 1–12, 2013. 10. Radim Rehurek. gensim 1.0.0rc1 : Python Package Index. 11. Alexander Strehl and Joydeep Ghosh. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Re- search, 3:583–617, 2002.