A Modular Approach to Topic Modeling for Heterogeneous Documents Discussion Paper Giovanni Toto2 , Emanuele Di Buccio1,2,∗ 1 Department of Information Engineering, University of Padova, Via G. Gradenigo 6/b, 35131, Padova, Italy 2 Department of Statistical Sciences, University of Padova, Via C. Battisti, 241, 35121, Padova, Italy Abstract Topic Modeling algorithms help unveil the latent thematic structure from large document collections. Previous works showed that traditional approaches could be less effective when applied to short texts, e.g., tweets; however, that can be mitigated by assuming that each document is about a single topic, as done in Twitter-LDA. In this work, we relax this assumption and propose a new model where a document can be about single or multiple topics. Our model allows the generation of diverse types of descriptors from latent topics, e.g., words and hashtags, similarly to Hashtag-LDA. Moreover, words/hashtags can be generated from topics or a background/global distribution. The proposed model is modular, and our goal is to tailor it to collections that can be heterogeneous both in the presence of single or multiple-topic documents and in the adoption of diverse topic representations. Keywords Topic Modeling, Text Mining, Heterogeneous Text Topic Modeling, Topic Modeling for Microblogs 1. Introduction Topic Modeling algorithms are Machine Learning approaches introduced to unveil the latent thematic structure from unstructured document corpora. In Probabilistic Topic Models [1, 2], whose most representative technique can be considered Latent Dirichlet Allocation (LDA) [3], a theme is represented through a topic which is a probability distribution over the entire corpus vocabulary. Documents in the corpus can be represented as a mixture of topics. One of the benefits of this representation is interpretability: the weights (probabilities) of the words in a topic help the interpretation of the topic, i.e., associate a topic to a theme, for instance, by looking at the words with the highest weights; moreover, the extracted topics allow users to have a preliminary idea of the themes covered in a possibly large document corpus; finally, each document can be represented in terms of topics, thus obtaining a more dense representation than when words are used as descriptors. Topic Modeling has been adopted in many tasks and settings [4]. Previous works showed that when applied to short texts, e.g., Microblog posts, the lack of word co-occurrence information can IIR2022: 12th Italian Information Retrieval Workshop, June 29 - June 30th, 2022, Milan, Italy ∗ Corresponding author. Envelope-Open toto@stat.unipd.it (G. Toto); emanuele.dibuccio@unipd.it (E. Di Buccio) Orcid 0000-0002-7444-5702 (G. Toto); 0000-0002-6506-617X (E. Di Buccio) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) negatively affect the effectiveness of traditional approaches [5]; therefore, ad-hoc solutions were proposed. Twitter-LDA [6] assumes that each tweet is generated by a single topic, moving the topic mixture from document to the user; experimental results suggested that this assumption is promising. We hypothesize that this assumption might be too restrictive for generic short texts and also on Twitter after the extension of the maximum number of characters per tweet. Our approach aims at relaxing the assumption of single-topic short text. Besides text length, another issue is the heterogeneity of the descriptors. For instance, Twitter allows the use of hashtags: an hashtag is a sequence of characters – not including punctuation or spaces – starting with “#” which “is used to index keywords or topics on Twitter”.1 Hashtag-LDA [7] relies on the same assumption of single-topic tweets, but differently from Twitter-LDA, not only words but also hashtags are generated by the topics. Even if previous works [8, 9, 10] explicitly include metadata/tags/labels, in Hashtag-LDA tags are generated by the latent topics, and not vice-versa. Our model shares the same intuition underlying Hashtag-LDA but relax the single-topic assumption and explicitly considers the possible generation of words and hashtags from a background/global distribution, as Twitter-LDA does for the words. 2. Modeling Single and Multi-Topic Documents and Heterogeneous Descriptors The overall model in plate notation is depicted in Fig. 1. The model can be considered an extension of LDA, Twitter-LDA, and Hashtag-LDA: the latent structure of LDA is used to model multi-topic documents, while the latent structure of the other two is used to model single-topic documents. We will describe the model in the context of Microblog, e.g., Twitter; however, the model is modular and we plan to apply it to heterogeneous document collections constituted by diverse types of documents, e.g., news, forum posts, blog posts, and tweets. Our model can be decomposed in four conceptual blocks depicted in different colors in Fig. 1. The first block, highlighted in red, models the key idea underlying our approach: two types of documents can be distinguished, those about a single topic and those about multiple topics. In the model, each user 𝑢 has her own inclination to write document on a single topic or multiple topics; this inclination is encoded in the probability 𝜋𝑢𝑇 , which affects the type 𝑥𝑢𝑑 of document written by 𝑢. This is a simplifying assumption since aspects other than the user might affect the choice of writing on single or multiple topics. Our approach allows diverse types of users – in terms of their inclination on single or multiple topics – to be modeled. For instance, influencers or politicians, through their official accounts, usually write long and elaborated posts to express their point of view; other users publish very concise messages, e.g., for answers to other tweets. The second and third blocks are highlighted in blue and green; they are responsible for the topic assignment to documents, words and hashtags. The assignment depends on the document type identified in the first block: if the document is about multiple topics – blue block –, assignment is very close to that proposed in LDA, where a single topic is associated to each textual element, e.g., a word or an hashtag; if the document is about a single topic – green block –, topic assignment follows Twitter-LDA and Hashtag-LDA, where a single topic is assigned to each document — we will refer to such topic as the main topic. 1 https://help.twitter.com/en/using-twitter/how-to-use-hashtags Figure 1: Full model in plate notation. When a document is about multiple topics – blue block in Fig. 1 –, • a topic proportion, 𝜃𝑢𝑑 , is assigned to each document 𝑢𝑑, where the 𝑡th element denotes the importance of topic 𝑡 for document 𝑢𝑑; • a vector 𝜆𝑢𝑑 of active topics is assigned to each document 𝑢𝑑: a non-active topic will have a very low weight, thus making very unlikely the observation of words or hashtags associated with that topic; 𝛿 denotes the probability that a topic is active; 𝑉 is assigned to each word 𝑢𝑑𝑛 and topics with a larger weight in the document • a topic 𝑧𝑢𝑑𝑛 𝐻 is assigned to each hashtag 𝑢𝑑𝑙 will generate words more frequently; similarly a topic 𝑧𝑢𝑑𝑙 and topics with a larger weight will generate hashtags more frequently. In the event of tweets, longer documents will be focused on a limited number of topics and the vector 𝜆𝑢𝑑 will introduce sparsity in the representation of documents as mixture of topics. When a document is about a single topic – green block in Fig. 1 –, • a topic proportion, 𝜃𝑢∗ , is assigned to each user and its 𝑡th element denotes the preference of the user to select the 𝑡th topic as the main topic; ∗ , is assigned to each document 𝑢𝑑 and topics with larger weight in 𝜃 ∗ are • a main topic, 𝑧𝑢𝑑 𝑢 assigned more frequently. In this case, no topic is associated to words and hashtags since they are generated from the main topic. The idea underlying this block is that more simple and concise documents are focused on a single topic and the selection of the topic depends on the personal preference of the user. The last block is highlighted in orange and is the one responsible for the generation of words and hashtags from the topics. The model considers: • a double representation of topics: there is a fixed number of topics and each topic is represented both as a distribution over words and as a distribution over hashtags; • background words common to all the topics: this group of words is modeled as a “dedicated” topic and therefore is represented as a distribution over the word vocabulary; similarly, global hashtags are used independently from the topic and are modeled as a dedicated topic and represented as a distribution over the hashtag vocabulary. The generative processes of words and hashtags are basically identical, and the difference lies in the latent variables and the parameters. In the case of words (hashtags), a source, 𝑦𝑢𝑑𝑛 𝑉 (𝑦 𝐻 ), 𝑢𝑑𝑙 is assigned to each word 𝑢𝑑𝑛 (hashtag 𝑢𝑑𝑙) and indicates if it was generated from a topic or it is a background word (global hashtag). The observed word (hashtag) depends on the source, the type of document, and the topic: if it is a background word (global hashtag), the background distribution 𝜙 𝐵 (𝜓 𝐵 ) is considered; otherwise, the main topic distribution, 𝜙𝑧𝑢𝑑 ∗ (𝜓 ∗ ), or that of 𝑧𝑢𝑑 the topic associated to the word, 𝜙𝑧 𝑉 (𝜓𝑧 𝐻 ), is considered, depending on the document type 𝑢𝑑𝑛 𝑢𝑑𝑙 — respectively single or multiple-topic document. The two generative processes are identical and independent of each other – the presence of certain words does not affect the presence of hashtags in the same document and vice-versa –; therefore, our approach can be extended with topic representations based on additional vocabularies, e.g., emojis. 3. Ongoing and Future Work We are currently focusing on the experimental evaluation of the proposed approach using Twitter datasets and “generic” short texts [5]. A first evaluation was carried out on a collection of tweets in Italian gathered by Twitter API. The collection is constituted of 8895 tweets about the COVID-19 published between Jan. 24 and Jan. 30, 2022. LDA, Twitter-LDA, and Hashtag-LDA were adopted as baselines. Since those methods are parametric in the number of topics, we selected the number of topics that maximized Topic Coherence (TC), specifically TC-PMI [11], for LDA. Topic Coherence was computed on the same collection, not on an external corpus. We used collapsed Gibbs sampling for learning the topic models. Our approach achieved results comparable with Twitter-LDA, which was the most effective baseline in terms of Topic Coherence – TC-PMI and TC-NZ [12] – and Jensen-Shannon diver- gence between the distribution over the words of the topics and the distribution over the words of the collection. However, differently from Twitter-LDA, our approach provides two different representations of the same topic, one in terms of words and one in terms of hashtags; these representations might be beneficial for interpreting the topics. The subsequent steps will be: (i) investigate in detail the effect of the number of topics on the proposed approach; (ii) investigate how to tailor the model to heterogeneous test collections [13] since it was initially designed for Microblogs; (iii) extend the set of adopted baselines, e.g., including the relevant ones among those surveyed in [13, 5]; (iv) evaluate the effectiveness in diverse tasks such as (hash)tag recommendation, text classification, and clustering; (v) perform a qualitative analysis through a case study, for instance, involving expert users. References [1] M. Steyvers, T. Griffiths, Probalistic Topic Models, in: Latent Semantic Analysis: A Road To Meaning, Lawrence Erlbaum Associates Publishers, 2007, pp. 427–448. [2] D. M. Blei, Probabilistic topic models, Communications of the ACM 55 (2012) 77–84. doi:1 0 . 1 1 4 5 / 2 1 3 3 8 0 6 . 2 1 3 3 8 2 6 . [3] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [4] J. Boyd-Graber, Y. Hu, D. Mimno, Applications of Topic Models, Foundations and Trends® in Information Retrieval 11 (2017) 143–296. doi:1 0 . 1 5 6 1 / 1 5 0 0 0 0 0 0 3 0 . [5] J. Qiang, Z. Qian, Y. Li, Y. Yuan, X. Wu, Short Text Topic Modeling Techniques, Applications, and Performance: A Survey, IEEE Transactions on Knowledge and Data Engineering 34 (2022) 1427–1445. doi:1 0 . 1 1 0 9 / T K D E . 2 0 2 0 . 2 9 9 2 4 8 5 . a r X i v : 1 9 0 4 . 0 7 6 9 5 . [6] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, X. Li, Comparing twitter and traditional media using topic models, in: P. Clough, C. Foley, C. Gurrin, G. J. F. Jones, W. Kraaij, H. Lee, V. Mudoch (Eds.), Advances in Information Retrieval, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 338–349. [7] F. Zhao, Y. Zhu, H. Jin, L. T. Yang, A personalized hashtag recommendation approach using lda-based topic model in microblog environment, Future Gener. Comput. Syst. 65 (2016) 196–206. doi:1 0 . 1 0 1 6 / j . f u t u r e . 2 0 1 5 . 1 0 . 0 1 2 . [8] D. Ramage, D. Hall, R. Nallapati, C. D. Manning, Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora, in: EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009, August, 2009, pp. 248–256. [9] F. S. Tsai, A tag-topic model for blog mining, Expert Systems with Applications 38 (2011) 5330–5335. doi:1 0 . 1 0 1 6 / j . e s w a . 2 0 1 0 . 1 0 . 0 2 5 . [10] Z. Ma, W. Dou, X. Wang, S. Akella, Tag-Latent Dirichlet Allocation: Understanding Hash- tags and Their Relationships, in: 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 1, IEEE, 2013, pp. 260–267. doi:1 0 . 1 1 0 9 / W I - I A T . 2 0 1 3 . 3 8 . [11] D. Newman, J. H. Lau, K. Grieser, T. Baldwin, Automatic evaluation of topic coherence, in: Human Language Technologies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computational Linguistics, HLT ’10, Association for Computational Linguistics, USA, 2010, p. 100–108. [12] J. Boyd-Graber, D. Mimno, D. Newman, Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements, CRC Handbooks of Modern Statistical Methods, CRC Press, Boca Raton, Florida, 2014. [13] J. Qiang, P. Chen, W. Ding, T. Wang, F. Xie, X. Wu, Heterogeneous-length text topic mod- eling for reader-Aware multi-document summarization, ACM Transactions on Knowledge Discovery from Data 13 (2019). doi:1 0 . 1 1 4 5 / 3 3 3 3 0 3 0 .