<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modular Approach to Topic Modeling for Heterogeneous Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Discussion Paper</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Toto</string-name>
          <email>toto@stat.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Di Buccio</string-name>
          <email>emanuele.dibuccio@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <addr-line>Via G. Gradenigo 6/b, 35131, Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Statistical Sciences, University of Padova</institution>
          ,
          <addr-line>Via C. Battisti, 241, 35121, Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Topic Modeling algorithms help unveil the latent thematic structure from large document collections. Previous works showed that traditional approaches could be less efective when applied to short texts, e.g., tweets; however, that can be mitigated by assuming that each document is about a single topic, as done in Twitter-LDA. In this work, we relax this assumption and propose a new model where a document can be about single or multiple topics. Our model allows the generation of diverse types of descriptors from latent topics, e.g., words and hashtags, similarly to Hashtag-LDA. Moreover, words/hashtags can be generated from topics or a background/global distribution. The proposed model is modular, and our goal is to tailor it to collections that can be heterogeneous both in the presence of single or multiple-topic documents and in the adoption of diverse topic representations.</p>
      </abstract>
      <kwd-group>
        <kwd>Topic Modeling</kwd>
        <kwd>Text Mining</kwd>
        <kwd>Heterogeneous Text Topic Modeling</kwd>
        <kwd>Topic Modeling for Microblogs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Topic Modeling algorithms are Machine Learning approaches introduced to unveil the latent
thematic structure from unstructured document corpora. In Probabilistic Topic Models [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ],
whose most representative technique can be considered Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a
theme is represented through a topic which is a probability distribution over the entire corpus
vocabulary. Documents in the corpus can be represented as a mixture of topics. One of the
benefits of this representation is interpretability: the weights (probabilities) of the words in
a topic help the interpretation of the topic, i.e., associate a topic to a theme, for instance, by
looking at the words with the highest weights; moreover, the extracted topics allow users to
have a preliminary idea of the themes covered in a possibly large document corpus; finally, each
document can be represented in terms of topics, thus obtaining a more dense representation
than when words are used as descriptors.
      </p>
      <p>
        Topic Modeling has been adopted in many tasks and settings [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Previous works showed that
when applied to short texts, e.g., Microblog posts, the lack of word co-occurrence information can
negatively afect the efectiveness of traditional approaches [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]; therefore, ad-hoc solutions were
proposed. Twitter-LDA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] assumes that each tweet is generated by a single topic, moving the
topic mixture from document to the user; experimental results suggested that this assumption
is promising. We hypothesize that this assumption might be too restrictive for generic short
texts and also on Twitter after the extension of the maximum number of characters per tweet.
Our approach aims at relaxing the assumption of single-topic short text.
      </p>
      <p>
        Besides text length, another issue is the heterogeneity of the descriptors. For instance,
Twitter allows the use of hashtags: an hashtag is a sequence of characters – not including
punctuation or spaces – starting with “#” which “is used to index keywords or topics on
Twitter”.1 Hashtag-LDA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] relies on the same assumption of single-topic tweets, but diferently
from Twitter-LDA, not only words but also hashtags are generated by the topics. Even if previous
works [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ] explicitly include metadata/tags/labels, in Hashtag-LDA tags are generated by the
latent topics, and not vice-versa. Our model shares the same intuition underlying Hashtag-LDA
but relax the single-topic assumption and explicitly considers the possible generation of words
and hashtags from a background/global distribution, as Twitter-LDA does for the words.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Modeling Single and Multi-Topic Documents and</title>
    </sec>
    <sec id="sec-3">
      <title>Heterogeneous Descriptors</title>
      <p>The overall model in plate notation is depicted in Fig. 1. The model can be considered an
extension of LDA, Twitter-LDA, and Hashtag-LDA: the latent structure of LDA is used to model
multi-topic documents, while the latent structure of the other two is used to model single-topic
documents. We will describe the model in the context of Microblog, e.g., Twitter; however, the
model is modular and we plan to apply it to heterogeneous document collections constituted by
diverse types of documents, e.g., news, forum posts, blog posts, and tweets.</p>
      <p>Our model can be decomposed in four conceptual blocks depicted in diferent colors in Fig. 1.</p>
      <p>The first block, highlighted in red, models the key idea underlying our approach: two types of
documents can be distinguished, those about a single topic and those about multiple topics. In
the model, each user  has her own inclination to write document on a single topic or multiple
topics; this inclination is encoded in the probability    , which afects the type   of document
written by  . This is a simplifying assumption since aspects other than the user might afect the
choice of writing on single or multiple topics. Our approach allows diverse types of users – in
terms of their inclination on single or multiple topics – to be modeled. For instance, influencers
or politicians, through their oficial accounts, usually write long and elaborated posts to express
their point of view; other users publish very concise messages, e.g., for answers to other tweets.</p>
      <p>The second and third blocks are highlighted in blue and green; they are responsible for
the topic assignment to documents, words and hashtags. The assignment depends on the
document type identified in the first block: if the document is about multiple topics – blue block
–, assignment is very close to that proposed in LDA, where a single topic is associated to each
textual element, e.g., a word or an hashtag; if the document is about a single topic – green block
–, topic assignment follows Twitter-LDA and Hashtag-LDA, where a single topic is assigned to
each document — we will refer to such topic as the main topic.
1https://help.twitter.com/en/using-twitter/how-to-use-hashtags</p>
      <p>When a document is about multiple topics – blue block in Fig. 1 –,
• a topic proportion,   , is assigned to each document  , where the  th element denotes
the importance of topic  for document  ;
• a vector   of active topics is assigned to each document  : a non-active topic will
have a very low weight, thus making very unlikely the observation of words or hashtags
associated with that topic;  denotes the probability that a topic is active;
• a topic   is assigned to each word  and topics with a larger weight in the document
will generate words more frequently; similarly a topic   is assigned to each hashtag 
and topics with a larger weight will generate hashtags more frequently.</p>
      <p>In the event of tweets, longer documents will be focused on a limited number of topics and the
vector   will introduce sparsity in the representation of documents as mixture of topics.</p>
      <p>When a document is about a single topic – green block in Fig. 1 –,
• a topic proportion,  ∗, is assigned to each user and its  th element denotes the preference
of the user to select the  th topic as the main topic;
• a main topic,  ∗ , is assigned to each document  and topics with larger weight in  ∗ are
assigned more frequently.</p>
      <p>In this case, no topic is associated to words and hashtags since they are generated from the main
topic. The idea underlying this block is that more simple and concise documents are focused on
a single topic and the selection of the topic depends on the personal preference of the user.</p>
      <p>The last block is highlighted in orange and is the one responsible for the generation of words
and hashtags from the topics. The model considers:
• a double representation of topics: there is a fixed number of topics and each topic is
represented both as a distribution over words and as a distribution over hashtags;
• background words common to all the topics: this group of words is modeled as a “dedicated”
topic and therefore is represented as a distribution over the word vocabulary; similarly,
global hashtags are used independently from the topic and are modeled as a dedicated
topic and represented as a distribution over the hashtag vocabulary.</p>
      <p>The generative processes of words and hashtags are basically identical, and the diference lies

in the latent variables and the parameters. In the case of words (hashtags), a source,  
(  ),
is assigned to each word</p>
      <p>(hashtag  ) and indicates if it was generated from a topic or it is
a background word (global hashtag). The observed word (hashtag) depends on the source, the
type of document, and the topic: if it is a background word (global hashtag), the background
distribution   (  ) is considered; otherwise, the main topic distribution,   ∗ (  ∗ ), or that of
— respectively single or multiple-topic document. The two generative processes are identical
and independent of each other – the presence of certain words does not afect the presence of
hashtags in the same document and vice-versa –; therefore, our approach can be extended with
topic representations based on additional vocabularies, e.g., emojis.
the topic associated to the word,  




(  ), is considered, depending on the document type</p>
    </sec>
    <sec id="sec-4">
      <title>3. Ongoing and Future Work</title>
      <p>
        We are currently focusing on the experimental evaluation of the proposed approach using
Twitter datasets and “generic” short texts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A first evaluation was carried out on a collection
of tweets in Italian gathered by Twitter API. The collection is constituted of 8895 tweets about the
COVID-19 published between Jan. 24 and Jan. 30, 2022. LDA, Twitter-LDA, and Hashtag-LDA
were adopted as baselines. Since those methods are parametric in the number of topics, we
selected the number of topics that maximized Topic Coherence (TC), specifically TC-PMI [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
for LDA. Topic Coherence was computed on the same collection, not on an external corpus.
We used collapsed Gibbs sampling for learning the topic models.
      </p>
      <p>
        Our approach achieved results comparable with Twitter-LDA, which was the most efective
baseline in terms of Topic Coherence – TC-PMI and TC-NZ [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] – and Jensen-Shannon
divergence between the distribution over the words of the topics and the distribution over the words
of the collection. However, diferently from Twitter- LDA, our approach provides two diferent
representations of the same topic, one in terms of words and one in terms of hashtags; these
representations might be beneficial for interpreting the topics.
      </p>
      <p>
        The subsequent steps will be: (i) investigate in detail the efect of the number of topics on the
proposed approach; (ii) investigate how to tailor the model to heterogeneous test collections [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
since it was initially designed for Microblogs; (iii) extend the set of adopted baselines, e.g.,
including the relevant ones among those surveyed in [
        <xref ref-type="bibr" rid="ref13 ref5">13, 5</xref>
        ]; (iv) evaluate the efectiveness in
diverse tasks such as (hash)tag recommendation, text classification, and clustering; (v) perform
a qualitative analysis through a case study, for instance, involving expert users.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Steyvers</surname>
          </string-name>
          , T. Grifiths, Probalistic Topic Models, in: Latent Semantic Analysis:
          <string-name>
            <given-names>A Road</given-names>
            <surname>To</surname>
          </string-name>
          <string-name>
            <surname>Meaning</surname>
          </string-name>
          , Lawrence Erlbaum Associates Publishers,
          <year>2007</year>
          , pp.
          <fpage>427</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <article-title>Probabilistic topic models</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>55</volume>
          (
          <year>2012</year>
          )
          <fpage>77</fpage>
          -
          <lpage>84</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 2 1 3 3 8 0 6 . 2 1 3 3 8 2 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          ,
          <article-title>Latent dirichlet allocation</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          , Applications of Topic Models,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>11</volume>
          (
          <year>2017</year>
          )
          <fpage>143</fpage>
          -
          <lpage>296</lpage>
          .
          <source>doi:1 0 . 1 5</source>
          <volume>6 1 / 1 5 0 0 0 0 0 0 3 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          , Short Text Topic Modeling Techniques, Applications, and
          <article-title>Performance: A Survey</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>1427</fpage>
          -
          <lpage>1445</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ T K D E .</surname>
          </string-name>
          <article-title>2 0 2 0 . 2 9 9 2 4 8 5 . a r X i v : 1 9 0 4 . 0 7 6 9 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-P.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Comparing twitter and traditional media using topic models</article-title>
          , in: P. Clough,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          , V. Mudoch (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2011</year>
          , pp.
          <fpage>338</fpage>
          -
          <lpage>349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>A personalized hashtag recommendation approach using lda-based topic model in microblog environment</article-title>
          ,
          <source>Future Gener. Comput. Syst</source>
          .
          <volume>65</volume>
          (
          <year>2016</year>
          )
          <fpage>196</fpage>
          -
          <lpage>206</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . f u t u r e .
          <source>2 0 1 5 . 1 0 . 0 1 2 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <surname>Labeled</surname>
            <given-names>LDA</given-names>
          </string-name>
          :
          <article-title>A supervised topic model for credit attribution in multi-labeled corpora</article-title>
          ,
          <source>in: EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT</source>
          , a Special Interest Group of ACL,
          <article-title>Held in Conjunction with ACL-IJCNLP 2009</article-title>
          ,
          <year>August</year>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Tsai</surname>
          </string-name>
          ,
          <article-title>A tag-topic model for blog mining</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>38</volume>
          (
          <year>2011</year>
          )
          <fpage>5330</fpage>
          -
          <lpage>5335</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . e
          <source>s w a . 2 0 1 0 . 1 0 . 0 2 5 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Akella</surname>
          </string-name>
          ,
          <article-title>Tag-Latent Dirichlet Allocation: Understanding Hashtags and Their Relationships</article-title>
          , in: 2013 IEEE/WIC/ACM International Joint Conferences on
          <article-title>Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 1</article-title>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>267</lpage>
          .
          <source>doi:1 0 . 1 1 0 9 / W I - I A T . 2</source>
          <volume>0 1 3 . 3</volume>
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <article-title>Automatic evaluation of topic coherence, in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          , HLT '10,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, USA,
          <year>2010</year>
          , p.
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <article-title>Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements, CRC Handbooks of Modern Statistical Methods</article-title>
          , CRC Press, Boca Raton, Florida,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Heterogeneous-length text topic modeling for reader-Aware multi-document summarization</article-title>
          ,
          <source>ACM Transactions on Knowledge Discovery from Data</source>
          <volume>13</volume>
          (
          <year>2019</year>
          ).
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 3 3 0 3 0 .</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>