<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Microblog Search Task at CLEF 2017: Query Generation using IR and LDA Topic Modeling Combination</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Malek Hajjem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiraz Latiri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIPAH research Laboratory, Faculty of Sciences of Tunis</institution>
          ,
          <addr-line>Tunis EL Manar Univeristy,Campus Universitaire Farhat Hached B.P. n94, 1068 Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The microblogs search task at CLEF 2017 consists of developing a system to search the most relevant microblogs for cultural query in a collection about festivals in all languages. Our general approach to get this objective is the following: we propose to generate from the initial tweet queries, provided for the task, extended queries able to get an answer-rich set of microblogs. This is achieved using a thematic representation of tweet query extracted from microblog corpus. We investigate in this paper a novel method to improve topics learned from Twitter content without modifying the basic machinery of LDA. This latter is based on Information Retrieval (IR) process to generate a query-speci c set of similar tweets. The result then represent the input of a basic LDA topic modeling process. Finally, the output thematic cluster serves as our source of expansion for the initial queries.</p>
      </abstract>
      <kwd-group>
        <kwd>CLEF</kwd>
        <kwd>Microblogs Search</kwd>
        <kwd>LDA</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Aggregation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The microblog search is the second task 1 at CLEF 2017 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ](Conference and Labs
of the Evaluation Forum) from Cultural Microblog Contextualization track. This
task consists in developing a system to search the most relevant microblogs for
cultural query in a collection about festivals in all languages. Topics, announcing
some cultural event, were gathered from di erent sources2[
        <xref ref-type="bibr" rid="ref4 ref5">5, 4</xref>
        ]. The goal is to
retrieve relevant and diverse tweets related to each event from a dataset of 70
000 000 microblogs. This corpus dates from May to September 2015 and is about
the keyword \Festival".
      </p>
      <p>
        We consider that using topic models such as Latent Dirichlet Allocation (LDA)
could be useful in this microblog search task. Taking into account that \topic
model is often employed to mine \latent topics" from high dimensionality of
terms in text"[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], these topics can be used to describe the content of a collection
1 https://mc2.talne.eu/spip/Tasks/2-microblog-search/
2 spanish query: http://www.jornada.unam.mx/2017/05/26/
or a query. In fact, the high probability topics and words within the topics can
be viewed as a loose description of a text data.
      </p>
      <p>The contribution of this work is to identifying the \topic" or topics being
discussed in a query, and then using this knowledge of topics to include semantically
related words. To better match with ambiguous nature of tweet query, topics are
extracted from a microblog corpus. A novel method to improve topics learned
from Twitter content without modifying the basic machinery of LDA is
investigated. Following this introduction, the paper is organized as follows. In Section
2, a state of the art is shown. In Section 3, the methodology used is presented.
The Conclusion section wraps up the paper.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <sec id="sec-2-1">
        <title>Topic modeling for short texts</title>
        <p>
          Topic models are used to uncover the latent semantic structure from text
corpus. A topic consists of a cluster of words that frequently occur together. Using
contextual clues, topic models can connect words with similar meanings and
distinguish between uses of words with multiple meanings. Traditional topic models,
like LDA, rely on co-occurrence patterns of words in documents to learn latent
topics [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Due to the messy nature of short texts, applying directly conventional
topic models (e.g. LDA and pLSA) on such short texts is not e cient. Indeed,
naive topic models implicitly capture the document-level word co-occurrence
patterns to reveal topics. Thus, to avoid data sparsity, works such as [
          <xref ref-type="bibr" rid="ref12 ref7 ref9">7, 12,
9</xref>
          ] have applied topic models to tweets based on a pooling strategy. It consists
in aggregating similar short texts in one document. In [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], authors proposed
tweet pooling startegy which was based on user aggregated messages. Authors
in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] also have experimented several schemes to train a standard topic model
and compare their quality and e ectiveness. In [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], authors have proposed to
gather tweets occurring in a same user-to-user conversation and show that this
new technique outperforms other pooling methods in terms of clustering quality
and document retrieval. Closer to our work, authors in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] have proposed a new
method of tweet pooling using hashtags where documents with common hashtag
were gathered. All these works have proved that by training a topic model on
aggregated messages, they obtained a higher quality of learned model.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Topic model for IR: query expansion</title>
        <p>
          There are two obvious approaches to including topic models in IR. In the rst,
a document is represented by itself and the topics to which it belongs. A second
approach is to calculate a query related topic by using topic models and use it for
query expansion. In this case, queries are reformulated (i.e. usually expanded)
to improve the retrieval e ectiveness. Authors in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] proposed a method to
nd a good query-related topic based on LDA where experiments con rm that
query expansion based on the derived topics achieved statistically signi cant
improvements. Others, in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] implement one of most common local approach
of query expansion - Pseudo Relevance Feedback (PRF). In this last, top k
documents are considered to be relevant and extracts their topic's terms to
extend queries.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Topic model for microblog search</title>
        <p>
          With the rapid development of microblogs, microblog search has become one
of the most trendy research areas in recent years. In contrast to traditional
text retrieval, microblog search signi cantly di ers. In fact, microblogs users
often issue short queries to nd relevant information. Moreover, the restriction
of messages length lead to a problem in discriminating terms within a given
item. To improve retrieval e ectiveness in microblogs researches tend to use
query expansion techniques. The goal is reducing the usual query/document
mismatch. In this area, using topic model like, LDA, could be useful. However,
unlike regular text, Topic modeling is not good at processing short text. Rare
are works which try to enhance microblog search using topic modeling such as
LDA. We cite [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], where authors present a method of contextualization of
short messages using a thematic representation extracted from Wikipedia. This
representation allows to extend the vocabulary of short messages by a set of
thematically related words. The results show the contribution of this method
to a better understanding of short messages. Other works like [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] propose a
novel approach to locating the microblogging experts on a given query. First
they de ne the experts by social in uence and content relevance. For the social
in uence, they present a global-ranking algorithm as GUserRank and a
topicranking algorithm as TUserRank after applying the LDA topic model.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Ressources and Data pre-proceeding</title>
        <p>
          To build a robust LDA model, a large amount of data is needed. From this
perspective, two tweet corpora are used in di erent runs. We note that we have
choose to use explicitly microblog corpus to respect the noisy nature of tweet
query. The rt corpus is a comparable tweet corpus about Arab spring collected
through Twitter's API3 in Arabic and French languages. Basically, accessing
Twitter data is done by collecting tweets that contain speci c keywords. More
information about this corpus could be found here [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The second tweet corpus
is provided to the participants in the evaluation campaign. It is composed of
70 000 000 microblogs. This corpus dates from May to September 2015 and is
about the keyword \Festival". Notice that before we applied LDA, redundancy
was eliminated by deleting retweets. A language detection was also performed
using a java library4 to remove foreign language tweets.
3 Otterapi a real time search engine that indexes the most in uential tweets
search api https://dev.twitter.com/rest/public/search
4 https://code.google.com/p/language-detection/
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Information retrieval based approach for tweet pooling</title>
        <p>
          In this work, an unsupervised topic model based on aggregating tweets that are
thematically closed is presented. The goal is to adapt the LDA basic process to
short tweet text. This will lead to improve the quality of topics latent
discovered. To perform tweet pooling, we propose to use information retrieval strategy
and hierarchical classi cation in order to avoid data sparsity in short texts as
illustrated in Figure 1. Our approach represents an alternative of state of the
art methods based on tweet pooling via meta data (hashtags, user information,
etc). Indeed, such methods are highly dependent on the meta data content of
the tweet corpus. Our approach relies on three main steps, namely:
{ Step 1: Preliminary set generation: For each tweeti, we propose to
retrieve a set i of matched tweets out of a large tweet collection C of n tweets,
using an information retrieval system. Thus, for a tweet tweeti, performed as
a query, several tweets in C may match it with di erent similarity degrees.
{ Step 2: Pooled set construction: The idea is to aggregate a set of tweets
in partitions by gathering preliminary sets, resulting from the IR process,
according to a combination criterion. If same search results are assigned to
di erent tweets then the tweets are considered thematically close.
{ Step 3: Topic extraction: It consists on learning latent topics from
aggregating tweets via LDA.
{ Extract Topics using the combined method of IR and aggregation strategy
as described above section 3.2
{ Project the resulted Topic on the tweet text to detect the subset of thematic
relevant terms
{ Reformulate the initial query using the subset of thematic terms as enriched
features in form of indri query
{ RUN 1: Extend the initial query using the latent topic extracted from
comparable tweet corpus[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] through IR-LDA
{ RUN 2: Extend the initial query using latent topic extracted from the Festival
tweet corpus through IR-LDA
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>A large multilingual collection of posts have become publicly available due to
the phenomenal growth of using social networks and microblogs all over the
world. This makes from Microblogs valuable information sources. However little
is known about how search socially-generated content in e ective way. In this
paper, we present a method to expand short messages using a thematic
representation. A novel method for aggregating tweets in order to improve the quality of
LDA-based topic modeling for short text is used to improve the quality of latent
topics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Alvarez-Melis</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Saveski</surname>
          </string-name>
          .
          <article-title>Topic modeling in twitter: Aggregating tweets by conversations</article-title>
          .
          <source>In Tenth International AAAI Conference on Web and Social Media</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>M. I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          {
          <fpage>1022</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          .
          <article-title>Locating query-oriented experts in microblog search</article-title>
          .
          <source>In Proceedings of Workshop on Semantic Matching in Information Retrieval</source>
          co
          <article-title>-located with the 37th international ACM SIGIR conference on research and development in information retrieval</article-title>
          ,
          <source>SMIR@SIGIR</source>
          <year>2014</year>
          , Queensland, Australia, July
          <volume>11</volume>
          ,
          <year>2014</year>
          ., pages
          <volume>16</volume>
          {
          <fpage>23</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.-V.</given-names>
            <surname>Cossu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gaillard</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-M. Juan-Manuel</surname>
            , and
            <given-names>M. El</given-names>
          </string-name>
          <string-name>
            <surname>Beze</surname>
          </string-name>
          .
          <article-title>Contextualisation de messages courts :l'importance des metadonnees</article-title>
          .
          <source>In EGC'2013 13e Conference Francophone sur l'Extraction et la Gestion des connaissances</source>
          , Toulouse, France, Jan.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hajjem</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Latiri</surname>
          </string-name>
          .
          <article-title>Features extraction to improve comparable tweet corpora building</article-title>
          .
          <source>In JADT Acte</source>
          , nice, France,
          <year>Juin 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Hajjem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Latiri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Slimani</surname>
          </string-name>
          .
          <article-title>Twitter as a multilingual source of comparable corpora</article-title>
          .
          <source>In Proceedings of the 12th International Conference on Advances in Mobile Computing and Multimedia</source>
          , Kaohsiung, Taiwan, December 8-
          <issue>10</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>342</fpage>
          {
          <fpage>345</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Davison</surname>
          </string-name>
          .
          <article-title>Empirical study of topic modeling in twitter</article-title>
          .
          <source>In Proceedings of the First Workshop on Social Media Analytics, SOMA '10</source>
          , pages
          <fpage>80</fpage>
          {
          <fpage>88</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>J. M. P. M. J.-Y. N. Liana</surname>
            <given-names>Ermakova</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          and
          <string-name>
            <surname>E. SanJuan.</surname>
          </string-name>
          <article-title>Clef 2017 microblog cultural contextualization lab overview</article-title>
          .
          <source>In International Conference of the Cross-Language Evaluation Forum for European Languages Proceedings,LNCS volume</source>
          , Springer, CLEF
          <year>2017</year>
          , Dublin.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mehrotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Buntine</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          .
          <article-title>Improving lda topic models for microblogs via tweet pooling and automatic labeling</article-title>
          .
          <source>In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13</source>
          , pages
          <fpage>889</fpage>
          {
          <fpage>892</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Morchid</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Dufour</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Linares</surname>
          </string-name>
          . Combinaison de themes latents pour la contextualisation de Tweets.
          <source>In EGC'2013 13e Conference Francophone sur l'Extraction et la Gestion des connaissances</source>
          , Toulouse, France, Jan.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>L.</given-names>
            <surname>Ren</surname>
          </string-name>
          .
          <article-title>Implement topic relevance model for query expansion</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Weng</surname>
            ,
            <given-names>E.-P.</given-names>
          </string-name>
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            , and
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          . Twitterrank:
          <article-title>Finding topic-sensitive in uential twitterers</article-title>
          .
          <source>In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10</source>
          , pages
          <fpage>261</fpage>
          {
          <fpage>270</fpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          .
          <article-title>A comparative study of utilizing topic models for information retrieval</article-title>
          .
          <source>In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR '09</source>
          , pages
          <fpage>29</fpage>
          {
          <fpage>41</fpage>
          , Berlin, Heidelberg,
          <year>2009</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>