<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Monolingual and Cross-lingual information retrieval in cultural Microblog at CLEF 2018</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chedi Bechikh Ali</string-name>
          <email>chedi.bechikh@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hatem Haddad</string-name>
          <email>hatem.haddad@ulb.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut superieur de gestion, Universite de Tunis</institution>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universite libre de Bruxelles</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>For CLEF 2018, we focus on cultural microblog search. The aim of this work is to nd relevant microcritics in a monolingual and cross lingual context about lms. This task is challenging due to the short length of the query and of the documents. For the monolingual context we propose to expand the query using a probalistic weighting scheme. For the french-english cross language task, we used a state of the art approach based on query transation.</p>
      </abstract>
      <kwd-group>
        <kwd>Microblog search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Query expansion
Query translation.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>The Cross Language cultural microblog search is the rst task3 from the lab
multilingual cultural mining and retrieval (MC2).</p>
      <p>The goal of this task is to nd relevant microblogs in di erent languages
from MC2 corpus using 75 topics related to lms from a dataset of 70 000 000
microblogs. This corpus is collected between May and September 2015 and is
about the keyword Festival. The topics used are selected by the task organizers
from VodKaster website and represent a selection of microcrtitcs in french about
lm and cinema festivals. The topics are composed by a lm title, a narrative
eld containing a microcritic about the lm and a third eld containing a list of
expressions extracted manually from the microcritics.</p>
      <p>Figure 1 shows an example of the query structure.</p>
      <p>In this work we choose to intervene at the topics level, because it is simpler
than to change the indexing shceme. For the monolingual search we based our
work on a query expansion approach, for the cross lingual search we used query
translation.</p>
      <p>This paper is organized as follow, in section 2 we describe related work on
query expansion and cross language information retrieval. Then in section 3
we describe the proposed approach used for query expansion and the query
translation techniques used. We conclude this work in Section 4.
3 https://mc2.talne.eu/
&lt; topic &gt;
&lt; id &gt; 201800 &lt; =id &gt;
&lt; title &gt; Phantom of the Paradise &lt; =title &gt;
&lt; narrative &gt; Palma d'or pour le festival de Swan &lt; =narrative &gt;
&lt; nugguets &gt; Palma d'or;festival de Swan &lt; =nugguets &gt;
&lt; =topic &gt;
In this paper we propose to use query expansion for monolingual microblog
search, so we present a brief state of the art about query expansion. We present
also a state of the art about the di erent approaches used for cross language
information retrieval.
Given that user queries are usually short and that some words can be ambiguous,
the use of the simple retrieval model based on the matching between query and
document is prone to errors and omissions. Also the users of an information
retrieval system can use other words than those present in relevant documents,
this lead to the issue of term mismatch. Researchers proposed to use query
expansion to resolve the problems caused by short queries and term mismatch.</p>
      <p>
        Di erent approach were proposed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
{ Interactive Query Re nement
{ Relevance feedback
{ Word Sense Disambiguation in IR
{ Search Results Clustering
      </p>
      <p>In our work we rely on relevance feedback. For Relevance feedback, an initial
retrieval run is performed using the initial formulation of the query. In addition,
some number of top-ranked documents are mined for additional query terms to
be added to the initial query. The content of the assessed documents is used to
adjust the weights of terms in the original query and/or to add words to the
query.
2.2</p>
      <p>
        Cross language information retrieval
For cross language information retrieval (CLIR), given a query in language A,
the goal is to retrieve documents in language B. CLIR can be useful in di erent
contexts. For example, relevant documents may not exist in the query's language,
so the system must be able to retrieve relevant documents in other languages.
Many approaches were used for CLIR: query translation, document translation,
or the translation of both documents and query. Most researchers rely on query
translation because it is easier than the translation of all the documents. For
this purpose di erent approaches were used [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
{ Machine translation
{ Dictionary
{ Parallel or comparable corpora
{ Inter-language representation
      </p>
      <p>These di erent approaches can be used for cross language microblog search
and adapted to our task context. The translated queries are then executed
against the target collection in a monolingual way.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The documents search in microblogs presents many di culties: the queries are
short, so the retrieval system doesn't have enough contexts to understand user's
needs. This can lead to di erent problems: word ambiguity and word mismatch
between queries and documents. Also, we can notice that the documents
(microcritics) are also short, so they lack of context.</p>
      <p>We submitted 3 runs:
{ A French monolingual run: "Be-Ha-submission-fr-fr"
{ Tow french-english Cross lingual runs:
"Be-Ha-Submission3Fr-English-dictionary"
"Be-HaSubmussion2-FR-English"</p>
      <p>
        For the monolingual run we used query expansion based on the probalistic
BM25 model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This process is based on many steps:
{ Step 1: Original queries are used to retrieve the top 50 microcritics for each
query. These retrieved microcrtics are used as a set of pseudo-relevant
documents.
{ Step 2: For each query, we merge the retrieved documents set in a single
document.
{ Step 3: Since expansion terms are selected from this set of supposed relevant
documents, the new set of documents is used to compute the weight of each
expansion term. The top 30 terms with the highest score are selected as
expansion terms and added to the original query.
{ Step 4: The new query is used to match the collection of microblogs.
      </p>
      <p>
        We choose the BM25 model to attribute the score for each terms, because it
is one of most accurate model for information retrieval. This model is based on
a weighting scheme de ned by this formula [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]:
      </p>
      <p>Score(RjD) =</p>
      <p>X w(1)
(k1 + 1) tf</p>
      <p>d
b) + b jdjavjgj
+ tf
(k3 + 1)qtf
k3 + qtf
(1)
where qtf is the number of times that the term t is present in the query, tf
is the frequency of the term t in the document, jdj is the number of terms in the
document, jdavgj is the average length of a document, k1 is a parameter that
controls the saturation of tf , k3 is a parameter that controls the saturation of
qtf and b is the frequency normalization parameter.</p>
      <p>w(1) is similar to the inverse document frequency (idf) and is de ned by [?]:
w(1) = log</p>
      <p>N</p>
      <p>df + 0:5
df + 0:5
(2)</p>
      <p>Where N is the number of documents in the collection and df is the number
of documents where the term t occurs in the collection.</p>
      <p>For cross language microblog search, we translate the query using tow
techniques:
{ A french-english dictionary
{ Bilingual Wordnet.</p>
      <p>For the run "Be-Ha-Submission3Fr-English-dictionary" we used a dictionary
translation approach to translate the queries from the french to the english
language.</p>
      <p>A bilingual dictionary was used for query translation and it is constructed
from an online dictionary. It consists of 33k distinct English words and 28k
distinct French words, which constitutes 76k translation pairs. It contains
lemmatized forms of content words (nouns, verbs, adjectives, adverbs).</p>
      <p>
        For the run "Be-HaSubmussion2-FR-English" we used an inter-lingual
representation based on english and french Wordnet. WordNet is a large lexical
database of English that was extended to many languages. Nouns, verbs,
adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each
expressing a distinct concept [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. WordNet's structure makes it a useful tool for
computational linguistics and natural language processing.
      </p>
      <p>In wordnet each word is mapped to an indenti er, so words with the same
meaning have the same number. For exemple the french word 'acteur' (actor)
has as identi er "09765278-n", wich is the samed identi er for the word "actor",
"actress", "performer", etc.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This paper describes our rst participation on the rst task Cross language
Microblog search of the MC2 lab at CLEF 2018. The aim of this work is to
nd relevant microblogs given title and microcritic about lms. In this work we
submitted 3 runs, one french monolingual run and two cross lingual runs based on
query translation. This work is still in progress and needs more investigations for
future work, so other expansion approaches must be proposed, also the indexing
process must be studied.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bechikh-Ali</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddad</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Slimani</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Cross-language information retrieval based on bilingual formal concept mining</article-title>
          .
          <source>In: 14 IACS/IEEE International Conference on Computer Systems and Application (AICSSA</source>
          <year>2017</year>
          ), Hammamet, Tunisia,
          <source>October 30-November 3</source>
          ,
          <year>2017</year>
          . pp.
          <volume>1</volume>
          {
          <issue>7</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Carpineto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romano</surname>
          </string-name>
          , G.:
          <article-title>A survey of automatic query expansion in information retrieval</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>44</volume>
          (
          <issue>1</issue>
          ), 1:
          <issue>1</issue>
          {1:
          <issue>50</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C. (ed.):
          <article-title>WordNet: an electronic lexical database</article-title>
          . MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hancock-Beaulieu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatford</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Okapi at TREC-3</article-title>
          .
          <source>In: Proceedings of The Third Text REtrieval Conference</source>
          , TREC 1994, Gaithersburg, Maryland, USA, November 2-
          <issue>4</issue>
          ,
          <year>1994</year>
          . pp.
          <volume>109</volume>
          {
          <issue>126</issue>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>