<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimedia Information Modeling and Retrieval(MRIM)/Laboratoire d'Informatique de Grenoble (LIG) at CHiC2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kian Lam Tan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohannad ALMasri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Pierre Chevallet</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philippe Mulhem</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Catherine Berrut</string-name>
          <email>Catherine.Berrutg@imag.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kian-Lam.Tan</institution>
          ,
          <addr-line>Mohannad.Almasri,Jean-Pierre.Chevallet</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UJF-Grenoble 1</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Numerous cultural heritage materials are accessible through online digital library portals. However, this conversion resulted in the issues of inconsistency and incompleteness. The Cultural Heritage in CLEF 2013 (CHiC) takes the initiative to organize an evaluation campaign which involve several tasks such as 1) multilingual task, 2) polish task and 3) interactive task. We present the results of the MRIM/LIG team for the Ad-Hoc task and for the Semantic Enrichment task. For the Ad-Hoc task, we incorporate Term Links based on Wikipedia into the Language Model. Our approach has the following advantages: 1) it is easy and simple to generate the Term Similarity Matrix based on statistical information 2) a light weight integration in the Language Model. For the semantic query enrichment task, we deal with short queries found in this collection. These short queries can not describe a speci c information need. Hence, the goal of this task is to nd best ten terms for a query to semantically enrich the topic and guess the user's information need or original query intent. We use the Wikipedia as a semantic resource in order to nd these related terms.</p>
      </abstract>
      <kwd-group>
        <kwd>Information Retrieval</kwd>
        <kwd>Language Model</kwd>
        <kwd>Query Enrichment</kwd>
        <kwd>Query Expansion</kwd>
        <kwd>Semantic Resource</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Cultural heritage is an expression of the ways of living developed by a
community and passed on from generation to generation, including customs, practices,
places, objects, artistic expressions and values. Basically, cultural heritage can
be distinguished in two types such as artifacts and built environment. Artifacts
consist of books, objects, documents and pictures such as Mona Lisa portrait
that display at Musee du Louvre, Paris and The Last Supper painting that
display at Santa Maria delle Grazie, Milan by Leonardo da Vinci.</p>
      <p>Basically, Europeana provides the exibility for all the people around the
world to access the information of cultural heritage such as text, image, audio
and video. Therefore, Cultural Heritage in CLEF (CHiC) takes the initiative to
organize the evaluation lab since 2012 to address the key problem from
Europeana.</p>
      <p>We participated in the English monolingual ad-hoc retrieval task and English
monolingual semantic enrichment task.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Ad-hoc Retrieval Task</title>
      <p>This is a standard ad-hoc retrieval task, which measure the e ectiveness of the
Information Retrieval System (IRS). The ad-hoc task is the standard setting for
IRS which returns a relevance-ranked list of documents based on the query and
the collection of the documents.
2.1</p>
      <sec id="sec-2-1">
        <title>Approach</title>
        <p>The main idea of this approach is to integrate the term links into the current
Dirichlet formula. Firstly, we assume that a term w is w0 2 d which can play the
role of w where w is w 2 q during the matching process. More speci cally, we
consider that if w does not occur in the initial document d, but it occurs in the
document dext, which is the result of the extension of d according to the query
and some knowledge 1. Then, the probability of the term will de ne according
to the extended document dext.</p>
        <p>The knowledge assumes to form a symmetrical similarity function which is
Sim : V V ! [0; 1], that denotes the strength of the similarity between two
terms from the vocabulary (the larger the value, the higher the strength). We
propose that: 8w 2 V; Sim(w; w0) = 1 if exact matching between w with w0, and
8w 2 V; Sim(w; w0) = 0 if w does not contain any link with w0.</p>
        <p>To achieve this, we use some simple and sensible heuristics:
1. If a query term w occurs in a document d, then the term will not change the
length of the document.
2. If a query term w does not occur in a document d but the term w contains a
link with w0 (term from document), then we de ne w00 = argmaxw02d;w06=wSim(w; w0)
as the term from the document will serve as the basis count of the pseudo
occurrences of w in d as c(w00; d):Sim(w00; w). This pseudo occurrences of
the term w00 are then included into the size of the extended document.
3. If a query term w does not occur in the document and does not contains any
link, then it's occurrences is counted in the extended document.</p>
        <p>Eventually, using usual set of notations for the terms that occur in the
document and the query, then the new length of the document (jdextj) is:
jdextj = Pw2d\q c(w; d) + Pw002dnq;Sim(w;w00)6=0 c(w00; d):Sim(w00; w)
+ Pw02dnq;Sim(w;w0)=0 c(w0; d)</p>
        <sec id="sec-2-1-1">
          <title>1 The knowledge refers to the term links</title>
          <p>with w" de ned above for one query term w so that:</p>
          <p>w00 = argmaxw02d;w06=wSim(w; w0)
Using the fact above, the expression of jdextj can be easily simpli ed into:
jdextj = jdj +</p>
          <p>X
w002dnq;Sim(w;w00)6=0
c(w00; d):Sim(w00; w)</p>
          <p>With all the elements described above, the extended Dirichlet Smoothing
leads to the following probability for the term w of the vocabulary V in the
document extended dext according to a query q, noted that p (wjdext) is de ned
as:
1. if w 2 d \ q :
2. if 9w00 2 d n q; Sim(w; w00) 6= 0 :</p>
          <p>P (wjdext) =
c(w; d) + P (w0jC)</p>
          <p>jdextj +
P (wjdext) =
c(w"; d):Sim(w; w") + P (w"jC)
jdextj +
with w00 = argmaxw02d;w06=wSim(w; w0) .
3. if n9w00 2 d n q; Sim(w; w00) 6= 0</p>
          <p>P (wjdext) =
c(w; d) + P (wjC)</p>
          <p>jdextj +
(1)
(2)
(3)
(4)
(5)
(6)
with w00 = argmaxw02d;w06=wSim(w; w0) .</p>
          <p>In the speci c case when all the query terms from q occur in the document
d, the rst case in the above is used where jdextj = jdj leads to p (wjd) =
p (wjdext).
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Term Links</title>
        <p>Basically, we make the assumption that two terms are considered link to each
other if both terms co-occur in the same context. So, the term links contains
the link between the term w and w0. In this experiment, we only used Cosine
Similarity (CS) to generate the term links. The DC between term w and w0 are
calculated as follows:</p>
        <p>The CS between term w and w0 is represented using a dot product and
magnitude as follows:</p>
        <p>
          Simcosine(w; w0) =
s
n(w \ w0)
n(w):n(w0)
All the experiments are done by using the XIOTA engine [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The performance
is measured by Mean Average Precision (MAP). The optimal value for
Dirichlet prior smoothing for baseline is 100 and 350 for all the Extended Dirichlet.
Besides, we only use the title without any description form the queries and
index the title, subject, and description from the documents (CHiC collection).
As for pre-processing, we remove all the stop words which contains 571 words
and non-character, and apply the Porter Stemming method. On the other hand,
we convert all the upper case to lower case. In addition, we use the English
Wikipedia (version 2012-01-01) which contains 3.835 million articles to generate
the two types of Term Links (we called it as \TermLinks1" and \TermLinks2")
based on Cosine Similarity (6). We do not apply Porter Stemming method on
\TermLinks1" while we apply Porter Stemming method on \TermLinks2".
        </p>
        <p>The approaches used for the experiments in the following section are:
{ LMED-Cos-TL1: LM with Extended Dirichlet, CS, and TermLinks1
{ LMED-Cos-TL2: LM with Extended Dirichlet, CS, and TermLinks2</p>
        <p>We only submitted two results (since we participated in the English
monolingual ad-hoc task) based on our propose approach. Table 1 shows the MAP
for the the ad-hoc experiments. Basically, we achieved the highest MAP if we
compare to others in the English monolingual ad-hoc retrieval task. Besides,
both of our results (LMED-Cos-TL1 and LMED-Cos-TL2) outperforms the rest
of the participants in the English multilingual ad-hoc retrieval task except the
team from Chemnitx University of Technology, Germany.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Semantic Query Enrichment</title>
      <p>In this part, we address short queries in ChiC collection which have no su cient
information to express its semantic. For example, assume the query \last
supper". A retrieval model will retrieve documents which contain these two words
or one of them without any attention to the meaning of this query in the
Christian religion. Whereas, if we know this information, some related terms to this
meaning like \Jesus", \cruci xion", \twelve apostles", and \Judas" could be
found. Then, we can enrich the original query using these related terms.
Therefore, the ability of an IRS to retrieve the relevant document to this query can be
enhanced. Semantic query enrichment is to nd and add these terms which are
semantically related to a query. These added terms provide a semantic context
for a query. This context is used by IRS to enhance its relevance estimation in
its retrieval task.</p>
      <p>
        Pseudo-Relevance Feedback is one of the most popular methods for nding
these enrichment terms using the top k retrieved document to the original query.
Whereas, if top retrieved documents for a given query contains a few number of
relevant document. In this case, selected terms using Pseudo-Relevance Feedback
will not be strongly related to the original query and will introduce noise into
the enriched query. As a result, the relevance estimation for the enriched query
would be less or equal than the original query [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">4, 2, 1</xref>
        ].
      </p>
      <p>We present another method in order to select related terms for a given query
using an external knowledge. Many resources are available in order to achieve this
task: ontologies, encyclopedias, lexical resources. We use, in our task, Wikipedia
as an external knowledge in order to achieve semantic query enrichment. Given a
query q, in our case, this query talks about one well known thing: person, place,
event, etc.. Wikipedia is a freely available large knowledge which contains a huge
number of articles and links between them. First, we present the structure of
Wikipedia. Then, we present our semantic query enrichment approach which is
based on this structure.
3.1</p>
      <sec id="sec-3-1">
        <title>Wikipedia Structure</title>
        <p>Wikipedia is a knowledge base which can be represented as a directed weighted
graph of articles. The basic entry in Wikipedia is an entity page, which is an
article that contains information focusing on one single entity. Furthermore, each
article is linked to other articles by a number of weighted links. This weights
represent how much the two entities are semantically related. An article point
to a collection of articles and is pointed by a collection of other articles Figure
1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Enrichment Steps</title>
        <p>As we mentioned before, our semantic query enrichment use Wikipedia as a
knowledge base. We see from the previous section, Wikipedia is organized as
directed weighted graph of articles. Each article is identi ed by its title, links
in, and links out. Using Wikipedia, each text can be mapped into a collection of
articles. Relaying on what mentioned about Wikipedia, we present our semantic
query enrichment steps:</p>
        <p>Given a query, rst, nding all articles which correspond this query in
Wikipedia, we call them: identi ed articles.</p>
        <p>Using the identi ed articles we have di erent variants to enrich the original
query q:
o Links in: candidate articles to enrich the original query, in this rst
case, all articles which point out to at least one article of the identi ed
articles.
o Links out: candidate articles to enrich the original query, in this second
case, all articles which are pointed out by at least one article of the
identi ed articles.
o Mixed: candidate articles to enrich the original query, in this last case,
contain the union between articles form rst and second case.</p>
        <p>Sort candidate articles depending on its relatedness to the identi ed
concepts.</p>
        <p>Take best k articles titles from candidate articles and add them to the original
query.</p>
        <p>
          For weighting these articles, we multiplied the relatedness values using
different values between [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] like the following (0, 0.1, 0.2, 0.3, ,1). The value
which provided the best precision enhancement was 0.3.
        </p>
        <p>Using these steps, we obtain best k related titles to a given query with their
wights. These titles are added to this query to obtain a long query. We claim
that this long query has su cient information to express the information need.
Therefore, it is proposed to help IRS to enhance its relevance estimation or in
other words its precision.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Experiment and Result</title>
        <p>Experiments are done using WikipediaMiner2 which is an API for searching
and accessing Wikipedia content. We mean by content articles and their links.
WikipediaMiner is a toolkit for tapping the rich semantics encoded within Wikipedia.
It helps to integrate Wikipedia's knowledge into applications, by:1) providing
simpli ed, object-oriented access to Wikipedia's structure and content.2)
Measuring how terms and concepts in Wikipedia are semantically related to each
other.</p>
        <sec id="sec-3-3-1">
          <title>2 http://wikipedia-miner.cms.waikato.ac.nz/index.html</title>
          <p>We validate our approach over CHIC2013 English collection. For the query
enrichment task we have 25 queries. These queries contain well known entities
like persons, events, etc. The task requires systems to present a ranked list of at
most 10 related terms for a query to semantically enrich the topic and/or guess
the user's information need or original query intent. Related terms in our case
are extracted using WikipediaMiner.</p>
          <p>The evaluation metric for the semantic enrichment task is precision
(precision@1, @3, @10) Table 2. Precision at a given index k measure if the rst
k enrichment terms to a given query are related to this query or not. In this
table, we have two runs, in rst run we use in enrichment a mix between
links in and links out. We select the 5 top articles titles form link in and
the 5 top articles titles from link out. In the second run, we use best 10
articles titles from links out (best means most semantically related depending on
Wikipedia relatedness values between Wikipedia articles). Basically, our second
result (MRIM SE13 EN WM 1) outperforms the other participants for
monolingual English enrichment by means P @1 and P @3. Whereas, it is slightly less
of them by means of P @10.</p>
          <p>Run Name P @1 P @3 P @10
MRIM SE13 EN WM 0.2800 0.1333 0.1448</p>
          <p>MRIM SE13 EN WM 1 0.2800 0.1467 0.1598
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>For the ad-hoc retrieval task, our results indicated that both results
(LMEDCos-TL1 and LMED-Cos-TL2) achieved almost the same MAP. Based on this
scenario, we can conclude that there is not much di erent to apply Porter
Stemming method on the Term Links since the gap between these two results is very
small. Whereas, in the semantic enrichment task, our results show that using
links out is better of using the mix between links in and out.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Eneko</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paul D.</given-names>
            <surname>Clough</surname>
          </string-name>
          , Samuel Fernando, Mark Hall, Arantxa Otegi, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Stevenson</surname>
          </string-name>
          .
          <article-title>The she eld and basque country universities entry to chic: Using random walks and similarity to access cultural heritage</article-title>
          .
          <source>In CLEF (Online Working Notes/Labs/Workshop)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Mitra</given-names>
            <surname>Akasereh</surname>
          </string-name>
          , Nada Naji, and
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Savoy</surname>
          </string-name>
          . Unine at clef
          <year>2012</year>
          . In CLEF (Online Working Notes/Labs/Workshop),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jean-Pierre Chevallet</surname>
          </string-name>
          .
          <article-title>X-iota: An open xml framework for ir experimentation</article-title>
          .
          <source>In SungHyon Myaeng</source>
          , Ming Zhou,
          <string-name>
            <surname>Kam-Fai Wong</surname>
          </string-name>
          , and Hong-Jiang Zhang, editors,
          <source>Information Retrieval Technology</source>
          , volume
          <volume>3411</volume>
          of Lecture Notes in Computer Science, pages
          <volume>263</volume>
          {
          <fpage>280</fpage>
          . Springer Berlin Heidelberg,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Jinxi</given-names>
            <surname>Xu</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Query expansion using local and global document analysis</article-title>
          .
          <source>In In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <volume>4</volume>
          {
          <fpage>11</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>