<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparison of Several Word embedding Sources for Medical Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julie Budaher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohannad Almasri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorraine Goeuriot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire d'informatique de Grenoble Universite Grenoble Alpes</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of MRIM team in Task 3: Patient-Centered Information Retrieval-IRTask 1: Ad-hoc search of CLEF eHealth Evaluation lab 2016. The aim of this task is to evaluate the e ectiveness of information retrieval systems when searching for health content on the web. Our submission investigates the e ectiveness of word embedding for query expansion in the health domain. We experiment two variants of query expansion method using word embedding. Our rst run is a baseline system with default stopping and stemming. The other two runs expand the queries using two di erent word embedding sources. Our three runs are conducted on Terrier platform using Dirichlet language model. Keywords: Query expansion, Word embedding, Language model</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The goal of the eHealth evaluation lab is to evaluate information retrieval
systems helping people in understanding their health information [2]. We are
describing in this paper our participation to Task 3: Patient-Centered Information
Retrieval which aims to evaluate the e ectiveness of information retrieval
systems when searching for health content on the web, with the objective to foster
research and development of search engines tailored to health information
seeking [4], this task is divided into three sub-tasks: ad-hoc search which extends
the evaluation framework used in 2015 (which considered, along with topical
relevance, also the readability of the retrieved documents) to consider further
dimensions of relevance such as the reliability of the retrieved information, query
variation which explores query variations for an information need and
multilingual search which o ers parallel queries in several languages (Czech, French,
Hungarian, German, Polish and Swedish)[3].</p>
      <p>Our team MRIM has submitted three runs, in order to investigate the
following research questions:
1. Is the word embedding approach for query expansion e ective for consumer
health search?
2. What in uence has the word embedding source on the results?</p>
      <p>This paper is organized as follows. In Section 2, we describe our approaches
and in Section 3, we describe our future work and conclusion.
2</p>
      <p>Approach: Using Word Embedding for Query</p>
      <p>Expansion
2.1</p>
      <p>Dataset
The dataset contains a document collection, a set of topics, and relevance
judgments. The document collection is Clueweb12 B131, created by the Lemur Project
to support research on information retrieval and related human language
technologies. This collection contains more than 50 million documents, on varied
topics. The collection has been made available to participants via the Microsoft
Azure platform, along with in standard indexes built with the Terrier tool and
the Indri tool.</p>
      <p>The topics provided explore real health consumer cases, extracted from posts
from health web forums. The posts were extracted from the 'askDocs' forum of
Reddit, and presented to query generators, who had to create queries based on
what they read and think would be queried by the posts author. Di erent query
creators generated di erent queries for the same post, creating variations of the
same information need (forum post).</p>
      <p>For IRTask 1, participants had to treat each query individually, submitting
the returned documents for each query. Example queries follow:
&lt;queries&gt;
&lt;query&gt;
&lt;id&gt;900001&lt;/id&gt;
&lt;title&gt;medicine for nasal drip&lt;/title&gt;
&lt;/query&gt;
&lt;query&gt;
&lt;id&gt;900002&lt;/id&gt;
&lt;title&gt;bottle neck and nasal drip medicine&lt;/title&gt;
&lt;/query&gt;
....</p>
      <p>&lt;query&gt;
&lt;id&gt;904001&lt;/id&gt;
&lt;title&gt;omeprazole side effect&lt;/title&gt;
&lt;/query&gt;
....
&lt;/queries&gt;</p>
      <p>Example queries were provided, and a nal set of 300 queries was distributed
for the runs.
1 http://lemurproject.org/clueweb12/specs.php
2.2
Three runs were submitted, the mandatory baseline run and two other runs with
query expansion based on word embedding with two di erent training sets. We
used Terrier for indexing and retrieval from the Azure platform.
Run1- Baseline In this run, we apply Dirichlet language model with the default
Mu value (2500) on the 300 queries. Stop-words are removed using default Terrier
TermPipeline interface. We use PorterStemmer for stemming as used in the
document index. This model is the simple approximation and is the baseline for
comparing more complex models.</p>
      <p>Run2 and Run3 The second two runs are using query expansion method
based on word embedding. Term embeddings are learned using Neural Networks.
Each term is represented by a real-valued vector. The resulting vectors carry
relationships between terms. The similarity between two terms is measured with
the normalized cosine between their two vectors. The resulting expanded query
is the union of the original query terms and the expanded terms which are the
k-most similar terms to those of the query.</p>
      <p>The tool word2vec is used to generate deep learning vectors. The word2vec
tool takes as its input a large corpus of text and produces the term vectors as
its output.</p>
      <p>
        The training corpus of Run2 is built using three di erent CLEF medical
collections: Image2009, Case2011 and Case2012. The training corpus consists of
about 400 million words. The vocabulary size for this training corpus is about
350,000 di erent terms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The training corpus of the third run is built using CLEF eHealth 2014 medical
collection. The training corpus consists of 1,056,629,741 words. The vocabulary
size of this training corpus is 1,210,867 terms
3</p>
      <p>Conclusion
We have described our participation in CLEF eHealth 2016 for Task3. Our
purpose was to investigate the e ectiveness of the word embedding for query
expansion on consumer health search, as well as the e ect of the learning resource
for learning on the results. Our system was based on Terrier with Dirichlet
language model. We applied query expansion on two di erent training sets using
word embedding sources.
2. Liadh Kelly, Lorraine Goeuriot, Hanna Suominen, Aurelie Neveol, Joao Palotti, and
Guido Zuccon. Overview of the clef ehealth evaluation lab 2016. In CLEF 2016
- 7th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer
Science (LNCS). Springer, 2016.
3. Joao Palotti, Guido Zuccon, Lorraine Goeuriot, Liadh Kelly, Allan Hanbury, Gareth
Jones, Mihai Lupu, and Pavel Pecina. Clef ehealth evaluation lab 2015, task 2:
Retrieving information about medical symptoms. In CLEF 2015 Working notes,
2015.
4. Guido Zuccon, Joao Palotti, Lorraine Goeuriot, Liadh Kelly, Mihai Lupu, Pavel
Pecina, Henning Mueller, Julie Budaher, and Anthony Deacon. The ir task at the
clef ehealth evaluation lab 2016: User-centred health information retrieval. In CLEF
2016 Evaluation Labs and Workshop: Online Working Notes, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Mohannad</surname>
            <given-names>ALMasri</given-names>
          </string-name>
          , Catherine Berrut, and
          <string-name>
            <surname>Jean-Pierre Chevallet</surname>
          </string-name>
          .
          <article-title>A comparison of deep learning based query expansion with pseudo-relevance feedback and mutual information</article-title>
          .
          <source>In Advances in Information Retrieval</source>
          , pages
          <volume>709</volume>
          {
          <fpage>715</fpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>