<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UEvora at CLEF eHealth 2017 Task 3</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hua Yang</string-name>
          <email>huayangchn@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Teresa Gonçalves</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Évora Évora</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the methods we used for our participation to CLEF eHealth 2017 Task 3 IRTask 1: ad-hoc search. This task aims at retrieving information relevant to people seeking health advice on the web. We present our work of using query reformulation techniques in this paper. We use cTAKES, a clinical natural processing system, to identify UMLS concepts in the original query. Query expansion techniques are then applied to the identified medical concepts. Query expansion based on UMLS meta-thesaurus or a Word2vec model trained with domain data is used in our work. We also use other techniques, like increasing the weight of the terms that are considered to catch the users' need much more compared to other terms.</p>
      </abstract>
      <kwd-group>
        <kwd>UMLS</kwd>
        <kwd>word2vec</kwd>
        <kwd>query reformulation</kwd>
        <kwd>query expansion</kwd>
        <kwd>cTAKES</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        CLEF eHealth 2017 information retrieval (IR) tasks 3 is a continuation of the
previous tasks that ran in 2013, 2014, 2015, and 2016 and embraces the
TRECstyle evaluation process, with a shared collection of documents and queries, the
contribution of runs from participants and the subsequent formation of relevance
assessments and evaluation of the participants submissions [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>CLEF eHealth 2017 Task 3 includes four sub tasks this year. Our team
participates in Task 3 IRTask 1, which is a standard ad-hoc search task, aiming at
retrieving information relevant information to people seeking health advice on
the web.</p>
      <p>Data corpus. ClueWeb12-B13 is used as the corpus of the CLEF eHealth 2017
Task 3. We use the indexes provided by the organizers from Microsoft Azure,
which are available with Terrier and Indir formats.</p>
      <p>
        Queries. All the queries used in the task are extracted from public health
web forums where users were seeking advice about specific symptoms,
diagnosis, conditions or treatments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The queries are considered as the real health
information needs expressed by the general public. For each forum post a set of
6 query variants are generated, representing different ways to express the same
information need. A total of 300 queries are created for the task.
      </p>
      <p>Evaluation. Evaluation measures for IRTask1 are NDCG@10, BPref and
RBP.</p>
      <p>The rest of this paper is organized as follows. The methods we used for
participating in the task are presented in section 2. The experiments and submission
runs are described in section 3.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>In this work, we use query reformulation techniques to reform the original
queries. Figure 1 illustrates the framework we used.</p>
      <p>
        We first use natural language processing tools to identify the medical
concepts in the original query. For the identified medical concept, we then use query
expansion techniques to find its related terms or synonyms with the same
concept. The expanded queries are then issued to the retrieval platform. With an
weighting model in the IR platform, a ranked list of documents is returned.
Complex semantic relationships exists in health articles, like term dependency
and vocabulary mismatch [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Natural language processing tools applied in
clinical area can extracts concepts from free text and normalises them with respect
to a gold standard ontology to alleviate issues of vocabulary mismatch [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In our system, we use clinical NLP tool to identify the medical concepts in
the original queries. For medical concepts which are identified, we regard them
as important information reflecting the users’ needs. We increase the weight of
these terms or phrases. We denote them as reformed query in our system.
UMLS metathesaurus or word2vec models trained with domain data is used for
query expansion in our work. Also, pseudo relevance feedback techniques are
used for automatic expansion. We denote the query expanded with UMLS or
word2vec models as expanded query in our system. We first use cTAKES1 to
identify medical concepts. The terms identified as ‘anatomy’ or ‘disorder’ are
expanded using UMLS. We include all the terms with the same CUI number.
We use word embeddings to find two terms that are nearest to each other in the
original query. The two terms are regarded as a loose phrase and is included in
the original query.
Term dependency is a the characteristic of health articles. For example, “inguinal
hernia” means hernia occurs in inguinal part, but not the other parts of the
body. In our work, we treat this phrasal medical concept as an integral part and
implement phrase search in our system [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Terrier</title>
        <p>In this section, we introduce the platforms and models that are used in our work
and then we describe the submission runs for the task.</p>
        <p>Terrier2 retrieval platform version 4.17 was used as the search engine. Terrier
is described to be “a highly flexible, efficient, and effective open source search
engine, readily deployable on large-scale collections of documents”. It
implements state-of-the-art indexing and retrieval functionalities, and provides an
ideal platform for the rapid development and evaluation of large-scale retrieval
applications. In our experiments, we use BM25 as the retrieval model and all
the parameters are set to default.
3.2</p>
        <p>
          cTAKES
In our work, we use cTAKES to identify the medical concepts in the query.
Apache cTAKES is an open source natural language processing system for
extraction of information from electronic medical record clinical free-text [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It
includes following components:
- Sentence boundary detector
- Tokenizer
- Normalizer
1 http://ctakes.apache.org/index.html
2 http://terrier.org/
- Part-of-speech (POS) tagger
- Shallow parser
- Named entity recognition (NER) annotator, including status and negation
annotators.
3.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Word2vec models</title>
        <p>
          In our work, we produce word embeddings using word2vec algorithms [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Word2vec
uses shallow, two-layer neural networks and includes two model architectures for
learning distributed representations of words: Continuous Bag-of- Words model
(CBOW) and Continuous Skip-gram Model (Skip-gram). We used the CBOW
model, a context window size equal to five and a word vector of size 100 in our
experiments. We use the data snapshotted on 16th Feb, 2017 from PMC Open
Access Subset 3 and the trained word embeddings contain 25,140,380 words types
in the result.
3.4
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Runs</title>
        <p>We submit 5 runs for IRTask 1. For all runs, the stop words are removed and
Porter stemmer are used for word stemming. We use BM25 as the weighting
model and the parameters are set to default in Terrier.</p>
        <p>UEvora_EN_Run1: We use cTAKES to identify the medical concepts. The
medical concept identified as a phrase replaces the single terms in the original
query. Meanwhile, we expand the concepts with UMLS synonyms and increase
their weight.</p>
        <p>UEvora_EN_Run2: Based on run1, for the terms that are not identified by
cTAKEs, we use our trained word2vec model to do the expansion.</p>
        <p>UEvora_EN_Run3: For identified medical concept terms, we expand them
with UMLS synonyms and increase their weight.</p>
        <p>UEvora_EN_Run4: The medical concepts are identified with cTAKEs. The
concept identified as phrase replaces the single terms in the original query.</p>
        <p>UEvora_EN_Run5: For identified medical concept terms, we expand them
with our trained word2vec model and increase their weight.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>This work was supported by EACEA under the Erasmus Mundus Action 2,
Strand 1 project LEADER - Links in Europe and Asia for engineering,
eDucation, Enterprise and Research exchanges.
3 https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cogley</surname>
          </string-name>
          , James.
          <article-title>Applying natural language processing to clinical information retrieval</article-title>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Savova</surname>
          </string-name>
          ,
          <string-name>
            <surname>Guergana</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>James</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Masanz</surname>
          </string-name>
          ,
          <string-name>
            <surname>Philip</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Ogren</surname>
          </string-name>
          , Jiaping Zheng, Sunghwan Sohn, Karin C.
          <string-name>
            <surname>Kipper-Schuler</surname>
          </string-name>
          , and Christopher G. Chute.
          <article-title>Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>17</volume>
          , no.
          <issue>5</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>Christopher D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prabhakar</surname>
            <given-names>Raghavan</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Hinrich</given-names>
            <surname>Schütze</surname>
          </string-name>
          .
          <article-title>Introduction to information retrieval</article-title>
          . Vol.
          <volume>1</volume>
          , no.
          <issue>1</issue>
          . Cambridge: Cambridge university press,
          <year>2008</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mikolov</surname>
            , Tomas, Kai Chen, Greg Corrado, and
            <given-names>Jeffrey</given-names>
          </string-name>
          <string-name>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Lorraine</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          , Liadh Kelly, Hanna Suominen, Aurélie Névéol, Aude Robert, Evangelos Kanoulas, Rene Spijker, João Palotti, and
          <string-name>
            <given-names>Guido</given-names>
            <surname>Zuccon</surname>
          </string-name>
          .
          <article-title>CLEF 2017 eHealth Evaluation Lab Overview</article-title>
          .
          <source>CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS)</source>
          , Springer, September,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Palotti</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimmy</surname>
            , Pecina,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lupu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanbury</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>CLEF 2017 Task Overview: The IR Task at the eHealth Evaluation Lab</article-title>
          . In: Working Notes of Conference and
          <article-title>Labs of the Evaluation (CLEF) Forum</article-title>
          . CEUR Workshop Proceedings (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>