LIG-Health at Adhoc and Spoken IR Consumer
 Health Search: expanding queries using UMLS
                 and FastText.

    Philippe Mulhem, Gabriela Gonzalez Saez, Aidan Mannion, Didier Schwab,
                                and Jibril Frej

      Univ. Grenoble Alpes, CNRS, Grenoble INP?? , LIG, 38000 Grenoble, France


        Abstract. This paper describes the work done by the LIG of Grenoble
        for the Adhoc and the Spoken Consumer Health search. Our focus for
        this participation is to study the effectiveness of simple query expansions
        for health related retrieval. We focused on several query expansions, us-
        ing knowledge-based or embedding-based techniques, with and without
        weighting of expansions, with and without Pseudo Relevance Feedback.
        The results obtained for Adhoc queries show that our baseline run out-
        performs the query expansions proposed. The results obtained for spoken
        queries show that several speakers lead to very different results, and that
        merging the results from several users improve the quality of the system.

        Keywords: Query Expansion, UMLS, FastText, Query fusion


1     Introduction
This paper describes the experiments achieved by the LIG-Health team for the
CLEF 2020 evaluation campaign [7]. We did participate to the task Consumer
Health Search of CLEF eHealth 2020 [12], and more spcifically to the adHoc sub-
task and to the spoken queries subtask [6]. The people involved are for these ex-
periments are members of the Information Retrieval group (MRIM) and the Nat-
ural Language Processing group (GETALP) of the Laboratoire d’Informatique
de Grenoble1 .
   Our work targeted the two subtasks proposed: adhoc and spoken queries.
For both subtasks, we explored the use of two query expansions methods: one
knowledge-supported using the UMLS meta-thesaurus [2], and one using embed-
dings using FastText [3]. Binary and weighted expansions were processed in both
case. For the retrieval stage, we considered both “Straight” (SR) and Relevance
Feedback (RF) cases. We study how some simple processes may be adapted for
both text and spoken queries. In the case of spoken queries, query expansions
   Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
   mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
   ber 2020, Thessaloniki, Greece.
??
   Institute of Engineering Univ. Grenoble Alpes
 1
   http://liglab.imag.fr
may be questionable because of the possible errors of speech to text steps. In all
the cases, we made use of the assessments of CLEF eHealth 2018 to select the
submissions.


                    Fig. 1. Overview of Adhoc LIG-Health runs.


    We tackled the spoken queries by considering all the transcriptions provided,
and applying the two expansions and the two retrieval described above, in a
way to evaluate the best configuration to submit. For the fusion of runs, we did
consider a simple fusion of result lists.
    The remaining of the paper is organized as follows. In Section 2 we describe
in detail the two expansions approaches used, before describing our proposal in
Section 3. Section 4 focuses on the features and parameters of the Information
Retrieval used. The official results are presented in part 6. We discuss the results
in Section 7 before concluding in Section 8.

2     Expansion Approaches
2.1   FastText-based
This first expansion proposed relies on FastText [3, 10]. FastText proposes a
framework to learn and manage words embeddings. It is able to consider sub-
words (using ngrams) as opposed to more classical embeddings models like
Word2Vec [11], which create embeddings only for whole-word tokens. The Fast-
Text embedding vector of a word is the sum of the vectors of its component
ngrams.
    We used the pre-trained word vectors for English language, trained on Com-
mon Crawl and Wikipedia using FastText. The features of the model used are
as follows;
 – Continuous bag-of-words (CBOW) with positional weighting
 – Vector embeddings of dimension d = 300
 – Character n-grams of length 5
 – Context window of size 5
 – Sampling of 10 negative examples per positive example
    Using such embeddings, in our experiments, we expand each query using
terms with a cosine similarity greater than an experimentally determined thresh-
old t with the original query terms - i.e. denoting the cosine similarity function as
F Tcos , for each term w in the preprocessed query, we calculate its FastText em-
bedding vector f (w) and then add all terms w0 for which F Tcos (f (w), f (w0 )) ≥ t
to the query.

2.2    UMLS-based
The second expansion strategy used in this work relies on the Unified Medical
Language System (UMLS) Metathesaurus [2], a comprehensive biomedical the-
saurus incorporating a network of semantically related concepts linking a large
number of medical language resources. From the many information sources in
UMLS Metathesaurus, we restricted our expansion search to one that is specifi-
cally designed to deal consumer-level medical vocabulary: the Open Access and
Collaborative Consumer Health Vocabulary2 , known as the CHV, which contains
more than 88 000 synonyms for more that 57 000 concepts.
    The CHV is used to get the synonyms of query terms, and we denote the
function mapping a term to its CHV synonym as CHVsyn in the following. As
the synonyms were often too general or too numerous in initial experiments,
we introduced in addition a filtering step based on FastText similarity F Tcos .
Given that the goal of the UMLS-based expansion is to find expansion terms
that are semantically rather than syntactically related to the query terms, the
CHV synonyms were only included in the expanded query if their FastText
embedding had a cosine similarity less than 0.6 with the original query term
they were associated with.


3     Query expansions proposed
The two query expansions proposed are described now. Each of them has two
versions: the binary one and the weighted one. As their names suggest, the binary
expansions do not consider any weight to query terms, and the weighted ones
are able to indicate a level of importance of a term in the query. We detail them
in the following.

3.1    Embedding-based only expansion.
This approach is quite similar to [1], one major difference is the embeddings
consider subwords, as described above in part 2.1. We do not use any manually
defined knowledge for these expansions. Formulas 1 and 2 describe the binary
expansion based on FastText. In formula 1, the set V OC F T denotes the vocab-
ulary that FastText manages. The manually-defined threshold considered here,
0.75, is lower than the one in the UMLS expansion: it is consistent with [13] and
a trade off between quality of suggested terms and the quantity of terms found.


         T exp F Tbinary (qi ) = {e|e ∈ V OC F T q ∧ F Tcos (qi , e) >= 0.75}   (1)
2
    https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/CHV/index.html
               Exp Query F Tbinary (q) = q ∪qi ∈q T exp F Tbinary (qi )            (2)

   For the weighted FT-based expansion, the principle is the same as before,
but:

 – the initial query terms have a weight of 1;
 – the expanded terms are weighted by the cosine similarity between their Fast-
   Text embedding and that of their synonym in the original query;
 – if one expansion term occurs several times in the expansion, each (weighted)
   occurrence is considered in the expansion.

More formally, the formulas 3 and 4 describe such expansion :

T exp F Tweighted (qi ) = {(e, F Tcos (qi , e))|e ∈ V OC F T q∧F Tcos (qi , e) >= 0.75}
                                                                                    (3)

             Exp Query F Tweighted (q) = q ∪qi ∈q T exp F Tweighted (qi )          (4)


3.2    UMLS-based expansion.

The use of knowledge-based expansions of query is well studied, as in [8]. In
the specific case of medical search, the use of UMLS meta thesaurus is classical,
as in [16]. The binary expansion is processed in follows query term by query
term qi , as described in formulas 5 and 6. In our experiments for each query
term qi from a query q, we look for the synonyms of qi in the consumer health
vocabulary. Then, we apply a filtering that keeps the term if his similarity, using
FastText [3], is larger that 0.8. Again, this threshold has been manually defined
and is consistent with [13] (even if [5] showed that such threshold can not be
considered as a rule of thumb). This filtering allows to consolidate the trust we
have in the synonyms provided by CHV.


      T exp U M LSbinary (qi ) = {e|e ∈ CHVsyn (qi ) ∧ F Tcos (qi , e) >= 0.8}     (5)


          Exp Query U M LSbinary (q) = q ∪qi ∈q T exp U M LSbinary (qi )           (6)

   For the weighted UMLS-based expansion, the principle is the same as before,
but:

 – the initial query terms have a weight of 1;
 – as CHV does not weight synonymy relationships, we propose that the ex-
   panded terms get the weight provided by FastText;
 – if one expansion term occurs several times on the expansion, each (weighted)
   occurrence is considered in the expansion.
More formally, the formulas 7 and 8 describe such expansion :

T exp U M LSweighted (qi , q) = {(e, F Tcos (qi , e))|e ∈ CHVsyn (qi )\q∧F Tcos (qi , e)) >= 0.8}
                                                                                      (7)


Exp Query U M LSweighted (q) = {(qi , 1)|qi ∈ q} ∪qi ∈q T exp U M LSweighted (qi )
                                                                              (8)


4     Information Retrieval System
The information retrieval system used for the experiments is Terrier v5.23 [9].
We did not index the corpus, but we used the index provided by the organizers.
This had a impact on the retrieval, as simple tests made us find out that the
index seem corrupted, leading to duplicate document identifiers in result lists.
We did then post-process the result list in a way to remove these duplicated
documents. Because this removal was applied on the top-1000 documents, our
results lists are less than 1000 long.
   The IR model used is BM25 [14], with b=0.75 after preliminary experiments,
other parameters by default. Many experiments show that BM25 is a very good
model to be used [15]. The Relevance Feedback model is Bose Einstein (bo1
model of Terrier), with default parameters (3 top documents considered, and
10 terms for expansion). The Bo1 relevance feedback model provides very good
results.


5     Runs description
The different runs submitted were the best four runs of several configurations.
As described above in section 3, adding expanded configurations, we get a total
10 runs:

 1. Noexp: no expansion, straight query processing (i.e., without Relevance
    Feedback) ‡;
 2. Noexp RF: no expansion, RF query processing †;
 3. FT Straight binary: FastText-based query expansion, binary expansion mode,
    no RF ‡;
 4. FT Straight weighted: FastText-based weighted query expansion, no RF;
 5. FT RF binary: FastText-based binary expansion, RF query processing †;
 6. FT RF weighted: FastText-based weighted query excpansion, RF query pro-
    cessing;
 7. UMLS Straight binary: UMLS-based query expansion, binary expansion mode,
    straight query processing ‡;
 8. UMLS Straight weighted: UMLS-based weighted query expansion, straight
    query processing;
3
    http://terrier.org/
 9. UMLS RF binary: UMLS-based binary query excpansion, RF query mode
    †;
10. UMLS RF weighted: UMLS-weighted with RF query processing ‡ †.
   A described below, we select among these configurations our submissions for
the two subtasks Adhoc (marked ‡) and Spoken queries (marked †).

5.1   Adhoc subtask
For the selection of our submitted run, we did evaluate the quality on the qrels
of CLEF eHealth Adhoc 2018, using the MAP, of the 10 configurations above.
The results obtained are presented in Table 1. The best reference run between
Noexp and Noexp RF plus the top three runs were submitted as our official
runs. For the Adhoc subtask, dedicated to retrieve documents when asking one
query, a set of 50 queries are provided. The runs with the best results over these
50 queries are chosen (marked with a ‡ in section 5).


Table 1. LIG-Health configurations results for the Clef2018 eHealth Adhoc subtask
(‡ selected for submission).

                         Configuration
                                                     MAP selected
           expansion query processing expansion mode
             Noexp       Straight            /       0.2575  ‡
             Noexp         RF                /       0.2471
              FT         Straight         binary     0.2239  ‡
              FT         Straight        weighted    0.2239
              FT           RF             binary     0.2137
              FT           RF            weighted    0.2137
             UMLS        Straight         binary     0.2287  ‡
             UMLS        Straight        weighted    0.2287
             UMLS          RF             binary     0.2155
             UMLS          RF            weighted    0.2225  ‡


    From Table 1, we see that, using the CLEF eHealth 2018 reference, the best
run is the non-expanded and non-RF one. When binary configurations achieve
the same quality that their binary counterpart, we choose the binary configura-
tion. This explain why the UMLS and TF-based binary expansions with straight
query processing are selected. Overall, we notice that the Relevance Feedback
query processing underperforms straight query processing for the FastText-based
expansions, and that the weighted FastText expansions behave the same than
their binary counterparts.

5.2   Spoken subtask
On the Spoken subtask, the 50 topics from the Adhoc task had been recorded
by six users (Participant 1 to Participant 6). Per participant, six transcrip-
tions are provided: default enhanced transcription, ESPNET commonvoice, ES-
PNET librispeech, ESPNET librispeech rnnlm, phone enhanced, video enhanced.
Weexplore all if these transcriptions for each participant. This leads to a total
of 36 (= 6 participants × 6 transcriptions) versions of the set of queries.
    The full selection of the four submitted runs per user considers the 10 config-
urations described previously over these versions of queries. It follows two steps:
the first one select one transcription per user, and in a second step we choose
the configurations used for the submission. More precisely:


Fig. 2. MAP evaluations (with standard deviation error bars) of non-expanded spoken
transcriptions per user, wrt. CLEF eHealth Adhoc 2018 assessments.


 1. Selection of the one transcription per participant
    We first choose the transcriptions that achieves the higher MAP values (ac-
    cording to CLEF eHEalth 2018 qrels) over the non-expanded runs. These
    results are presented in Figure 2. We see that the transcription quality varies
    a lot depending on the speaker: fort instance the default enhanced transcrip-
    tion is very good for the participants 1, 2, 3 and 6, but fails for participants
    4 and 5. By analyzing this figure, we select the following transcriptions:
      – default enhanced transcription for Participant 1
      – default enhanced transcription for Participant 2
      – video enhanced for Participant 3
      – phone enhanced for Participant 4
      – phone enhanced for Participant 5
      – default enhanced transcription for Participant 6
 2. Selection of four configurations per participant over the chosen
    transcription of step 1
    Then, we computed the averaged MAP (over the 6 participants, using CLEF
    eHealth 2018 Adhoc assessments provided) for each IR configuration system.
    These results are presented in Figure 3. Considering these averaged maps, we
    choose the top-four configurations with only one non-expanded run (marked
    as † in section 5):
    (a) Noexp RF: no expansion, RF query processing;
    (b) FT RF binary: FastText-based binary query expansion, RF query pro-
        cessing;
    (c) UMLS RF binary: UMLS-based binary query expansion, RF query pro-
        cessing
    (d) UMLS RF weighted: UMLS-based weighted query expansion, RF query
        processing.

    We notice that all the selected configurations use Relevance Feedback.


Fig. 3. MAP evaluations of expanded spoken transcriptions per user for the selected
transcriptions, wrt. CLEF eHealth Adhoc 2018 assessments.


    Fused runs
We also submitted 4 runs that fuse the results of the same configuration for
each participant. To integrate these results, we used a simple sum of scores.
This allows to study if the integration of several transcriptions from several
participants outperforms single participant transcriptions. The MAP evaluation
of these four configurations on the CLEF eHealth 2018 assessments, as presented
in Figure 4, show a slight increase compared to the top results per user.


6    Results
We present here the official results obtained by our runs, for the adhoc and
spoken queries subtasks. We consider the following evaluation measures: MAP
Fig. 4. MAP evaluations fused spoken configurations, wrt. CLEF eHealth Adhoc 2018
assessments.


to assess globally the quality of the configurations, Bpref [4] that takes into
account the fact that the evaluation relies on incomplete assessments, and the
classical ndcg@10 that focuses on the relevance of the top-10 results. For these
runs, the other measures provided by the organizers, like RBP-based ones, lead
to similar rankings of the configurations tested.


6.1   Adhoc

Our official results for the Adhoc query runs are presented in Table 2. In this
table, we see that the best MAP and ndcg@10 results are obtained without any
query expansion, and without any relevance feedback. This means that none of
the query expansions are able to increase the quality regarding these evalua-
tion measures. However, binary UMLS and TF-based expansions, with straight
query processing, slightly outperform the un-expanded runs with straight query
processing.
    We studied also the results obtained for RBP measures in table3. We see
that the UMLS expanded run outperforms the non-expanded one for RBP and
RBP readability.


6.2   Spoken queries

The official results of our results per participant are presented in Table 4. This
table confirms our evaluations on 2018 assessments: there are large variations of
      Table 2. LIG-Health official results for the Adhoc subtask (bests in bold).

                           Run
                                                   MAP Bpref ndcg@10
        expansion query processing expansion mode
          Noexp       straight            /       0.2627 0.3640 0.5919
          UMLS        straight         binary     0.2340 0.3665 0.5769
          UMLS           RF           weighted    0.2258 0.3616 0.5918
           FT         straight         binary     0.2318 0.3669 0.5617

    Table 3. LIG-Health RBP official results for the Adhoc subtask (bests in bold).

                       Run
                                              RBP 0.80 rRBP 0.80 cRBP 0.80
    expansion query processing expansion mode
      Noexp       straight            /        0.7094    0.2993   0.4615
      UMLS        straight         binary      0.7058    0.3062    0.4614
      UMLS           RF           weighted     0.7172   0.3123     0.4593
       FT         straight         binary      0.6912    0.2909    0.4555


quality (for all the measures) depending on the participant considered. From our
four configurations submitted, the best MAP for participant 6 is 0.1744 (Noexp
with RF), where for participant 5 the best MAP (UMLS-based binary expan-
sion with RF) is only 0.1036. The best measures (among the 3 presented) per
user are obtained without any query expansion in 12 cases on 24. The UMLS-
based binary expansion only provides twice the best measures (MAP and Bpref)
for participant 5. The weighted expansion using UMLS outperforms other con-
figurations in 4 cases. The weighted UMLS expansion outperforms the binary
UMLS expansion on 12 cases over 18. FT-based expansion never produces the
best results, but the second best Bpref and ndcg@10 for participant 1.
     For the submitted merged results, we see that the merging always increase
(over the best single participant) the evaluations measures for each configura-
tion. Here, the binary UMLS-based expansion outperforms its weighted counter-
part for Bpref and ndcg@10. FT-based expansions underperform UMLS-based
expansions.
     For the RBP evaluations of the merged spoken runs, the no-expanded run
still outperforms our other submissions.


7     Discussion

We focus first on the Adhoc subtask. According to the MAP values obtained
on the 2018 assessments, our official results are consistent: the best run is the
non-expanded one without relevance feedback. We see that the UMLS expanded
query with relevance feedback performs as well as the non-expanded run for first
results, as the P@10 and ndcg@10 are almost equal. Binary UMLS and FT-based
expansions slightly outperform our non-expanded runs for the Bpref measure,
this shows that such expansions can be beneficial. We see in Figures 5 and 6
Table 4. LIG-Health official per participant results for the spoken queries subtask
(best per user in bold).

                               Run
 Participant                                              MAP Bpref ndcg@10
               expansion query processing expansion mode
                 Noexp         RF                /       0.1726 0.3192 0.5178
                 UMLS          RF             binary     0.1271 0.2783 0.4540
      1
                 UMLS          RF            weighted    0.1416 0.2928 0.4628
                  FT           RF             binary     0.1565 0.3096 0.5179
                 Noexp         RF                /       0,1206 0.2634 0,4247
                 UMLS          RF             binary     0.1017 0.2384 0.3963
      2
                 UMLS          RF            weighted    0.1133 0.2575 0.4419
                  FT           RF             binary     0.0995 0.2314 0.3885
                 Noexp         RF                /       0.1447 0.3023 0.4583
                 UMLS          RF             binary     0.1385 0.2984 0.4217
      3
                 UMLS          RF            weighted    0.1485 0.3114 0.4605
                  FT           RF             binary     0.1274 0.2664 0.4290
                 Noexp         RF                /       0.1301 0.2880 0.4310
                 UMLS          RF             binary     0,1246 0.2852 0,4042
      4
                 UMLS          RF            weighted    0,1282 0.2877 0.4273
                  FT           RF             binary     0.1090 0.2582 0.3805
                 Noexp         RF                /       0.1035 0.2412 0.3539
                 UMLS          RF             binary     0.1036 0.2470 0.3462
      5
                 UMLS          RF            weighted    0.0917 0.2227 0.3275
                  FT           RF             binary     0.0952 0.2287 0.3097
                 Noexp         RF                /       0.1744 0.3238 0.4807
                 UMLS          RF             binary     0.1478 0.3019 0.4355
      6
                 UMLS          RF            weighted    0.1594 0.3072 0.4439
                  FT           RF             binary     0.1509 0.2921 0.4468

Table 5. LIG-Health official results for the spoken queries subtask - Merged runs. Best
results in bold.

                          Run
                                               MAP Bpref ndcg@10
          expansion query mode expansion mode
            Noexp       RF            /       0.1810 0.3279 0.5411
            UMLS        RF         binary     0.1582 0.3085 0.5203
            UMLS        RF        weighted    0.1671 0.2964 0.5203
             FT         RF         binary     0.1626 0.3054 0.4873

Table 6. LIG-Health RBP official results for the spoken queries subtask - Merged runs.
Best results in bold.

                      Run
                                          RBP 0.80 rRBP 0.80 cRBP 0.80
      expansion query mode expansion mode
        Noexp       RF            /        0.6186   0.2601    0.4067
        UMLS        RF         binary      0.6017    0.2557    0.3864
        UMLS        RF        weighted     0.5847    0.2407    0.3722
         FT         RF         binary      0.5629    0.2264    0.3592
(i.e, our two best results according to MAP), that the expansion is much more
unstable compared to the media results.


Fig. 5. MAP evaluations per query for Adhoc Straight without expansion. (Image
courtesy of the CLEF eHealth 2020 Task 2 organizers.)


    When considering spoken participant runs, we show again that some expan-
sions proposed are able to outperform the non-expansions configurations. More
precisely, UMLS-based expansions obtain larger MAP and Bpref values than
non-expansions for 33% (= 2/6) of the participants.
    The fused spoken evaluation results that we get from the spoken queries
are consistent with the adhoc results: the expansions underperform the non-
expanded runs. With merged results, the expansions are never close to the quality
of the non-expanded runs: the reason is that the expansions over each user tend to
disperse the initial query expression already subject to transcription errors. We
present in 7 and 8 (i.e, our two best results according to ndcg@10) the results
per query. We see again here that the expansion underperforms the median
results more often than the non-expanded run.
    A more detailed fusion process may improve the overall quality, but in any
case merging results for spoken queries needs to be able to retrieve similar queries
asked by several users, which is not an easy task.
Fig. 6. MAP evaluations per query for UMLS binary expanded with Relevance Feed-
back. (Image courtesy of the CLEF eHealth 2020 Task 2 organizers.)


   This work is only focusing on simple expansions, and these expansions do
not succeed in increasing the quality of the results. The expansion terms are
not strongly enough related to teh initial query. Future experiments will be
conducted to check exactly why these expansions fail.
   For the official evaluation measures related to credibility, the UMLS ex-
panded runs outperform by 4.3% (cRBP 0.80) the non-expanded one for the
Adhoc search, but still the non-expanded runs achieve a higher quality than the
expanded ones for the spoken runs.


8   Conclusion

We presented in this paper the configurations of the retrieval for Adhoc and Spo-
ken queries subtasks of the consumer Health Search task from CLEF eHealth
2020. We focused our proposal on several query expansions. The query expan-
sions rely on UMLS meta-thasaurus, and on words embeddings using FastText.
    The main findings is that the expansion proposed for classical Adhoc under-
performs simple retrieval (with or without relevance Feedback strategy). For spo-
ken runs, we were able to detect that some query expansions (based on UMLS)
do compete well with simple retrieval without expansion.
Fig. 7. ndcg@10 evaluation per query for Merged run, with Relevance Feedback, with-
out expansion. (Image courtesy of the CLEF eHealth 2020 Task 2 organizers.)


   Other approaches based on reranking should be studied in the future, in a
way to avoid the noise generated by the expansions of queries.


Acknowledgement
This work was partially supported by the ANR Kodicare project, grant ANR-
19-CE23-0029 of the French Agence Nationale de la Recherche.


References
 1. Mohannad Almasri, Catherine Berrut, and Jean-Pierre Chevallet. A Comparison
    of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and
    Mutual Information. In Conférence ECIR, volume 42, pages 369 – 715, Padoue,
    Italy, March 2016.
 2. Olivier Bodenreider. The Unified Medical Language System (UMLS): Integrating
    Biomedical Terminology. Nucleic Acids Res., 32(Database-Issue):267–270, 2004.
 3. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching
    Word Vectors with Subword Information. CoRR, abs/1607.04606, 2016.
 4. Chris Buckley and Ellen M. Voorhees. Retrieval Evaluation with Incomplete Infor-
    mation. In Proceedings of the 27th Annual International ACM SIGIR Conference
Fig. 8. ndcg@10 evaluation per query for Merged run, with UMLS binary query ex-
pansion, with Relevance Feedback. (Image courtesy of the CLEF eHealth 2020 Task 2
organizers.)


    on Research and Development in Information Retrieval, SIGIR ’04, page 25–32,
    New York, NY, USA, 2004. Association for Computing Machinery.
 5. A. Elekes, M. Schaeler, and K. Boehm. On the Various Semantics of Similarity
    in Word Embedding Models. In 2017 ACM/IEEE Joint Conference on Digital
    Libraries (JCDL), pages 1–10, 2017.
 6. Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Zhengyang Liu, Gabriella Pasi,
    Gabriela Saez Gonzales, Marco Viviani, and Chenchen Xu. Overview of the CLEF
    eHealth 2020 Task 2: Consumer Health Search with Ad Hoc and Spoken Queries.
    In Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR
    Workshop Proceedings, 2020.
 7. Lorraine Goeuriot, Hanna Suominen, Liadh Kelly, Antonio Miranda-Escalada,
    Martin Krallinger, Zhengyang Liu, Gabriella Pasi, Gabriela Saez Gonzales, Marco
    Viviani, and Chenchen Xu. Overview of the CLEF eHealth Evaluation Lab 2020.
    In Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis,
    Hideo Joho, Christina Lioma, Carsten Eickhoff, Aurélie Névéol, Linda Cappellato,
    and Nicola Ferro, editors, Experimental IR Meets Multilinguality, Multimodality,
    and Interaction: Proceedings of the Eleventh International Conference of the CLEF
    Association (CLEF 2020) , LNCS Volume number: 12260, 2020.
 8. Alexander Kotov and ChengXiang Zhai. Tapping into Knowledge Base for Con-
    cept Feedback: Leveraging Conceptnet to Improve Search Results for Difficult
    Queries. In Eytan Adar, Jaime Teevan, Eugene Agichtein, and Yoelle Maarek,
    editors, WSDM, pages 403–412. ACM, 2012.
 9. Craig Macdonald, Richard McCreadie, Rodrygo LT Santos, and Iadh Ounis. From
    Puppy to Maturity: Experiences in Developing Terrier. Proc. of OSIR at SIGIR,
    pages 60–63, 2012.
10. Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Ar-
    mand Joulin. Advances in Pre-Training Distributed Word Representations. In
    Proceedings of the International Conference on Language Resources and Evalua-
    tion (LREC 2018), 2018.
11. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
    tributed Representations of Words and Phrases and their Compositionality. In
    C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,
    editors, Advances in Neural Information Processing Systems 26, pages 3111–3119.
    Curran Associates, Inc., 2013.
12. Antonio Miranda-Escalada, Aitor Gonzalez-Agirre, Jordi Armengol-Estapé, and
    Martin Krallinger. Overview of Automatic Clinical Coding: Annotations, Guide-
    lines, and Solutions for non-English Clinical Cases at CodiEsp Track of CLEF
    eHealth 2020. In Working Notes of Conference and Labs of the Evaluation (CLEF)
    Forum, CEUR Workshop Proceedings, 2020.
13. Navid Rekabsaz, Mihai Lupu, and Allan Hanbury. Exploration of a Threshold for
    Similarity Based on Uncertainty in Word Embedding. In Joemon M Jose, Claudia
    Hauff, Ismail Sengor Altıngovde, Dawei Song, Dyaa Albakour, Stuart Watt, and
    John Tait, editors, Advances in Information Retrieval, pages 396–409, Cham, 2017.
    Springer International Publishing.
14. Stephen Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford.
    Okapi at trec-3. In Overview of the Third Text REtrieval Conference (TREC-3),
    pages 109–126. Gaithersburg, MD: NIST, January 1995.
15. Peilin Yang and Hui Fang. A Reproducibility Study of Information Retrieval Mod-
    els. ICTIR ’16, page 77–86, New York, NY, USA, 2016. Association for Computing
    Machinery.
16. Liu Zhenyu and Chu Wesley W. Knowledge-based Query Expansion to Support
    Scenario-specific Retrieval of Medical Free Text. Information Retrieval, 10(2):173–
    202, 2007.