SandiDoc at CLEF 2020 - Consumer Health
             Search : AdHoc IR Task

     Sandaru Seneviratne1 , Eleni Daskalaki1[0000−0002−7665−7039] , Md Zakir
     Hossain1[0000−0003−1892−831X] , and Artem Lenskiy1[0000−0002−4745−6756]

Research School of Computer Science, College of Engineering and Computer Science,
                       The Australian National University


        Abstract. Information retrieval (IR) processes deal with the retrieval
        of ranked documents based on the similarity among documents and the
        key words specified in user’s query. CLEF 2020 eHealth task on consumer
        health search adhoc IR addresses the need of improved information re-
        trieval techniques in the health domain to provide relevant information
        to users. This paper presents our work in CLEF eHealth 2020 consumer
        health search, on term frequency-inverse document frequency and word
        vector representation-based techniques adopted in the adhoc IR task.
        The goal of our work is to experiment on different techniques for infor-
        mation retrieval and look into how different word vector representations
        of text can affect the final results.

        Keywords: Information Retrieval · TF-IDF Score · Word Vector Rep-
        resentations.


1     Introduction
With the increasing expansion of the online content, there has been a growth in
online health information retrieval efforts in order to obtain medical knowledge.
These efforts, pursued not only by medical specialists but also from the general
public, have led to improved mechanisms of health information retrieval. Given
the enormous amount of available information, it is vital to provide users with
documents fitting to their requests. Information retrieval (IR) can be described
as the automatic retrieval of a list of ranked documents that are relevant to
a given user query based on similarity measures between the query and the
documents. Different theoretical models like boolean, probabilistic, and vector
models are used in IR which utilise distinct matching and ranking algorithms to
retrieve the documents relevant to a certain query [7].
    Most of the early IR systems were based on boolean models [11] which use
boolean logic and set theory to represent the presence or the absence of a term
in a document respectively. Another major approach for IR is probabilistic re-
trieval models [11] which make use of the probability of relevance of queries to
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
documents by calculating the term weights in queries and documents. In vector
space models [11], all queries and documents are represented by vectors of an n
dimensional vector space where n refers to the number of distinct terms in the
collection.
    CLEF eHealth 2020 [4] task 2 on consumer health search [3] consists of two
sub tasks; adhoc IR and spoken query retrieval. Adhoc IR is a traditional IR task
to produce relevant documents to the written queries whereas the spoken query
retrieval task utilises spoken queries for the document retrieval. We participated
in the sub task 1 of the consumer health search to experiment on adhoc IR.
    This paper is organized as follows. Section 2 introduces the data set, queries
and other additional resources used in the task. Section 3 describes the method-
ology used in the experimental setup. In section 4, we present the results and
sections 5 and 6 include the discussion and future work respectively.


2     Resources
2.1   Dataset
The document collection used in the document retrieval task was acquired by
the common crawl dump of 2018-19. This included web pages of the formats such
as HTML, XHTML, XML. The data set used for the task is clefehealth2018 B
which is a subset of the initial dataset of 1903 domains. clefehealth B dataset
contains web pages from 1653 website domains and was created by removing
a number of websites that were not strictly related to health. The size of this
corpus was 294 GB out of which a subset of size 30 GB was used in the task.

2.2   Queries
For the Sub task 1, adhoc IR, 50 topics/queries were provided. These queries
were chosen from a set of sample queries collected over 6 months by domain
experts. These 50 queries were raw queries with no preprocessing performed
beforehand. Fig. 1 provides an example query input which contains the id and
the query. The first 3 digits in the id refer to the topic whereas the last 3 digits
are used to identify the creator of the query.


                           Fig. 1: Example of a query.
2.3   Word Embedding Model
As additional resources, Medical Continuous Bag of Words (CBOW) and Skip-
gram word embeddings created using TREC (The Text REtrieval Conference)
Medical Records collection were provided. These models use neural network
architecture to develop the word representations. In the CBOW architecture,
the model takes the context words into account when predicting the target word
whereas the Skip-gram model architecture uses the target word to predict the
context words [5].

3     Methodology
In this section, we describe the different techniques we used in the document
retrieval task. Fig. 2 gives an overview of the complete process which includes
preprocessing, representation of queries and documents, and, finally, a matching
and ranking algorithm to get the most relevant document list for the queries.


               Fig. 2: Elements of a document retrieval system.


3.1   Preprocessing
Preprocessing is an important initial step in natural language processing (NLP)
[8] tasks to convert text into a more simplified and an understandable format
so that the NLP and Machine Learning (ML) techniques can perform better.
Both the clefehealth B dataset and queries are raw data with no preprocessing
performed on them. In order to obtain clean text from the data set and the
queries, we follow different preprocessing steps [9].
     Clefehealth B dataset contains web pages of different formats crawled from
the web. These files include the content of the web page along with HTML
tags, scripting, and styling. In order to obtain the important content from the
web pages, a proper parsing of the web pages is performed using the beautiful-
soup library [12]. Converting text to lower case is one of the simplest forms of
preprocessing which is useful in entity normalisation. If ignored, this can lead to
identifying the same entity as distinct entities which can eventually affect the
final result of a system. Both the queries and the text obtained from HTML
parsing were converted to lower-case. Next, the digits or the numbers in the text
were converted to text in order to facilitate entity normalization. Stop words
carry little to no important information in text. Hence, as a next step, stop
words were removed in both the queries and the documents using the stop word
list provided by nltk library. The punctuation and other unnecessary characters
were removed in order to obtain clean text. To ensure that queries and documents
are free of spelling errors, spell correction was done using the edit distance of
the words and the words in the given word embedding model. Once that was
completed, stemming was performed using Porter Stemmer’s algorithm to bring
each word to its stem word to ensure that different forms of words are identified
as one word [10]. Fig. 3 gives an overview of the preprocessing function.


                 Fig. 3: Elements in the preprocessing function.


3.2   Document Retrieval

This section describes TF-IDF and word embedding based techniques used for
the IR task.

TF-IDF score based: TF-IDF (Eq. 1) is a popular IR technique used in many
applications [2]. It is a weight (statistical) measure used to evaluate the impor-
tance of a word in a document with respect to the whole collection of documents
[6]. Number of occurrences of a term in a document (term frequency - TF) and
inverse document frequency (IDF) are used to calculate TF-IDF weight. There
are different variants of TF and IDF scores used in calculating the relevance of
a document to a user query. In our work, we use the variation in equation 1.
    Using the TF alone to calculate the scores may give more weight to non-
relevant terms. In order to dampen the effect of TF, IDF score is incorporated.
However, a linear IDF function may boost the document scores with high IDF
terms. To address this issue and dampen the effect of a linear IDF function, log
value (sub-linear function) of IDF is considered.


                            wi,d = tfi,d · log(n/dfi )                       (1)


Word Embedding based: Word embeddings can capture semantic meanings among
words which is a huge advantage in IR tasks. Word embeddings use distributional
hypothesis which focuses on the context of words to derive the word representa-
tion[1]. The word embedding model architecture used in the task is Skip-gram
which predicts the context words in a window given the target word. Out of the
different Skip-gram models, the model with 500 dimensional embedding space
and 5 dimensional context window are used in this task.
   To represent the documents in the collection using word embedding, we use
average, minimum and maximum vector representations using the 100 most fre-
quent terms in the document. Along with the average vector representation
obtained (Eq. 2), we average the minimum and maximum vectors to obtain an-
other vector representation (Eq. 3) for the document. Similarly, we obtain two
vector representations (average vector and the average of minimum and maxi-
mum vectors) for each given query.


                                                   P100
                                                         n=1 xi
                     vector representation1i =                               (2)
                                                         100


                                            min veci + max veci
               vector representation2i =                                     (3)
                                                     2


    We calculate the similarity for TF-IDF technique by summing the TF-IDF
scores of query tokens in each document and ranking the scores of documents in
descending order. For the two vector representations (average vectors, average
of the minimum and maximum vectors) we calculate the similarity using cosine
similarity to rank the documents with the highest scores in descending order.
For each query, we retrieve the 1000 most similar documents as results. Table 1
gives the fields in each row of the results file.
                        Table 1: Fields in the results file.

Field Details
qid   query id
Q0    literal Q0
docno document id
rank rank of the document
score similarity score for the document with respect to the query
tag   system identifier


4   Results
The experiment was performed on a subset of size 30 GB of the Clefehealth B
dataset and only the results from the TF-IDF document retrieval algorithm were
submitted for evaluation due to time constraints and computational limitations.
Using a subset of the dataset has a huge impact on the accuracy of the results
since only part of the relevant documents are retrieved, missing a significant
number of other relevant documents in the dataset. Table 2 provides the result
scores for the IR task using TF-IDF technique for the dataset of 30/294 GB.


                     Table 2: Results of the adhoc IR task.

Evaluation Metric                                                   Result
Mean Average Precision (MAP)                                        0.0239
Precision at 10 (P@10)                                              0.426
Normalized Discounted Cumulative Gain through position 10 (NDCG@10) 0.3235
Accuracy of credibility                                             0.1744
Relevance-ranked biased precision (RBP 0.95)                        0.2981 +0.2934
Credibility-ranked biased precision (cRBP 0.95)                     0.1801 +0.2934
Understandability-ranked biased precision (uRBP 0.95)               0.1633 +0.2934


5   Discussion
In this paper, we present our methodology for the adhoc IR subtask of CLEF2020
using TF-IDF score and word vector representations. TF-IDF is considered a
simple yet effective algorithm which provides an ideal baseline for IR tasks on
which we can develop and expand to more complicated IR algorithms. Despite
these advantages, TF-IDF lacks the use of context information compared to other
models like word embedding models which take context information into account
in developing the embeddings for words. If a user query contains “diabetes” as
a key word, TF-IDF algorithm would not consider documents which contain the
variation “diabetic” in the IR task. Similarly, the algorithm would not consider
documents which contain “diabetec” (misspelled terms) despite how relevant
they are to the user query. In order to produce the most relevant documents
using TF-IDF algorithm, it is vital to preprocess the data prior to applying the
algorithm which can have a significant effect on the results.
    Word embedding models have been successfully used in many NLP and ML
tasks since they consider contextual information in developing the representa-
tions for words. However, one of the major limitations in word embedding mod-
els is that they are unable to identify words similar in text but with different
meanings (homonyms) creating a single vector representation for those words.
This limitation can be avoided by using approaches which produce multi sense
embeddings for words.


6   Future Work
In future, we will further improve and expand algorithms for IR building on the
baseline models TF-IDF and word embedding. Moreover, we will expand our
current work to incorporate query expansion which can be used to obtain differ-
ent forms of the original query to improve the results of the IR task. One of the
popular query expansion techniques is synonym identification and substitution
which is done mostly using existing vocabularies. In the medical domain, vocabu-
laries like UMLS (Unified Medical Language System), SNOMED CT (SNOMED
Clinical Terms), OAC-CHV (open-access and collaborative consumer health vo-
cabulary) can be used for query expansion along with word embedding tech-
niques. In addition, we will explore techniques for multi sense embeddings to
improve on the word embedding based model for IR.


Acknowledgements
This research was funded by and has been delivered in partnership with Our
Health in Our Hands (OHIOH), a strategic initiative of the Australian National
University, which aims to transform health care by developing new personalized
health technologies and solutions in collaboration with patients, clinicians and
health-care providers.


References
 1. Croft, W.B., Zamani, H.: Relevance-based Word Embedding. SIGIR ’17: Proceed-
    ings of the 40th International ACM SIGIR Conference on Research and Develop-
    ment in Information Retrieval (2017)
 2. Fautsch, C., Savoy, J.: Adapting the tf idf Vector-Space Model to Domain Specific
    Information Retrieval. SAC ’10: Proceedings of the 2010 ACM Symposium on
    Applied Computing (2010), http://www.lucene.apache.org/
 3. Goeuriot, L., Suominen, H., Kelly, L., Liu, Z., Pasi, G., Gonzales, G.S., Viviani,
    M., Xu, C.: Overview of the CLEF eHealth 2020 task 2: Consumer health search
    with ad hoc and spoken queries. In: Working Notes of Conference and Labs of the
    Evaluation (CLEF) Forum. CEUR Workshop Proceedings (2020)
 4. Goeuriot, L., Suominen, H., Kelly, L., Miranda-Escalada, A., Krallinger, M., Liu,
    Z., Pasi, G., Saez Gonzales, G., Viviani, M., Xu, C.: Overview of the CLEF eHealth
    evaluation lab 2020. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis, S.,
    Joho, H., Lioma, C., Eickhoff, C., Névéol, A., andNicola Ferro, L.C. (eds.) Exper-
    imental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of
    the Eleventh International Conference of the CLEF Association (CLEF 2020) .
    LNCS Volume number: 12260, Springer, Heidelberg, Germany (2020)
 5. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of
    Word Representations in Vector Space. Conference: Proceedings of the In-
    ternational Conference on Learning Representations (ICLR 2013) (2013),
    http://ronan.collobert.com/senna/
 6. Ramos, J.: Using TF-IDF to Determine Word Relevance in Document Queries.
    Tech. rep. (2003)
 7. Singhal Google, A.: Modern Information Retrieval: A Brief Overview. Tech. rep.
    (2001), http://trec.nist.gov
 8. Sun, X., Liu, X., Hu, J., Zhu, J.: Empirical Studies on the NLP Tech-
    niques for Source Code Data Preprocessing. EAST 2014: Proceedings of
    the 2014 3rd International Workshop on Evidential Assessment of Soft-
    ware Technologies, vol. 14 (2014). https://doi.org/10.1145/2627508.2627514,
    http://dx.doi.org/10.1145/2627508.2627514
 9. Uysal,     A.K.,      Gunal,     S.:   The      impact    of    preprocessing     on
    text      classification.     Information      Processing      and     Management
    50(1),        104–112      (2014).       https://doi.org/10.1016/j.ipm.2013.08.006,
    http://dx.doi.org/10.1016/j.ipm.2013.08.006
10. Willett, P.: The Porter stemming algorithm: Then and now. Elec-
    tronic    library     and    information     systems    40(3),    219–223     (2006).
    https://doi.org/10.1108/00330330610681295
11. Yu, B.: Research on information retrieval model based on ontology.
    EURASIP Journal on Wireless Communications and Networking (2019).
    https://doi.org/10.1186/s13638-019-1354-z, https://doi.org/10.1186/s13638-019-
    1354-z
12. Zheng, C., He, G., Peng, Z.: A Study of Web Information Extraction Tech-
    nology Based on Beautiful Soup. Journal of Computers 10(6), 381–387 (2015).
    https://doi.org/10.17706/jcp.10.6.381-387