<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>B.: Research on information retrieval model based on ontology.
EURASIP Journal on Wireless Communications and Networking (2019).
https://doi.org/10.1186/s13638</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/2627508.2627514</article-id>
      <title-group>
        <article-title>SandiDoc at CLEF 2020 - Consumer Health Search : AdHoc IR Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandaru Seneviratne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eleni Daskalaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Zakir Hoss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>nskiy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Word Vector Rep-</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research School of Computer Science, College of Engineering and Computer Science, The Australian National University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>14</volume>
      <issue>12260</issue>
      <fpage>22</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Information retrieval (IR) processes deal with the retrieval of ranked documents based on the similarity among documents and the</p>
      </abstract>
      <kwd-group>
        <kwd>Information Retrieval TF-IDF Score resentations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>With the increasing expansion of the online content, there has been a growth in
online health information retrieval e orts in order to obtain medical knowledge.
These e orts, pursued not only by medical specialists but also from the general
public, have led to improved mechanisms of health information retrieval. Given
the enormous amount of available information, it is vital to provide users with
documents tting to their requests. Information retrieval (IR) can be described
as the automatic retrieval of a list of ranked documents that are relevant to
a given user query based on similarity measures between the query and the
documents. Di erent theoretical models like boolean, probabilistic, and vector
models are used in IR which utilise distinct matching and ranking algorithms to
retrieve the documents relevant to a certain query [7].</p>
      <p>Most of the early IR systems were based on boolean models [11] which use
boolean logic and set theory to represent the presence or the absence of a term
in a document respectively. Another major approach for IR is probabilistic
retrieval models [11] which make use of the probability of relevance of queries to
documents by calculating the term weights in queries and documents. In vector
space models [11], all queries and documents are represented by vectors of an n
dimensional vector space where n refers to the number of distinct terms in the
collection.</p>
      <p>
        CLEF eHealth 2020 [4] task 2 on consumer health search [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] consists of two
sub tasks; adhoc IR and spoken query retrieval. Adhoc IR is a traditional IR task
to produce relevant documents to the written queries whereas the spoken query
retrieval task utilises spoken queries for the document retrieval. We participated
in the sub task 1 of the consumer health search to experiment on adhoc IR.
      </p>
      <p>This paper is organized as follows. Section 2 introduces the data set, queries
and other additional resources used in the task. Section 3 describes the
methodology used in the experimental setup. In section 4, we present the results and
sections 5 and 6 include the discussion and future work respectively.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Resources</title>
      <sec id="sec-2-1">
        <title>Dataset</title>
        <p>The document collection used in the document retrieval task was acquired by
the common crawl dump of 2018-19. This included web pages of the formats such
as HTML, XHTML, XML. The data set used for the task is clefehealth2018 B
which is a subset of the initial dataset of 1903 domains. clefehealth B dataset
contains web pages from 1653 website domains and was created by removing
a number of websites that were not strictly related to health. The size of this
corpus was 294 GB out of which a subset of size 30 GB was used in the task.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Queries</title>
        <p>For the Sub task 1, adhoc IR, 50 topics/queries were provided. These queries
were chosen from a set of sample queries collected over 6 months by domain
experts. These 50 queries were raw queries with no preprocessing performed
beforehand. Fig. 1 provides an example query input which contains the id and
the query. The rst 3 digits in the id refer to the topic whereas the last 3 digits
are used to identify the creator of the query.
As additional resources, Medical Continuous Bag of Words (CBOW) and
Skipgram word embeddings created using TREC (The Text REtrieval Conference)
Medical Records collection were provided. These models use neural network
architecture to develop the word representations. In the CBOW architecture,
the model takes the context words into account when predicting the target word
whereas the Skip-gram model architecture uses the target word to predict the
context words [5].
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section, we describe the di erent techniques we used in the document
retrieval task. Fig. 2 gives an overview of the complete process which includes
preprocessing, representation of queries and documents, and, nally, a matching
and ranking algorithm to get the most relevant document list for the queries.
Preprocessing is an important initial step in natural language processing (NLP)
[8] tasks to convert text into a more simpli ed and an understandable format
so that the NLP and Machine Learning (ML) techniques can perform better.
Both the clefehealth B dataset and queries are raw data with no preprocessing
performed on them. In order to obtain clean text from the data set and the
queries, we follow di erent preprocessing steps [9].</p>
      <p>Clefehealth B dataset contains web pages of di erent formats crawled from
the web. These les include the content of the web page along with HTML
tags, scripting, and styling. In order to obtain the important content from the
web pages, a proper parsing of the web pages is performed using the
beautifulsoup library [12]. Converting text to lower case is one of the simplest forms of
preprocessing which is useful in entity normalisation. If ignored, this can lead to
identifying the same entity as distinct entities which can eventually a ect the
nal result of a system. Both the queries and the text obtained from HTML
parsing were converted to lower-case. Next, the digits or the numbers in the text
were converted to text in order to facilitate entity normalization. Stop words
carry little to no important information in text. Hence, as a next step, stop
words were removed in both the queries and the documents using the stop word
list provided by nltk library. The punctuation and other unnecessary characters
were removed in order to obtain clean text. To ensure that queries and documents
are free of spelling errors, spell correction was done using the edit distance of
the words and the words in the given word embedding model. Once that was
completed, stemming was performed using Porter Stemmer's algorithm to bring
each word to its stem word to ensure that di erent forms of words are identi ed
as one word [10]. Fig. 3 gives an overview of the preprocessing function.
This section describes TF-IDF and word embedding based techniques used for
the IR task.</p>
      <p>
        TF-IDF score based: TF-IDF (Eq. 1) is a popular IR technique used in many
applications [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is a weight (statistical) measure used to evaluate the
importance of a word in a document with respect to the whole collection of documents
[6]. Number of occurrences of a term in a document (term frequency - TF) and
inverse document frequency (IDF) are used to calculate TF-IDF weight. There
are di erent variants of TF and IDF scores used in calculating the relevance of
a document to a user query. In our work, we use the variation in equation 1.
      </p>
      <p>Using the TF alone to calculate the scores may give more weight to
nonrelevant terms. In order to dampen the e ect of TF, IDF score is incorporated.
However, a linear IDF function may boost the document scores with high IDF
terms. To address this issue and dampen the e ect of a linear IDF function, log
value (sub-linear function) of IDF is considered.</p>
      <p>
        wi;d = tfi;d log(n=dfi)
(1)
Word Embedding based: Word embeddings can capture semantic meanings among
words which is a huge advantage in IR tasks. Word embeddings use distributional
hypothesis which focuses on the context of words to derive the word
representation[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The word embedding model architecture used in the task is Skip-gram
which predicts the context words in a window given the target word. Out of the
di erent Skip-gram models, the model with 500 dimensional embedding space
and 5 dimensional context window are used in this task.
      </p>
      <p>To represent the documents in the collection using word embedding, we use
average, minimum and maximum vector representations using the 100 most
frequent terms in the document. Along with the average vector representation
obtained (Eq. 2), we average the minimum and maximum vectors to obtain
another vector representation (Eq. 3) for the document. Similarly, we obtain two
vector representations (average vector and the average of minimum and
maximum vectors) for each given query.</p>
      <p>vector representation1i =</p>
      <p>P1n0=01 xi</p>
      <p>100
vector representation2i =
min veci + max veci
2
(2)
(3)</p>
      <p>We calculate the similarity for TF-IDF technique by summing the TF-IDF
scores of query tokens in each document and ranking the scores of documents in
descending order. For the two vector representations (average vectors, average
of the minimum and maximum vectors) we calculate the similarity using cosine
similarity to rank the documents with the highest scores in descending order.
For each query, we retrieve the 1000 most similar documents as results. Table 1
gives the elds in each row of the results le.
The experiment was performed on a subset of size 30 GB of the Clefehealth B
dataset and only the results from the TF-IDF document retrieval algorithm were
submitted for evaluation due to time constraints and computational limitations.
Using a subset of the dataset has a huge impact on the accuracy of the results
since only part of the relevant documents are retrieved, missing a signi cant
number of other relevant documents in the dataset. Table 2 provides the result
scores for the IR task using TF-IDF technique for the dataset of 30/294 GB.</p>
      <p>Evaluation Metric Result
Mean Average Precision (MAP) 0.0239
Precision at 10 (P@10) 0.426
Normalized Discounted Cumulative Gain through position 10 (NDCG@10) 0.3235
Accuracy of credibility 0.1744
Relevance-ranked biased precision (RBP 0.95) 0.2981 +0.2934
Credibility-ranked biased precision (cRBP 0.95) 0.1801 +0.2934
Understandability-ranked biased precision (uRBP 0.95) 0.1633 +0.2934
5</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>In this paper, we present our methodology for the adhoc IR subtask of CLEF2020
using TF-IDF score and word vector representations. TF-IDF is considered a
simple yet e ective algorithm which provides an ideal baseline for IR tasks on
which we can develop and expand to more complicated IR algorithms. Despite
these advantages, TF-IDF lacks the use of context information compared to other
models like word embedding models which take context information into account
in developing the embeddings for words. If a user query contains \diabetes" as
a key word, TF-IDF algorithm would not consider documents which contain the
variation \diabetic" in the IR task. Similarly, the algorithm would not consider
documents which contain \diabetec" (misspelled terms) despite how relevant
they are to the user query. In order to produce the most relevant documents
using TF-IDF algorithm, it is vital to preprocess the data prior to applying the
algorithm which can have a signi cant e ect on the results.</p>
      <p>Word embedding models have been successfully used in many NLP and ML
tasks since they consider contextual information in developing the
representations for words. However, one of the major limitations in word embedding
models is that they are unable to identify words similar in text but with di erent
meanings (homonyms) creating a single vector representation for those words.
This limitation can be avoided by using approaches which produce multi sense
embeddings for words.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Future Work</title>
      <p>In future, we will further improve and expand algorithms for IR building on the
baseline models TF-IDF and word embedding. Moreover, we will expand our
current work to incorporate query expansion which can be used to obtain di
erent forms of the original query to improve the results of the IR task. One of the
popular query expansion techniques is synonym identi cation and substitution
which is done mostly using existing vocabularies. In the medical domain,
vocabularies like UMLS (Uni ed Medical Language System), SNOMED CT (SNOMED
Clinical Terms), OAC-CHV (open-access and collaborative consumer health
vocabulary) can be used for query expansion along with word embedding
techniques. In addition, we will explore techniques for multi sense embeddings to
improve on the word embedding based model for IR.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This research was funded by and has been delivered in partnership with Our
Health in Our Hands (OHIOH), a strategic initiative of the Australian National
University, which aims to transform health care by developing new personalized
health technologies and solutions in collaboration with patients, clinicians and
health-care providers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamani</surname>
          </string-name>
          , H.:
          <article-title>Relevance-based Word Embedding</article-title>
          .
          <source>SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fautsch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savoy</surname>
          </string-name>
          , J.:
          <article-title>Adapting the tf idf Vector-Space Model to Domain Speci c Information Retrieval</article-title>
          .
          <source>SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing</source>
          (
          <year>2010</year>
          ), http://www.lucene.apache.org/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzales</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viviani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth 2020 task 2: Consumer health search with ad hoc and spoken queries</article-title>
          . In: Working Notes of Conference and
          <article-title>Labs of the Evaluation (CLEF) Forum</article-title>
          . CEUR Workshop Proceedings (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>