-

KISTI at CLEF eHealth 2016 Task 3: Ranking Medical Documents using Word Vectors

Heung-Seon Oh

Yuchul Jung

0 0 Korea Institute of Science and Technology Information

User's searching activity to obtain relevant medical information becomes very common as the general public uses the Web as source of health information. As a response to this phenomenon, there have been a number of approaches to find useful information for diagnosing or understanding their health conditions from the Web or medical literatures. As an ongoing effort to deliver useful medical information, we attempted two different approaches using word vectors learnt by Word2Vec with Wikipedia. At first, initial documents are obtained using a search engine. Based the retrieved documents, pseudo-relevance feedback is applied with two different usage of the word vectors. In the first approach, a feedback model is constructed using new relevance scores using the word vectors while it is constructed with a new query expanded.

medical information retrieval language models pseudo relevance feedback word vectors

Laypeople use the Web to acquire medical information such as symptoms, diagnosis, treatments, diseases, and hospitals. Unfortunately, they may fail to find relevant information due to difficulty of representing information needs. This happens because they are often not only unfamiliar with medical terminology but also uncertain about their exact questions. To mitigate this problem, CLEF eHealth [2, 4] aims to support laypeople for finding and understanding medical documents on the Web by leveraging medical text processing techniques.

CLEF 2016 eHealth [3] continues to make an effort for the same purpose. We participate in task 3 (patient-centered information retrieval) that focuses on evaluating the effectiveness of medical information retrieval on the Web [10]. This task utilizes a vast of Web document collection, ClueWeb12-B, while the previous tasks employs about 1M Web documents collected from several health-related web sites. In this paper, we proposed two different approaches using word vectors obtained from Word2Vec to perform pseudo relevance feedback.

Method Ranking framework

Our method is to rank medical documents using word vectors constructed from a medical resource, specifically medical Wikipedia. The aim of using the word vectors is to understand the information need of a query properly. For a query , a set of documents, = { 1, 2, … , | |}, from a collection are retrieved using a search engine. For a retrieval model, query-likelihood method with Dirichlet smoothing (QLD) is chosen [8]. Based on , pseudo relevance feedback (PRF) using the word vectors is performed to re-rank the documents in S with a feedback model. In this step, the word vectors are adopted in two different approaches. In the first approach, they are used to compute relevance scores ( , )between and while, in the second approach, they are used to directly expand to

by adding more words that are not appear in .

For each approach, final scores are computed by KL-divergence method with a feedback model constructed using 2.2

Basic Foundation KL-divergence method (KLD) is adopted to compute a relevance score between and by estimating language models [5, 7, 9] because it has a principle to incorporate information into a query in PRF: ( , )= exp (−

( || )) = exp (− ∑ ( | ) ( | ) ( | ) ) (1) tively.

low: where and are the query and document unigram language models, respecA query model is estimated by maximum likelihood estimation (MLE), as shown be ( | )= ( , )

| | ( | )= ( , )+ ⋅ ( | )

∑ ( , )+ where ( , )is the count of a word w in query and | | is the number of words in .

A document model is estimated using Dirichlet smoothing to improve retrieval performance [8]:

where ( , )is the count of a word w in document D, ( | )is the probability of a word w in collection C, and is the Dirichlet prior parameter.

Pseudo-relevance feedback (PRF) is a popular query expansion approach to update a query. It assumes that the top-ranked documents = { 1, 2,…, | |} relevant to a given query and the words in F are useful to reveal hidden information needs. A relevance model (RM) is a multinomial distribution ( | ), which is the likelihood of a word w in a query based on . The first version of the relevance model (RM1) is defined as follows: (2) (3) (4) 1( | )= ∑ ( | ) ( | ) ∈ ∈ ∈

( | ) ( ) = ∑ ( | )

( ) ∝ ∑ ( | ) ( ) ( | )

RM1 is composed of three components: the document prior ( ), the document weight ( | ), and the term weight in a document ( | ). In general, ( )is assumed to have a uniform distribution without knowledge of document D. ( | )= ∏ ∈ ( | ) ( , ) indicates the query-likelihood score.

Finally, a new query model is estimated by combining the original query model and RM1. Documents are re-scored and re-ranked using the new query model. RM3 [1] is a variant of a relevance model which is used here to estimate a new query model with RM1, ( | ′ )= (1 − )⋅ ( | )+ ⋅ 1( | ), (5) where is a control parameter between the original query model and the feedback model.

2.3 Word Vectors

Word2Vec [6] learns a vector representation for a word using a neural network language model. The resulting vector representations for words (i.e., word vectors) can be used in various tasks because a word is represented by a small-size vector. Learning the word vectors is entirely unsupervised and it can be computed on the text corpus according to purposes.

In our approach, Wikipedia was chosen to an input to train the Word2Vec. We assumed that non-medical pages are not useful to medical-related word vectors. Therefore, we just focused on medical pages by filtering out non-medical pages. To this end, first, categories were collected from a root to leaves. We set Health/Diseases_and_disorders and Health/Health_care/Medicine to the root because it is assumed that general medical queries want to find out information about diseases and treatments. This filtering procedure produced 7,672 categories. Then, all pages associated with those categories were used as input. The details of the medical Wikipedia pages are summarized at do that, cosine similarity is computed between and by averaging associated word vectors respectively:

( , ): tion 4 and tion 1.

Then, a new relevance score is computed by multiplying ( , ) and ( , ) PRF is performed with ( , ). For detail,

1( | )is estimated in Equa( , ). In Equation 5, ( | ′ ) is constructed by combining 1( | )and ( | ). Finally, re-ranking is performed with ( | ′ )using EquaIn the second approach, a query is directly expanded to WV using the word vectors. To do that, ⃗W⃗⃗ , the average word vector for all query words, is computed. Then, cosine similarity is computed between ⃗W⃗⃗ and ⃗W⃗⃗ where ∈ . Top-5 words with high cosine similarity that don’t appear in are chosen and added to . Then, PRF is performed with using Equations 1, 4, and 5. 3 3.1

Experiments

Data This task used ClueWeb12-Disk-B (ClueWeb12B) collection which contains about 50M pages. Text of pages were extracted by removing HMTL tags using JSOUP1 parser. Table 2 shows a summary of data statistics of ClueWeb12B. Lucene2 was exploited to index and search the initial documents . For text processing, Stop-words were removed using 419 stop-words3 in INQUERY. | | was set to 2500 and obtained using QLD.

To generate the word vectors, Java version of Word2Vec4 was used. CBOW architecture was used with 200 sized word vector. For input, we removed all punctuations and lowercased words without removing stop-words. We submitted three runs for this task. Run1 is our baseline while other two runs are our proposed approaches using the word vectors. Run2 is PRF with new relevance scores using the word vectors. Run3 is PRF with an expanded query using the word vectors.

1 https://jsoup.org/ 2 http://lucene.apache.org/

3 http://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquer y

4 https://github.com/medallia/Word2VecJava

Run 1 2

Description

Scoring by KLD with RM1 Scoring by KLD with RM1 using Scoring by KLD with RM1 using

Abdul-Jaleel , N. et al.: UMass at TREC 2004 : Novelty and HARD . In: Proceedings of Text REtrieval Conference (TREC) . ( 2004 ).

Goeuriot , L. et al.: Overview of the CLEF eHealth Evaluation Lab 2015 . In: CLEF 2015 - 6th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science (LNCS) , Springer ( 2015 ).

Kelly , Liadh and Goeuriot, Lorraine and Suominen, Hanna and Névéol, Aurélie and Palotti, Joao and Zuccon, G.: Overview of the CLEF eHealth Evaluation Lab 2016 . In: CLEF 2016 - 7th Conference and Labs of the Evaluation Forum . Springer ( 2016 ).

Kelly , L. et al.: Overview of the ShARe/CLEF eHealth Evaluation Lab 2014 .

In: Proceedings of CLEF 2014 . Springer ( 2014 ).

Kurland , O. , Lee , L. : PageRank without hyperlinks: Structural re-ranking using links induced by language models . In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05 . pp. 306 - 313 ACM Press, New York, New York, USA ( 2006 ).

Mikolov , T. et al.: Efficient Estimation of Word Representations in Vector Space . In: Proceedings of the International Conference on Learning Representations (ICLR 2013 ). pp. 1 - 12 ( 2013 ).

Oh , H.-S. , Jung , Y. : Cluster-based query expansion using external collections in medical information retrieval . J. Biomed. Inform . 58 , 70 - 79 ( 2015 ).

Zhai , C. , Lafferty , J.: A study of smoothing methods for language models applied to information retrieval . ACM Trans. Inf. Syst . 22 , 2 , 179 - 214 ( 2004 ).

Zhai , C. , Lafferty , J.: Model-based feedback in the language modeling approach to information retrieval . In: Proceedings of the tenth international conference on Information and knowledge management . pp. 403 - 410 ACM, New York, New York, USA ( 2001 ).

Zuccon , G. et al.: The IR Task at the CLEF eHealth Evaluation Lab 2016 : User-centred Health Information Retrieval . In: CLEF 2016 Evaluation Labs and Workshop: Online Working Notes. ( 2016 ).