CUNI team: CLEF eHealth Consumer Health Search Task 2018 Shadi Saleh and Pavel Pecina Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics, Czech Republic {saleh,pecina}@ufal.mff.cuni.cz Abstract. In this paper, we present our participation in CLEF Con- sumer Health Search Task 2018, mainly, its monolingual and multilin- gual subtasks: IRTask1 and IRTask4. In IRTask1, we use language-model based retrieval model, vector-space model and Kullback-Leiber diver- gence query expansion mechanism to build our runs. In IRTask4, we submitted 4 runs for each language of Czech, French and German. We follow query-translation approach in which we employ a Statistical Ma- chine Translation (SMT) system to get a ranked list of translation hy- potheses in English. We use this list for two systems: the first one uses 1-best-list translation to construct queries, and the second one uses a hypotheses reranker to select the best translation (in terms of retrieval performance) to construct queries. We also present our term reranking model for query expansion, in which we deploy feature set from differ- ent resources (the document collection, Wikipedia articles, translation hypotheses). These features are used to train a logistic regression model that can predict the performance when a candidate term is added to a base query. Keywords: Multilingual information retrieval, statistical machine trans- lation, hypotheses reranking, term reranking 1 Introduction Internet searches for medical topics had been increasing recently, and have got- ten the attention of information retrieval researchers. Fox [3] reported that about 80% of Internet users in the United States look for medical information online. The main challenge in the medical information retrieval systems that people with different experience express their information need in different way [14]. Laypeople express their medical information need using non-medical terms, while medical experts tend to use advanced medical terms, thus, information retrieval systems need to be stable for such different query variations. The significant in- creasing of non-English digital content on the World Wide Web has been followed by an increase in looking for this information by internet users. Grefenstette and Nioche [8] presented an estimation of language size in 1996, late 1999 and early 2000 for documents captured from the internet. Their study showed that the English content has grown 800%, German 1500%, and Spanish 1800% in the same period. Furthermore, users started to look for information needs that is represented in documents which are not available in their native languages. The system that searches for information in a language different from the one of user is called Cross-Lingual (multilingual) Information Retrieval (CLIR) system. It enables users to write queries (information need) represented in a language (lang. A), and returns results from a document collection written in a different language (lang. B). Usually, the baseline system in CLIR is to take 1-best-list translations which are returned by a statistical machine translation (SMT) system and perform the retrieval as shown in the CLEF eHealth Infor- mation Retrieval tasks before [6]. Nikoulina et al. [10] presented an approach to develop Cross-lingual information retrieval (CLIR) system which is based on reranking the hypotheses given from the SMT system. Saleh and Pecina [20] considered Nikoulina’s work as a starting point and expanded it by adding a rich set of features for training. They presented approach covered translating queries from Czech, French and German into English and rerank the alternative translations to predict the hypothesis that gives better CLIR performance. In this paper, we describe our participation at the CLEF 2018 eHealth con- sumer health search task [23]. We focus in our participation in the multilingual IR Task. We present our machine learning model which reranks the alternative translations given by the machine translation system for better IR results. We also present our new approach to expand translated queries using our machine learning model. 2 Task Description CLEF eHealth Consumer Health Search Task 2018 [9] is similar to the IR tasks in the previous years (2013–2017). The participants this year are required to re- trieve relevant web pages from the provided document collection in response to users’ queries. These queries represent information need in the medical domain. The IR task consists of IRTask1 which is a standard ad-hoc monolingual search task. IRTask2 is a similar task of the personalised search task in 2017 [16, 7], the retrieved documents are personalised to match user expertise (how likely the user is able to understand the content of the retrieved documents). IRTask 3 contains query variations for the same information need, and the participants have to de- sign a search system that is steady when the same information need is expressed in different query variations. In the multilingual ad-hoc search task (IRTask4 ), the monolingual English queries were translated by experts into Czech, French and German, and the participants are asked to design a search system to retrieve relevant documents to these queries from the English document collection. 2.1 Document Collection Document collection in the CLEF 2018 consumer health search task is created using CommonCrawl platform 1 . First, the query set (described in Section 2.2) is submitted to Microsoft Bing APIs, and a list of domains is extracted from the top retrieved results. This list is extended by adding reliable health websites, at the end clefehealth2018 B (which we use in this work) contained 1, 653 sites, after excluding non-medical websites such as news websites. After preparing the domain list, these domains are crawled and provided as an indexed collection to the participants. Two indexes are provided, in the first one, documents are stemmed and a stop-word list is used, while no preprocessing is done in the second index. The collection contains 5, 560, 074 documents, the stemmed in- dex contains 14, 213, 903 vocabularies, while the non-stemmed index contains 15, 298, 904 ones. 2.2 Queries The query set this year includes 50 English queries. This set is a subset of 150 medical queries that were created from HON and TRIP query logs within the Khresmoi project [4]. Table 1 shows the average number of terms in the 50 test queries in all languages. Although the average number of terms in the English queries is 5.64, there are queries that are much longer (e.g. query 199001 ), as shown in Table 2. Queries might contain typos since they are constructed from real query logs, as shown in query 175001, which contains Emugel instead of Emulgel. Table 1. The average number of terms in the query test set of the CLEF eHealth 2018 IR task EN CS FR DE 5.64 5.28 6.08 4.62 Table 2. Query samples from the English query test of the CLEF eHealth 2018 IR task id title 156001 food allergy test 168001 hiv vaccine phase 175001 Voltaren Emugel 1% why is there a minimum drinking age and 199001 what are the consequences of underage drinking ? feeling of fullness with hiccups with 200001 a feeling of a lump in the back of the throat 1 http://commoncrawl.org/ 3 The training data The data that we use to train our systems was presented by the CLEF eHealth 2014 Task 3 - Information Retrieval [5] and CLEF eHealth 2015 Task 2: User- Centred Health Information Retrieval [15]. It is almost identical to the collec- tion used in CLEFeHealth 2013 Task 3 - User-Centred Health Information Re- trieval, which contained a few additional documents which were excluded from the 2014/2015 collection due to license issues. The document collection includes a total of 1,104,298 web pages in HTML, automatically crawled from various English medical websites such as Genetics Home Reference, ClinicalTrial.gov and Diagnosia. To clean the HTML pages in the collection, we follow the work of Saleh and Pecina [19]. The queries have also been adopted from the CLEF eHealth series and include all the test queries from the IR task of 2013 (50 queries), 2014 (50 queries), and 2015 (66 queries). We joined them to create a more representative and balanced sample for IR experiments. The set of all 166 queries was split into 100 queries for training and 66 queries for testing. The two sets are stratified in terms of distribution of the year of origin, number of relevant/not-relevant documents, and query length (number of words). 4 Methods 4.1 Translation system For the multilingual task (IRTask4 ), we follow the query translation approach, in which a query is translated into the collection language (English), then the retrieval is conducted. Query translation approach reduces the task into mono- lingual task (both queries and documents are expressed in the same language). We use Khresmoi statistical machine translation (SMT) system [2], for lan- guage pairs: Czech-English, French-English and German-English, to translate the queries into English. Khresmoi SMT system was trained to translate queries, and tuned on parallel and monolingual data taken from the medical domain re- sources like Wikipedia, UMLS concept descriptions and UMLS metathesaurus. Such domain specific data made Khresmoi perform better when translating sen- tences in the medical domain like the queries in our case. Generally, feature weights in SMT systems are tuned toward BLEU [17], a method for automatic evaluation of SMT systems correlates with human judgments. It is not neces- sary to have correlation between the quality of general SMT system and the quality of CLIR performance [18]; therefore Khresmoi SMT system was tuned using MERT [12] towards PER (position-independent word error rate), because it does not penalise word reorder; which is not important for the performance of IR systems. 4.2 Hypotheses reranking Khresmoi SMT system produces a list of ranked translations in the target lan- guage, for each sentence in the source language, this list is called n-best-list. However, this n-best-list is ranked based on the translation quality rather than the retrieval performance. Saleh and Pecina [20] presented an approach to rerank an n-best-list and predict a translation that gives the best retrieval performance in terms of P@10. The reranker is a generalized linear regression model that uses a set of features which can be divided according to their sources into: 1) The SMT system: This includes features that are derived from the verbose output of the Khresmoi SMT system (e.g. phrase translation model, the tar- get language model, the reordering model and word penalty). 2) Document collection: This includes IDF scores and features that are based on the blind- relevance feedback approach. 2) External resources: Resources like Wikipedia articles and UMLS metathesaurus [22] are employed to create a rich set of fea- tures for each query hypothesis. 3) Retrieval status value (RSV): RSV is the score of the retrieval scoring function when constructing a query from a transla- tion hypothesis. It helps to involve more information from the collection in the reranking process by assigning to each hypothesis the score from the retrieval function. This feature is based on the work of Nottelman et al. [11], where they investigated the correlation between RSV and relevance probability. To train the model, we join the training and test sets that we presented in Section 3 in one set, then calculate feature values from each language, and merge them from all seven languages in one training set. The test set is the CLEF eHealth 2018 query set in Czech, French and German. 4.3 Query Expansion Query expansion is a process that reformulates user’s initial queries as an at- tempt to represent more information to improve retrieval performance eventu- ally. In this section, we present our approach to reformulate user’s query in the CLIR task using machine learning model, based on the presented work of Saleh and Pecina [21]. This approach is based on expanding a query by adding can- didate terms from an existing pool. This is done by reranking candidate terms using machine learning model towards better IR performance and adding the top ranked terms to the original query. To create a pool of candidate terms for each query, we use two main resources: – Translation hypotheses: This pool is built by merging n-best-list trans- lations for each query, after filtering stopwords and terms that already ap- peared in the 1-best-list translation. – Wikipedia titles: First, we index English Wikipedia articles (titles and ab- stracts without any preprocessing) using Terrier [13] and its implementation of Dirichlet language model as an IR model, then we conduct retrieval for each query’s 1-best-list translation from this index, then the top 10 ranked Wikipedia articles are selected and their titles are added to the pool. To train the model, we use the training data that we presented in Section 3, while for testing, we use the provided queries from the CLEF eHealth 2018 IR task in Czech, French and German languages. After building a pool of candidate terms, we generate the following features for each term: – IDF The inverse document frequency which is calculated from the relevant document collection. – Translation pool frequency This feature represents how many times a term appeared in the translation pool. When a term appears in multiple hypotheses, this means that the probability of being a relevant translation to one of the terms in the original query is high. – Wikipedia frequency The frequency of a term in the top 10 retrieved Wikipedia articles. Retrieval is conducted using the 1-best-list translation for the query that we want to expand with the candidate term. – Retrieval Status Value difference To calculate this feature, we conduct two retrievals, the first one using the original query (1-best-list translation ), and the second one using the original query expanded with the candidate term, then we take the score of the highest ranked document in each re- trieval and calculate the difference between them. This feature tells us the contribution of the candidate term to the retrieval status value. – Similarity To calculate the similarity between a candidate term tm and the query terms, we use a trained model of word2vec embeddings on 25 millions articles from PubMed 2 . First, we get the word embeddings for each term in the original query and we sum these embeddings to get a vector that represents the entire query. Then we take the embeddings for tm , and calculate the cosine similarity between the query vector and tm vector. – Co-occurrence frequency The co-occurrences of a candidate term tm and the query terms ti ∈ Q indicates how likely tm is related to the original query Q. We sum up the co-occurrence frequency for each term in query Q and the candidate term tm in all documents dj in the collection C, as shown in the Equation 1. X co(tm , Q) = tf (dj , ti )tf (dj , tm ) (1) dj ∈C,ti ∈Q – Term frequency First, we perform retrieval from the collection using a query that is constructed from the 1-best-list translation, then we calculate the term frequency of a candidate term tm in the top 10 ranked documents from the retrieval result. – Medical term count This feature represents how many times a term ap- peared in the UMLS lexicon, as an attempt to give more weight to the medical terms. Our goal is to design a model that can predict the performance of the retrieval when expanding a query with a term from the terms pool, and add terms that can improve the performance. To train the model, we perform the following steps: – Generate a pool of candidate terms for each query in the training and test set. 2 https://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/DATASET/ – Add one term from the pool to the query that we want to expand (1-best-list translation) and perform the retrieval using the baseline system (Dirichlet model) – Calculate the feature values for each term as we described above. – For training queries, we evaluate the performance for each expanded query considering P @10 as a main metric, P @10 being the objective function for our model. – Merge training queries from the 7 languages together to enrich the training set with more instances. – After preparing the training set, we normalise feature values using standard scaling by removing the mean and scaling them to have unit variance. This is done independently on each feature, then we use the scaler coefficient to standardise the test set. Scaling is important since the range of the feature values varies widely. The term reranker is a generalised linear regression model which predicts P @10 value for each term when expanding the original query with, we choose the term that has the highest predicted value of P @10. 5 Systems We submit runs for the monolingual task (IRTask1) and the multilingual task (IRTask4), as we present in the following sections. 5.1 Monolingual system In the monolingual task, we submit four runs: – Run 1 In this system, we use the Terrier’s index that is provided by the or- ganisers without applying any data preprocessing. Terrier’s implementation of Dirichlet smoothing language model is used as the retrieval model with its default parameters. – Run 2 This system also uses the same retrieval model as in Run 1, while as an index, we use Terrier’s index that uses Porter-stemming method and English stop-word list. – Run 3 This system uses Terrier’s implementation of TF-IDF model, for the purpose of comparing between a vector-space model and an LM model (the one that is used in Run 1), we use the same index as in Run 1. – Run 4 In this run, we use Terrier’s implementation of Kullback-Leiber di- vergence (KLD) [1] for query expansion, with number of top documents is set to 10 and number of terms for expansion is set to 3. These 3 terms are selected as following: first, an initial retrieval is done using the base query and the top 10 documents are chosen as pseudo-relevant documents. Then each term in these documents is scored as shown in Equation 2, where Pr (t) is the probability of term t in the pseudo-relevant documents (these docu- ments are treated as a bag-of-words), and Pc (t) is probability of term t in the document collection c. Finally the top 3 scored terms are added to the base query and a final retrieval is done using the new expanded query.   Pr (t) Score(t) = Pr (t) · log (2) Pc (t) 5.2 Cross-lingual system – Run 1 In this run, we translate the queries in the source languages into English and get 1-best-list translations. Retrieval is conducted using Dirichlet model, and non-stemmed index. The same retrieval settings are used in the following runs. – Run 2 This run uses hypotheses reranking approach, in which each query is translated into English and from the 15-best-list translations, the 1-best-list (in terms of IR quality) translation is selected for the retrieval as described in Section 4.2 – Run 3 First we translate the queries into English and the 1-best-list that is produced by the SMT system is chosen as a base query, then this query is expanded by one term using the term reranking approach that is presented in Section 4.3 – Run 4 This run is similar to Run1, the only difference is that Google Trans- late 3 is used to translate the queries into English. Table 3. Similarity (in percent) of top 10 retrieved documents between the submitted runs in IRTask4 runs CS DE FR run1-run2 48.80 50.20 55.60 run1-run3 38.00 26.60 35.20 run1-run4 52.80 54.40 62.80 run2-run3 32.20 22.00 33.80 run2-run4 46.20 43.20 43.40 run3-run4 27.80 14.4 28.00 Table 3 shows the percent of similar documents that are retrieved (among the highest 10 ranked ones) by different runs. It is clear from the table that dif- ferent approaches tend to retrieve different documents, for example, run 3 uses query expansion based approach. Query expansion means that a query will be expanded by more terms to include more information, leading to retrieve differ- ent documents, that is the reason why this run has the lowest similarity to the other runs. Both of run 1 and run 4 use 1-best-list translation from two different machine translation systems (Khresmoi and Google Translate respectively) to 3 translate.google.com construct the queries. This explains why these two systems share similar docu- ments more than all other systems. Run 2 uses hypotheses reranking approach to select best translation to be used for retrieval, while run1 uses 1-best-list trans- lation as it is selected from the SMT system to construct queries. According to further analysis we performed between the difference between the retrieved documents by these two runs, we found that 23 queries (out of 50) have 100% similarity of the top 10 retrieved documents, and this correlates with what was shown by Saleh and Pecina [20], that an SMT system fails in 50% of the cases to select the best translation to perform the best performance for the retrieval. 6 Conclusion We presented our participation in CLEF eHealth Consumer Health Search Task 2018 (monolingual and multilingual subtasks). Four runs were submitted to the monolingual task, two runs use a language-model IR with Dirichlet smoothing, they differ in the used index (one uses a stemmed index and one uses an in- dex without stemming). As for the multilingual task, we submitted four runs for each language of Czech, French and German. The first one uses 1-best-list translation from a statistical machine translation system, the second run uses hypotheses translations reranking, the third run is an implementation of query expansion using term reranker model, while the last run uses Google Translate to translate the provided queries into English. Our results analysis shows that similar approaches tend to share more similar retrieved documents than different approaches. Acknowledgments This research was supported by the Czech Science Foundation (grant n. P103/12/G084). References 1. Amati, G., Carpineto, C., Romano, G.: Query difficulty, robustness, and selective application of query expansion. In: European conference on information retrieval. pp. 127–137. Springer (2004) 2. Dušek, O., Hajič, J., Hlaváčová, J., Novák, M., Pecina, P., Rosa, R., et al.: Machine translation of medical texts in the Khresmoi project. In: Proceedings of the Ninth Workshop on Statistical Machine Translation. pp. 221–228. ACL, Baltimore, USA (2014) 3. Fox, S.: Health Topics: 80% of internet users look for health information online. Tech. rep., Pew Research Center (2011) 4. Goeuriot, L., Hamon, O., Hanbury, A., Jones, G.J., Kelly, L., Robertson, J.: D7.2 Meta-analysis of the first phase of empirical and user-centered evaluations. Tech. rep. (August 2013) 5. Goeuriot, L., Kelly, L., Li, W., Palotti, J., Pecina, P., Zuccon, G., Hanbury, A., Jones, G., Mueller, H.: ShARe/CLEF eHealth Evaluation Lab 2014, Task 3: User- centred health information retrieval. In: Proceedings of CLEF 2014. pp. 1–22. Springer, Sheffield, UK (2014) 6. Goeuriot, L., Kelly, L., Suominen, H., Hanlen, L., Néváol, A., Grouin, C., Palotti, J., Zuccon, G.: Overview of the CLEF eHealth evaluation lab 2015. In: The 6th Conference and Labs of the Evaluation Forum. pp. 1–15. Springer, Berlin, Germany (2015) 7. Goeuriot, L., Kelly, L., Suominen, H., Nvol, A., Robert, A., Kanoulas, E., Spi- jker, R., Palotti, J., Zuccon, G.: CLEF 2017 eHealth evaluation lab overview. In: CLEF 2017 - 8th Conference and Labs of the Evaluation Forum, Lecture Notes in Computer Science (LNCS). Springer (2017) 8. Grefenstette, G., Nioche, J.: Estimation of english and non-english language use on the www. In: Content-Based Multimedia Information Access - Volume 1. pp. 237– 246. RIAO, Centre de hautes etudes internationales d’informatique documentaire, Paris, France (2000) 9. Jimmy, Zuccon, G., Palotti, J., Goeuriot, L., Kelly, L.: Overview of the CLEF 2018 consumer health search task. In: CLEF 2018 Evaluation Labs and Workshop: Online Working Notes. CEUR-WS, Avignon, France (2018) 10. Nikoulina, V., Kovachev, B., Lagos, N., Monz, C.: Adaptation of statistical machine translation model for cross-lingual information retrieval in a service context. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. pp. 109–119. Avignon, France (2012) 11. Nottelmann, H., Fuhr, N.: From retrieval status values to probabilities of relevance for advanced IR applications. Information retrieval 6, 363–388 (2003) 12. Och, F.J.: Minimum error rate training in statistical machine translation. In: Pro- ceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1. pp. 160–167. Sapporo, Japan (2003) 13. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: A high performance and scalable information retrieval platform. In: Proceedings of Workshop on Open Source Information Retrieval. pp. 18–25. ACM, Seattle, WA, USA (2006) 14. Palotti, J.R.M., Hanbury, A., Müller, H., Jr., C.E.K.: How users search and what they search for in the medical domain - understanding laypeople and experts through query logs. Inf. Retr. Journal 19(1-2), 189–224 (2016) 15. Palotti, J.R., Zuccon, G., Goeuriot, L., Kelly, L., Hanbury, A., Jones, G.J., Lupu, M., Pecina, P.: CLEF eHealth Evaluation Lab 2015, Task 2: Retrieving information about medical symptoms. In: CLEF (Working Notes). pp. 1–22. Spriner, Berlin, Germany (2015) 16. Palotti, J., Zuccon, G., Jimmy, Pecina, P., Lupu, M., Goeuriot, L., Kelly, L., Han- bury, A.: CLEF 2017 task overview: The IR Task at the eHealth evaluation lab. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR- WS, Dublin, Ireland (2017) 17. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on Association for Computational Linguistics. pp. 311–318. Philadelphia, USA (2002) 18. Pecina, P., Dušek, O., Goeuriot, L., Hajič, J., Hlavářová, J., Jones, G.J., et al.: Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial Intelligence in Medicine 61(3), 165–185 (2014) 19. Saleh, S., Pecina, P.: CUNI at the ShARe/CLEF eHealth Evaluation Lab 2014. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum. vol. 1180, pp. 226–235. Sheffield, UK (2014) 20. Saleh, S., Pecina, P.: Reranking hypotheses of machine-translated queries for cross- lingual information retrieval. In: Experimental IR Meets Multilinguality, Multi- modality, and Interaction. The 7th International Conference of the CLEF Associ- ation, CLEF 2016. pp. 54–66. Springer, Évora, Portugal (2016) 21. Saleh, S., Pecina, P.: Task3 patient-centred information retrieval: Team CUNI. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings. vol. 1866. Dublin, Ireland (2017) 22. Schuyler, P.L., Hole, W.T., Tuttle, M.S., Sherertz, D.D.: The UMLS Metathe- saurus: representing different views of biomedical concepts. Bulletin of the Medical Library Association 81(2), 217 (1993) 23. Suominen, H., Kelly, L., Goeuriot, L., Kanoulas, E., Azzopardi, L., Spijker, R., Li, D., Nèvèol, A., Ramadier, L., Robert, A., Palotti, J., Jimmy, Zuccon, G.: Overview of the CLEF eHealth evaluation lab 2018. In: CLEF 2018 - 8th Conference and Labs of the Evaluation Forum. Lecture Notes in Computer Science LNC, Springer, Avignon, France (2018)