Multi-facet Document Representation and Retrieval Karam Abdulahhad*, Jean-Pierre Chevallet**, and Catherine Berrut* * UJF-Grenoble 1, ** UPMF-Grenoble 2, LIG laboratory, MRIM group karam.abdulahhad,jean-pierre.chevallet,catherine.berrut@imag.fr Abstract. This paper presents our participation in ImageCLEF2011, in the two tasks: ad-hoc image-based retrieval and case-based retrieval, of the medical retrieval track. We participated through a simple IR model based on three hypotheses: 1) the amount of overlap between a document and a query, 2) the descrip- tive power of an indexing element, and 3) the discriminative power of an indexing element. We used three types of indexing elements: ngrams, keywords, and concepts, for building and checking the effectiveness of multi-facet document representation. Although of the simplicity of our model in both documents’ representa- tion and retrieval, we could obtain good results. The eighth out of 64 runs in the ad-hoc image-based retrieval task, and the fifth out of 35 runs in the case-based retrieval task. 1 Introduction Each information retrieval (IR) model has its advantages and drawbacks. In other words, an IR model may perform well in some cases but badly in others. In general, there is no IR model performs well in all cases [9]. Our model is not an exception. Therefore, a general mechanism to evaluate the performance of an IR model and compare it with the performance of others is needed. In this context, there are many campaigns of evaluation in IR field, e.g. CLEF1 . First, our model is a text-based retrieval and very simple model. It uses multiple types of indexing elements, e.g. ngrams, keywords, or concepts. The goals of this research are: 1) studying the performance of the model itself, using one of the indexing element types, 2) studying the effects of using multiple indexing element types at the same time on the performance. Second, CLEF (Cross-Language Evaluation Forum) is a yearly evaluation campaign in Multilanguage information retrieval field since 2000. ImageCLEF2 is a part of CLEF. It concerns searching medical images through documents that contain text and images [14]. 1 http://www.clef-campaign.org/ 2 http://www.imageclef.org/ This year, ImageCLEF20113 [12] contains four main tracks: 1) medical re- trieval, 2) photo annotation, 3) plant identification, and 4) Wikipedia retrieval. Medical retrieval track contains three tasks: 1) modality classification, 2) ad-hoc image-based retrieval which is an image retrieval task using textual, image or mixed queries, and 3) case-based retrieval: in this task the documents are journal articles extracted from PubMed4 and the queries are case descriptions. Third, we participated in the last two tasks: ad-hoc image-based retrieval and case-based retrieval. The test collection of this year 2011 contains, according to each task: 1) ad- hoc image-based retrieval: about 230,000 images with their text caption written in English and 30 queries written in three languages: English, French, and Ger- man. 2) case-based retrieval: about 55,000 articles written in English and 10 queries written in English. This paper is structured as follows: Section 2 describes in details our model and the different types of indexing elements that are used. Section 3 presents some technical aspects of applying this model for ImageCLEF2011 test collec- tion. Section 4 discusses the obtained results. We conclude in section 5. 2 The Proposed Model 2.1 Three Types of Indexing Elements Any IR model should contain two main components: indexing function and matching function. The goal of the indexing function is to convert documents and queries from their original form to another easy to use form. Index : D ∪ Q → E ∗ (1) Where D set of documents Q set of queries E set of indexing elements E ∗ the set of all subsets of E Concerning indexing elements, three different types are used: ngrams (N G), keywords (K), and concepts5 (C). Therefore, three indexing functions are exist (one for each type). ∗ IndexN G : D ∪ Q → EN G (2) ∗ IndexK : D ∪ Q → EK (3) 3 http://www.imageclef.org/2011 4 http://www.ncbi.nlm.nih.gov/pubmed/ 5 ”Concepts” can be defined as ”Human understandable unique abstract notions in- dependent from any direct material support, independent from any language or information representation, and used to organize perception and knowledge” [8]. In IR domain, to achive the conceptual indexing, each concept is associated to a set of terms that describe it [3] [7]. ∗ IndexC : D ∪ Q → EC (4) Where EN G set of ngrams EK set of keywords EC set of concepts We believe that no single type of indexing elements could completely rep- resent the content of documents and queries, because: 1) there is no perfect indexing function [3] [2] [11]. It is always an approximate function, 2) concern- ing concepts, the most of resources that contain concepts, e.g. UMLS6 , are in- complete [5] [6] [1], 3) each type covers an aspect of documents and queries [10]. Ngrams cover the statistical aspect, keywords cover the lexical aspect, and concepts cover the conceptual aspect. 2.2 Matching Function Our model, as almost all models, depends on some hypotheses. Actually, it de- pends on the following three hypotheses: 1. The more shared elements a document and a query have, the semantically closer they are. 2. The descriptive power of an element (local weight): the more frequently an element occurs in a document, the better it describes the document [13] [3]. 3. The discriminative power of an element (global weight): the less number of documents an element appears in, the more important it is [13] [3]. By taking these hypotheses into account, our model could be formulated. For any type of indexing elements the Relevance Status Value (RSV) between a document d and a query q is: ! X N fd,e RSV (d, q) = kd ∩ qk × × (5) e∈q Ne kdk Where d = {e|e ∈ Index (d)} a document q = {e|e ∈ Index (q)} a query kd ∩ qk = k{e|e ∈ Index (d) ∩ Index (q)}k the number of shared elements be- tween a document d and a query q N the number of documents in the corpus Ne = k{d|e ∈ Index (d)}k the number of documents that contain the element e fd,e the number of occurances of an element e in a document d kdk the number of elements in a document d 6 Unified Medical Language System. It is a meta-thesaurus in medical domain. http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=nlmumls 2.3 Matching Function According to each Type of Indexing Elements Now after presenting our model in general, it should be instantiated according to each type of indexing elements. Ngrams ! X N fd,ng RSVN G (d, q) = kd ∩ qkN G × × (6) ng∈q Nng kdkN G Where d = {ng|ng ∈ IndexN G (d)} a document q = {ng|ng ∈ IndexN G (q)} a query kd ∩ qkN G = k{ng|ng ∈ IndexN G (d) ∩ IndexN G (q)}k the number of shared ngrams between a document d and a query q N the number of documents in the corpus Nng = k{d|ng ∈ IndexN G (d)}k the number of documents that contain the ngram ng fd,ng the number of occurances of a ngram ng in a document d kdkN G the number of ngrams in a document d Keywords: We added to this instance a new component, which is the length of keyword (the number of characters). Here we supposed that the longer a keyword is, the more information it contains [15].   X N f d,k RSVK (d, q) = kd ∩ qkK ×  × × kkk (7) Nk kdkK k∈q Where kkk the number of characters in a keywords k Concepts ! X N fd,c RSVC (d, q) = kd ∩ qkC × × (8) c∈q Nc kdkC 2.4 The Three Types in one Model As we said earlier, no single type of indexing elements could cover all aspects of documents and queries. Therefore, merging all types (aspects) in one model could enhance the performance of our model [9] [4] [16]. One of the merging formulas is: RSVall (d, q) = RSVN G (d, q) + RSVK (d, q) + RSVC (d, q) (9) To increase the chance of retrieving more documents, another component could be added to the Formula (9). The component represents an expansion of the query q. It is kd ∩ qkexpan : the number of shared keywords between a document d and a query q after replacing each query’s concept c that does not occur in d by the set of keywords that represent c7 . RSVall (d, q) = RSVN G (d, q) + RSVK (d, q) + RSVC (d, q) + kd ∩ qkexpan (10) 3 Model Validation 3.1 Ad-hoc Image-Based Retrieval In this task, image captions are used as documents and the English part of queries is just taken into account. Text indexing: we extracted three types of indexing elements: 1. 5gram8 : before extracting 5grams from documents and queries, we deleted all non-ASCII characters. Then we used five-characters-wide window for ex- tracting 5grams with shifting the window one character each time. 2. Keywords: before extracting keywords from documents and queries, we deleted all non-ASCII characters. Then we eliminated the stop words and stem the remaining keywords using Porter algorithm to get finally the list of keywords that index documents and queries. 3. Concepts: before mapping the text of documents and queries to concepts, we deleted all non-ASCII characters. Then we mapped the text to UMLS’s concepts using MetaMap [2]. Model variants: actually we experimented five variants of our model in this task, which are: RSVall (d, q) = RSV5G (d, q) + RSVK (d, q) + RSVC (d, q) (11) RSVall (d, q) = (kd ∩ qk5G,K,C ) × (sum5G,K,C ) (12) RSVall (d, q) = RSV5G (d, q) + RSVK (d, q) + RSVC (d, q) + kd ∩ qkexpan (13) Where kd ∩ qk5G,K,C = kd ∩ qk5G + kd ∩ qkK + kd ∩ qkC sum5G,K,C = sum5G + sumK  + sumC P N fd,5g sum5G = × kdk5G P 5g∈q N5g  N fd,k sumK = × P k∈q Nk kdkK  N fd,c sumC = c∈q Nc × kdkC RSVall (d, q) = (kd ∩ qk5G,K,C ) × (sum5G,K,C + kd ∩ qkexpan ) (14) RSVall (d, q) = ((kd ∩ qk5G,K,C ) × (sum5G,K,C )) + kd ∩ qkexpan (15) 7 We supposed that each concept is represented by a set of keywords in its resource (we used UMLS as resource). 8 5gram is a ngram consists of five characters. We picked out 5grams because they gave the best results using ImageCLEF2010 comparing to the other ngrams. 3.2 Case-Based Retrieval In this task, articles are used as documents. We indexed documents and queries in the same way as in the previous task. However, we extracted only two types of indexing elements (4grams, keywords) because of technical reasons. Model variants: actually we experimented two variants of our model in this task, which are: RSVall = RSV4G + RSVK (16) RSVall (d, q) = (kd ∩ qk4G,K ) × (sum4G,K ) (17) Where kd ∩ qk4G,K = kd ∩ qk4G + kd ∩ qkK sum4G,K = Psum4G + sumK  N fd,4g sum4G = × kdk P 4g∈q N4g 4G N fd,k sumK = k∈q Nk × kdkK 4 Results and Discussion Before we start representing and discussing the obtained results, we will present the names that we will use in our discussion for refering to each variant with its corresponding formula and run’s name that used in the official campaign9 (see Table 1). Table 1. Variants’ names Varint’s name Corresponding furmula Corresponding run’s name I CK5G 1 Formula 11 IVSCT5G I CK5G 2 Formula 12 IVPCT5G I CK5G Q 1 Formula 13 IVSCT5GK I CK5G Q 2 Formula 14 IVPCT5GKin I CK5G Q 3 Formula 15 IVPCT5GKout C K4G 1 Formula 16 MRIM KJ A VM Sop T4G C K4G 2 Formula 17 MRIM KJ A VM Pos T4G 4.1 Ad-hoc Image-Based Retrieval The following table (see Table 2) contains the obtained results. The first row (Best) is the result of the first ranked run in the ad-hoc image-based retrieval task. From one side, although of using a very simple structure (set of elements) to represent the content of documents and queries, and using a simple formula 9 To see all results: http://www.imageclef.org/2011/medical Table 2. The results of ad-hoc image-based retrieval task MAP P@10 P@20 # rel ret Rank Best 0.2172 0.3467 0.3017 1471 1 I CK5G 1 0.2008 0.3033 0.3050 1544 8 I CK5G Q 1 0.2008 0.3033 0.3050 1543 9 I CK5G Q 2 0.1975 0.2967 0.2833 1517 10 I CK5G 2 0.1974 0.2967 0.2833 1520 11 I CK5G Q 3 0.1973 0.2967 0.2833 1519 12 (see Formula 5) to compute the matching value between a document and a query, we obtained good results. The run I CK5G 1 is ranked eighth out of 64 runs. That’s because, we explicitly represented multi-facet (multi aspects) of documents and queries. In other words, three types of elements: 5Grams, Keywords, and Concepts are used and involved in the matching process. From another side, using different fusion formulas to merge the results of us- ing different types of indexing elements does not change a lot of things. Compare the results of the two runs: I CK5G 1 and I CK5G 2 (see Table 2). In addition, the query expansion and the different formulas to integrate it into the model did not add anything. Compare the results of the three runs: I CK5G Q 1, I CK5G Q 2, and I CK5G Q 3 with the result of the run I CK5G 1 (see Table 2). That’s because, the added value of query expansion is already compensated by using keywords and 5Grams. Another noted result is that our model could retrieve the most relevant doc- uments comparing to the other runs10 . 4.2 Case-Based Retrieval The following table (see Table 3) contains the obtained results. The first row (Best) is the result of the first ranked run in the case-based retrieval task. Table 3. The results of case-based retrieval task MAP P@10 P@20 # rel ret Rank Best 0.1297 0.1889 0.1500 144 1 IC K4G 1 0.1114 0.1444 0.1444 149 5 IC K4G 2 0.0911 0.1111 0.1278 146 10 Here also, we obtained good results. The run C K4G 1 is ranked fifth out of 35 runs, knowing that, we used a very simple structure (set of elements), a 10 http://www.imageclef.org/2011/medical simple matching formula (see Formula 5), and also two simple types of index- ing elements: Keywords and 4Grams. The two-facet (Keywords and 4Grams) representation of documents and queries was useful. However, the formulas of merging the results of using different types of indexing elements were more sensitive comparing to the formulas in ad-hoc image-based retrieval task. Compare the results of the two runs: C K4G 1 and C K4G 2 (see Table 3). That’s maybe because, we used less number of elements’ types, two types (Keywords and 4Grams) in case-based retrieval comparing to three types (Concepts, Keywords, and 5Grams) in ad-hoc image-based retrieval. 5 Conclusion We presented in this paper our approach to index and retrieve documents. We used three types of indexing elements (ngrams, keywords, concepts) for building a multi-facet document representation, and then we used a simple formula based on three hypotheses (the amount of overlap between a document and a query, the descriptive power of an indexing element, and the discriminative power of an indexing element) for retrieving documents, considering all facets (elements’ types) of documents. We obtained good results. The eighth out of 64 runs in the ad-hoc image- based retrieval task, and the fifth out of 35 runs in the case-based retrieval task, knowing that, we used a very simple structure for representing documents and queries and also a very simple ranking formula. References 1. Karam Abdulahhad, Jean-Pierre Chevallet, and Catherine Berrut. Solving con- cept mismatch through bayesian framework by extending umls meta-thesaurus. In la huitime dition de la COnfrence en Recherche d’Information et Applications (CORIA 2011), Avignon, France, March 16–18 2011. 2. Alan R. Aronson. Metamap: Mapping text to the umls metathesaurus, 2006. 3. Mustapha Baziz. Indexation conceptuelle guide par ontologie pour la recherche d’information. Thse de doctorat, Universit Paul Sabatier, Toulouse, France, dcem- bre 2005. 4. Nicholas J. Belkin, C. Cool, W. Bruce Croft, and James P. Callan. The effect multiple query representations on information retrieval system performance. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’93, pages 339–346, New York, NY, USA, 1993. ACM. 5. O Bodenreider, A Burgun, G Botti, M Fieschi, P Le Beux, and F Kohler. Evalua- tion of the unified medical language system as a medical knowledge source. J Am Med Inform Assoc, 5(1):76–87, 1998. 6. Olivier Bodenreider, Anita Burgun, and Thomas C. Rindflesch. Lexically-suggested hyponymic relations among medical terms and their representation. In in the UMLS, in Proceedings of TIA2001, 1121, 2001. 7. Jean-Pierre Chevallet. endognes et exognes pour une indexation conceptuelle in- termdia. Mmoire d’Habilitation a Diriger des Recherches, 2009. 8. Jean-Pierre Chevallet, Joo Hwee Lim, and Thi Hoang Diem Le. Domain knowledge conceptual inter-media indexing, application to multilingual multimedia medical reports. In ACM Sixteenth Conference on Information and Knowledge Manage- ment (CIKM 2007), Lisboa, Portugal, November 6–9 2007. 9. W. Bruce Croft. Incorporating different search models into one document retrieval system. SIGIR Forum, 16:40–45, May 1981. 10. Padima Das-Gupta and Jeffrey Katzer. A study of the overlap among document representations. SIGIR Forum, 17:106–114, June 1983. 11. Christopher Dozier, Ravi Kondadadi, Khalid Al-Kofahi, Mark Chaudhary, and Xi S. Guo. Fast tagging of medical terms in legal text. In ICAIL, pages 253–260, 2007. 12. Jayashree Kalpathy-Cramer, Henning Müller, Steven Bedrick, Ivan Eggel, Alba Garcia Seco de Herrera, and Theodora Tsikrika. The clef 2011 medical image retrieval and classification tasks, 2011. 13. H. P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Dev., 2:159–165, April 1958. 14. Henning Müller, Ivan Eggel, Joe Reisetter, Charles E. Kahn, and William Hersh. Overview of the clef 2010 medical image retrieval track, 2010. 15. Jamie Reilly and Jacob Kean. Information content and word frequency in natural language: Word length matters. Proceedings of the National Academy of Sciences, 108(20):E108, May 2011. 16. Joseph A. Shaw and Edward A. Fox. Combination of multiple searches. In Text REtrieval Conference, 1994.