Introduction

Multi-facet Document Representation and Retrieval

Karam Abdulahhad

Jean-Pierre Chevallet

Catherine Berrut

catherine.berrut@imag.fr 0 0 UJF-Grenoble 1

This paper presents our participation in ImageCLEF2011, in the two tasks: ad-hoc image-based retrieval and case-based retrieval, of the medical retrieval track. We participated through a simple IR model based on three hypotheses: 1) the amount of overlap between a document and a query, 2) the descriptive power of an indexing element, and 3) the discriminative power of an indexing element. We used three types of indexing elements: ngrams,

Introduction

Each information retrieval (IR) model has its advantages and drawbacks. In other words, an IR model may perform well in some cases but badly in others. In general, there is no IR model performs well in all cases [ 9 ].

Our model is not an exception. Therefore, a general mechanism to evaluate the performance of an IR model and compare it with the performance of others is needed. In this context, there are many campaigns of evaluation in IR eld, e.g. CLEF1.

First, our model is a text-based retrieval and very simple model. It uses multiple types of indexing elements, e.g. ngrams, keywords, or concepts. The goals of this research are: 1) studying the performance of the model itself, using one of the indexing element types, 2) studying the e ects of using multiple indexing element types at the same time on the performance.

Second, CLEF (Cross-Language Evaluation Forum) is a yearly evaluation campaign in Multilanguage information retrieval eld since 2000. ImageCLEF2 is a part of CLEF. It concerns searching medical images through documents that contain text and images [ 14 ]. 1 http://www.clef-campaign.org/ 2 http://www.imageclef.org/

This year, ImageCLEF20113 [ 12 ] contains four main tracks: 1) medical retrieval, 2) photo annotation, 3) plant identi cation, and 4) Wikipedia retrieval. Medical retrieval track contains three tasks: 1) modality classi cation, 2) ad-hoc image-based retrieval which is an image retrieval task using textual, image or mixed queries, and 3) case-based retrieval: in this task the documents are journal articles extracted from PubMed4 and the queries are case descriptions.

Third, we participated in the last two tasks: ad-hoc image-based retrieval and case-based retrieval.

The test collection of this year 2011 contains, according to each task: 1) adhoc image-based retrieval: about 230,000 images with their text caption written in English and 30 queries written in three languages: English, French, and German. 2) case-based retrieval: about 55,000 articles written in English and 10 queries written in English.

This paper is structured as follows: Section 2 describes in details our model and the di erent types of indexing elements that are used. Section 3 presents some technical aspects of applying this model for ImageCLEF2011 test collection. Section 4 discusses the obtained results. We conclude in section 5. 2 2.1

The Proposed Model Three Types of Indexing Elements

Any IR model should contain two main components: indexing function and matching function. The goal of the indexing function is to convert documents and queries from their original form to another easy to use form.

Index : D [ Q ! E Where D set of documents Q set of queries E set of indexing elements E the set of all subsets of E

Concerning indexing elements, three di erent types are used: ngrams (N G), keywords (K), and concepts5 (C). Therefore, three indexing functions are exist (one for each type).

IndexNG : D [ Q ! ENG IndexK : D [ Q ! EK

3 http://www.imageclef.org/2011 4 http://www.ncbi.nlm.nih.gov/pubmed/ 5 "Concepts" can be de ned as "Human understandable unique abstract notions independent from any direct material support, independent from any language or information representation, and used to organize perception and knowledge" [ 8 ]. In IR domain, to achive the conceptual indexing, each concept is associated to a set of terms that describe it [ 3 ] [ 7 ]. (1) (2) (3)

IndexC : D [ Q ! EC

(4) Where ENG set of ngrams EK set of keywords EC set of concepts

We believe that no single type of indexing elements could completely represent the content of documents and queries, because: 1) there is no perfect indexing function [ 3 ] [ 2 ] [ 11 ]. It is always an approximate function, 2) concerning concepts, the most of resources that contain concepts, e.g. UMLS6, are incomplete [ 5 ] [ 6 ] [ 1 ], 3) each type covers an aspect of documents and queries [ 10 ]. Ngrams cover the statistical aspect, keywords cover the lexical aspect, and concepts cover the conceptual aspect. 2.2

Matching Function

Our model, as almost all models, depends on some hypotheses. Actually, it depends on the following three hypotheses: 1. The more shared elements a document and a query have, the semantically closer they are. 2. The descriptive power of an element (local weight): the more frequently an element occurs in a document, the better it describes the document [ 13 ] [ 3 ]. 3. The discriminative power of an element (global weight): the less number of documents an element appears in, the more important it is [ 13 ] [ 3 ].

By taking these hypotheses into account, our model could be formulated. For any type of indexing elements the Relevance Status Value (RSV) between a document d and a query q is:

RSV (d; q) = kd \ qk (5) X N e2q Ne fd;e kdk ! Where d = feje 2 Index (d)g a document q = feje 2 Index (q)g a query kd \ qk = kfeje 2 Index (d) \ Index (q)gk the number of shared elements between a document d and a query q N the number of documents in the corpus Ne = kfdje 2 Index (d)gk the number of documents that contain the element e fd;e the number of occurances of an element e in a document d kdk the number of elements in a document d 6 Uni ed Medical Language System. It is a meta-thesaurus in medical domain.

http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=nlmumls 2.3

Matching Function According to each Type of Indexing Elements

Now after presenting our model in general, it should be instantiated according to each type of indexing elements.

Ngrams

RSVNG (d; q) = kd \ qkNG

X N ng2q Nng fd;ng kdkNG ! Where d = fngjng 2 IndexNG (d)g a document q = fngjng 2 IndexNG (q)g a query kd \ qkNG = kfngjng 2 IndexNG (d) \ IndexNG (q)gk the number of shared ngrams between a document d and a query q N the number of documents in the corpus Nng = kfdjng 2 IndexNG (d)gk the number of documents that contain the ngram ng fd;ng the number of occurances of a ngram ng in a document d kdkNG the number of ngrams in a document d

Keywords: We added to this instance a new component, which is the length of keyword (the number of characters). Here we supposed that the longer a keyword is, the more information it contains [ 15 ].

RSVK (d; q) = kd \ qkK Where kkk the number of characters in a keywords k

Concepts

0 kdkK 1 kkkA RSVC (d; q) = kd \ qkC X N c2q Nc fd;c kdkC ! 2.4

The Three Types in one Model

As we said earlier, no single type of indexing elements could cover all aspects of documents and queries. Therefore, merging all types (aspects) in one model could enhance the performance of our model [ 9 ] [ 4 ] [ 16 ]. One of the merging formulas is:

RSVall (d; q) = RSVNG (d; q) + RSVK (d; q) + RSVC (d; q) (9)

To increase the chance of retrieving more documents, another component could be added to the Formula (9). The component represents an expansion (6) (7) (8) of the query q. It is kd \ qkexpan: the number of shared keywords between a document d and a query q after replacing each query's concept c that does not occur in d by the set of keywords that represent c7.

RSVall (d; q) = RSVNG (d; q) + RSVK (d; q) + RSVC (d; q) + kd \ qkexpan (10) 3 3.1

Model Validation Ad-hoc Image-Based Retrieval

In this task, image captions are used as documents and the English part of queries is just taken into account.

Text indexing: we extracted three types of indexing elements: 1. 5gram8: before extracting 5grams from documents and queries, we deleted all non-ASCII characters. Then we used ve-characters-wide window for extracting 5grams with shifting the window one character each time. 2. Keywords: before extracting keywords from documents and queries, we deleted all non-ASCII characters. Then we eliminated the stop words and stem the remaining keywords using Porter algorithm to get nally the list of keywords that index documents and queries. 3. Concepts: before mapping the text of documents and queries to concepts, we deleted all non-ASCII characters. Then we mapped the text to UMLS's concepts using MetaMap [ 2 ].

Model variants: actually we experimented ve variants of our model in this task, which are:

RSVall (d; q) = RSV5G (d; q) + RSVK (d; q) + RSVC (d; q)

RSVall (d; q) = (kd \ qk5G;K;C ) (sum5G;K;C ) (11) (12)

RSVall (d; q) = RSV5G (d; q) + RSVK (d; q) + RSVC (d; q) + kd \ qkexpan (13) Where kd \ qk5G;K;C = kd \ qk5G + kd \ qkK + kd \ qkC sum5G;K;C = sum5G + sumK + sumC sum5G = P5g2q NN5g kfddk;55gG sumK = Pk2q NNk kfddk;kK sumC = Pc2q NNc kfddk;cC

RSVall (d; q) = (kd \ qk5G;K;C ) (sum5G;K;C + kd \ qkexpan) RSVall (d; q) = ((kd \ qk5G;K;C ) (sum5G;K;C )) + kd \ qkexpan (14) (15) 7 We supposed that each concept is represented by a set of keywords in its resource (we used UMLS as resource). 8 5gram is a ngram consists of ve characters. We picked out 5grams because they gave the best results using ImageCLEF2010 comparing to the other ngrams. 3.2 In this task, articles are used as documents. We indexed documents and queries in the same way as in the previous task. However, we extracted only two types of indexing elements (4grams, keywords) because of technical reasons.

Model variants: actually we experimented two variants of our model in this task, which are:

RSVall = RSV4G + RSVK RSVall (d; q) = (kd \ qk4G;K ) (sum4G;K ) (16) (17) Where kd \ qk4G;K = kd \ qk4G + kd \ qkK sum4G;K = sum4G + sumK sum4G = P4g2q NN4g kfddk;44gG sumK = Pk2q NNk kfddk;kK 4

Results and Discussion

Before we start representing and discussing the obtained results, we will present the names that we will use in our discussion for refering to each variant with its corresponding formula and run's name that used in the o cial campaign9 (see Table 1). 9 To see all results: http://www.imageclef.org/2011/medical (see Formula 5) to compute the matching value between a document and a query, we obtained good results. The run I CK5G 1 is ranked eighth out of 64 runs. That's because, we explicitly represented multi-facet (multi aspects) of documents and queries. In other words, three types of elements: 5Grams, Keywords, and Concepts are used and involved in the matching process.

From another side, using di erent fusion formulas to merge the results of using di erent types of indexing elements does not change a lot of things. Compare the results of the two runs: I CK5G 1 and I CK5G 2 (see Table 2).

In addition, the query expansion and the di erent formulas to integrate it into the model did not add anything. Compare the results of the three runs: I CK5G Q 1, I CK5G Q 2, and I CK5G Q 3 with the result of the run I CK5G 1 (see Table 2). That's because, the added value of query expansion is already compensated by using keywords and 5Grams.

Another noted result is that our model could retrieve the most relevant documents comparing to the other runs10. 4.2

Case-Based Retrieval

The following table (see Table 3) contains the obtained results. The rst row (Best) is the result of the rst ranked run in the case-based retrieval task. 0.1500 0.1444 0.1278 # rel ret

Rank

Here also, we obtained good results. The run C K4G 1 is ranked fth out of 35 runs, knowing that, we used a very simple structure (set of elements), a 10 http://www.imageclef.org/2011/medical simple matching formula (see Formula 5), and also two simple types of indexing elements: Keywords and 4Grams. The two-facet (Keywords and 4Grams) representation of documents and queries was useful.

However, the formulas of merging the results of using di erent types of indexing elements were more sensitive comparing to the formulas in ad-hoc image-based retrieval task. Compare the results of the two runs: C K4G 1 and C K4G 2 (see Table 3). That's maybe because, we used less number of elements' types, two types (Keywords and 4Grams) in case-based retrieval comparing to three types (Concepts, Keywords, and 5Grams) in ad-hoc image-based retrieval. 5

Conclusion

We presented in this paper our approach to index and retrieve documents. We used three types of indexing elements (ngrams, keywords, concepts) for building a multi-facet document representation, and then we used a simple formula based on three hypotheses (the amount of overlap between a document and a query, the descriptive power of an indexing element, and the discriminative power of an indexing element) for retrieving documents, considering all facets (elements' types) of documents.

We obtained good results. The eighth out of 64 runs in the ad-hoc imagebased retrieval task, and the fth out of 35 runs in the case-based retrieval task, knowing that, we used a very simple structure for representing documents and queries and also a very simple ranking formula.

Karam

Abdulahhad , Jean-Pierre Chevallet , and Catherine Berrut . Solving concept mismatch through bayesian framework by extending umls meta-thesaurus. In la huitime dition de la COnfrence en Recherche d'Information et Applications (CORIA 2011 ), Avignon, France, March 16 { 18 2011.

2. Alan

Aronson . Metamap: Mapping text to the umls metathesaurus , 2006 .

Mustapha

Baziz . Indexation conceptuelle guide par ontologie pour la recherche d'information . Thse de doctorat, Universit Paul Sabatier, Toulouse, France, dcembre 2005 .

4. Nicholas

Belkin , C.

Cool , W. Bruce

Croft , and James P.

Callan . The e ect multiple query representations on information retrieval system performance . In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '93 , pages 339 { 346 , New York, NY, USA, 1993 . ACM.

Bodenreider ,

Burgun ,

Botti ,

Fieschi ,

P Le

Beux , and

Kohler . Evaluation of the uni ed medical language system as a medical knowledge source . J Am Med Inform Assoc , 5 ( 1 ): 76 { 87 , 1998 .

Olivier

Bodenreider , Anita Burgun, and Thomas C. Rind esch. Lexically-suggested hyponymic relations among medical terms and their representation . In in the UMLS, in Proceedings of TIA2001 , 1121, 2001 .

7. Jean-Pierre Chevallet . endognes et exognes pour une indexation conceptuelle intermdia . Mmoire d'Habilitation a Diriger des Recherches , 2009 .

8. Jean-Pierre

Chevallet

, Joo Hwee Lim, and Thi Hoang Diem Le . Domain knowledge conceptual inter-media indexing, application to multilingual multimedia medical reports . In ACM Sixteenth Conference on Information and Knowledge Management (CIKM 2007 ), Lisboa, Portugal, November 6 { 9 2007 .

W. Bruce

Croft . Incorporating di erent search models into one document retrieval system . SIGIR Forum , 16 : 40 { 45 , May 1981 .

10. Padima Das-Gupta and Je rey Katzer. A study of the overlap among document representations . SIGIR Forum , 17 : 106 { 114 , June 1983 .

11. Christopher

Dozier

, Ravi Kondadadi, Khalid Al-Kofahi,

Mark

Chaudhary , and Xi

Guo . Fast tagging of medical terms in legal text . In ICAIL , pages 253 { 260 , 2007 .

12. Jayashree Kalpathy-Cramer, Henning Muller, Steven Bedrick, Ivan Eggel, Alba Garcia Seco de Herrera, and Theodora Tsikrika . The clef 2011 medical image retrieval and classi cation tasks , 2011 .

13.

H. P.

Luhn . The automatic creation of literature abstracts . IBM J. Res. Dev. , 2: 159 { 165 , April 1958 .

14. Henning Muller, Ivan Eggel, Joe Reisetter,

Charles E.

Kahn , and

William

Hersh . Overview of the clef 2010 medical image retrieval track , 2010 .

15.

Jamie

Reilly and

Jacob

Kean . Information content and word frequency in natural language: Word length matters . Proceedings of the National Academy of Sciences , 108 ( 20 ): E108 , May 2011 .

16. Joseph A. Shaw and Edward A. Fox. Combination of multiple searches . In Text REtrieval Conference , 1994 .