=Paper= {{Paper |id=Vol-1173/CLEF2007wn-ImageCLEF-MaisonnasseEt2007 |storemode=property |title=Multiplying Concept Sources for Graph Modeling |pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-MaisonnasseEt2007.pdf |volume=Vol-1173 |dblpUrl=https://dblp.org/rec/conf/clef/MaisonnasseGC07a }} ==Multiplying Concept Sources for Graph Modeling== https://ceur-ws.org/Vol-1173/CLEF2007wn-ImageCLEF-MaisonnasseEt2007.pdf
Multiplying Concept Sources for Graph Modeling
                  Loı̈c Maisonnasse, Eric Gaussier, Jean-Pierre Chevallet
                                   LIG-UJF, IPAL-I2R
         loic.maisonnasse, eric.gaussier@imag.fr, viscjp@i2r.a-star.edu.sg


                                             Abstract
     This paper presents the LIG contribution to the CLEF 2007 medical retrieval task (i.e.
     ImageCLEFmed). The main idea in this paper is to incorporate medical knowledge in
     the language modeling approach to information retrieval (IR). Our model makes use of
     the textual part of ImageCLEFmed corpus and of the medical knowledge as found in the
     Unified Medical Language System (UMLS) knowledge sources. The use of UMLS allows
     us to create a conceptual representation of each sentence in the corpus. We use these
     sentence representations to create a graph model for each document. As in the standard
     language modeling approach, we evaluate the probability that a document graph model
     generates the query graph. Graphs are created from medical texts and queries, and
     are built for different languages, with different methods. The use of a conceptual
     representation allows the system to work at a higher semantic level, which solves some
     of the information retrieval problems, as term variation. After developing the graph
     model in the first part of the paper, we present our tests, which involve mixing different
     concepts sources (i.e. languages and methods) for the matching of the query and text
     graphs. Results show that using language model on concepts provides good results in
     IR. Multiplying the concept sources further improves the results. Lastly, using relations
     between concepts (provided by the graphs under consideration) improves results when
     only few conceptual sources are used to analyze the query.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software;

General Terms
Algorithms, Theory

Keywords
Information retrieval, language model


1    Introduction
Previous ImageCLEFmed raised the interest of the use semantic resources for information retrieval,
essentially in specialised domains such as the medical domain. Indeed, some of the best performing
methods from ImageCLEFmed used resources for concept extraction. As concepts can be defined
as human understandable unique abstract notions independent from any direct material support,
from any language or information representation, indexing at the conceptual level solves term
variation problems and conceptual indexing is naturally multilingual. Most of the previously
proposed works on concepts integrate concepts in a vector space model. We propose to improve
such conceptual indexing in two ways. First we used an advanced representation of the document
by using relations between concepts, thus a document is represented as a graph. Secondly we
integrate such representation in a more advanced model than the vector space model. We propose
to extend the graph language modeling approach developed in [5] by considering that relations
between terms or concepts are labeled (both syntactic and semantic relations are generally labeled;
the model we present here thus addresses a common situation). This paper first presents a short
overview of the use of concepts in medical document indexing and language modeling for complex
structures. Then a graph modeling approach is proposed. The different graph extraction processes
used for documents and queries are then described. Finally, the different results obtained on the
CLEF 2007 medical retrieval task are presented.


2     State of the Art
This section explores previous work on the use of conceptual indexing in the medical domain as
well as previous work on the use of structure in language modeling.

2.1    Graphical Representations in the Medical Domain
The usefulness of concepts has been shown in the previous ImageCLEFmed tasks, where some
of the best performing systems on text [2] used conceptual indexing methods based on vector
space models. On TREC genomics, [12] uses the Mesh and Entrez databases to select terms from
medical publications. They use terms related to concepts, by identifying this terms in document
and in queries they improves the results of bag of words. They also made different experiments
by adding domain specific knowledge to the query. They add related terms, thesaurus relation,
or computed variations. Results show that adding the terms variants of concepts gives the best
improvement of results. If authors have directly used concepts instead of terms associated to
concepts we supposed that variation of the concept would have been incorporate in the concepts,
so terms variation improvement directly is related to concepts. The authors declare that retrieval
on the concept level can achieve substantial improvement over purely term-based retrieval model.
    Other researchers have tried to go beyond the use of concepts by exploiting relations between
concepts. [11] evaluates the usefulness of UMLS concepts and semantic relations in medical IR.
They first extract concepts and relations from documents and queries. To do so, they hypothesize
that if two concepts have a semantic relation in a thesaurus and appear in the same sentence, then
the semantic relation holds between the associated words. To select relations, they rely on two
further assumptions: (1) interesting relations occur between interesting concepts; (2) relations are
expressed by typical lexical markers such as verbs. They implement assumption (1) through a
selection based on the IDF of concepts linked by a relation, whereas assumption (2) leads them
to use a co-occurrence matrix between lexical verbs and relations. The method is evaluated on a
medical collection with 25 queries with relevance assessments provided by medical experts. In their
experimentation, concepts and relations are considered in different vector spaces. The experiments
show that using both concepts and relations lower the results obtained with concepts alone.

2.2    Structure Language Modeling
The language modeling approach to IR has first been proposed in [7]. The basic idea is to
view each document as a language sample and querying as a generative process. Even though
smoothed unigram models have yielded good performance in IR, several works have investigated,
within the language modeling framework, the use of more advanced representations. Works like
[10] and [9] proposed to combine unigram models with bigram models. Others works, e.g. [4]
or [3], incorporated syntactic dependencies in the language model. [3], for example, introduces
a dependence language model for IR which integrates syntactic dependencies in the computation
of document relevance scores. This model relies on a variable L, defined as a “linkage” over
query terms, which is generated from a document according to P (L|Md), where Md represents
a document model. The query is then generated given L and Md , according to P (Q|L, Md). In
principle, the probability of the query, P (Q|Md ), is to be calculated over all linkages Ls, but, for
efficiency reasons, the authors make the standard assumption that these linkages are dominated
by a single one, the most probable one: L = argmaxL P (L|Q). P (Q|Md ) is then formulated as:

                                 P (Q|Md ) = P (L|Md ) P (Q|L, Md)                                            (1)
    In the case of a dependency parser, as the one used in [3], each term has exactly one governor
in each linkage L, so that the above quantity can be further decomposed, leading to:

                                           P                              P
      log P (Q|Md ) =    log P (L|Md ) +       i=1..n log P (qi |Md ) +       (i,j)∈L M I (qi , qj |L, Md )   (2)

where M I denotes the mutual information, and:
                                             Y
                                P (L|Md ) ∝    P̂ (R|qi , qj )                                                (3)
                                                    (i,j)∈L


P̂ (R|qi , qj ) in the above equation represents the empirical estimate of the probability that concepts
qi and qj are related through a parse in document d. As the reader may have noticed, there is a
certain ambiguity in the way the linkage L is used in this model. Consequently, this model is not
completly satisfying (see [5] for a short discussion of this problem), and we rely on a different model
to account for graphical structures in the language modeling approach to IR. We now describe
this model, which we will refer to as the graph model.


3    Graph Model
We propose a graph modeling approach for IR (which generalizes the one proposed in [5]) in which
each relation is labelled with one or more lables. We assume that a semantic analysis of a query
q can be represented as a graph Gq = (C, E), where C is the set of terms (or concepts) in q, and
E is a relation from C × C in the set of the label sets EN (E(ci , cj ) = {labels} if ci and cj are
related through a relation labelled with the labels in {labels}, and ∅ otherwise). The probability
that the graph of query q is generated by the model of document d can be decomposed as:

                                P (Gq |Md ) = P (C|Md ) P (E|C, Md )                                          (4)
    Assuming that, conditioned on Md , query concepts are independent of one another (a standard
assumption in the language model), and that, conditioned on Md and C, edges are independent
of one another (again a standard assumption), we can write:

                                              Q
                               P (C|Md )   = ci ∈C P (ci |Md )                                                (5)
                                          Q
                            P (E|C, Md ) = (i,j) P (E(qi , qj )|C, Md )                                       (6)

   Equation 5 corresponds to the standard language model (potentially applied to concepts), and
equation 6 carries the contribution of edges. The quantity P (ci |Md ) of equation 5 is computed
through a simple Jelinek-Mercer smoothing:


                                                       D(ci )      C(ci )
                             P (ci |Md ) = (1 − λu )          + λu                                            (7)
                                                       D(∗)        C(∗)

where D(ci ) (C(ci )) is the number of times ci appears in a document (collection) and D(∗) (C(∗))
in the number of concepts in the document (collection).
    The quantities P (E(qi , qj )|C, Md ) of equation 6 can be decomposed as:
                                                 Y
                P (E(qi , qj )|C, Md ) =                      P (R(qi , qj , label)|qi , qj , Md )   (8)
                                           label∈E(qi ,qj )

where R(qi , qj , label) indicates that there is a relation between qi and qj , the label set of which
contains label.
    An edge probability is thus equal to the product of the corresponding single-label relations.
Following standard practice in language modeling, one can furthermore “smooth” this estimate
by adding a contribution from the collection. This results in:


                                                        D(ci , cj , label)      C(ci , cj , label)
            P (R(ci , cj , label)|C, Md ) = (1 − λe )                      + λe                      (9)
                                                          D(ci , cj )             C(ci , cj )

where D(ci , cj , label) (C(ci , cj , label)) is the number of times ci and cj are linked with a relation
labeled label in the document (collection). D(ci , cj ) (C(ci , cj )) is the number of times ci and cj
are observed together in the document.
   The above model can be applied to any graphical representation of queries and documents,
and relies on only two terms, which are easy to estimate. We now show how this model behaves
experimentally.


4     Graph Extractions
UMLS is a good candidate as a knowledge source for medical text indexing. It is more than a
terminology because it describes terms with associated concepts. This knowledge is large (more
than 1 million concepts, 5.5 million of terms in 17 languages). Unfortunately, UMLS is constitute
of different sources (thesaurus, terms lists), and is neither complete, nor consistent. UMLS is a
“meta thesaurus”, i.e. a merger of existing thesaurus. It is not an ontology, because there is
no formal description of concepts, but its large set of terms and variation restricted to medical
domain only, enable full scale conceptual indexing system. In UMLS, all concepts are assigned to
at least one semantic type from the Semantic Network. This provides consistent categorization
of all concepts in the meta-thesaurus at the relatively general level represented in the Semantic
Network. This enables to detect general semantic relation between concepts that are define in this
network.
    Graphs are produced in two steps: concept detection and then relation detection. For concept
detection, we use three different methods. The first one use MetaMap [1] but as MetaMap is only
for English we use it only on the English part of the collection. No equivalent tools are available
for French and for German, so we use our own mapping tools. This tool selects in UMLS all
concepts that have a textual instance in the document. We show in [8] that such a strategy is
better than extracting only precise concepts and can challenge or improve text based IR. This
concept detection uses a syntactic analysis of sentences and a term mapping. The syntactic
analysis is provided by MiniPar (for English) or by treeTagger (for all languages). To improve
term mapping, we carried out some filtering on word and/or on UMLS. First we consider that stop
word can not be instance of a concept even if such instance exists in UMLS. We eliminate some
specific thesaurus/ontologies from UMLS (those that are very precise or that are not relevant for
our task). At last, we remove from UMLS some hierarchies that we consider irrelevant for our
task (ex. hierarchy corresponding to ”Geographic area”). Such filtering reduces terms ambiguity
and consequently improve the mapping. Results of concept extraction for a sentence can be view
on figure 4.
    We therefore have three concept extraction methods:
    • (1) MetaMap
    • (2) Mapping tools with MiniPar
             Figure 1: FCG produced for Show me chest CT images with emphysema


    • (3) Mapping tools with TreeTagger
    After concept detection, we add conceptual relations between concepts. The relations used are
those defined in the semantic network. We made the hypothesis that a relation exists in a semantic
graph if two concepts are detected in the same sentence and if a relation between these concepts
is defined in the semantic network. For finding relations, we first tag concept with their semantic
type and we add semantic relations that link concepts with corresponding tag. Result of relation
extraction for a sentence can be view on figure 4. We don’t make any further disambiguisation on
relations.
    Since we compute the frequency D(ci , cj , name) (C(ci , cj , name)) as the number of times that
ci and cj are linked in sentence on document (collection) and the probabilities D(ci , cj ) (C(ci , cj ))
as the number of times that they appear in the same sentence in a document (collection). Assuming
our relation extraction method the probability of a relation on a document will be 1 if the two
concepts appear in a same sentence of the document and 0 otherwise.


5     Evaluation
We show here the results obtains for this methods on the corpus CLEFmed 2007 [6].

5.1    Corpus Analysis
We choose to analyses the English part of the collection with MetaMap(1). The French and the
German part of the collection are analysed with TreeTagger.
    For queries we propose to regroup different analysis, a query is therefore a set of graph Q =
{Gq }. The probability of a query assuming a document graph model is obtained by the product
of the probability of generating each graph.

                                                     Y
                              P (Q = {Gq } |Mg ) =         P (Gq |Md )                              (10)
                                                      Gq

    We propose to group the analysis as following:

    • (E) one English graph extracted by (1)
    • (E Mix) English graphs extracted by (1)(2)(3)
    • (EFG) English graph extracted by (1) with French and German graph extracted by (3)
    • (EFG Mix) English graphs extracted by (1)(2)(3) with French and German graph extracted
      by (3)

    For example, in group EFG Mix a query is represented by 5 different graphs.
Table 1: best results for mean average precision (MAP) and precision at five documents (P@5)
                              unigram model                        graph model
                               2005-2006     2007                     2005-2006       2007
                      λu     text    image  image      λu    λe     text    image    image
          MAP
          E           0.2   0.2468   0.2284   0.3494   0.2   0.9   0.2463   0.2277   0.3638
          E Mix       0.1   0.2610   0.2359   0.3784   0.1   0.9   0.2620   0.2363   0.3779
          EFG         0.1   0.2547   0.2274   0.3643   0.1   0.9   0.2556   0.2313   0.3733
          EFG Mix     0.1   0.2673   0.2395   0.3962   0.1   0.9   0.2670   0.2394   0.3969
          P@5
          E           0.2   0.4618   0.4436   0.4533   0.2   0.9   0.4582   0.4400   0.4867
          E Mix       0.1   0.4727   0.4582   0.4767   0.1   0.8   0.4800   0.4582   0.4767
          EFG         0.2   0.4582   0.4364   0.5333   0.1   0.8   0.4618   0.4473   0.5667
          EFG Mix     0.1   0.4836   0.4691     0.5    0.1   0.8   0.4909   0.4655     0.5


5.2    Results
We first evaluate our system on the two previous years of CLEFmed, we select the best performing
methods at the textual level. At this level, we consider a textual annotation pertinent if one of its
associated images is pertinent at the image level. Table 5.2 shows the results obtained on CLEF
for the different collections, with the best parameters evaluated on the textual part of CLEFmed
2005 and 2006. We evaluate the results with mean average precision (MAP), since it gives an
overview of results, and with precision at 5 documents (P@5), since this measure shows system
precision on first results.
    Results show that the best performing method, for MAP, is the one that uses all concept sources
presented in this paper (EFG mix). Using different concept sources for the query improves the
overall results of IR. Such method helps in finding all query concepts and improves the recall. But
for precision at five documents, best results are obtained with EFG that use one concept source
per language. Using only the three languages provides the best concepts. Adding other concept
sources may add some false concepts that lower the precision.
    For graph indexing, MAP results show a similar behaviour to concepts alone. The only differ-
ence is on EFG where relation results are better than concept results, on MAP and P@5. This
confirms our idea that concepts and relations are well extracted in this method that is why P@5
results are the best and why using relation improves concept results.


6     Conclusion
We proposed here a framework for using semantic resources in medical domain. We describe a
method for creating graph representation of text and we propose a graph modeling approach for
IR. On this framework we evaluate the impact of the multiplication of concept extraction sources
on query. Results show that graph indexing can be useful for improving the first results of the
system and that multiplying concept sources improves the overall results of IR. In this paper we
only work on the query, in future work, we intend to evaluate multiplication of concept sources for
document. Our relation extraction method is simple; using a more precise method should improve
results.


References
 [1] A. Aronson. Effective mapping of biomedical text to the UMLS metathesaurus: The MetaMap
     program. In Proc AMIA 2001, pages 17–21, 2001.
 [2] Joo-Hwee Lim Xiong Wei Daniel Raccoceanu Diem Le Thi Hoang Roxana Teodorescu Nico-
     las Vuillenemot Caroline Lacoste, Jean-Pierre Chevallet. Ipal knowledge-based medical im-
     age retrieval in imageclefmed 2006. In Working Notes for the CLEF 2006 Workshop, 20-22
     September, Alicante, Spain, 2006.
 [3] Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, and Guihong Cao. Dependence language model
     for information retrieval. In Research and Development in Information Retrieval, 2004.
 [4] Changki Lee, Gary Geunbae Lee, and Myung Gil Jang. Dependency structure language model
     for information retrieval. In ETRI journal, 2006.
 [5] Loic Maisonnasse, Gaussier Eric, and Jean-Pierre Chevallet. Revisiting the dependence lan-
     guage model for information retrieval. In Research and Development in Information Retrieval,
     2007.
 [6] Henning Müller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer, Thomas M.
     Deserno, Paul Clough, and William Hersh. Overview of the ImageCLEFmed 2007 medical
     retrieval and annotation tasks. In Working Notes of the 2007 CLEF Workshop, Budapest,
     Hungary, September 2007.
 [7] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In
     Research and Development in Information Retrieval, 1998.
 [8] Said Radhouani, Loic Maisonnasse, Joo-Hwee Lim, Thi-Hoang-Diem Le, and Jean-Pierre
     Chevallet. Une indexation conceptuelle pour un filtrage par dimensions, experimentation sur
     la base medicale imageclefmed avec le meta thesaurus umls. In COnference en Recherche
     Information et Applications CORIA’2006, pages 257–271, mars 2006.
 [9] Fei Song and W. Bruce Croft. A general language model for information retrieval. In CIKM
     ’99: Proceedings of the eighth international conference on Information and knowledge man-
     agement, pages 316–321, New York, NY, USA, 1999. ACM Press.
[10] M. Srikanth and R. Srikanth. Biterm language models for document retrieval. In Research
     and Development in Information Retrieval, 2002.
[11] VolkSemantic M. Vintar S, Buitelaar P. Relations in concept-based cross-language medi-
     cal information retrieval. In Proceedings of the ECML/PKDD Workshop on Adaptive Text
     Extraction and Mining (ATEM), 2003.
[12] Neil Smalheiser-Vetle Torvik Jie Hong Wei Zhou, Clement Yu. Knowledge-intensive concep-
     tual retrieval and passage extraction of biomedical literature. In Research and Development
     in Information Retrieval, 2007.