An evaluation of the concept retrieval annotation for Spanish-English CLEFER parallel corpora Rafael Berlanga1 , Antonio Jimeno-Yepes2 , Marı́a Pérez-Catalán1 , and Dietrich Rebholz-Shuchmann3 1 Department of Languages and Computer Systems Universitat Jaume I, Castelló, Spain email:{berlanga,maria.perez}@uji.es 2 National ICT Australia email: antonio.jimeno@gmail.com 3 Department of Computational Linguistics, University of Zürich, Ch rebholz@ifi.uzh.ch 1 Introduction This paper presents a study about the use of the concept retrieval annotation method for parallel corpora. The concept retrieval annotation method (CRA) consists of considering concepts as documents and text chunks as queries [1]. Concepts with higher similarity to text chunks are considered for generating the final semantic annotation. CRA makes use of an existing knowledge resource (KR) from which lexicons are extracted to perform the semantic annotation. Until now, CRA has been applied to mono-lingual scenarios showing a good performance over both very large collections (e.g., CALBCII-SSC4 ) and very large lexicons (e.g., UMLS R [2]). We have also applied this semantic annotator to different tasks in Biomedicine such as resource discovery [3], relation extraction [4], and sicentific bibliography analysis [5]. In this work, we will apply CRA in a bi-lingual scenario. For this purpose, we make use of the provided lexicons at CLEFER workshop. More specifically, we have made use of the English and Spanish lexicons. In this extended ab- stract, we first summarize the main features of CRM as a cross-lingual anno- tator, and then obtained results over the two provided parallel corpora, EMEA and MEDLINE R . 2 Concept retrieval-based semantic annotation Performing the semantic annotation of a document D consists of finding map- pings between text chunks of D (i.e., sequences of adjacent terms), and the con- cepts provided by a knowledge resource (KR) that best semantically describes the contents of D. As concepts of a KR are usually expressed as noun phrases, text chunks are usually associated to these syntactic structures. We assume that 4 http://www.ebi.ac.uk/Rebholz-srv/CALBC/ there exists a function lexKR (C) that returns the set of strings describing the concept C. This set of strings can contain different lexical variants of C, syn- onyms of these variants, and a short definition of the concept. Given a text chunk T = (w1 · · · wn ), and the concept C of the KR, the retrieval score of C w.r.t. to T is calculated as follows [1]: inf o(s ∩ T ) − inf o(T − s) sim(T, C) = maxs∈lex(C) inf o(s) The function inf o(s)Pprovides the information the string s brings, which is calculated as inf o(s) = w∈s −log(p(w|Background)). The retrieval of candidate concepts is efficiently performed by using an in- verted file where each entry is a vocabulary word, and the hit list contains the concept strings containing the word. In this way, the text chunk T is executed as a query over this inverted file, and the retrieved concept strings are ranked according to sim(T, C). Finally, the top-ranked concepts that best cover the words in T are included in the semantic annotation of T . We propose a simple strategy to annotate texts given a multi-lingual KR: – Build a different inverted file for each language supported by the KR. – Define a series of simple lexical rules to generate variants from one language to the other (e.g., proteı́na −→ protein). – Fetch the query to each inverted file with the variants corresponding to its language. – Return the set of all concepts retrieved by each lexicon. In the multi-lingual scenario we also have to estimate the word probabilities in large text collections for each language. Fortunately, there exist several pub- licly available resources providing such word estimations 5 . We have performed Word Sense Disambiguation (WSD) based on the MRD (Machine Readable Dic- tionary) method [6] built on the UMLS2012AB, for both English and Spanish. In EMEA, the context for disambiguaton is the document instead of the unit, since broader context has shown to produce better disambiguation results. In the MEDLINE annotation, there is only one unit per document. 3 Results Table 1 shows the main features of the annotated collections. Annotated collec- tions provided at CLEFER are indicated with SSC (Silver Standard Corpus), and they are in English. Annotations are calculated as the number of text chunks having associated some concept. The average size of an annotation is the aver- age of the number of words of annotated text chunks. We also measure the percentage of ambiguous annotations, which are those having more than one entity type associated to the text chunk. In general, English collections gener- ate more ambiguous annotations than the Spanish ones. However, this result is 5 http://invokeit.wordpress.com/frequency-word-lists/ Annotated Collection Documents/Units Annotations Ann. Avg. size Ambiguity EMEA SSC 879/364005 971715 1.25 5.2% EMEA EN 879/364005 427013 1.3 10.8% EMEA EN (Ed) 879/364005 373971 1.3 5.2% EMEA ES (Ed) 895/140552 433671 1.5 5.6% MEDLINE SSC 1593546 4101813 1.3 5.2% MEDLINE EN (Ed) 1593546 3529800 1.5 5.6% MEDLINE ES (Ed) 247655 610636 1.5 7.0% Table 1. Features of the annotations generated for the selected datasets. mainly due to the higher noise of the English lexicon. We noticed that many ambiguous annotations were derived from contextual descriptions. To alleviate this problem, we developed a simple heuristic to detect these false ambiguities, and accordingly edited the lexicons. Thus, collections EMEA EN/ES (Ed) and MEDLINE EN/ES (Ed) have been annotated using the edited lexicons. Notice that for the EMEA English collection, this heuristic has reduced notably the ambiguity degree of the annotations (compare the second and third rows of the table). Table 2 shows the concept overlap at both collection and the aligned unit levels. As it can be noticed, the edition of the lexicon increases the overlap between the collections. This is because wrongly annotated concepts are unlikely to appear in the parallel collection. It must be also noticed that overlap at collection level is much higher than at unit level. Collection Collection level Unit level EMEA EN/SS 80% 75% EMEA ES/SS 80% 58% EMEA EN/ES 79.7% 53% MEDLINE EN/SS 76.9% 68% MEDLINE ES/SS 46.7% 56% MEDLINE EN/ES 42.0% 51% Table 2. Overlap of concepts at collection and aligned unit levels (% common CUIs). Acknowledgements This work has been partially funded by the Spanish National R&D Programme project with contract number TIN2011-24147 of the “Ministerio de Economı́a y Competitividad”, and the EU STREP project grant 296410 (“Mantra”) under the 7th EU Framework Programme within Theme “Information Content Tech- nologies, Technologies for Digital Content and Languages” [FP7-ICT-2011-4.1]. It has been also supported by Australian Federal and Victoria State Governments and the Australian Research Council through the ICT Centre of Excellence pro- gram, National ICT Australia (NICTA). References 1. Berlanga, R., Nebot, V., Jiménez, E.: Semantic annotation of biomedical texts through concept retrieval. Procesamiento del Lenguaje Natural 45 (2010) 247–250 2. Bodenreider, O.: The unified medical language system (UMLS): integrating biomed- ical terminology. Nucleic acids research 32(Database Issue) (2004) D267 3. Pérez, M., Berlanga, R., Sanz, I., Aramburu, M.J.: A semantic approach for the requirement-driven discovery of web resources in the life sciences. Knowl. Inf. Syst. 34(3) (2013) 671–690 4. Nebot, V., Berlanga, R.: Exploiting semantic annotations for open information extraction: an experience in the biomedical domain. Knowledge and Information Systems (2012) 1–25 5. Berlanga, R., Jiménez-Ruiz, E., Nebot, V.: Exploring and linking biomedical re- sources through multidimensional semantic spaces. BMC Bioinformatics 13(S-1) (2012) S6 6. Jimeno-Yepes, A., Aronson, A.: Knowledge-based biomedical word sense disam- biguation: comparison of approaches. BMC bioinformatics 11:565 (2010)