Evaluation of String Normalisation Modules for String-based Biomedical Vocabularies Alignment with AnAGram Anique van Berne, Veronique Malaisé A.vanBerne@Elsevier.com V.Malaise@Elsevier.com Elsevier BV Elsevier BV Abstract: Biomedical vocabularies have specific characteristics that make their lexical alignment challenging. We have built a string-based vocabulary alignment tool, AnAGram, dedicated to efficiently compare terms in the biomedical domain, and evaluate this tool’s results against an algorithm based on Jaro-Winkler’s edit-distance. AnAGram is modular, enabling us to evaluate the precision and recall of different normalization procedures. Globally, our normalization and replacement strategy im- proves the F-measure score from the edit-distance experiment by more than 100%. Most of this increase can be explained by targeted transformations of the strings with the use of a dictionary of adjective/noun correspondences yielding useful results. However, we found that the classic Porter stemming algorithm needs to be adapted to the biomedical domain to give good quality results in this area. 1. Introduction Elsevier has a number of online tools in the biomedical domain. Improving their interoperability involves aligning the vocabularies these tools are built on. The vo- cabulary alignment tool needs to be generic enough to work with any of our vocabu- laries, but each alignment requires specific conditions to be optimal, due to vocabular- ies’ specific lexical idiosyncrasies. We have designed a modular, step-wise alignment tool: AnAGram. Its normaliza- tion procedures are based on previous research[1], basic Information Retrieval nor- malization processes, and our own observations. We chose a string-based alignment method as these perform well on the anatomical datasets of the OAEI campaign[1], and string-based alignment is an important step in most methods identified in [3][4]. We compare the precision and recall of AnAGram against an implementation of Jaro-Winkler’s edit-distance method (JW)[7] and evaluate the precision of each step of the alignment process. We gain over 100% F-measure compared to the edit- distance method. We evaluate the contribution and quality of the string normalization modules independently and show that the Porter stemmer[2] does not give optimal results in the biomedical domain. In Section 2 we present our use-case: aligning Dorland’s to Elsevier’s Merged Medical Taxonomy (EMMeT)1. Section 3 describes related work in vocabulary 1 http://river-valley.tv/elsevier-merged-medical-taxonomy-emmet-from-smart-content-to-smart- collection/ alignment in the biomedical domain. Section 4 and 5 present AnAGram and evaluate against Jaro-Winkler’s edit-distance. Section 6 presents future work and conclusions. 2. Use case: Dorland’s definition alignment to EMMeT Elsevier’s Merged Medical Taxonomy (EMMeT) is used in “Smart Content” applica- tions2; it contains more than 1 million biomedical concepts and their hierarchical, linguistic and semantic relationships. We aim at expanding EMMeT with definitions from the authoritative biomedical dictionary Dorland’s3 by aligning them. 3. Related work Cheatham and Hitzler[1] list the types of linguistic processes used by at least one alignment tool in the Ontology Alignment Evaluation Initiative (OAEI)[5]. AnAGram implements all syntactic linguistic transformations listed; instead of a generic syno- nym expansion system, we used a correspondence dictionary of adjective/noun pairs. This dictionary is a manually curated list based on information automatically extract- ed from Dorland’s. It contains pairs that would not be not solved by stemming such as saturnine/lead. Ambiguous entries, such as gluteal/natal, were removed. Chua and Kim’s[6] approach for string-based vocabulary alignment is the closest to AnAGram: they use WordNet4, a lexical knowledge base, to gather adjective/noun pairs to improve the coverage of their matches, after using string normalization steps; our set of pairs is larger than the one derived from WordNet. 4. AnAGram: biomedical vocabularies alignment tool AnAGram was built for use on a local system5, and is tuned to performance by us- ing hash-table lookup to find matches. Currently, no partial matching is possible. The matching steps are built in a modular way: one can select the set of desired steps. The source taxonomy is processed using these steps and the target taxonomy is processed sequentially: the alignment stops at the first match. Modules are ordered to increasing distance between original and transformed string, simulating a confidence value. Exact matching: corresponds to JW edit-distance 1. Normalization: special characters are removed or transformed (Sjögren’s syndrome to Sjogren’s syndrome; punctuation marks to space), string is lower cased. Stop word removal: tokenization by splitting on spaces, removal of stop words, us- ing a list that was fine-tuned over several rounds of indexing with EMMeT. 2 http://info.clinicalkey.com/docs/Smart_Content.pdf 3 http://www.dorlands.com/ 4 http://wordnet.princeton.edu/ 5 Dell™ Precision™ T7500, 2x Intel® Xeon® CPU E5620 2.4 GHz processors, 64 GB RAM. Software: Windows 7 Professional 64 bit, Service Pack 1; Perl v5.16.3 Re-ordering: tokens are sorted alphabetically, enabling matches for inverted terms. Substitution: sequences of tokens are replaced with the corresponding value from our dictionary, applying a longest string matching principle. Stemming: using the Porter stemming algorithm[2] (Perl module Lingua::Stem:: Snowball). The substitution step is then repeated, using stemmed dictionary entries. Independent lists: stop-words list and substitution dictionary are independent files. 5. Experimentation and results We align EMMeT version 3.2 (13/12/13) (1,027,717 preferred labels) to Dorland’s 32nd edition (115,248 entries). We evaluate AnAGram as a whole against JW, with a 0.92 threshold (established experimentally). The JW implementation can work only with preferred labels. To evaluate the recall of AnAGram vs the JW implementation, we use a manual gold set of 115 mappings created by domain experts (Table 1). AnAGram gives better recall and better precision than the JW method. Correct mapping Incorrect mapping Recall (%) Precision(%) F-measure Jaro-Winkler 46 8 43% 85% 0.57 AnAGram 80 3 71% 96% 0.82 Table 1 - Results of AnAGram vs. Jaro-Winkler on Dorland’s Gold Set pairs We evaluate a random sample of 25 non-exact alignments from each module to get a better insight on AnAGram’s normalization process. The results are either: Correct, Related (useful but not exactly correct), or Incorrect (Table 2 and Figure 1). AnA- Gram gives more correct results but JW is useful for finding related matches. 100% Preferred labels C R I 90% 80% Jaro-Winkler 16 40 44 70% 60% AnAGram non-exact 77 14 9 50% 40% Normalised 25 0 0 30% No stop words 16 3 6 20% 10% 0% Word order 25 0 0 Substituted 16 9 0 Stemmed 11 11 3 Subst. & stem 13 7 5 Table 2 – Results for AnAGram’s modules. (C: correct; R: related; I: incorrect) Figure 1 - Quality of matches returned by Correct Related Incorrect AnAGram’s modules. We evaluate the performance of each normalization step by evaluating 25 ran- dom results for each of AnAGram’s modules separately6 (Table 2, Figure 1). Normal- 6 Some modules are based on the result of a previous transformation, so the later the module comes in the chain, the more complicated matches it faces. ization does very well (100% correct results). Removal of stop words causes some errors and related matches: single-letter stop words can be meaningful, like A for hepatitis A. Word order rearranging ranks second: it does not often change the mean- ing of the term. Substitution performs reasonably well; most of the non-correct results are related matches. Stemming gives the poorest results with false positives due to nouns/verbs stemmed to the same root, such as cilitated/ciliate. The substituted and stemmed matches have a result similar to the stemmed results. Still, even the worst results from any AnAGram module are better than the overall results of the non-exact matches from the JW algorithm. One reason for this is that the JW does not stop the alignment at the best match, but delivers everything that satisfies the threshold of 0.92. Not all modules account for an equal portion of the non-exact results. The nor- malization module delivers around 70% of matches, stemming accounts for 15 to 20% and the other modules account for 2% to 4% of the matches each. 6. Future work and conclusion Results are good compared to OAEI large biomedical vocabularies alignment’s re- sults for string-based tools[1]. We will work on the Stemming algorithm, the im- provement of our stop words list and substitution dictionary, and on adding an opti- mized version of the JW algorithm as a final optional module for AnAGram to im- prove results further. In this way we will benefit from additional related matches in cases where no previous match was found. References [1] Michelle Cheatham, Pascal Hitzler. String Similarity Metrics for Ontology Alignment. In- ternational Semantic Web Conference (ISWC2013) (2) 2013: 294-309 [2] Cornelis .J. van Rijsbergen, Stephen E. Robertson, MartinF. Porter. New models in proba- bilistic information retrieval. London: British Library. (British Library Research and Develop- ment Report, no. 5587), 1980 [3] Jérôme Euzenat (Coordinator) et al. State of the art on Ontology alignment. Knowledge Web D 2.2.3, 2004. [4] Jerôme Euzenat, Pavel Shvaiko. Ontology Matching. Springer-Verlag, Berlin Heidelberg 2013 [5] Jérôme Euzenat, Christian Meilicke, Heiner Stuckenschmidt, Pavel Shvaiko, Cássia Tro- jahn. Ontology Alignment Evaluation Initiative: Six Years of Experience. Journal on Data Se- mantics XV, Lecture Notes in Computer Science (6720) 2011: 158-192 [6] Watson W. K. Chua and Jung-Jae Kim. BOAT: Automatic alignment of biomedical ontolo- gies using term informativeness and candidate selection. Journal of Biomedical Informatics (45) 2012: 337-349 [7] William E.Winkler. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods (American Statistical Association) 1990: 354–359