<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of String Normalisation Modules for String-based Biomedical Vocabularies Alignment with AnAGram</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anique van Berne</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.vanBerne@Elsevier.com Elsevier BV</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We evaluate the precision and recall of the different normalization modules of AnAGram: a modular string-based vocabulary alignment tool we built for biomedical vocabularies. The main feature of AnAGram is a targeted transformation using a dictionary of adjective/noun correspondences, which gives interesting results. We find that the classic Porter stemming algorithm needs adaption to the biomedical domain in order to produce quality results. 1 http://oaei.ontologymatching.org/2013/anatomy/index.html 2 Dell™ Precision™ T7500, 2x Intel® Xeon® CPU E5620 2.4 GHz processors, 64 GB RAM. Software: Windows 7 Professional 64 bit, Service Pack 1; Perl v5.16.3 3 http://tartarus.org/martin/PorterStemmer/ 4 http://disi.unitn.it/~p2p/RelatedWork/Matching/strings-iswc13.pdf 5 http://oaei.ontologymatching.org/2014/ 6 http://www.ncbi.nlm.nih.gov/pubmed/22155335 7 http://wordnet.princeton.edu/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction: AnAGram and Related Work</p>
      <p>This paper stems from a product interoperability effort in the biomedical domain
through taxonomy alignment. Though requiring a generic tool, each individual
alignment requires specific conditions to be optimal, due to lexical idiosyncrasies.
AnAGram is constructed as a modular, step-wise, string-based alignment tool (as
stringbased tools perform well on the anatomical datasets of the OAEI campaign1).</p>
      <p>AnAGram is built for a local system2, using hash-table lookup for performance.
Matching is modular: a user selects one or multiple modules for processing the source
taxonomy. The alignment stops at the first match in the target taxonomy. The modules
are ordered to produce results of increasing distance from the original string (similar
to a confidence value) and include: exact match; stop word removal (using an
independent fine-tuned list); re-ordering (sorting tokens alphabetically for multi-word
terms match); stemming (with Porter stemmer3); normalization (of non-alpha-numeric
characters); substitution (replacing adjective/noun from our substitution dictionary).</p>
      <p>The modules correspond to the list by Cheatham and Hitzler4 of syntactic linguistic
processes used by at least one alignment tool in the Ontology Alignment Evaluation
Initiative (OAEI)5. Chua and Kim’s6 approach is closest to AnAGram, using
WordNet7 for building adjective/noun pairs to improve their matches, where ours is built on
the biomedical reference Dorland’s (creating a larger substitution dictionary).
As a test case, we align EMMeT8 to Dorland’s (32nd edition). We evaluate a random
sample of non-exact alignments (100), comparing them with a baseline Jaro-Winkler
(JW) matching approach. AnAGram gives more correct results and JW finds more
related matches (Table 1- top two lines, and Figure 1).</p>
      <p>100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%</p>
      <p>Preferred labels
Jaro-Winkler
AnAGram non-exact
Normalised
No stop words</p>
      <p>C
16</p>
      <p>The performance of each normalization is evaluated using 25 random results for
each of AnAGram’s modules separately9 (Table 1- bottom, Figure 1). Normalization
does very well (100% correct results). Removal of stop words causes some errors and
related matches (stop words can be meaningful like A for hepatitis A). Word order
rearranging ranks second: it does not often change the meaning of the term.
Substitution performs reasonably well: most of the non-correct results are related matches.
Stemming gives the poorest results, with false positives due to nouns/verbs stemmed
to the same root, such as cilitated/ciliate. The substituted-and-stemmed matches have
a result similar to the stemmed results. Still, even the worst results from any
AnAGram module are better than the overall results of the non-exact matches from the JW
algorithm. One reason for this can be that JW does not stop the alignment at the best
match, but delivers everything that satisfies the threshold.</p>
      <p>Not all modules account for an equal portion of the non-exact results. The
normalization module delivers around 70% of matches, stemming accounts for 15 to 20%
and the other modules account for 2% to 4% of the matches each.</p>
      <p>AnAGram’s results are good compared to the performance of string-based
methods in the OAEI large biomedical vocabularies alignment10. We will work on the
Stemming algorithm, on improving our stop words list and substitution dictionary,
and on adding an optimized version of the JW algorithm, thus benefitting from
additional related matches where no previous match was found.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>