<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of String Normalisation Modules for String-based Biomedical Vocabularies Alignment with AnAGram</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anique van Berne</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.vanBerne@Elsevier.com Elsevier BV</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Biomedical vocabularies have specific characteristics that make their lexical alignment challenging. We have built a string-based vocabulary alignment tool, AnAGram, dedicated to efficiently compare terms in the biomedical domain, and evaluate this tool's results against an algorithm based on Jaro-Winkler's edit-distance. AnAGram is modular, enabling us to evaluate the precision and recall of different normalization procedures. Globally, our normalization and replacement strategy improves the F-measure score from the edit-distance experiment by more than 100%. Most of this increase can be explained by targeted transformations of the strings with the use of a dictionary of adjective/noun correspondences yielding useful results. However, we found that the classic Porter stemming algorithm needs to be adapted to the biomedical domain to give good quality results in this area. 1 http://river-valley.tv/elsevier-merged-medical-taxonomy-emmet-from-smart-content-to-smartcollection/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Elsevier has a number of online tools in the biomedical domain. Improving their
interoperability involves aligning the vocabularies these tools are built on. The
vocabulary alignment tool needs to be generic enough to work with any of our
vocabularies, but each alignment requires specific conditions to be optimal, due to
vocabularies’ specific lexical idiosyncrasies.</p>
      <p>
        We have designed a modular, step-wise alignment tool: AnAGram. Its
normalization procedures are based on previous research[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], basic Information Retrieval
normalization processes, and our own observations. We chose a string-based alignment
method as these perform well on the anatomical datasets of the OAEI campaign[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
and string-based alignment is an important step in most methods identified in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        We compare the precision and recall of AnAGram against an implementation of
Jaro-Winkler’s edit-distance method (JW)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and evaluate the precision of each step
of the alignment process. We gain over 100% F-measure compared to the
editdistance method. We evaluate the contribution and quality of the string normalization
modules independently and show that the Porter stemmer[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] does not give optimal
results in the biomedical domain.
      </p>
      <p>In Section 2 we present our use-case: aligning Dorland’s to Elsevier’s Merged
Medical Taxonomy (EMMeT)1. Section 3 describes related work in vocabulary
alignment in the biomedical domain. Section 4 and 5 present AnAGram and evaluate
against Jaro-Winkler’s edit-distance. Section 6 presents future work and conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Use case: Dorland’s definition alignment to EMMeT</title>
      <p>Elsevier’s Merged Medical Taxonomy (EMMeT) is used in “Smart Content”
applications2; it contains more than 1 million biomedical concepts and their hierarchical,
linguistic and semantic relationships. We aim at expanding EMMeT with definitions
from the authoritative biomedical dictionary Dorland’s3 by aligning them.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related work</title>
      <p>
        Cheatham and Hitzler[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] list the types of linguistic processes used by at least one
alignment tool in the Ontology Alignment Evaluation Initiative (OAEI)[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. AnAGram
implements all syntactic linguistic transformations listed; instead of a generic
synonym expansion system, we used a correspondence dictionary of adjective/noun pairs.
This dictionary is a manually curated list based on information automatically
extracted from Dorland’s. It contains pairs that would not be not solved by stemming such as
saturnine/lead. Ambiguous entries, such as gluteal/natal, were removed.
      </p>
      <p>
        Chua and Kim’s[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] approach for string-based vocabulary alignment is the closest
to AnAGram: they use WordNet4, a lexical knowledge base, to gather adjective/noun
pairs to improve the coverage of their matches, after using string normalization steps;
our set of pairs is larger than the one derived from WordNet.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. AnAGram: biomedical vocabularies alignment tool</title>
      <p>AnAGram was built for use on a local system5, and is tuned to performance by
using hash-table lookup to find matches. Currently, no partial matching is possible. The
matching steps are built in a modular way: one can select the set of desired steps. The
source taxonomy is processed using these steps and the target taxonomy is processed
sequentially: the alignment stops at the first match. Modules are ordered to increasing
distance between original and transformed string, simulating a confidence value.
Exact matching: corresponds to JW edit-distance 1.</p>
      <p>Normalization: special characters are removed or transformed (Sjögren’s syndrome
to Sjogren’s syndrome; punctuation marks to space), string is lower cased.
Stop word removal: tokenization by splitting on spaces, removal of stop words,
using a list that was fine-tuned over several rounds of indexing with EMMeT.
2 http://info.clinicalkey.com/docs/Smart_Content.pdf
3 http://www.dorlands.com/
4 http://wordnet.princeton.edu/
5 Dell™ Precision™ T7500, 2x Intel® Xeon® CPU E5620 2.4 GHz processors, 64 GB RAM.</p>
      <p>Software: Windows 7 Professional 64 bit, Service Pack 1; Perl v5.16.3
Re-ordering: tokens are sorted alphabetically, enabling matches for inverted terms.
Substitution: sequences of tokens are replaced with the corresponding value from our
dictionary, applying a longest string matching principle.</p>
      <p>
        Stemming: using the Porter stemming algorithm[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (Perl module Lingua::Stem::
Snowball). The substitution step is then repeated, using stemmed dictionary entries.
Independent lists: stop-words list and substitution dictionary are independent files.
5.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experimentation and results</title>
      <p>We align EMMeT version 3.2 (13/12/13) (1,027,717 preferred labels) to Dorland’s
32nd edition (115,248 entries). We evaluate AnAGram as a whole against JW, with a
0.92 threshold (established experimentally). The JW implementation can work only
with preferred labels.</p>
      <p>To evaluate the recall of AnAGram vs the JW implementation, we use a manual
gold set of 115 mappings created by domain experts (Table 1). AnAGram gives better
recall and better precision than the JW method.</p>
      <p>Correct mapping Incorrect mapping Recall (%) Precision(%) F-measure
Jaro-Winkler 46 8 43% 85% 0.57
AnAGram 80 3 71% 96% 0.82</p>
      <p>Table 1 - Results of AnAGram vs. Jaro-Winkler on Dorland’s Gold Set pairs</p>
      <p>We evaluate a random sample of 25 non-exact alignments from each module to get
a better insight on AnAGram’s normalization process. The results are either: Correct,
Related (useful but not exactly correct), or Incorrect (Table 2 and Figure 1).
AnAGram gives more correct results but JW is useful for finding related matches.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%</p>
      <p>Correct</p>
      <p>Related</p>
      <p>Incorrect</p>
      <p>Preferred labels
Jaro-Winkler
AnAGram non-exact
Normalised
No stop words</p>
      <p>C
16
6 Some modules are based on the result of a previous transformation, so the later the module
comes in the chain, the more complicated matches it faces.
ization does very well (100% correct results). Removal of stop words causes some
errors and related matches: single-letter stop words can be meaningful, like A for
hepatitis A. Word order rearranging ranks second: it does not often change the
meaning of the term. Substitution performs reasonably well; most of the non-correct results
are related matches. Stemming gives the poorest results with false positives due to
nouns/verbs stemmed to the same root, such as cilitated/ciliate. The substituted and
stemmed matches have a result similar to the stemmed results. Still, even the worst
results from any AnAGram module are better than the overall results of the non-exact
matches from the JW algorithm. One reason for this is that the JW does not stop the
alignment at the best match, but delivers everything that satisfies the threshold of
0.92.</p>
      <p>Not all modules account for an equal portion of the non-exact results. The
normalization module delivers around 70% of matches, stemming accounts for 15 to 20%
and the other modules account for 2% to 4% of the matches each.</p>
    </sec>
    <sec id="sec-6">
      <title>Future work and conclusion</title>
      <p>
        Results are good compared to OAEI large biomedical vocabularies alignment’s
results for string-based tools[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We will work on the Stemming algorithm, the
improvement of our stop words list and substitution dictionary, and on adding an
optimized version of the JW algorithm as a final optional module for AnAGram to
improve results further. In this way we will benefit from additional related matches in
cases where no previous match was found.
      </p>
      <p>References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Michelle</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Hitzler</surname>
          </string-name>
          .
          <article-title>String Similarity Metrics for Ontology Alignment</article-title>
          .
          <source>International Semantic Web Conference (ISWC2013) (2)</source>
          <year>2013</year>
          :
          <fpage>294</fpage>
          -
          <lpage>309</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cornelis .J. van Rijsbergen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Stephen E</surname>
          </string-name>
          . Robertson, MartinF. Porter.
          <article-title>New models in probabilistic information retrieval</article-title>
          .
          <source>London: British Library. (British Library Research and Development Report, no. 5587)</source>
          ,
          <year>1980</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jérôme</given-names>
            <surname>Euzenat (Coordinator</surname>
          </string-name>
          ) et al.
          <article-title>State of the art on Ontology alignment</article-title>
          .
          <source>Knowledge Web D 2.2.3</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jerôme</given-names>
            <surname>Euzenat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          .
          <source>Ontology Matching</source>
          . Springer-Verlag, Berlin Heidelberg 2013
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jérôme</given-names>
            <surname>Euzenat</surname>
          </string-name>
          , Christian Meilicke, Heiner Stuckenschmidt, Pavel Shvaiko,
          <string-name>
            <given-names>Cássia</given-names>
            <surname>Trojahn</surname>
          </string-name>
          .
          <article-title>Ontology Alignment Evaluation Initiative: Six Years of Experience</article-title>
          .
          <source>Journal on Data Semantics XV, Lecture Notes in Computer Science</source>
          (
          <volume>6720</volume>
          )
          <year>2011</year>
          :
          <fpage>158</fpage>
          -
          <lpage>192</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Watson</surname>
            <given-names>W. K.</given-names>
          </string-name>
          <string-name>
            <surname>Chua</surname>
          </string-name>
          and
          <string-name>
            <surname>Jung-Jae Kim</surname>
          </string-name>
          . BOAT:
          <article-title>Automatic alignment of biomedical ontologies using term informativeness and candidate selection</article-title>
          .
          <source>Journal of Biomedical Informatics (45)</source>
          <year>2012</year>
          :
          <fpage>337</fpage>
          -
          <lpage>349</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>William</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Winkler</surname>
          </string-name>
          .
          <article-title>String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage</article-title>
          .
          <source>Proceedings of the Section on Survey Research</source>
          Methods (American Statistical Association)
          <year>1990</year>
          :
          <fpage>354</fpage>
          -
          <lpage>359</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>