<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>E cient synonym search by semantic linking of multiple data sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kenny Knecht</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berenice Wulbrecht</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filip Pattyn</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hans Constant</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ONTOFORCE NV</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Belgium</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>kenny@ontoforce.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We describe a method to automatically pick a highly relevant subset of synonyms to broaden a text search based on keywords. Public datasets in the bio-medical area tend to provide a plethora of synonyms or alternative names. It is not uncommon that chemicals or diseases have more than 50 di erent alternative names in data sets like UMLS or ChEMBL. This may result in ine cient searches and sometimes even in false positives if you use these to extend an initial search. Through semantic linking of several datasets we de ne a heuristic which increases the power of the search meanwhile making it more e cient. We evaluated the method on the 500 most common keyword searches used the rst 6 months of 2017 in the semantic web platform DISQOVER (www.disqover.com). More than 98% of the hits are retrieved back by submitting only 16% of the synonyms. We implemented this method as a visual suggestion, which the user can override manually at any time. Notwithstanding the fact that we focus our examples and concrete implementation on the biomedical databases in the publicly available DISQOVER, we would like to stress that the method is much more generally applicable.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic web</kwd>
        <kwd>text search</kwd>
        <kwd>synonyms</kwd>
        <kwd>data integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>The problem</title>
      <p>Traditionally search engines start with a text query. All the documents
containing that text are subsequently returned: in DISQOVER this can be publications,
clinical trials or funded research programs, but also other concepts like diseases,
genes, variants or other chemicals.</p>
      <p>The central example will be all documents about the concept aspirin. Merely
typing "aspirin" will surely return a lot of relevant results, but some will also
be missed. For example some documents may only mention the more scienti c
name "Acetylsalicylic acid". So logically the user would like to expand the search
with synonyms to broaden his or her search, i.e. query for "aspirin" OR
"Acetylsalicylic acid".</p>
      <p>Since DISQOVER brings together many data sources -many of them
actually contributing synonyms and other alternative names- this can be easily
automated. The record for Aspirin for example collects data from no less than 13
di erent databases (HMDB, DrugCentral, DrugBank, HSDB, UNII, ChEMBL,
UMLS, ChEBI, IUPHAR Compendium, SureChEMBL, RxNorm, MeSH,
PubChem ), from which 8 contribute to alternative names. In total the public
databases gives 75 distinct alternative names for aspirin, many of which are not
strictly synonyms, but hyponyms like "Nu-Seals 300" or "Bayer Extra Strength"
(ChEMBL via http://www.w3.org/2004/02/skos/core#altLabel).
Including all those alternative names in our search will result in a fairly complete
result, but may also be very ine cient.</p>
      <p>There is also another risk involved. Elaborating on the previous example,
one of the alternative names is ASA. Although this is used as an alternative
name for aspirin, it is also the abbreviation of anti-sarcolemmal autoantibodies
Mus Musculus gene, of the disease Argininosuccinic aciduria and many more.
A search for this word will inevitably introduce many false positives. Another
source of ambiguity may be hypernymy, synonyms that are broader then the
actual submitted keyword.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        While query expansion is a much studied subject we apply it here in highly
specialized eld of bio-medical sciences. This makes the use of tools like WordNet
as is done in [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] less e ective. However contrary to general language queries,
we have the advantage that the biomedical eld is covered excellently by
several ontologies. Combining these multiple semantic ontologies in a very simple
heuristic to obtain an optimal query expansion, makes our approach distinct
from previous approaches.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The solution</title>
      <p>We have opted to address the issues raised above with some simple heuristics,
which rely heavily on the fact that semantic platforms like DISQOVER bring
together multiple data sources. While each data source separately does not have
adequate power to discriminate, together they do.</p>
      <p>We use the following algorithm to prune the synonym list
{ Retrieve documents We retrieve all the documents that match the query
string exactly either with their preferred label or with one of the alternative
names. The provenance of each of these labels or names is also retrieved
{ Merge documents the documents are merged if there is su cient overlap
between their names and if the classes of the documents are compatible. As
an example consider a broad term like lung cancer. Multiple disease instances
match this term. By merging these instances and their synonyms we cover
the landscape the user probably wants to investigate
{ Score synonyms Synonyms are scored based on the number of data set
that support them
score =</p>
      <p>#data sets
max(data sets)
1
If data sets == 1 then the score is set 1. This scales the score between 0 and
1.</p>
      <p>If a synonym is su ciently short (currently 8 characters) then it is checked
whether it gives rise to false positives. This happens by retrieving the classes
of the documents having exact matches for that synonym. If more classes are
found than the current concept possesses, it gets a negative score for being
ambiguous. If a synonym is a number with less than 5 digits or less than 3
alphanumeric characters, it also gets a negative score for the same reason.
{ Remove containing synonyms If a synonym contains another shorter
synonym, there is no reason to put it in a query. If the shorter word has a
lower score than the containing word, it inherits the highest score
We only retain the synonyms which have a score larger than 0.</p>
      <p>In the example of aspirin we retain 12 possible synonyms (16%), the rst
three being aspirin, acetylsalicylic acid and salicylic acid acetate.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and results</title>
      <p>We have analyzed the results of the 500 most prevalent keywords which were
recorded in DISQOVER in the rst 6 months of 2017. For each keyword we have
ordered all the synonyms by score and by descending word length in case of a
tie for score . As an evaluation we submitted all these keywords to the search
engine in this order instead of submitting only the optimal synonyms and we
recorded the following per synonym
{ How much hits we found in total by accumulating the synonyms with OR
{ How much hits we get by only submitting the current synonym
An example of this output is shown in Appendix A. Per merged document we
register
{ The optimal number of synonyms. We do this by re-ordering the synonyms
by descending number of hits they add and count how much synonyms would
we minimally need to obtain 98% of the hits
{ Total number of synonyms per merged document
{ The number of synonyms with score &gt; 0
This enables us to measure what we miss if we only submit synonyms with a
score 0 .</p>
      <p>We can split the results in two big groups, which we analyze separately. On
one hand we have keywords for which the score is not able to discriminate: so all
synonyms need to be checked. Basically this happens when there is only one data
source contributing to the concept or that all the data sources completely agree
on all alternative names. On the other hand we have the group for which the
method does make some di erence and it does allow us to skip some synonyms.</p>
      <p>The rst group on average has 6.16 synonyms per keywords, while the second
group has 30.60 synonyms per keyword. In other words, in the cases the method
does not make any di erence, there was not really a need for it to begin with.</p>
      <p>Group
All synonyms considered
Filtered by method</p>
      <p>In the second group we submit only 15.8% of all the synonyms if we apply
the method. We observe that the time to run the factor roughly scales with the
amount of synonyms we submit. If we compare the number of hits we obtain
by this subset to the hits by submitting all the synonyms, on average we miss
9.4%. However the median of the fraction of missed hits is 1.7%. So we have
a few very high outliers. What is causing this? The worst is example is DNM2
where we miss 94% of all the hits by only considering the synonyms with score
&gt; 0. This is almost exclusively caused by one alternative name for this gene i.e.
Cytoskeletal protein (through UMLS). Although this gene is indeed related to
this concept, this concept is much broader. So it is a hypernymy of the submitted
concept and excluding it is actually bene cial: it would have given 16 times
more false positives if it were included. This is a pattern for most high miss
fraction examples. Consider atezolizumab. Here the hypernymy anti-pd-l1 (from
ChEMBL) is successfully excluded by our method. So we consider it justi ed
to exclude the high end misses tail and focus on the median: more than 98% of
the hits were found by less than 16% of the synonyms. The minimum number
of synonyms needed to get 98% of the hits is actually 9,01%, meaning that we
submit less than double of this absolute minimum.</p>
      <p>For ambiguous synonyms we conducted a manual check for false positives
in a small subset of 20 clinical studies prominently containing ASA. Nine out
of twenty are not about aspirin at all: we found 3 about 5-aminosalicylates, 3
about the ASA-PS classi cation and 1 about resp. advanced surface ablation,
Avonex- Steroid Azathioprine and Argininosuccinic Aciduria. Of the other 11
only one does not contain one of the other synonyms included by our method.
So we avoid 45% false positives and trade these for 9% false negatives.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have reduced the number of submitted synonyms by 84%, thereby losing only
1.7% of the hits. The false positive exclusion, which is a lot harder to check, also
seems to work well based on the small manually curated sample.</p>
      <p>On a broader level we see that the actual number of synonyms needed to
attain most of the text hits is even lower: on average we only between 2 and 3
synonyms, the median is even 1! So for public data set it might be a hint to focus
more on quality then on quantity when choosing alternative names for concepts.</p>
      <p>Overall we can conclude that the methods works well, despite the fact that
it is very simple. It is clearly a demonstration of the cheap gains we get by
combining multiple data sets in one semantic framework.</p>
    </sec>
    <sec id="sec-6">
      <title>APPENDIX Complete example: Aspirin</title>
      <p>The example output we generated when evaluating Aspirin is presented here.
The synonyms are submitted in order of the table 1, which also contains the
results for each synonym. As you can see the rst 2 synonyms bring the bulk
of the hits. The second one (Acetylsalicylic acid ) has about 8 times less hits
then Aspirin. But of those 10000 hits only about third is unique: the other
already have a hit for Aspirin. This pattern returns although all subsequent
synonyms return even less hits. The only synonym returning a signi cant number
of synonyms is ASA, but as mentioned in the text this is a very ambiguous word
with many di erent meanings. Many of these hits are identi ed as false positives.</p>
      <p>Overall Aspirin has 75 alternative names. In the table we omit the containing
synonyms (such as aspirin sodium). In total 17.3% of the synonyms are
submitted (13) and we miss 1.04% of the hits. In the optimal case only 2.7% of the
synonyms have to be submitted to obtain 98% of the hits. The timings for the
queries are: 70 ms for retrieving hits for Aspirin only, 190 ms for retrieving all
hits for synonyms with score 0 and 2650 ms for all 75 synonyms.
synonym
measurin
aspirine
postmi 75
gencardia
aspro clr
equi-prin
disprin cv
alka rapid
postmi 300
polopiryna
acetophen
angettes 75
nu-seals 75
nu-seals 300
nu-seals 600
8-hour bayer
micropirin ec
disprin direct
acetosalic acid
acetylsalic acid
anadin all night
2-acetoxybenzoate
acetyl salicylate
acetylsalicylsure
nu-seals cardio 75
azetylsalizylsure
azetylsalizylsaeure
acetylsalicylsaeure
acetylsalisylic acid
bayer extra strength
acetylsalicyclic acid
acetyl salicylic acid
cido acetilsaliclico
acetyl salicyclic acid
2-acetoxy-benzoic acid
acide actylsalicylique
acetylsalicylicum acidum
acide 2-(actyloxy)benzoque
acide 2-(acetyloxy)benzoique
(aspirin)2-acetoxy-benzoic acid
2-(methoxycarbonyl)benzoic acid
ecotrin
asa</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          , Ellen M.
          <article-title>Query expansion using lexical-semantic relations</article-title>
          .
          <source>Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          . Springer-Verlag New York, Inc.,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Mandala</surname>
            , Rila,
            <given-names>Takenobu</given-names>
          </string-name>
          <string-name>
            <surname>Tokunaga</surname>
            , and
            <given-names>Hozumi</given-names>
          </string-name>
          <string-name>
            <surname>Tanaka</surname>
          </string-name>
          .
          <article-title>"Combining multiple evidence from di erent types of thesaurus for query expansion</article-title>
          .
          <source>" Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>