The problem

E cient synonym search by semantic linking of multiple data sets

Kenny Knecht

Berenice Wulbrecht

Filip Pattyn

Hans Constant

ONTOFORCE NV

Belgium

kenny@ontoforce.com

We describe a method to automatically pick a highly relevant subset of synonyms to broaden a text search based on keywords. Public datasets in the bio-medical area tend to provide a plethora of synonyms or alternative names. It is not uncommon that chemicals or diseases have more than 50 di erent alternative names in data sets like UMLS or ChEMBL. This may result in ine cient searches and sometimes even in false positives if you use these to extend an initial search. Through semantic linking of several datasets we de ne a heuristic which increases the power of the search meanwhile making it more e cient. We evaluated the method on the 500 most common keyword searches used the rst 6 months of 2017 in the semantic web platform DISQOVER (www.disqover.com). More than 98% of the hits are retrieved back by submitting only 16% of the synonyms. We implemented this method as a visual suggestion, which the user can override manually at any time. Notwithstanding the fact that we focus our examples and concrete implementation on the biomedical databases in the publicly available DISQOVER, we would like to stress that the method is much more generally applicable.

semantic web text search synonyms data integration

The problem

Traditionally search engines start with a text query. All the documents containing that text are subsequently returned: in DISQOVER this can be publications, clinical trials or funded research programs, but also other concepts like diseases, genes, variants or other chemicals.

The central example will be all documents about the concept aspirin. Merely typing "aspirin" will surely return a lot of relevant results, but some will also be missed. For example some documents may only mention the more scienti c name "Acetylsalicylic acid". So logically the user would like to expand the search with synonyms to broaden his or her search, i.e. query for "aspirin" OR "Acetylsalicylic acid".

Since DISQOVER brings together many data sources -many of them actually contributing synonyms and other alternative names- this can be easily automated. The record for Aspirin for example collects data from no less than 13 di erent databases (HMDB, DrugCentral, DrugBank, HSDB, UNII, ChEMBL, UMLS, ChEBI, IUPHAR Compendium, SureChEMBL, RxNorm, MeSH, PubChem ), from which 8 contribute to alternative names. In total the public databases gives 75 distinct alternative names for aspirin, many of which are not strictly synonyms, but hyponyms like "Nu-Seals 300" or "Bayer Extra Strength" (ChEMBL via http://www.w3.org/2004/02/skos/core#altLabel). Including all those alternative names in our search will result in a fairly complete result, but may also be very ine cient.

There is also another risk involved. Elaborating on the previous example, one of the alternative names is ASA. Although this is used as an alternative name for aspirin, it is also the abbreviation of anti-sarcolemmal autoantibodies Mus Musculus gene, of the disease Argininosuccinic aciduria and many more. A search for this word will inevitably introduce many false positives. Another source of ambiguity may be hypernymy, synonyms that are broader then the actual submitted keyword. 2

Related work

While query expansion is a much studied subject we apply it here in highly specialized eld of bio-medical sciences. This makes the use of tools like WordNet as is done in [ 1, 2 ] less e ective. However contrary to general language queries, we have the advantage that the biomedical eld is covered excellently by several ontologies. Combining these multiple semantic ontologies in a very simple heuristic to obtain an optimal query expansion, makes our approach distinct from previous approaches. 3

The solution

We have opted to address the issues raised above with some simple heuristics, which rely heavily on the fact that semantic platforms like DISQOVER bring together multiple data sources. While each data source separately does not have adequate power to discriminate, together they do.

We use the following algorithm to prune the synonym list { Retrieve documents We retrieve all the documents that match the query string exactly either with their preferred label or with one of the alternative names. The provenance of each of these labels or names is also retrieved { Merge documents the documents are merged if there is su cient overlap between their names and if the classes of the documents are compatible. As an example consider a broad term like lung cancer. Multiple disease instances match this term. By merging these instances and their synonyms we cover the landscape the user probably wants to investigate { Score synonyms Synonyms are scored based on the number of data set that support them score =

#data sets max(data sets) 1 If data sets == 1 then the score is set 1. This scales the score between 0 and 1.

If a synonym is su ciently short (currently 8 characters) then it is checked whether it gives rise to false positives. This happens by retrieving the classes of the documents having exact matches for that synonym. If more classes are found than the current concept possesses, it gets a negative score for being ambiguous. If a synonym is a number with less than 5 digits or less than 3 alphanumeric characters, it also gets a negative score for the same reason. { Remove containing synonyms If a synonym contains another shorter synonym, there is no reason to put it in a query. If the shorter word has a lower score than the containing word, it inherits the highest score We only retain the synonyms which have a score larger than 0.

In the example of aspirin we retain 12 possible synonyms (16%), the rst three being aspirin, acetylsalicylic acid and salicylic acid acetate. 4

Evaluation and results

We have analyzed the results of the 500 most prevalent keywords which were recorded in DISQOVER in the rst 6 months of 2017. For each keyword we have ordered all the synonyms by score and by descending word length in case of a tie for score . As an evaluation we submitted all these keywords to the search engine in this order instead of submitting only the optimal synonyms and we recorded the following per synonym { How much hits we found in total by accumulating the synonyms with OR { How much hits we get by only submitting the current synonym An example of this output is shown in Appendix A. Per merged document we register { The optimal number of synonyms. We do this by re-ordering the synonyms by descending number of hits they add and count how much synonyms would we minimally need to obtain 98% of the hits { Total number of synonyms per merged document { The number of synonyms with score > 0 This enables us to measure what we miss if we only submit synonyms with a score 0 .

We can split the results in two big groups, which we analyze separately. On one hand we have keywords for which the score is not able to discriminate: so all synonyms need to be checked. Basically this happens when there is only one data source contributing to the concept or that all the data sources completely agree on all alternative names. On the other hand we have the group for which the method does make some di erence and it does allow us to skip some synonyms.

The rst group on average has 6.16 synonyms per keywords, while the second group has 30.60 synonyms per keyword. In other words, in the cases the method does not make any di erence, there was not really a need for it to begin with.

Group All synonyms considered Filtered by method

In the second group we submit only 15.8% of all the synonyms if we apply the method. We observe that the time to run the factor roughly scales with the amount of synonyms we submit. If we compare the number of hits we obtain by this subset to the hits by submitting all the synonyms, on average we miss 9.4%. However the median of the fraction of missed hits is 1.7%. So we have a few very high outliers. What is causing this? The worst is example is DNM2 where we miss 94% of all the hits by only considering the synonyms with score > 0. This is almost exclusively caused by one alternative name for this gene i.e. Cytoskeletal protein (through UMLS). Although this gene is indeed related to this concept, this concept is much broader. So it is a hypernymy of the submitted concept and excluding it is actually bene cial: it would have given 16 times more false positives if it were included. This is a pattern for most high miss fraction examples. Consider atezolizumab. Here the hypernymy anti-pd-l1 (from ChEMBL) is successfully excluded by our method. So we consider it justi ed to exclude the high end misses tail and focus on the median: more than 98% of the hits were found by less than 16% of the synonyms. The minimum number of synonyms needed to get 98% of the hits is actually 9,01%, meaning that we submit less than double of this absolute minimum.

For ambiguous synonyms we conducted a manual check for false positives in a small subset of 20 clinical studies prominently containing ASA. Nine out of twenty are not about aspirin at all: we found 3 about 5-aminosalicylates, 3 about the ASA-PS classi cation and 1 about resp. advanced surface ablation, Avonex- Steroid Azathioprine and Argininosuccinic Aciduria. Of the other 11 only one does not contain one of the other synonyms included by our method. So we avoid 45% false positives and trade these for 9% false negatives. 5

Conclusion

We have reduced the number of submitted synonyms by 84%, thereby losing only 1.7% of the hits. The false positive exclusion, which is a lot harder to check, also seems to work well based on the small manually curated sample.

On a broader level we see that the actual number of synonyms needed to attain most of the text hits is even lower: on average we only between 2 and 3 synonyms, the median is even 1! So for public data set it might be a hint to focus more on quality then on quantity when choosing alternative names for concepts.

Overall we can conclude that the methods works well, despite the fact that it is very simple. It is clearly a demonstration of the cheap gains we get by combining multiple data sets in one semantic framework.

APPENDIX Complete example: Aspirin

The example output we generated when evaluating Aspirin is presented here. The synonyms are submitted in order of the table 1, which also contains the results for each synonym. As you can see the rst 2 synonyms bring the bulk of the hits. The second one (Acetylsalicylic acid ) has about 8 times less hits then Aspirin. But of those 10000 hits only about third is unique: the other already have a hit for Aspirin. This pattern returns although all subsequent synonyms return even less hits. The only synonym returning a signi cant number of synonyms is ASA, but as mentioned in the text this is a very ambiguous word with many di erent meanings. Many of these hits are identi ed as false positives.

Overall Aspirin has 75 alternative names. In the table we omit the containing synonyms (such as aspirin sodium). In total 17.3% of the synonyms are submitted (13) and we miss 1.04% of the hits. In the optimal case only 2.7% of the synonyms have to be submitted to obtain 98% of the hits. The timings for the queries are: 70 ms for retrieving hits for Aspirin only, 190 ms for retrieving all hits for synonyms with score 0 and 2650 ms for all 75 synonyms. synonym measurin aspirine postmi 75 gencardia aspro clr equi-prin disprin cv alka rapid postmi 300 polopiryna acetophen angettes 75 nu-seals 75 nu-seals 300 nu-seals 600 8-hour bayer micropirin ec disprin direct acetosalic acid acetylsalic acid anadin all night 2-acetoxybenzoate acetyl salicylate acetylsalicylsure nu-seals cardio 75 azetylsalizylsure azetylsalizylsaeure acetylsalicylsaeure acetylsalisylic acid bayer extra strength acetylsalicyclic acid acetyl salicylic acid cido acetilsaliclico acetyl salicyclic acid 2-acetoxy-benzoic acid acide actylsalicylique acetylsalicylicum acidum acide 2-(actyloxy)benzoque acide 2-(acetyloxy)benzoique (aspirin)2-acetoxy-benzoic acid 2-(methoxycarbonyl)benzoic acid ecotrin asa

1. Voorhees , Ellen M. Query expansion using lexical-semantic relations . Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval . Springer-Verlag New York, Inc., 1994 .

2. Mandala , Rila, Takenobu

Tokunaga , and Hozumi

Tanaka . "Combining multiple evidence from di erent types of thesaurus for query expansion . " Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM , 1999 .