A web prototype for detecting chemical compounds and
                            drugs


    Daniel Sánchez-Cisneros1, Sara Lana-Serrano2, Isabel Segura-Bedmar1, Leonardo
                            Campillos3, Paloma Martínez3


            1
             ComputerScienceDepartment, Universidad Carlos III de Madrid, Spain
                     {dscisner, isegura,pmf}@inf.uc3m.es
                          2
                           Universidad Politécnica de Madrid, Spain
                                  slana@diatel.upm.es
                              3
                               Universidad Autónoma de Madrid
                              leonardo.campillos@uam.es


        Abstract. This paper introduces a web prototype for named entity recognition
        of chemical compounds and drugs. The tool is based on a system developed to
        participate in the ChemDNER task organized as part of Biocreative 2013 work-
        shop. The system combines the ChemSpot tool as well as a set of semantic-
        based rules, which were defined according to the guidelines provided to task
        participants.        The       prototype        is        available        at
        http://multimedica.uc3m.es:8080/biocreative2013demo/

        Keywords: Drug named entity recognition, information extraction


1       Introduction

Most research on named entity recognition (NER) in the biomedical domain are based
on dictionary based methods and Supervised Machine Learning (SML) methods. The
main problems with the former approach are their domain dependency and their ina-
bility to recognize terms not included in the dictionaries. Machine learning techniques
build classification models based on annotated corpus and produce the best results [1],
although they require annotated corpora.
Current trends try to develop hybrid systems that combine best of two approaches. In
this work we present a prototype that combines existing systems such as ChemSpot
[2] and Metamap [3] with gazetteers extracted from biomedical resources such as
MeSH1, DrugBank2, Wikipedia3 and ChEBI [4]. Lastly, based on error analysis of the
development set, we defined a set of semantic rules to detect false negatives and dis-
card false positives generated by the previous processes. In this paper, we present a
web tool designed on this system. The tool allows user to introduce a text and then
detect chemical compounds and drugs occurring in the text.


                                 Fig. 1. Pipeline architecture


2       Description of the prototype

Figure 1 shows the pipeline architecture of the prototype. In a first step, texts are pro-
cessed by the ChemSpot tool. This tool is able to identify mentions of chemicals. The
next three processes are responsible for extracting semantic knowledge from the
CheBI ontology, the MeSH vocabulary and the UMLS Metathesaurus (using the Met-
aMap tool). In particular, the semantic features used are: MeSH semantic types,
MeSH type, MeSH_TreeNumbers, UMLS semantic types, and ancestors from ChEBI

1
    http://www.nlm.nih.gov/mesh/meshhome.html
2
    http://www.drugbank.ca
3
    http://wikipedia.org
by traversing recursively the relationships: is_a, has_role, is_conjugate_acid_of and
is_conjugate_base_of. In the next phase, a gazetteer tagger implemented in the
GATE4 environment is used. Based on error analysis of the development set, a set of
27 gazetteers with more than 340,000 entries have been compiled to process texts in
order to rule out false positive instances and to annotate false negative instances that
were not recognized in the previous steps. The sixth module is the ANNIE PoS tagger
included in GATE. Pos tags are used to discard some instances as well as to define the
rules used in the last two steps to classify the entities according to PoS tagging, affix
processing and multiword processing. More information about the processes and re-
sources used can be found at [5].

Figure 2 shows a screenshot of the web tool. The tool allows users to write a text to be
processed by the system. As result of the processing, chemical compounds and drugs
appear highlighted in text. Also, the identified chemical compounds are linked to the
ChEBI database.


                 Fig. 2. A web tool for identifying chemical compounds and drugs.


4
    http://gate.ac.uk
The system was evaluated on the test dataset provided by the BioCreative IV
(CHEMDNER 2013 task5). It was able to recognize chemical and drug named entities
with an F-measure of 0,594 over Chemical Entity Mentions (CEM) evaluation. As
future work, we plan to conduct an evaluation with users to measure the usability of
our tool.


Acknowledgments

TThis work was supported by the EU project TrendMiner [FP7-ICT287863], by the
project MultiMedica [TIN 2010-20644-C03-01)], and by the research network
MA2VICMR [S2009/TIC-1542].


References

    1. Rocktschel, T., Huber, T., Weidlich, M., Leser, U.: WBI-NER: The impact ofdomain-
       specific features on the performance of identifying and classifying mentions of drugs. Pro-
       ceedings of SemEval 2013, pp. 356-363, (2013).
    2. Rocktschel, T., Weidlich, M., Leser, U.: Chemspot: a hybrid system for chemical named
       entity recognition. Bioinformatics, 28(12), pp. 1633-1640, (2012).
    3. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: the
       MetaMap program. Proceedings of the AMIA Symposium, American Medical Informatics
       Association, pp. 17-21, (2001)
    4. Degtyarenko, K., De Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Al-
       cántara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database and ontology for
       chemical entities of biological interest. Nucleic Acids Research, 36, pp. 344-350, (2008).
    5. Lana-Serrano, S., Sánchez-Cisneros, D., Campillos, L. and Segura-Bedmar, I. Recognizing
       chemical compounds and drugs: a rule-based approach using semantic information. Pro-
       ceedings of the fourth BioCreative challenge evaluation workshop, vol 2, (2013)


5
    http://www.biocreative.org/tasks/biocreative-iv/chemdner/