Using a Hybrid Approach for Entity Recognition in the Biomedical Domain Marco Basaldella Lenz Furrer Università degli Studi di Udine Institute of Computational Linguistics Via delle Scienze 208, Udine University of Zurich basaldella.marco.1@ Andreasstrasse 15, CH-8050 Zürich spes.uniud.it lenz.furrer@uzh.ch Nico Colic Tilia R. Ellendorff Institute of Computational Linguistics Institute of Computational Linguistics University of Zurich University of Zurich ncolic@gmail.com ellendorff@cl.uzh.ch Carlo Tasso Fabio Rinaldi Università degli Studi di Udine Institute of Computational Linguistics carlo.tasso@uniud.it University of Zurich fabio.rinaldi@uzh.ch Abstract technique may require deep domain and lin- guistic knowledge. A simple example may be This paper presents an approach towards the task of recognizing US phone numbers, high performance extraction of biomedical which can be solved by a simple regular ex- entities from the literature, which is built pression. by combining a high recall dictionary- based technique with a high-precision ma- • Machine learning-based approach: a statisti- chine learning filtering step. The tech- cal classifier is used to recognize an entity, nique is then evaluated on the CRAFT cor- such as Naive Bayes, Conditional Random pus. We present the performance we ob- Fields, and so on. Several different types of tained, analyze the errors and propose a features can be used by such systems, for ex- possible follow-up of this work. ample prefixes and suffixes of the entity can- didates, the number of capital letters, etc. A 1 Introduction major drawback of this approach is that it The problem of technical term extraction (herein typically requires a large, manually annotated TTE) is the problem of extracting relevant techni- corpus for algorithm training and testing. cal terms from a scientific paper. It can be seen • Dictionary-based approach: candidate en- as related to Named Entity Recognition (NER), tities are matched against a dictionary of where the entities one wants to extract are tech- known entities. The obvious drawback of this nical terms belonging to a given field. For exam- approach is that it is not able to recognize ple, while in traditional NER the entities that one new entities, making this technique ineffec- is looking for are of the types “Person”, “Date”, tive e.g. in documents which present new dis- “Location”, etc., in TTE we look for terms belong- coveries. ing to a particular domain, e.g. “Gene”, “Protein”, “Disease”, and so on (Nadeau and Sekine, 2007). • Hybrid approaches: two or more of the previ- A further evolution is the task of Concept Recog- ous techniques are used together. For exam- nition (CR), where the entity is also matched to a ple, Sasaki et al. (2008) as well as Akhondi concept in an ontology. et al. (2016) combine the dictionary and ML- NER (and then TTE) can be solved using very based approaches to combine the strengths of different techniques: both. • Rule-based approach: a group of manually The aim of this work is to propose a hybrid ap- written rules is used to identify entities. This proach based on two stages. First, we have a dic- tionary phase, where a list of all the possible terms al. (2016) compared their own NOBLE coder soft- is generated by looking for matches in a database. ware against other CR algorithms, showing a best This aims to build a low precision, high recall set F1-score of 0.44. Another system that makes use with all the candidate TTs. Then, this set is fil- of CRAFT for evaluation purposes is described in tered using a machine learning algorithm that ide- Campos et al. (2013). ally is able to discriminate between “good” and “bad” terms selected in the dictionary matching 3 System Design phase to augment the precision. 3.1 CRAFT Corpus This approach is realized by using two software The CRAFT corpus is a set of 671 manually an- modules. The first phase is performed by the On- notated journal articles from the biomedical field. toGene pipeline (Rinaldi et al., 2012b; Rinaldi, These articles are taken from the PubMed Central 2012), which performs TTE from documents in Open Access Subset,2 a part of the PubMed Cen- the biomedical field, using a dictionary approach. tral archive licensed under Creative Commons li- Then, OntoGene’s results are handed to Distiller, censes. a framework for information extraction introduced The corpus contains about 100,000 con- in Basaldella et al. (2015), which performs the ma- cept annotations which point to seven ontolo- chine learning filtering phase. gies/terminologies: 2 Related Work • Chemical entities of Biological Interest (ChEBI) (Degtyarenko et al., 2008) The field of technical term extraction has about 20 years of history, with early works focusing on ex- • Cell Ontology3 tracting a single category of terms, such as pro- tein names, from scientific papers (Fukuda et al., • Entrez Gene (Maglott et al., 2005) 1998). Later on, “term extraction” became the • Gene Ontology (biological process, cellular common definition for this task and some schol- component, and molecular function) (Ash- ars started to introduce the use of terminological burner et al., 2000) resources as a starting point for solving this prob- lem (Aubin and Hamon, 2006). • the US National Center for Biotechnology In- While the most recent state-of-the-art perfor- formation (NCBI) Taxonomy4 mance is obtained by using machine learning based systems (Leaman et al., 2015), there is • Protein Ontology5 growing interest in hybrid machine learning and • Sequence Ontology (Eilbeck et al., 2005) dictionary systems such as the one described by Akhondi et al. (2016), which obtains interest- Each of the 67 articles contains also linguistic ing performance on chemical entity recognition information, such as tokenized sentences, part-of- in patent texts. In the field of concept recogni- speech information, parse trees, and dependency tion, there are different strategies for improving trees. Articles are represented in different formats, the coverage of the recognized entities. For exam- such as plain text or XML, and are easily naviga- ple, known orthologous relations between proteins ble with common resources, such as the Knowtator of different species can be exploited for the detec- plugin for the Protégé software.6 tion of protein interactions in full text (Szklarczyk To make references to documents in the CRAFT et al., 2015). Groza and Verspoor (2015) explore corpus easily retrievable for the reader, when we the impact of case sensitivity and the information 1 The full CRAFT corpus comprises another 30 annotated gain of individual tokens in multi-word terms on articles, which are reserved for future competitions and have the performance of a concept recognition system. to date not been released. 2 http://www.ncbi.nlm.nih.gov/pmc/ The CRAFT Corpus (Bada et al., 2012) has tools/openftlist/ been built specifically for evaluating this kind of 3 https://github.com/obophenotype/ systems, and is described in detail in Section 3.1. cell-ontology/ 4 http://www.ncbi.nlm.nih.gov/taxonomy Funk et al. (2014) used the corpus to evaluate sev- 5 http://pir.georgetown.edu/pirwww/ eral CR tools, showing how they perform on the index.shtml 6 single ontologies in the corpus. Later, Tseytlin et http://knowtator.sourceforge.net/ will refer to an article contained in the corpus using a list of the most frequent English words. we will list the name of its corresponding XML As a further normalization step, Greek letters were file as contained in the corpus distribution, its expanded to their letter name in Latin spelling, e.g. PubMed Central ID (PMCID), and its PubMed ID α → alpha, since this is a common alternation. (PMID).7 For term matching, we compiled a dictionary resource using the Bio Term Hub (Ellendorff et 3.2 OntoGene al., 2015). The Bio Term Hub is a large biomed- The OntoGene group has developed an approach ical terminology resource automatically compiled for biomedical entity recognition based on dic- from a number of curated terminology databases. tionary lookup and flexible matching. Their ap- Its advantage lies in the ease of access, in that proach has been used in several competitive eval- it provides terms and identifiers from different uations of biomedical text mining technologies, sources in a uniform format. It is accessible often obtaining top-ranked results (Rinaldi et al., through a web interface,10 which recompiles the 2008; Rinaldi et al., 2010; Rinaldi et al., 2012a; resource on request and provides it as a tab- Rinaldi et al., 2014). Recently, the core parts of separated text file. the pipeline have been implemented in a more effi- Selecting the seven ontologies used in CRAFT cient framework using Python (Colic, 2016). It of- resulted in a term dictionary with 20.2 million en- fers a flexible interface for performing dictionary- tries. Based on preliminary tests, we removed all based TTE. entries with terms shorter than two characters or OntoGene’s term annotation pipeline accepts a terms consisting of digits only; this reduced the range of input formats, e.g. PubMed Central full- number of entries by less than 0.3%. In the On- text XML, gzipped chunks of Medline abstracts, toGene system, the entries of the term dictionary BioC,8 or simply plain text. It provides the an- were then preprocessed in the same way as the notated terms along with the corresponding iden- documents. Finally, the input documents were tifiers either in a simple tab-separated text file, in compared to the dictionary with an exact-match brat’s standoff format,9 or – again – in BioC. It al- strategy. lows for easily plugging in additional components, 3.3 Distiller such as alternative NLP preprocessing methods or postfiltering routines. Distiller11 is an open source framework written in In the present work, the pipeline was config- Java and R for machine learning, introduced in ured as follows: After sentence splitting, the input Basaldella et al. (2015). While the framework has documents were tokenized with a simple method its roots in the work of Pudota et al. (2010), thus based on character class: Any contiguous se- focusing on the task of automatic keyphrase ex- quence of either alphabetical or numerical char- traction (herein AKE), Distiller’s framework de- acters was considered a token, whereas any other sign allows us to adapt its pipeline to various pur- characters (punctuation and whitespace) were con- poses. sidered token boundaries and were ignored during AKE is the problem of extracting relevant the dictionary look-up. This lossy tokenization al- phrases from a document (Turney, 2000), and the ready has a normalizing effect, in that it collapses difference with TTE is that, while the former is spelling variants which arise from inconsistent use interested in a small set of relevant phrases from of punctuation symbols, e.g. “SRC 1” vs. “SRC-1” the source document, the latter is interested in all vs. “SRC1”. (A similar approach is described by domain-specific terms. Verspoor et al. (2010), which refer to it as “reg- While AKE can be performed using unsuper- ularization”.) All tokens are then converted to vised techniques, the most successful results have lowercase, except for acronyms that collide with been obtained using a supervised machine learn- a word from general language (e.g. “WAS”). We ing approach (Lopez and Romary, 2010). Super- enforced a case-sensitive match in these cases by vised AKE is performed using a quite common pipeline: first, the candidate keyphrases are gen- 7 We will not include articles from the CRAFT corpus in erated, using some kind of linguistic knowledge; the references as they are not actual bibliography for the pur- poses of this work. 10 http://pub.cl.uzh.ch/purl/biodb/ 8 11 http://bioc.sourceforge.net/ https://github.com/ailab-uniud/ 9 http://brat.nlplab.org/standoff.html distiller-CORE then, the AKE algorithm filters the candidates 4.2 Feature Set 1 assigning them some features which are in turn To improve the performance of the TTE task we used to train a machine learning algorithm, which start to augment our feature set by introducing fea- is able to classify “correct” keyphrases. These tures that should be able to catch some more fine- keyphrases can be then used for several purposes, grained information about the candidate terms. such as document indexing, filtering and recom- mendation (De Nart et al., 2013). Title Presence A flag which is set to 1 if the term To adapt Distiller to perform TTE effectively, appears in the title of the document and 0 oth- we substituted the candidate generation phase with erwise, much like the Abstract Presence fea- the output of OntoGene, i.e. candidate technical ture. terms become the potential “key phrases”. This configuration is then evaluated as our baseline. Symbols Count A counter for the number of Next, we gradually add new features into the sys- punctuation symbols, i.e. not whitespaces tem to train a machine learning model specialized and not alpha-numeric characters, appearing in the actual TTE task, and assess the improve- in the candidate term. ments in the performance of the system. Uppercase Count A counter for the number of 4 Features uppercase characters in the candidate term. 4.1 Baseline Lowercase Count A counter for the number of lowercase characters in the candidate term. First, we evaluated the performance of the Onto- Gene/Distiller system using the same feature set Digits Count A counter for the number of digits used in the original keyphrase extraction model in the candidate term. presented by Basaldella et al. (2015), which con- tains: Space Count A counter for the number of spaces in the candidate term. Frequency The frequency of the candidate in the document, also known as TF. Greek Flag A flag that is set to 1 if the candidate contains a Greek letter in spelled-out form, Height The relative position of the first appear- like “alpha”, “beta”, and so on. ance of the candidate in the document. These features offer a good improvement for detecting the particular shape that a tech- Depth The relative position of the last appearance nical term could have. For example, from the of the candidate in the document. document PLoS Biol-2-1-314463.nxml (PMC: PMC314463, PMID: 14737183) we Lifespan The distance between the first and the have the term “5-bromo-4-chloro-3-indolyl last appearance of the candidate. beta-D-galactoside”. This term contains: TF-IDF The peculiarity of the candidate with • A spelled-out Greek letter, beta; respect to the current document and the • An uppercase letter; CRAFT corpus. This is a very common fea- • Seven symbols (dashes); ture both in the AKE and TTE fields. • A whitespace. Abstract Presence A flag set to 1 if the candidate Without the new features this information would appears in the abstract, 0 otherwise. This is have been lost, so it may have been much harder motivated by the fact that often keyphrases to recognize the term as a technical one. are found to appear in the abstract. 4.3 Feature Set 2 This small feature set is the baseline of the ex- In this step we add even more features aimed at de- perimental evaluation performed on the proposed tecting more fine-grained information about candi- approach. date terms. The new features are: Dash flag Dashes are one of the most (if not the Since not all affixes are equally important, the most) common symbols found in technical affixes list needs to be cut at some point. While a terms. This flag is set to 1 if the term con- trivial decision could have been to pick the top 100 tains a dash, 0 otherwise. or 10% ranked prefixes and suffixes, our choice was to let the machine learning algorithm decide Ending number flag This flag is set to 1 if the by itself where to apply the cut. term ends with a number, 0 otherwise. To obtain this goal, each affix a from a database Inside capitalization This flag is set to 1 if the D is assigned a normalized score s ∈ [0, 1] com- term contains an uppercase letter which is not puted this way: at the beginning of a token. freq(a, D) s(a) = All uppercase This flag is set to 1 if the term con- max({freq(a1 , D) . . . freq(a|D| , D)}) tains only uppercase letters, 0 otherwise. where freq(a, D) is the frequency of an affix a All lowercase This flag is set to 1 if the term con- in D. This way we obtain a simple yet effective tains only lowercase letters, 0 otherwise. mechanism to let a ML algorithm learn which of 4.4 Feature Set 3: Affixes affixes are the most important. This feature set adds information about the affixes It is also worth noting that since we generate (i.e. prefixes and suffixes) of the words. This in- scores for prefixes and affixes of two and three formation is particularly useful in the biomedical letters from six databases, we have a total of field, since affixes in this field convey often a par- 2 × 2 × 6 = 24 features generated with this ap- ticular meaning: for example, words ending with proach. “ism” are typically diseases, words starting with 4.5 Feature Set 4: Removing AKE Features “zoo” refer to something from the animal life, and so on. Another example is the naming of chemical Now that we have many features that are more spe- compounds: for example, many ionic compounds cific for the technical term extraction field, we re- have the suffix “ide”, such as Sodium Chloride (the move the baseline feature set, which was tailored common table salt). on keyphrase extraction, to use only the features Using the Bio Term Hub resource, we compiled aimed at recognizing technical terms. a list of all the prefixes and suffixes of two or three These features (depth, height, lifespan, fre- letters from the following databases: quency, abstract presence, title presence, TF-IDF) are specific for the AKE field and supposedly • Cellosaurus,12 from the Swiss Institute of bring little value on knowing if a term is techni- Bioinformatics; cal or not. In fact, a term may appear just once in • Chemical compounds found in the Toxicoge- a random position of the text, and still be techni- nomics Database (CTD),13 from the North cal; the same does not hold for a keyphrase, which Carolina State University Comparative; is assumed to appear many times in specific posi- • Diseases found in the CTD; tions (introduction, conclusions. . . ) in the text. • EntrezGene (Maglott et al., 2005); • Medical Subject Headings (MeSH),14 from 4.6 Test Hardware the US National Center for Biotechnology Information (restricted to the subtrees “or- Both OntoGene and Distiller have been tested on a ganisms”, “diseases”, and “chemicals and laptop computer with an Intel i7 4720HQ proces- drugs”); sor running at 2,6GHz, 16 GB RAM and a Cru- • Reviewed records from the Universal Protein cial M.2 M550 SSD. The operating system was Resource (Swiss-Prot),15 developed by the Ubuntu 15.10. UniProt consortium, which is a joint USA- The speed was of 16275 words/second for On- EU-Switzerland project. toGene and 4745 words/second for Distiller. On- 12 toGene requires an additional time of about 25 http://web.expasy.org/cellosaurus/ 13 http://ctdbase.org/ second to load the dictionary at start up, but since 14 http://www.ncbi.nlm.nih.gov/mesh this operation is run only once we do not consider 15 http://www.uniprot.org/ it for the average. Metric OntoGene Baseline FS1 FS2 FS3 FS4 Precision 0.342 0.692 0.682 0.710 0.771 0.853 Recall 0.550 0.187 0.247 0.264 0.325 0.368 F1-Score 0.421 0.294 0.362 0.385 0.457 0.515 Table 1: Scores obtained with the Distiller/Ontogene pipeline using a MLP trained on the CRAFT corpus. In the column headers, “FSn” stands for “Feature Set n”. System Precision Recall F1 MMTx 0.43 0.40 0.42 MGrep 0.48 0.12 0.19 Concept Mapper 0.48 0.34 0.40 cTakes Dictionary Lookup 0.51 0.43 0.47 cTakes Fast Lookup 0.41 0.4 0.41 NOBLE Coder 0.44 0.43 0.43 OntoGene 0.34 0.55 0.42 OntoGene+Distiller 0.85 0.37 0.51 Table 2: Comparison of the scores obtained with OntoGene, with the combined OntoGene/Distiller pipeline and the scores obtained in Tseytlin et al. (2016). 5 Results ture Set 1, with a general improvement between 2% and 3%. Then Feature Set 3, adding the af- Using the feature sets defined above, we trained fixes, brings a great improvement of 7% F1-Score, a neural network to classify technical terms. The thanks to a general improvement of precision and network used is a simple multi-layer perceptron, recall of the same order. with one hidden layer containing twice the number Finally, it is clear that the Feature Set 4 (i.e. of neurons of the input layer and configured to use all the TTE-focused features, without the AKE- maximum conditional likelihood. The network is focused ones) is the best performing one. The ob- trained using 47 documents of the CRAFT corpus tained precision of 85.3% is a large improvement as training set and its performance is evaluated on from the baseline of 69% and more than twice the the remaining 20, which in turn form the test set. precision of the raw OntoGene output, which is We also experimented using a C5.0 decision just 34.2%. tree, but with unsatisfactory results (the perfor- More importantly, recall rises from 18.7% to mance decreases with the number of features) so 36.8% (over a theoretical maximum of 55.0% of we do not include its analysis in this paper. the raw OntoGene output). Feature Sets 3 and 4 The metrics used are simple Precision, Recall also obtain a better F1-score than OntoGene, with and F1-Score. Table 1 presents the performance 45.7% and 51.5%, respectively, while the score of the different iterations of the proposed system. obtained by the OntoGene system is just 42.1%. Plain OntoGene obtains 55.0% recall and 34.2% To compare our pipeline with similar TTE/CR precision, while the baseline AKE feature set im- software, we use the results by Tseytlin et al. proves the precision score with 69.2% score in (2016), which compared the NOBLE coder with precision but shows a dramatic drop in recall to MMTx,16 Concept Mapper,17 cTAKES18 and 18.7%. MGrep (Dai et al., 2008), as shown in Table 2. It can be seen that the introduction of TTE- We can see that our result outperforms the 0.47 specific features brings an important improvement F1-score obtained by the best performing system, in recall, with a 6% improvement between the i.e. cTAKES Dictionary Lookup, in that compar- baseline and Feature Set 1. Together with a small 16 drop in precision by 1%, it augments the F1-score https://mmtx.nlm.nih.gov/MMTx/ 17 https://uima.apache.org/sandbox.html# by 7 points. concept.mapper.annotator 18 Feature Set 2 performs slightly better than Fea- http://ctakes.apache.org/ ison. This result is achieved thanks to the high Serum levels of estrogen decreased in precision obtained by Distiller’s machine learn- aging Sam68−/− females as expected; ing stage, which boosts precision to 78%, while however, the leptin levels decreased in the precision of the best performing system in the aged Sam68−/− females. same comparison is just 51%. The term estrogen is not annotated in the CRAFT We must stress that our results are not directly corpus, even though it is found in the ChEBI re- comparable to the ones in Tseytlin et al. (2016), source. OntoGene, on the other hand, recognizes for three reasons. Firstly, we evaluate the com- this as a relevant term. The same holds for the two bined pipeline only on a portion of the dataset, other occurrences of this term in the same article. since a training set is needed for the Distiller sys- In the Results section of the same document, we tem. Secondly, we do not do concept disambigua- have tion, but rather we consider a true positive when- ever our pipeline marks a term that spans the same Given the apparent enhancement of text region as a CRAFT annotation, regardless of mineralized nodule formation by what entity is associated with this term, which is Sam68−/− bone marrow stromal cells an easier task than concept recognition. On the ex vivo and the phenotype observed other hand, Tseytlin et al. (2016) count also par- with short hairpin RNA (shRNA)- tial positives, i.e. if the software annotation does treated C3H10T1/2, we stained sections not exactly overlap with the gold annotation, they of bone from 4- and 12-month-old mice allocate one-half match in both precision and re- for evidence of changes in marrow call. Instead, while evaluating our system, we adiposity. count only exact matches, giving a disadvantage to our system. Here, OntoGene annotates the Sequence-Onto- Still, the more than doubling of precision from logy term shRNA both in its full and abbreviated the dictionary-only approach is noteworthy, espe- form. Nevertheless, they are missing from the cially because it compensates the loss in recall CRAFT annotations (along with 6 more occur- well enough to have a general improvement in rences of shRNA); however, CRAFT provides an- F1-score. The comparison, while not completely notations to parts of the term (hairpin and RNA). fair, shows that the high precision of our system is Then, in the Materials and Methods section, we hardly matched by other approaches. have The biggest drawback of our approach is the Briefly, cells were plated on glass cover- relatively low recall still obtained by the Onto- slips or on dentin slices in 24-well clus- Gene pipeline, which puts an upper bound to the ter plates for assessment of cell number recall obtainable by the complete pipeline. The and pit number, respectively. 55% recall score obtained on the CRAFT corpus is not a bad result per se, as it is better to the best Again, the term dentin, which is present in the Cell performance obtained in Tseytlin et al. (2016) by Ontology, is found by OntoGene but absent from NOBLE Coder and cTAKES Dictionary Lookup. the CRAFT corpus, together with 5 more occur- Nevertheless, we believe that recall can be im- rences of the same term. proved by addressing some specific issues we an- Looking at this example document, we can see alyze in greater detail in Section 6.2. that the annotation of the CRAFT corpus seems to be somewhat inconsistent. While the reasons 6 Error Analysis may be various and perfectly reasonable (e.g. the guidelines might explicitly exclude the mentioned 6.1 False Positives and CRAFT Problems terms in that context), this fact may affect the Looking at the errors performed by our system, training and evaluation of our system. we believe that some outcomes that seem to be false positive should actually be marked as 6.2 Causes of Low Recall true positives. Take as an example document Many terms annotated in the CRAFT corpus are PLoS Genet-1-6-1342629.nxml (PMC- missed by the OntoGene pipeline. As a general ID: PMC1315279, PMID: 16362077). In the observation, the OntoGene pipeline – originally Discussion section, we have (emphasis ours): geared towards matching gene and protein names – is not optimally adapted to the broad range of ral network classifier or switching to other ap- term types to be annotated. A small number of the proaches used in literature, such as conditional misses (less than 1%) is caused by the enforced random fields (Leaman et al., 2015). Another ap- case-sensitive match for words from the general proach that we will investigate is to make the algo- vocabulary (such as “Animal” at the beginning of rithm able to disambiguate between different term a sentence). Another portion (around 5%) are due types proposed by the OntoGene pipeline, using a to the matching strategy, in that the aggressive to- multi-class classifier. kenization method removed relevant information, such as trailing punctuation symbols or terms con- sisting entirely of punctuation (e.g. “+”). Approx- References imately 9% are short terms of one or two char- Saber A Akhondi, Ewoud Pons, Zubair Afzal, Herman acters’ length, which had been excluded from the van Haagen, Benedikt FH Becker, Kristina M Het- tne, Erik M van Mulligen, and Jan A Kors. 2016. dictionary a priori, as described above. A major Chemical entity recognition in patents by com- portion, though, are inflectional and derviational bining dictionary-based and statistical approaches. variants, such as plural forms or derived adjec- Database, 2016:baw061. tives (e.g. missed “mammalian” besides matched Michael Ashburner, Catherine A Ball, Judith A Blake, “mammal”). Some CRAFT annotations include David Botstein, Heather Butler, J Michael Cherry, modifiers that are missing from the dictionary, Allan P Davis, Kara Dolinski, Selina S Dwight, e.g. the protein name “TACC1” is matched on Janan T Eppig, et al. 2000. Gene Ontology: tool its own, but not when disambiguated with a for the unification of biology. Nature genetics, 25(1):25–29. species modifier such as “mouse TACC1”/“human TACC1”. Other occasional misses include para- Sophie Aubin and Thierry Hamon. 2006. Improving phrase (“piece of sequence”) or spelling errors term extraction with terminological resources. In (“phophatase” instead of “phosphatase”). Advances in Natural Language Processing, pages 380–387. Springer. 7 Conclusions and Future Work Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A In this paper we have presented and evaluated an Baumgartner, K Bretonnel Cohen, Karin Verspoor, Judith A Blake, et al. 2012. Concept annotation in approach towards efficient recognition of biomed- the CRAFT corpus. BMC bioinformatics, 13(1):1. ical entities in the scientific literature. Although some limitations are still present in our system, we Marco Basaldella, Dario De Nart, and Carlo Tasso. believe that this approach has the potential to de- 2015. Introducing Distiller: a unifying framework for knowledge extraction. In Proceedings of 1st liver high quality entity recognition, not only for AI*IA Workshop on Intelligent Techniques At Li- the scientific literature, but on any related form braries and Archives co-located with XIV Confer- of textual document. We have analyzed the lim- ence of the Italian Association for Artificial In- itations of our approach, clearly discussing the telligence (AI*IA 2015). Associazione Italiana per l’Intelligenza Artificiale. causes of the low recall when evaluated over the CRAFT corpus. The results show that the post- David Campos, Sérgio Matos, and José Luı́s Oliveira. annotation filtering step can significantly increase 2013. A modular framework for biomedical concept precision at the cost of a small loss of recall. Ad- recognition. BMC Bioinformatics, 14:281. ditionally, the approach provides a good ranking Nicola Colic. 2016. Dependency parsing for relation of the candidate entities, thus enabling a manual extraction in biomedical literature. Master’s thesis, selection of the best terms in the context of an as- University of Zurich, Switzerland. sisted curation environment. Manhong Dai, Nigam H Shah, Wei Xuan, Mark A As for future work, we intend to improve cov- Musen, Stanley J Watson, Brian D Athey, Fan Meng, erage of the OntoGene pipeline with respect to the et al. 2008. An efficient solution for mapping free text to ontology terms. AMIA Summit on Transla- CRAFT annotations. Based on the false-negative tional Bioinformatics, 21. analysis, the next steps include: (1) use a stemmer or lemmatizer, (2) optimize the punctuation han- Dario De Nart, Felice Ferrara, and Carlo Tasso. 2013. dling, (3) revise the case-sensitive strategy. Personalized access to scientific publications: from recommendation to explanation. In User Model- We also plan to improve Distiller’s machine ing, Adaptation, and Personalization, pages 296– learning phase, adding more features to the neu- 301. Springer Berlin Heidelberg. Kirill Degtyarenko, Paula De Matos, Marcus Ennis, Fabio Rinaldi, Thomas Kappeler, Kaarel Kalju- Janna Hastings, Martin Zbinden, Alan McNaught, rand, Gerold Schneider, Manfred Klenner, Simon Rafael Alcántara, Michael Darsow, Mickaël Guedj, Clematide, Michael Hess, Jean-Marc von Allmen, and Michael Ashburner. 2008. ChEBI: a database Pierre Parisot, Martin Romacker, and Therese Va- and ontology for chemical entities of biological in- chon. 2008. OntoGene in BioCreative II. Genome terest. Nucleic acids research, 36(suppl 1):D344– Biology, 9(Suppl 2):S13. D350. Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Si- Karen Eilbeck, Suzanna E Lewis, Christopher J mon Clematide, Therese Vachon, and Martin Ro- Mungall, Mark Yandell, Lincoln Stein, Richard macker. 2010. OntoGene in BioCreative II.5. Durbin, and Michael Ashburner. 2005. The Se- IEEE/ACM Transactions on Computational Biology quence Ontology: a tool for the unification of and Bioinformatics, 7(3):472–480. genome annotations. Genome biology, 6(5):R44. Fabio Rinaldi, Simon Clematide, and Simon Hafner. Tilia Renate Ellendorff, Adrian van der Lek, Lenz Fur- 2012a. Ranking of CTD articles and interactions rer, and Fabio Rinaldi. 2015. A combined re- using the OntoGene pipeline. In Proceedings of source of biomedical terminology and its statistics. the 2012 BioCreative workshop, Washington D.C., In Thierry Poibeau and Pamela Faber, editors, Pro- April. ceedings of the 11th International Conference on Fabio Rinaldi, Gerold Schneider, Simon Clematide, Terminology and Artificial Intelligence, pages 39– and Gintare Grigonyte. 2012b. Notes about the On- 49, Granada, Spain. toGene pipeline. In AAAI-2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi Biomedical Text, November 2-4, Arlington, Virginia, Tamura, Toshihisa Takagi, et al. 1998. Toward USA. information extraction: identifying protein names from biological papers. In Pac Symp Biocomput, Fabio Rinaldi, Simon Clematide, Hernani Marques, pages 707–718. Tilia Ellendorff, Raul Rodriguez-Esteban, and Mar- tin Romacker. 2014. OntoGene web services Christopher Funk, William Baumgartner, Benjamin for biomedical text mining. BMC Bioinformatics, Garcia, Christophe Roeder, Michael Bada, K Bre- 15(Suppl 14):S6. tonnel Cohen, Lawrence E Hunter, and Karin Ver- spoor. 2014. Large-scale biomedical concept recog- Fabio Rinaldi. 2012. The OntoGene system: an ad- nition: an evaluation of current automatic annotators vanced information extraction application for bio- and their parameters. BMC bioinformatics, 15(1):1. logical literature. EMBnet.journal, 18(Suppl B):47– 49. Tudor Groza and Karin Verspoor. 2015. Assessing the impact of case sensitivity and term information Yutaka Sasaki, Yoshimasa Tsuruoka, John McNaught, gain on biomedical concept recognition. PloS one, and Sophia Ananiadou. 2008. How to make the 10(3):e0119091. most of NE dictionaries in statistical NER. BMC bioinformatics, 9(11):1. Robert Leaman, Chih-Hsuan Wei, and Zhiyong Lu. Damian Szklarczyk, Andrea Franceschini, Stefan 2015. tmChem: a high performance approach for Wyder, Kristoffer Forslund, Davide Heller, Jaime chemical named entity recognition and normaliza- Huerta-Cepas, Milan Simonovic, Alexander Roth, tion. J. Cheminformatics, 7(S-1):S3. Alberto Santos, Kalliopi P. Tsafou, Michael Kuhn, Patrice Lopez and Laurent Romary. 2010. HUMB: Peer Bork, Lars J. Jensen, and Christian von Mer- automatic key term extraction from scientific articles ing. 2015. STRING v10: protein–protein interac- in GROBID. In Proceedings of the 5th international tion networks, integrated over the tree of life. Nu- workshop on semantic evaluation, pages 248–251. cleic Acids Research, 43(D1):D447–D452. Association for Computational Linguistics. Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski, Julia Corrigan, Girish Chavan, and Rebecca S Jacob- Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana son. 2016. NOBLE – Flexible concept recognition Tatusova. 2005. Entrez Gene: gene-centered infor- for large-scale biomedical natural language process- mation at NCBI. Nucleic acids research, 33(suppl ing. BMC bioinformatics, 17(1):1. 1):D54–D58. Peter D Turney. 2000. Learning algorithms David Nadeau and Satoshi Sekine. 2007. A sur- for keyphrase extraction. Information Retrieval, vey of named entity recognition and classification. 2(4):303–336. Lingvisticae Investigationes, 30(1):3–26. Karin Verspoor, Christophe Roeder, Helen L Johnson, Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, Kevin Bretonnel Cohen, William A Baumgartner Jr, Felice Ferrara, and Carlo Tasso. 2010. Auto- and Lawrence E Hunter. 2010. Exploring species- matic keyphrase extraction and ontology mining for based strategies for gene normalization. IEEE/ACM content-based tag recommendation. International Transactions on Computational Biology and Bioin- Journal of Intelligent Systems, 25(12):1158–1186. formatics, 7(3):462–471.