Learning Representations for Biomedical Named Entity Recognition Ivano Lauriola1,2 , Riccardo Sella1 , Fabio Aiolli1 , Alberto Lavelli2 , and Fabio Rinaldi2,3 1 University of Padova - Department of Mathematics Via Trieste, 63, 35121 Padova - Italy 2 Fondazione Bruno Kessler Via Sommarive, 18, 38123 Trento - Italy 3 University of Zurich - Institute of Computational Linguistics Andreasstrasse 15, CH-8050 Zurich - Switzerland ivano.lauriola@phd.unipd.it Abstract. Biomedical Named Entity Recognition is a common task in Natural Language Processing applications, whose purpose is to recog- nize and categorize different types of entities in biomedical documents. Recently, the literature has shown effective methods based on combina- tions of Machine Learning algorithms and Natural Language Processing techniques. However, a critical issue of such applications is the choice of the data representation. Generic and abstract word-embeddings can be easily used to train a learning algorithm, without prior knowledge of the domain. On the other hand, dedicated hand-crafted features are expen- sive to define, but they could represent better the specific problem. In this work, an extensive experimental assessment is carried out, where different representations have been analyzed. Then, a general framework to learn the representation by combining general and domain-specific features is proposed and evaluated, showing empirical results on the CRAFT corpus. Keywords: Named Entity Recognition, Representation learning, Mul- tiple Kernel Learning 1 Introduction The constant growth of the biomedical literature requires increasingly complex methods to index, categorize and retrieve documents from large-scale online repositories. The aim of Biomedical Named Entity Recognition (BNER) is to recognize and extract relevant entities and concepts from the biomedical litera- ture. These entities can be the name of proteins, cellular components, diseases, species and so on, and they could help large-scale searching algorithms to re- trieve relevant documents. One of the main difficulties of this task is the ambiguity of the terms. A single 83 term can refer to different concepts. A classical example is provided by the to- ken CAT, which can refer to an animal, or it can be the acronym for Computed Aided Tomography or for Chloramphenicol Acetyl Transferase. Another source of difficulties is that proteins and other biomedical entities can be written in different ways (e.g. HIV-1 versus HIV 1). Natural Language Processing (NLP) techniques have been widely used in the literature to solve these tasks [20]. Standard approaches include the usage of human-designed rules applied on the document, or exact match with a dictionary which contains all possible entities. However, there are some issues with these methods, such as the human effort to handle and update the dictionary, and the difficulty of designing powerful and expressive rules. Recently, Machine Learning algorithms have been combined with standard NLP techniques [8], aiming to improve the performance of these systems. State of the art methods include the application of Deep Neural Networks, focusing on 1D Convolutional Neural Network (CNN) and Long-Short Term Memory (LSTM). One of the main issues on the application of machine learning algorithms on the BNER task is the choice of the data representation which describes to- kens and entities. It is shown in the literature [6] that different representations emphasize different aspects of the problem, and they provide different results. Hence, the selection of the representation is a key aspect for a powerful predic- tor. Several representations have been analyzed in the literature to solve BNER tasks, each of them defining a particular point of view of the main problem. In [5], a set of hand-crafted and domain-specific character-level features have been considered. These features describe the inner structure of tokens, such as the number and position of upper and lower characters, the affixes, the presence of symbols and so on. The idea behind this representation is that biomedical entities have a particular inner structure easily recognizable by the defined char- acteristics. In other works, more general representations based on word embed- dings have been used to represent the tokens (see [24, 11]), reducing the human effort on the feature engineering phase, and making easier to adapt these systems to new biomedical entity types. However, these representations are not able to solve the disambiguation problem, since they consider only the character-level features. Hence, the same words have the same representation independently of their position in the text. On the other hand, word-level representations consider the spatial and semantic information of tokens and entities in the document, aiming to solve the disam- biguation problem. These representations consider the position of the entity with respect to the other tokens, or the other entities. The main contribution of this work is an extensive analysis and comparison of different data representations in the BNER task, where each of them emphasizes different viewpoints of the problem, and corresponds to different abstraction lev- els. Then, a general framework based on the Multiple Kernel Learning paradigm is proposed to learn the best representation from the training data directly. Several baselines based on deep and shallow machine learning techniques have 84 been compared with the proposed method, showing its empirical effectiveness in terms of efficacy, measured by means of the F1 score. The paper is organized as follows. Section 2 provides a background on the BNER task, including a description of the Multiple Kernel Learning paradigm and the related work. Then, the proposed method is defined in Section 3. Even- tually, Section 4 contains the experimental assessment and the results reached by the baselines and the proposed method. 2 Background and Related work NLP applications rely on a sequence of steps that extract structured textual features from the document. Usually, the first step is to divide the text into sentences (sentence splitting) and the sentences into tokens (tokenization). Ad- ditional normalization steps can follow at the token level, such as determining the lexical root of words (lemmatization). Through morpho-syntactic analysis it is then possible to determine the part of speech of words (e.g. noun, verb, adjective). NER can be performed either on general texts (e.g., newspaper articles), to recognize concepts like person, organization or location, or on technical docu- ments (e.g., biomedical literature), to recognize concepts like cells, diseases or proteins. NER can be used by itself, with the goal of recognizing the presence of a term in a given document, or as a preliminary step for further, more complex tasks (e.g., relation extraction). Several approaches exist in the literature to solve the NER task. They can be grouped in the following categories: – Rule-based: these methods consist of domain-specific hand-written rules which are able to recognize entities in documents. The rules consider regular ex- pressions or particular characteristics of the entities. Generally, these rules are defined by groups of biomedical and linguistic experts. – Dictionary-based: the simplest approach, which finds the occurrences of enti- ties in a document from a precompiled dictionary or ontology, which contains all of the possible entities. However, the maintenance and the constant up- date of dictionaries from specific domains is an expensive task. – Machine Learning methods: shallow machine learning techniques have been widely applied on the NER/BNER task, such as the Support Vector Ma- chine (SVM), Conditional Random Field (CRF) and Hidden Markov Models (HMM), showing good results with domain-specific features. Recently, deep learning algorithms have been considered, like the 1D CNN and LSTM, with promising results. The interest for the NER task in the biomedical domain has produced an extensive literature. Here we briefly discuss about the major advances. In [25] Conditional Random Fields are used, together with handcrafted features, in order to improve previous state of the art results. High-performance BNER systems often consider hybrid approaches, where rule- 85 or dictionary- based approaches and machine learning techniques are combined. A multiclass BNER problem has been analyzed in [18], where authors proposed a two-step algorithm, where in the first phase entities are recognized by means of the SVM algorithm. Then a dictionary look-up is applied to classify entities. Au- thors in [5] proposed a different hybrid approach, which consists of a dictionary look-up as first step and machine learning output filtering in the second step. This ensemble system is shown to empirically achieve state of the art results on the CRAFT corpus. In [9] is described an extensive quantitative analysis about word vectors trained on millions of documents of the biomedical literature, suggesting that using word vectors could improve the results in various related tasks. The claim of [9] found applications through the usage of a fairly complex model based on Long Short Term Memory deep neural networks and Conditional Ran- dom Fields [17, 10]. 2.1 Multiple Kernel Learning Kernel Machines are a large family of Machine Learning algorithms widely used in the literature to solve classification, regression and clustering problems. A kernelized algorithm comprises two elements. The first element is the learn- ing algorithm whose solution is expressed by dot-products between training ex- amples. The second consists of a symmetric positive semi-definite kernel func- tion k : X × X → R, which computes the dot-product in a Reproducing Ker- nel Hilbert Space (RKHS). This means that there is a function φ : X → K which maps data from the input space X to the kernel space K such that k(xi , xj ) = hφ(xi ), φ(xj )i, where xi , xj ∈ X . Usually, an expert user chooses the kernel function exploiting her/his domain-specific knowledge, or via a validation procedure. Recently the literature showed mechanisms to learn the kernel function di- rectly from the training data. The most well known kernel learning paradigm is the Multiple Kernel Learning (MKL) [16], which learns the kernel as a linear non-negative combination of P base kernels, with the form: P X kµ (xi , xj ) = µr kr (xi , xj ), µr ≥ 0 r=1 where kr is the r-th kernel function defined on the r-th representation φr , and µ is the weights vector which parametrizes the combination. These P base ker- nels correspond to different source, or different notions of similarity between examples. 3 Method This work considers/describes an extended version of the learning pipeline pro- posed in [5], which is a two-stage hybrid procedure to recognize entities in the biomedical literature. The next subsections describe this hybrid system (3.1) and the proposed one (3.2), emphasizing the differences and strengths. 86 3.1 A two-stage hybrid pipeline Authors in [5] proposed a hybrid system which combines NLP and machine learning techniques to recognize entities from documents. The system acts by means of a two-stage pipeline. The first phase of this pipeline consists of a dictionary-based filter, where a set of candidate entities is recognized from the corpus by means of dictionary look-up. The aim of the filter is to discard the large part of non-entities from documents, resulting in high recall but low precision. Then, a feature vector is computed for each candidate by means of a hand-crafted representation, which considers a set of character level features and affixes. Even- tually, a classifier based on neural networks is used to recognize entities from the set of candidates. Dictionary look-up is applied on both the training and test documents. Training candidates are used to train a machine learning algorithm. There are two weaknesses in this approach. Firstly, the training set used to train the neural network is composed exclusively of the output of the dictionary- based classifier, which corresponds to the set of candidate entities. The posi- tive class is composed by the candidates that correspond to annotated entities, whereas the remaining candidates form the negative class. Besides, entities discarded by the dictionary filter are not used in the train- ing phase, with a consequent loss of useful information. Moreover, when the first layer of the system works well, there is a further lack of negative examples. Hence, the application of complex Neural Networks is expected to result in lower performance. 3.2 The proposed extension In order to overcome the above mentioned limitations, an extended training set is taken into consideration to train the machine learning algorithms. On the one hand, the whole set of annotated entities defines the positive examples. On the other hand, the negative set is composed by the False Positive candidates from the dictionary-based filter. Furthermore, additional negative examples/tokens have been included in the training set to reduce the lack of negative examples. These tokens consist of words that are not entities nor stop-words from the train- ing corpus, that are discarded by the dictionary classifier, and they correspond to 50% of the positive examples. Hence, if the corpus contains N annotated entities, and the dictionary filter provides a candidate set composed by T P (< N ) True Positive and F P False Positive entities, the dataset will contains N positive examples and F P + N2 negative examples. The main extension proposed in this work concerns the generalization of the pipeline, by including a mechanism to learn the best representation directly from data, exploiting the MKL framework. Three different explicit representations have been considered as a descriptor of each candidate entity. The first representation consists of a word embedding computed by means of the Word2Vec algorithm [19]. The embedding has been 87 trained on the PubMed corpus, and it is available in the Gensim package for the python programming language [22]. Moreover, the hand-crafted representation defined in [5] has been used. This representation consists of two main groups of features. The former group contains features which focus on the affixes. The latter group describes the structure of the token from an orthographic point of view, i.e. the number of upper and lower characters, the presence of symbols and numbers. . . See [5] for the complete list of these features. In our work, these groups of features have been divided into two different representations to improve the expressiveness of the MKL algorithm. 5 Homogeneous Polynomial Kernels with degrees {1 . . . 5} are computed for each representation. Eventually, a MKL algorithm provides the combination of these 15 kernels. The choice of such kernels derives from theoretical results on the generalization of dot-product kernels. See [13] to get more details. Several MKL algorithms exists in the literature. In this work the EasyMKL [2] algorithm has been considered. EasyMKL is an efficient state-of-the-art MKL algorithm which tries to find the combination of base kernels that maximize the distance between the convex hull of the positive examples and the convex hull of the negative ones. The proposed approach has two main advantages with respect to standard machine learning approaches. First of all, the scalability with respect to the number of representations. Generally, the computational complexity of MKL algorithms increases linearly with the number of kernels. This means that the adding of novel representation requires only the computation of the associated feature vector, which generally is not an expensive task. Furthermore, the proposed procedure does not require the validation and the selection of the representation. The larger is the pool of representations, the more expressive is the MKL algorithm. A depiction of the proposed system is described in Fig. 1. 4 Experimental assessment The whole set of experiments is described in this section. The dataset, the algo- rithms and some baselines are also discussed. 4.1 Dataset The experimental analysis has been conducted on the Colorado Richly Anno- tated Full Text (CRAFT) corpus v2.0 [3]. The CRAFT corpus contains a set of 67 documents from the PubMed Central Open Access Subset. These documents have been manually annotated with respect to the following ontologies: – Chemical Entities of Biological Interest (ChEBI) [12]: contains chemical names; – Cell Ontology (CL) [4]: contains names of cell types; 88 (a) Training pipeline (b) Test pipeline Fig. 1: A depiction of the proposed system. During the traning phase (a) the MKL algorithm learn the weights of the linear non-negative combination of base kernels. Then (b), the learned representations are used to classify candidates by means of a two-stage pipeline. – Gene Ontology (GO) [7]: the CRAFT corpus is annotated with two sub- category, which are Cellular Components (GO CC) and Biological Processes and Molecular Functions (GO BPMF); – National Center for Biotechnology Information (NCBI) Taxonomy [15]: in- cludes names of species and taxonomic ranks; – Protein Ontology (PR) [1]: contains protein names; – Sequence Ontology (SO) [14]: contains names of biological sequence features and attributes. The corpus includes 570 000 tokens, with approximately 100 000 annotated con- cepts and more than 21 000 sentences. The 7 ontologies have been analyzed individually by using both the base training set used in [5], and the extended one, discussed in the previous section. 4.2 Baselines Several hard baselines have been analyzed and compared: – Multiple Layer Perceptron (MLP): the approach proposed in [5], with the same architecture. – Random Forest (RF): due to its generalization capability in several domains, and its computational efficiency, the RF classifier has been considered as a further baseline. 89 – Convolutional Neural Network (CNN): this algorithm has been considered aiming to combine the character-level features to the context and semantic information. MLP and RF algorithms consider exclusively the information that the candidate entity provides, that is its representation. CNN instead, considers a small-sized window of tokens around the candidate. Hence it has more information with respect to the other baselines, which can be used to improve the overall classifi- cation performance, reducing the disambiguation problem. 4.3 Evaluation The CRAFT corpus has been divided in training (47) and test (20) documents, by considering the same split used in [5]. The dictionary-based classifier has been used on the training corpus to produce the training sets and on the test corpus to recognize the candidate set. Both the base training set ([5]) and the extended training set (see Section 3) have been considered individually. To accomplish this step, the Onto-Gene’s Entity Recognizer (OGER) [23] frame- work has been used. Each document has been split in tokens, by using the lossy tokenization method of splitting every time a non-alphanumeric character is found. Each token has been then converted to lowercase, and stemming (using Lancaster stemmer) has been applied, except for acronyms. Greek letters have been expanded (α →alpha), with the aim of further normalizing the final tokens. The details and results of the first phase of the architecture, which corre- sponds to dictionary look-up, are shown in the Table 1, including the number of entities for each ontology, the number of candidates, of True Positives and of False Positives on both the training and the test corpus. A hold-out procedure has been applied by splitting the training set in training (80%) and validation/development (20%) to select the hyperparameters, which are: MLP : the architecture presented in [5] has been used. No additional hyperpa- rameters have been validated. The validation set has been used to prevent the overfitting by means of an early-stop procedure. RF : the number of trees used, with values {10, 50, 100, 200, 500, 1000}. Other hyperparameters have been set to their default values, defined in the Scikit- learn implementation [21]. 1D-CNN : the number of convolutional layers, from 1 up to 4, each of them with 128 filters. The dimension of the window around the candidate has been also validated, starting from 5 tokens up to 14. EasyMKL : the λ value of the algorithm, with values {0.0, 0.1, . . . , 1.0}. This value regularizes the solution by maximizing the distance between centroids which represent the classes rather than the margin. A hard-margin SVM has been used as base learner. 90 Table 1: Detailed results of the dictionary filter, including the number of candi- date entities, true positives and false positives computed on the training docu- ments (first row) and test documents (second row). F1 (precision,recall) scores are also reported. # annotated entities # candidates TP FP pos/neg (%) F1 ChEBI 5736 9284 4033 5251 43/57 54 (43,70) 1800 3020 1319 1710 55 (44,73) CL 4612 3804 3423 381 90/10 81 (90,75) 1266 1044 923 121 80 (88,73) GO BPMF 15608 10870 3821 7049 35/65 29 (35,25) 5608 3573 1377 2196 30 (39,25) GO CC 6302 7457 4419 3038 59/41 64 (59,70) 2075 2431 1236 1195 55 (51,60) NCBI 5432 17696 4832 12864 27/73 42 (27,89) 2021 6312 1854 4458 44 (30,92) PR 11827 19240 9599 9641 50/50 62 (50,81) 3814 6502 3199 3303 62 (49,84) SO 15143 24027 11093 12934 46/54 57 (46,73) 6093 8796 4056 4740 54 (46,67) all 87337 124056 55184 68881 44/56 52 (44,63) In order to find entities in the test documents, the dictionary-based classifi- cation has been performed to find the candidate set. Then, the trained machine learning algorithm has been applied to further classify the examples. Algorithms and representations have been compared by considering precision, recall and F1 . Results reached by the baselines on the two representations and the proposed method are shown in Table 2. Results are also summarized in the Table 3, where the average rank of each baseline is shown. 4.4 Discussion Several algorithms have been analyzed in this work. The MLP architecture pro- posed in [5] provides lower results with respect to the RF algorithm, whose training is less expensive by orders of magnitude. A notable result is the low F1 reached by the deep CNN, which was the most favourite algorithm. It is clear that each method has its own suitable representation, which is the hand-crafted for the RF algorithm, and the embedding computed by the Word2Vec algorithm for the Neural Networks. Besides, the MKL approach achieves high results avoiding the selection of the representation. However, in this work only 3 types of kernels have been taken into consideration, bounding the expres- siveness of the proposed approach. The inclusion of additional training examples provides an empirical improve- ment of the overall performance, with a general increment of 0-2% points of F1 score. Eventually, the experimental assessment on the MLP algorithm confirms the re- 91 Table 2: F1 (precision,recall) scores computed on the NER classification task by using different representations. For each ontology, the first row considers ML models trained on the output of the dictionary, whereas the second row considers ML models trained on the extended training set. [5] word2vec MLP RF CNN MLP RF CNN MKL ChEBI 76 (89,66) 79 (92,69) 80 (95,70) 78 (87,70) 78 (87,70) 75 (87,66) 79 (91,70) 77 (88,69) 80 (92,70) 80 (94,70) 79 (89,71) 78 (87,70) 75 (87,66) 80 (95,70) CL 76 (87,67) 81 (89,74) 83 (98,72) 81 (89,74) 81 (89,74) 83 (96,74) 80 (89,74) 78 (90,69) 82 (90,75) 83 (96,72) 81 (89,75) 81 (89,74) 84 (96,74) 80 (89,73) GO BPMF 35 (67,24) 36 (80,23) 31 (57,22) 35 (71,24) 36 (72,24) 36 (79,23) 36 (79,23) 36 (70,24) 37 (85,24) 38 (65,27) 37 (73,25) 36 (72,24) 36 (80,23) 36 (72,24) GO CC 70 (92,56) 68 (92,54) 44 (39,50) 69 (87,57) 67 (89,54) 70 (88,57) 69 (86,57) 70 (92,56) 69 (93,54) 47 (44,50) 69 (87,57) 70 (87,57) 69 (86,57) 70 (88,57) NCBI 94 (98,91) 95 (99,91) 94 (99,88) 91 (90,92) 90 (89,91) 95 (98,92) 95 (99,91) 94 (98,91) 95 (99,91) 95 (99,91) 90 (90,89) 90 (89,91) 94 (98,91) 95 (99,91) PR 80 (87,74) 83 (86,80) 88 (94,83) 77 (80,74) 79 (81,77) 80 (88,74) 82 (88,76) 81 (88,75) 83 (89,78) 88 (95,83) 80 (83,76) 80 (82,77) 81 (89,74) 82 (88,77) SO 75 (92,63) 75 (93,63) 72 (78,64) 74 (92,62) 75 (91,63) 75 (92,63) 75 (93,63) 74 (92,62) 75 (93,63) 72 (79,64) 75 (92,64) 75 (91,63) 76 (93,65) 75 (92,63) Table 3: Average rank of F1 , precision and recall scores reached by the algorithms by using the base (first row), and the extended training set (second row). [5] word2vec MLP RF CNN MLP RF CNN MKL F1 4.57 2.71 4.00 4.57 4.71 3.14 2.71 4.86 2.86 3.00 4.28 4.85 4.28 3.00 precision 4.14 2.00 3.57 4.86 4.86 3.14 2.57 3.71 1.71 3.57 4.43 5.58 3.14 3.00 recall 4.00 3.14 4.29 2.43 2.29 2.86 2.29 5.00 2.71 2.86 2.71 2.43 3.86 2.71 sults reached in [5], with the exception of the ChEBI ontology, where we reach 2 points less than the previous work. This difference could depend on the random component of the optimizer used. 5 Conclusions In this work, a general framework for learning the best representation for the Biomedical Named Entity Recognition task is presented and analyzed. The pro- posed method aims to combine several weak representations in a single one by means of the Multiple Kernel Learning paradigm. These representations define different points of view, and emphasize different aspects of the problem through a different set of features. 92 An empirical evaluation against hard baselines has been performed, showing the generalization capability of the proposed framework on the CRAFT corpus, with promising results. In the future, we plan to extend the proposed approach in different direc- tions. Firstly, more representations will be included, such as Word2Vec models pre-trained on different domains (e.g.: Wikipedia, GoogleNews. . . ), character- level embeddings, word-normalization features, and Part-Of-Speech information. Secondly, the weights that the Multiple Kernel Learning algorithm assigns will be analyzed. We aim to understand which are the most relevant feature sets in the combination for each ontology. Other points that will be taken into account are the analysis of the efficiency of these systems and their effectiveness on more corpora. Acknowledgments The work described in this paper is partially supported by grant CR30I1 162758 of the Swiss National Science Foundation. References 1. Protein ontology (2017), http://pir.georgetown.edu/pro/pro.shtml 2. Aiolli, F., Donini, M.: EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169, 215–224 (2015) 3. Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgart- ner, W.A., Cohen, K.B., Verspoor, K., Blake, J.A., et al.: Concept annotation in the CRAFT corpus. BMC bioinformatics 13(1), 161 (2012) 4. Bard, J., Rhee, S.Y., Ashburner, M.: An ontology for cell types. Genome biology 6(2), R21 (2005) 5. Basaldella, M., Furrer, L., Tasso, C., Rinaldi, F.: Entity recognition in the biomed- ical domain using a hybrid approach. Journal of biomedical semantics 8(1), 51 (2017) 6. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8), 1798–1828 (2013) 7. Botstein, D., Cherry, J.M., Ashburner, M., Ball, C., Blake, J., Butler, H., Davis, A., Dolinski, K., Dwight, S., Eppig, J., et al.: Gene ontology: tool for the unification of biology. Nat genet 25(1), 25–9 (2000) 8. Campos, D., Matos, S., Oliveira, J.L.: Biomedical named entity recognition: a survey of machine-learning tools. In: Theory and Applications for Advanced Text Mining. InTech (2012) 9. Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S.: How to train good word embed- dings for biomedical NLP. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing. pp. 166–174 (2016) 10. Dang, T.H., Le, H.Q., Nguyen, T.M., Vu, S.T.: D3ner: Biomedical named entity recognition using crf-bilstm improved with fine-tuned embeddings of various lin- guistic information. Bioinformatics 1, 8 (2018) 11. Das, A., Ganguly, D., Garain, U.: Named entity recognition with word embeddings and wikipedia categories for a low-resource language. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16(3), 18 (2017) 93 12. Degtyarenko, K., De Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Alcántara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic acids research 36(suppl 1), D344–D350 (2007) 13. Donini, M., Aiolli, F.: Learning deep kernels in the space of dot product polyno- mials. Machine Learning 106(9-10), 1245–1269 (2017) 14. Eilbeck, K., Lewis, S.E., Mungall, C.J., Yandell, M., Stein, L., Durbin, R., Ash- burner, M.: The sequence ontology: a tool for the unification of genome annota- tions. Genome biology 6(5), R44 (2005) 15. Federhen, S.: The NCBI taxonomy database. Nucleic acids research 40(D1), D136– D143 (2011) 16. Gönen, M., Alpaydın, E.: Multiple kernel learning algorithms. Journal of machine learning research 12(Jul), 2211–2268 (2011) 17. Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017) 18. Lee, K.J., Hwang, Y.S., Kim, S., Rim, H.C.: Biomedical named entity recognition using two-phase model based on SVMs. Journal of Biomedical Informatics 37(6), 436–447 (2004) 19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781 (2013) 20. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007) 21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011) 22. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Cor- pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/ publication/884893/en 23. Rinaldi, F., Clematide, S., Marques, H., Ellendorff, T., Romacker, M., Rodriguez- Esteban, R.: Ontogene web services for biomedical text mining. BMC bioinformat- ics 15(14), S6 (2014) 24. Seok, M., Song, H.J., Park, C.Y., Kim, J.D., Kim, Y.s.: Named entity recognition using word embedding as a feature. Int. J. Softw. Eng. Appl 10(2), 93–104 (2016) 25. Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. pp. 104–107. JNLPBA ’04, Association for Computational Linguistics, Stroudsburg, PA, USA (2004), http://dl.acm.org/citation.cfm?id=1567594.1567618 94