Using a Hybrid Approach for Entity Recognition
                             in the Biomedical Domain
                    Marco Basaldella                                   Lenz Furrer
              Università degli Studi di Udine           Institute of Computational Linguistics
               Via delle Scienze 208, Udine                        University of Zurich
               basaldella.marco.1@                        Andreasstrasse 15, CH-8050 Zürich
                    spes.uniud.it                              lenz.furrer@uzh.ch

                       Nico Colic                                  Tilia R. Ellendorff
         Institute of Computational Linguistics          Institute of Computational Linguistics
                   University of Zurich                            University of Zurich
                 ncolic@gmail.com                            ellendorff@cl.uzh.ch

                       Carlo Tasso                                    Fabio Rinaldi
              Università degli Studi di Udine           Institute of Computational Linguistics
              carlo.tasso@uniud.it                                 University of Zurich
                                                             fabio.rinaldi@uzh.ch

                      Abstract                               technique may require deep domain and lin-
                                                             guistic knowledge. A simple example may be
     This paper presents an approach towards
                                                             the task of recognizing US phone numbers,
     high performance extraction of biomedical
                                                             which can be solved by a simple regular ex-
     entities from the literature, which is built
                                                             pression.
     by combining a high recall dictionary-
     based technique with a high-precision ma-             • Machine learning-based approach: a statisti-
     chine learning filtering step. The tech-                cal classifier is used to recognize an entity,
     nique is then evaluated on the CRAFT cor-               such as Naive Bayes, Conditional Random
     pus. We present the performance we ob-                  Fields, and so on. Several different types of
     tained, analyze the errors and propose a                features can be used by such systems, for ex-
     possible follow-up of this work.                        ample prefixes and suffixes of the entity can-
                                                             didates, the number of capital letters, etc. A
1    Introduction
                                                             major drawback of this approach is that it
The problem of technical term extraction (herein             typically requires a large, manually annotated
TTE) is the problem of extracting relevant techni-           corpus for algorithm training and testing.
cal terms from a scientific paper. It can be seen
                                                           • Dictionary-based approach: candidate en-
as related to Named Entity Recognition (NER),
                                                             tities are matched against a dictionary of
where the entities one wants to extract are tech-
                                                             known entities. The obvious drawback of this
nical terms belonging to a given field. For exam-
                                                             approach is that it is not able to recognize
ple, while in traditional NER the entities that one
                                                             new entities, making this technique ineffec-
is looking for are of the types “Person”, “Date”,
                                                             tive e.g. in documents which present new dis-
“Location”, etc., in TTE we look for terms belong-
                                                             coveries.
ing to a particular domain, e.g. “Gene”, “Protein”,
“Disease”, and so on (Nadeau and Sekine, 2007).            • Hybrid approaches: two or more of the previ-
A further evolution is the task of Concept Recog-            ous techniques are used together. For exam-
nition (CR), where the entity is also matched to a           ple, Sasaki et al. (2008) as well as Akhondi
concept in an ontology.                                      et al. (2016) combine the dictionary and ML-
   NER (and then TTE) can be solved using very               based approaches to combine the strengths of
different techniques:                                        both.
    • Rule-based approach: a group of manually             The aim of this work is to propose a hybrid ap-
      written rules is used to identify entities. This   proach based on two stages. First, we have a dic-
tionary phase, where a list of all the possible terms   al. (2016) compared their own NOBLE coder soft-
is generated by looking for matches in a database.      ware against other CR algorithms, showing a best
This aims to build a low precision, high recall set     F1-score of 0.44. Another system that makes use
with all the candidate TTs. Then, this set is fil-      of CRAFT for evaluation purposes is described in
tered using a machine learning algorithm that ide-      Campos et al. (2013).
ally is able to discriminate between “good” and
“bad” terms selected in the dictionary matching         3       System Design
phase to augment the precision.                         3.1      CRAFT Corpus
   This approach is realized by using two software
                                                        The CRAFT corpus is a set of 671 manually an-
modules. The first phase is performed by the On-
                                                        notated journal articles from the biomedical field.
toGene pipeline (Rinaldi et al., 2012b; Rinaldi,
                                                        These articles are taken from the PubMed Central
2012), which performs TTE from documents in
                                                        Open Access Subset,2 a part of the PubMed Cen-
the biomedical field, using a dictionary approach.
                                                        tral archive licensed under Creative Commons li-
Then, OntoGene’s results are handed to Distiller,
                                                        censes.
a framework for information extraction introduced
                                                           The corpus contains about 100,000 con-
in Basaldella et al. (2015), which performs the ma-
                                                        cept annotations which point to seven ontolo-
chine learning filtering phase.
                                                        gies/terminologies:
2   Related Work                                            • Chemical entities of Biological Interest
                                                              (ChEBI) (Degtyarenko et al., 2008)
The field of technical term extraction has about 20
years of history, with early works focusing on ex-          • Cell Ontology3
tracting a single category of terms, such as pro-
tein names, from scientific papers (Fukuda et al.,          • Entrez Gene (Maglott et al., 2005)
1998). Later on, “term extraction” became the
                                                            • Gene Ontology (biological process, cellular
common definition for this task and some schol-
                                                              component, and molecular function) (Ash-
ars started to introduce the use of terminological
                                                              burner et al., 2000)
resources as a starting point for solving this prob-
lem (Aubin and Hamon, 2006).                                • the US National Center for Biotechnology In-
   While the most recent state-of-the-art perfor-             formation (NCBI) Taxonomy4
mance is obtained by using machine learning
based systems (Leaman et al., 2015), there is               • Protein Ontology5
growing interest in hybrid machine learning and             • Sequence Ontology (Eilbeck et al., 2005)
dictionary systems such as the one described by
Akhondi et al. (2016), which obtains interest-             Each of the 67 articles contains also linguistic
ing performance on chemical entity recognition          information, such as tokenized sentences, part-of-
in patent texts. In the field of concept recogni-       speech information, parse trees, and dependency
tion, there are different strategies for improving      trees. Articles are represented in different formats,
the coverage of the recognized entities. For exam-      such as plain text or XML, and are easily naviga-
ple, known orthologous relations between proteins       ble with common resources, such as the Knowtator
of different species can be exploited for the detec-    plugin for the Protégé software.6
tion of protein interactions in full text (Szklarczyk      To make references to documents in the CRAFT
et al., 2015). Groza and Verspoor (2015) explore        corpus easily retrievable for the reader, when we
the impact of case sensitivity and the information          1
                                                              The full CRAFT corpus comprises another 30 annotated
gain of individual tokens in multi-word terms on        articles, which are reserved for future competitions and have
the performance of a concept recognition system.        to date not been released.
                                                            2
                                                              http://www.ncbi.nlm.nih.gov/pmc/
   The CRAFT Corpus (Bada et al., 2012) has             tools/openftlist/
been built specifically for evaluating this kind of         3
                                                              https://github.com/obophenotype/
systems, and is described in detail in Section 3.1.     cell-ontology/
                                                            4
                                                              http://www.ncbi.nlm.nih.gov/taxonomy
Funk et al. (2014) used the corpus to evaluate sev-         5
                                                              http://pir.georgetown.edu/pirwww/
eral CR tools, showing how they perform on the          index.shtml
                                                            6
single ontologies in the corpus. Later, Tseytlin et           http://knowtator.sourceforge.net/
will refer to an article contained in the corpus                  using a list of the most frequent English words.
we will list the name of its corresponding XML                    As a further normalization step, Greek letters were
file as contained in the corpus distribution, its                 expanded to their letter name in Latin spelling, e.g.
PubMed Central ID (PMCID), and its PubMed ID                      α → alpha, since this is a common alternation.
(PMID).7                                                             For term matching, we compiled a dictionary
                                                                  resource using the Bio Term Hub (Ellendorff et
3.2    OntoGene                                                   al., 2015). The Bio Term Hub is a large biomed-
The OntoGene group has developed an approach                      ical terminology resource automatically compiled
for biomedical entity recognition based on dic-                   from a number of curated terminology databases.
tionary lookup and flexible matching. Their ap-                   Its advantage lies in the ease of access, in that
proach has been used in several competitive eval-                 it provides terms and identifiers from different
uations of biomedical text mining technologies,                   sources in a uniform format. It is accessible
often obtaining top-ranked results (Rinaldi et al.,               through a web interface,10 which recompiles the
2008; Rinaldi et al., 2010; Rinaldi et al., 2012a;                resource on request and provides it as a tab-
Rinaldi et al., 2014). Recently, the core parts of                separated text file.
the pipeline have been implemented in a more effi-                   Selecting the seven ontologies used in CRAFT
cient framework using Python (Colic, 2016). It of-                resulted in a term dictionary with 20.2 million en-
fers a flexible interface for performing dictionary-              tries. Based on preliminary tests, we removed all
based TTE.                                                        entries with terms shorter than two characters or
    OntoGene’s term annotation pipeline accepts a                 terms consisting of digits only; this reduced the
range of input formats, e.g. PubMed Central full-                 number of entries by less than 0.3%. In the On-
text XML, gzipped chunks of Medline abstracts,                    toGene system, the entries of the term dictionary
BioC,8 or simply plain text. It provides the an-                  were then preprocessed in the same way as the
notated terms along with the corresponding iden-                  documents. Finally, the input documents were
tifiers either in a simple tab-separated text file, in            compared to the dictionary with an exact-match
brat’s standoff format,9 or – again – in BioC. It al-             strategy.
lows for easily plugging in additional components,
                                                                  3.3    Distiller
such as alternative NLP preprocessing methods or
postfiltering routines.                                           Distiller11 is an open source framework written in
    In the present work, the pipeline was config-                 Java and R for machine learning, introduced in
ured as follows: After sentence splitting, the input              Basaldella et al. (2015). While the framework has
documents were tokenized with a simple method                     its roots in the work of Pudota et al. (2010), thus
based on character class: Any contiguous se-                      focusing on the task of automatic keyphrase ex-
quence of either alphabetical or numerical char-                  traction (herein AKE), Distiller’s framework de-
acters was considered a token, whereas any other                  sign allows us to adapt its pipeline to various pur-
characters (punctuation and whitespace) were con-                 poses.
sidered token boundaries and were ignored during                     AKE is the problem of extracting relevant
the dictionary look-up. This lossy tokenization al-               phrases from a document (Turney, 2000), and the
ready has a normalizing effect, in that it collapses              difference with TTE is that, while the former is
spelling variants which arise from inconsistent use               interested in a small set of relevant phrases from
of punctuation symbols, e.g. “SRC 1” vs. “SRC-1”                  the source document, the latter is interested in all
vs. “SRC1”. (A similar approach is described by                   domain-specific terms.
Verspoor et al. (2010), which refer to it as “reg-                   While AKE can be performed using unsuper-
ularization”.) All tokens are then converted to                   vised techniques, the most successful results have
lowercase, except for acronyms that collide with                  been obtained using a supervised machine learn-
a word from general language (e.g. “WAS”). We                     ing approach (Lopez and Romary, 2010). Super-
enforced a case-sensitive match in these cases by                 vised AKE is performed using a quite common
                                                                  pipeline: first, the candidate keyphrases are gen-
    7
      We will not include articles from the CRAFT corpus in       erated, using some kind of linguistic knowledge;
the references as they are not actual bibliography for the pur-
poses of this work.                                                 10
                                                                     http://pub.cl.uzh.ch/purl/biodb/
    8                                                               11
      http://bioc.sourceforge.net/                                   https://github.com/ailab-uniud/
    9
      http://brat.nlplab.org/standoff.html                        distiller-CORE
then, the AKE algorithm filters the candidates        4.2   Feature Set 1
assigning them some features which are in turn        To improve the performance of the TTE task we
used to train a machine learning algorithm, which     start to augment our feature set by introducing fea-
is able to classify “correct” keyphrases. These       tures that should be able to catch some more fine-
keyphrases can be then used for several purposes,     grained information about the candidate terms.
such as document indexing, filtering and recom-
mendation (De Nart et al., 2013).                     Title Presence A flag which is set to 1 if the term
   To adapt Distiller to perform TTE effectively,          appears in the title of the document and 0 oth-
we substituted the candidate generation phase with         erwise, much like the Abstract Presence fea-
the output of OntoGene, i.e. candidate technical           ture.
terms become the potential “key phrases”. This
configuration is then evaluated as our baseline.      Symbols Count A counter for the number of
Next, we gradually add new features into the sys-        punctuation symbols, i.e. not whitespaces
tem to train a machine learning model specialized        and not alpha-numeric characters, appearing
in the actual TTE task, and assess the improve-          in the candidate term.
ments in the performance of the system.
                                                      Uppercase Count A counter for the number of
4     Features                                           uppercase characters in the candidate term.

4.1    Baseline                                       Lowercase Count A counter for the number of
                                                         lowercase characters in the candidate term.
First, we evaluated the performance of the Onto-
Gene/Distiller system using the same feature set      Digits Count A counter for the number of digits
used in the original keyphrase extraction model            in the candidate term.
presented by Basaldella et al. (2015), which con-
tains:                                                Space Count A counter for the number of spaces
                                                          in the candidate term.
Frequency The frequency of the candidate in the
    document, also known as TF.                       Greek Flag A flag that is set to 1 if the candidate
                                                          contains a Greek letter in spelled-out form,
Height The relative position of the first appear-         like “alpha”, “beta”, and so on.
    ance of the candidate in the document.
                                                         These features offer a good improvement
                                                      for detecting the particular shape that a tech-
Depth The relative position of the last appearance
                                                      nical term could have. For example, from the
    of the candidate in the document.
                                                      document PLoS Biol-2-1-314463.nxml
                                                      (PMC: PMC314463, PMID: 14737183) we
Lifespan The distance between the first and the       have the term “5-bromo-4-chloro-3-indolyl
     last appearance of the candidate.                beta-D-galactoside”. This term contains:

TF-IDF The peculiarity of the candidate with            • A spelled-out Greek letter, beta;
    respect to the current document and the             • An uppercase letter;
    CRAFT corpus. This is a very common fea-            • Seven symbols (dashes);
    ture both in the AKE and TTE fields.                • A whitespace.

Abstract Presence A flag set to 1 if the candidate    Without the new features this information would
    appears in the abstract, 0 otherwise. This is     have been lost, so it may have been much harder
    motivated by the fact that often keyphrases       to recognize the term as a technical one.
    are found to appear in the abstract.
                                                      4.3   Feature Set 2
  This small feature set is the baseline of the ex-   In this step we add even more features aimed at de-
perimental evaluation performed on the proposed       tecting more fine-grained information about candi-
approach.                                             date terms. The new features are:
Dash flag Dashes are one of the most (if not the             Since not all affixes are equally important, the
    most) common symbols found in technical               affixes list needs to be cut at some point. While a
    terms. This flag is set to 1 if the term con-         trivial decision could have been to pick the top 100
    tains a dash, 0 otherwise.                            or 10% ranked prefixes and suffixes, our choice
                                                          was to let the machine learning algorithm decide
Ending number flag This flag is set to 1 if the           by itself where to apply the cut.
    term ends with a number, 0 otherwise.
                                                             To obtain this goal, each affix a from a database
Inside capitalization This flag is set to 1 if the        D is assigned a normalized score s ∈ [0, 1] com-
     term contains an uppercase letter which is not       puted this way:
     at the beginning of a token.
                                                                                freq(a, D)
                                                            s(a) =
All uppercase This flag is set to 1 if the term con-                 max({freq(a1 , D) . . . freq(a|D| , D)})
     tains only uppercase letters, 0 otherwise.
                                                          where freq(a, D) is the frequency of an affix a
All lowercase This flag is set to 1 if the term con-
                                                          in D. This way we obtain a simple yet effective
     tains only lowercase letters, 0 otherwise.
                                                          mechanism to let a ML algorithm learn which of
4.4    Feature Set 3: Affixes                             affixes are the most important.
This feature set adds information about the affixes          It is also worth noting that since we generate
(i.e. prefixes and suffixes) of the words. This in-       scores for prefixes and affixes of two and three
formation is particularly useful in the biomedical        letters from six databases, we have a total of
field, since affixes in this field convey often a par-    2 × 2 × 6 = 24 features generated with this ap-
ticular meaning: for example, words ending with           proach.
“ism” are typically diseases, words starting with
                                                          4.5   Feature Set 4: Removing AKE Features
“zoo” refer to something from the animal life, and
so on. Another example is the naming of chemical          Now that we have many features that are more spe-
compounds: for example, many ionic compounds              cific for the technical term extraction field, we re-
have the suffix “ide”, such as Sodium Chloride (the       move the baseline feature set, which was tailored
common table salt).                                       on keyphrase extraction, to use only the features
   Using the Bio Term Hub resource, we compiled           aimed at recognizing technical terms.
a list of all the prefixes and suffixes of two or three      These features (depth, height, lifespan, fre-
letters from the following databases:                     quency, abstract presence, title presence, TF-IDF)
                                                          are specific for the AKE field and supposedly
  • Cellosaurus,12 from the Swiss Institute of
                                                          bring little value on knowing if a term is techni-
    Bioinformatics;
                                                          cal or not. In fact, a term may appear just once in
  • Chemical compounds found in the Toxicoge-
                                                          a random position of the text, and still be techni-
    nomics Database (CTD),13 from the North
                                                          cal; the same does not hold for a keyphrase, which
    Carolina State University Comparative;
                                                          is assumed to appear many times in specific posi-
  • Diseases found in the CTD;
                                                          tions (introduction, conclusions. . . ) in the text.
  • EntrezGene (Maglott et al., 2005);
  • Medical Subject Headings (MeSH),14 from               4.6   Test Hardware
    the US National Center for Biotechnology
    Information (restricted to the subtrees “or-          Both OntoGene and Distiller have been tested on a
    ganisms”, “diseases”, and “chemicals and              laptop computer with an Intel i7 4720HQ proces-
    drugs”);                                              sor running at 2,6GHz, 16 GB RAM and a Cru-
  • Reviewed records from the Universal Protein           cial M.2 M550 SSD. The operating system was
    Resource (Swiss-Prot),15 developed by the             Ubuntu 15.10.
    UniProt consortium, which is a joint USA-                 The speed was of 16275 words/second for On-
    EU-Switzerland project.                               toGene and 4745 words/second for Distiller. On-
  12
                                                          toGene requires an additional time of about 25
     http://web.expasy.org/cellosaurus/
  13
     http://ctdbase.org/
                                                          second to load the dictionary at start up, but since
  14
     http://www.ncbi.nlm.nih.gov/mesh                     this operation is run only once we do not consider
  15
     http://www.uniprot.org/                              it for the average.
         Metric        OntoGene Baseline           FS1          FS2         FS3          FS4
         Precision      0.342    0.692            0.682        0.710       0.771        0.853
         Recall         0.550    0.187            0.247        0.264       0.325        0.368
         F1-Score       0.421    0.294            0.362        0.385       0.457        0.515

Table 1: Scores obtained with the Distiller/Ontogene pipeline using a MLP trained on the CRAFT corpus.
In the column headers, “FSn” stands for “Feature Set n”.

                     System                       Precision     Recall         F1
                     MMTx                           0.43         0.40         0.42
                     MGrep                          0.48         0.12         0.19
                     Concept Mapper                 0.48         0.34         0.40
                     cTakes Dictionary Lookup       0.51         0.43         0.47
                     cTakes Fast Lookup             0.41         0.4          0.41
                     NOBLE Coder                    0.44         0.43         0.43
                     OntoGene                       0.34         0.55         0.42
                     OntoGene+Distiller             0.85         0.37         0.51

Table 2: Comparison of the scores obtained with OntoGene, with the combined OntoGene/Distiller
pipeline and the scores obtained in Tseytlin et al. (2016).


5   Results                                           ture Set 1, with a general improvement between
                                                      2% and 3%. Then Feature Set 3, adding the af-
Using the feature sets defined above, we trained      fixes, brings a great improvement of 7% F1-Score,
a neural network to classify technical terms. The     thanks to a general improvement of precision and
network used is a simple multi-layer perceptron,      recall of the same order.
with one hidden layer containing twice the number        Finally, it is clear that the Feature Set 4 (i.e.
of neurons of the input layer and configured to use   all the TTE-focused features, without the AKE-
maximum conditional likelihood. The network is        focused ones) is the best performing one. The ob-
trained using 47 documents of the CRAFT corpus        tained precision of 85.3% is a large improvement
as training set and its performance is evaluated on   from the baseline of 69% and more than twice the
the remaining 20, which in turn form the test set.    precision of the raw OntoGene output, which is
   We also experimented using a C5.0 decision         just 34.2%.
tree, but with unsatisfactory results (the perfor-       More importantly, recall rises from 18.7% to
mance decreases with the number of features) so       36.8% (over a theoretical maximum of 55.0% of
we do not include its analysis in this paper.         the raw OntoGene output). Feature Sets 3 and 4
   The metrics used are simple Precision, Recall      also obtain a better F1-score than OntoGene, with
and F1-Score. Table 1 presents the performance        45.7% and 51.5%, respectively, while the score
of the different iterations of the proposed system.   obtained by the OntoGene system is just 42.1%.
Plain OntoGene obtains 55.0% recall and 34.2%            To compare our pipeline with similar TTE/CR
precision, while the baseline AKE feature set im-     software, we use the results by Tseytlin et al.
proves the precision score with 69.2% score in        (2016), which compared the NOBLE coder with
precision but shows a dramatic drop in recall to      MMTx,16 Concept Mapper,17 cTAKES18 and
18.7%.                                                MGrep (Dai et al., 2008), as shown in Table 2.
   It can be seen that the introduction of TTE-       We can see that our result outperforms the 0.47
specific features brings an important improvement     F1-score obtained by the best performing system,
in recall, with a 6% improvement between the          i.e. cTAKES Dictionary Lookup, in that compar-
baseline and Feature Set 1. Together with a small
                                                          16
drop in precision by 1%, it augments the F1-score          https://mmtx.nlm.nih.gov/MMTx/
                                                          17
                                                           https://uima.apache.org/sandbox.html#
by 7 points.                                          concept.mapper.annotator
                                                        18
   Feature Set 2 performs slightly better than Fea-        http://ctakes.apache.org/
ison. This result is achieved thanks to the high                Serum levels of estrogen decreased in
precision obtained by Distiller’s machine learn-                aging Sam68−/− females as expected;
ing stage, which boosts precision to 78%, while                 however, the leptin levels decreased in
the precision of the best performing system in the              aged Sam68−/− females.
same comparison is just 51%.
                                                          The term estrogen is not annotated in the CRAFT
   We must stress that our results are not directly
                                                          corpus, even though it is found in the ChEBI re-
comparable to the ones in Tseytlin et al. (2016),
                                                          source. OntoGene, on the other hand, recognizes
for three reasons. Firstly, we evaluate the com-
                                                          this as a relevant term. The same holds for the two
bined pipeline only on a portion of the dataset,
                                                          other occurrences of this term in the same article.
since a training set is needed for the Distiller sys-
                                                             In the Results section of the same document, we
tem. Secondly, we do not do concept disambigua-
                                                          have
tion, but rather we consider a true positive when-
ever our pipeline marks a term that spans the same              Given the apparent enhancement of
text region as a CRAFT annotation, regardless of                mineralized nodule formation by
what entity is associated with this term, which is              Sam68−/− bone marrow stromal cells
an easier task than concept recognition. On the                 ex vivo and the phenotype observed
other hand, Tseytlin et al. (2016) count also par-              with short hairpin RNA (shRNA)-
tial positives, i.e. if the software annotation does            treated C3H10T1/2, we stained sections
not exactly overlap with the gold annotation, they              of bone from 4- and 12-month-old mice
allocate one-half match in both precision and re-               for evidence of changes in marrow
call. Instead, while evaluating our system, we                  adiposity.
count only exact matches, giving a disadvantage
to our system.                                            Here, OntoGene annotates the Sequence-Onto-
   Still, the more than doubling of precision from        logy term shRNA both in its full and abbreviated
the dictionary-only approach is noteworthy, espe-         form. Nevertheless, they are missing from the
cially because it compensates the loss in recall          CRAFT annotations (along with 6 more occur-
well enough to have a general improvement in              rences of shRNA); however, CRAFT provides an-
F1-score. The comparison, while not completely            notations to parts of the term (hairpin and RNA).
fair, shows that the high precision of our system is         Then, in the Materials and Methods section, we
hardly matched by other approaches.                       have
   The biggest drawback of our approach is the                  Briefly, cells were plated on glass cover-
relatively low recall still obtained by the Onto-               slips or on dentin slices in 24-well clus-
Gene pipeline, which puts an upper bound to the                 ter plates for assessment of cell number
recall obtainable by the complete pipeline. The                 and pit number, respectively.
55% recall score obtained on the CRAFT corpus
is not a bad result per se, as it is better to the best   Again, the term dentin, which is present in the Cell
performance obtained in Tseytlin et al. (2016) by         Ontology, is found by OntoGene but absent from
NOBLE Coder and cTAKES Dictionary Lookup.                 the CRAFT corpus, together with 5 more occur-
Nevertheless, we believe that recall can be im-           rences of the same term.
proved by addressing some specific issues we an-             Looking at this example document, we can see
alyze in greater detail in Section 6.2.                   that the annotation of the CRAFT corpus seems
                                                          to be somewhat inconsistent. While the reasons
6     Error Analysis                                      may be various and perfectly reasonable (e.g. the
                                                          guidelines might explicitly exclude the mentioned
6.1    False Positives and CRAFT Problems
                                                          terms in that context), this fact may affect the
Looking at the errors performed by our system,            training and evaluation of our system.
we believe that some outcomes that seem to
be false positive should actually be marked as            6.2   Causes of Low Recall
true positives. Take as an example document               Many terms annotated in the CRAFT corpus are
PLoS Genet-1-6-1342629.nxml (PMC-                         missed by the OntoGene pipeline. As a general
ID: PMC1315279, PMID: 16362077). In the                   observation, the OntoGene pipeline – originally
Discussion section, we have (emphasis ours):              geared towards matching gene and protein names
– is not optimally adapted to the broad range of       ral network classifier or switching to other ap-
term types to be annotated. A small number of the      proaches used in literature, such as conditional
misses (less than 1%) is caused by the enforced        random fields (Leaman et al., 2015). Another ap-
case-sensitive match for words from the general        proach that we will investigate is to make the algo-
vocabulary (such as “Animal” at the beginning of       rithm able to disambiguate between different term
a sentence). Another portion (around 5%) are due       types proposed by the OntoGene pipeline, using a
to the matching strategy, in that the aggressive to-   multi-class classifier.
kenization method removed relevant information,
such as trailing punctuation symbols or terms con-
sisting entirely of punctuation (e.g. “+”). Approx-    References
imately 9% are short terms of one or two char-         Saber A Akhondi, Ewoud Pons, Zubair Afzal, Herman
acters’ length, which had been excluded from the         van Haagen, Benedikt FH Becker, Kristina M Het-
                                                         tne, Erik M van Mulligen, and Jan A Kors. 2016.
dictionary a priori, as described above. A major
                                                         Chemical entity recognition in patents by com-
portion, though, are inflectional and derviational       bining dictionary-based and statistical approaches.
variants, such as plural forms or derived adjec-         Database, 2016:baw061.
tives (e.g. missed “mammalian” besides matched
                                                       Michael Ashburner, Catherine A Ball, Judith A Blake,
“mammal”). Some CRAFT annotations include                David Botstein, Heather Butler, J Michael Cherry,
modifiers that are missing from the dictionary,          Allan P Davis, Kara Dolinski, Selina S Dwight,
e.g. the protein name “TACC1” is matched on              Janan T Eppig, et al. 2000. Gene Ontology: tool
its own, but not when disambiguated with a               for the unification of biology. Nature genetics,
                                                         25(1):25–29.
species modifier such as “mouse TACC1”/“human
TACC1”. Other occasional misses include para-          Sophie Aubin and Thierry Hamon. 2006. Improving
phrase (“piece of sequence”) or spelling errors          term extraction with terminological resources. In
(“phophatase” instead of “phosphatase”).                 Advances in Natural Language Processing, pages
                                                         380–387. Springer.

7   Conclusions and Future Work                        Michael Bada, Miriam Eckert, Donald Evans, Kristin
                                                         Garcia, Krista Shipley, Dmitry Sitnikov, William A
In this paper we have presented and evaluated an         Baumgartner, K Bretonnel Cohen, Karin Verspoor,
                                                         Judith A Blake, et al. 2012. Concept annotation in
approach towards efficient recognition of biomed-
                                                         the CRAFT corpus. BMC bioinformatics, 13(1):1.
ical entities in the scientific literature. Although
some limitations are still present in our system, we   Marco Basaldella, Dario De Nart, and Carlo Tasso.
believe that this approach has the potential to de-     2015. Introducing Distiller: a unifying framework
                                                        for knowledge extraction. In Proceedings of 1st
liver high quality entity recognition, not only for     AI*IA Workshop on Intelligent Techniques At Li-
the scientific literature, but on any related form      braries and Archives co-located with XIV Confer-
of textual document. We have analyzed the lim-          ence of the Italian Association for Artificial In-
itations of our approach, clearly discussing the        telligence (AI*IA 2015). Associazione Italiana per
                                                        l’Intelligenza Artificiale.
causes of the low recall when evaluated over the
CRAFT corpus. The results show that the post-          David Campos, Sérgio Matos, and José Luı́s Oliveira.
annotation filtering step can significantly increase     2013. A modular framework for biomedical concept
precision at the cost of a small loss of recall. Ad-     recognition. BMC Bioinformatics, 14:281.
ditionally, the approach provides a good ranking       Nicola Colic. 2016. Dependency parsing for relation
of the candidate entities, thus enabling a manual        extraction in biomedical literature. Master’s thesis,
selection of the best terms in the context of an as-     University of Zurich, Switzerland.
sisted curation environment.                           Manhong Dai, Nigam H Shah, Wei Xuan, Mark A
   As for future work, we intend to improve cov-        Musen, Stanley J Watson, Brian D Athey, Fan Meng,
erage of the OntoGene pipeline with respect to the      et al. 2008. An efficient solution for mapping free
                                                        text to ontology terms. AMIA Summit on Transla-
CRAFT annotations. Based on the false-negative          tional Bioinformatics, 21.
analysis, the next steps include: (1) use a stemmer
or lemmatizer, (2) optimize the punctuation han-       Dario De Nart, Felice Ferrara, and Carlo Tasso. 2013.
dling, (3) revise the case-sensitive strategy.           Personalized access to scientific publications: from
                                                         recommendation to explanation. In User Model-
   We also plan to improve Distiller’s machine           ing, Adaptation, and Personalization, pages 296–
learning phase, adding more features to the neu-         301. Springer Berlin Heidelberg.
Kirill Degtyarenko, Paula De Matos, Marcus Ennis,          Fabio Rinaldi, Thomas Kappeler, Kaarel Kalju-
  Janna Hastings, Martin Zbinden, Alan McNaught,             rand, Gerold Schneider, Manfred Klenner, Simon
  Rafael Alcántara, Michael Darsow, Mickaël Guedj,         Clematide, Michael Hess, Jean-Marc von Allmen,
  and Michael Ashburner. 2008. ChEBI: a database             Pierre Parisot, Martin Romacker, and Therese Va-
  and ontology for chemical entities of biological in-       chon. 2008. OntoGene in BioCreative II. Genome
  terest. Nucleic acids research, 36(suppl 1):D344–          Biology, 9(Suppl 2):S13.
  D350.
                                                           Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Si-
Karen Eilbeck, Suzanna E Lewis, Christopher J                mon Clematide, Therese Vachon, and Martin Ro-
  Mungall, Mark Yandell, Lincoln Stein, Richard              macker. 2010. OntoGene in BioCreative II.5.
  Durbin, and Michael Ashburner. 2005. The Se-               IEEE/ACM Transactions on Computational Biology
  quence Ontology: a tool for the unification of             and Bioinformatics, 7(3):472–480.
  genome annotations. Genome biology, 6(5):R44.
                                                           Fabio Rinaldi, Simon Clematide, and Simon Hafner.
Tilia Renate Ellendorff, Adrian van der Lek, Lenz Fur-       2012a. Ranking of CTD articles and interactions
   rer, and Fabio Rinaldi. 2015. A combined re-              using the OntoGene pipeline. In Proceedings of
   source of biomedical terminology and its statistics.      the 2012 BioCreative workshop, Washington D.C.,
   In Thierry Poibeau and Pamela Faber, editors, Pro-        April.
   ceedings of the 11th International Conference on        Fabio Rinaldi, Gerold Schneider, Simon Clematide,
   Terminology and Artificial Intelligence, pages 39–        and Gintare Grigonyte. 2012b. Notes about the On-
   49, Granada, Spain.                                       toGene pipeline. In AAAI-2012 Fall Symposium on
                                                             Information Retrieval and Knowledge Discovery in
Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi                 Biomedical Text, November 2-4, Arlington, Virginia,
  Tamura, Toshihisa Takagi, et al. 1998. Toward              USA.
  information extraction: identifying protein names
  from biological papers. In Pac Symp Biocomput,           Fabio Rinaldi, Simon Clematide, Hernani Marques,
  pages 707–718.                                             Tilia Ellendorff, Raul Rodriguez-Esteban, and Mar-
                                                             tin Romacker. 2014. OntoGene web services
Christopher Funk, William Baumgartner, Benjamin              for biomedical text mining. BMC Bioinformatics,
  Garcia, Christophe Roeder, Michael Bada, K Bre-            15(Suppl 14):S6.
  tonnel Cohen, Lawrence E Hunter, and Karin Ver-
  spoor. 2014. Large-scale biomedical concept recog-       Fabio Rinaldi. 2012. The OntoGene system: an ad-
  nition: an evaluation of current automatic annotators      vanced information extraction application for bio-
  and their parameters. BMC bioinformatics, 15(1):1.         logical literature. EMBnet.journal, 18(Suppl B):47–
                                                             49.
Tudor Groza and Karin Verspoor. 2015. Assessing
  the impact of case sensitivity and term information      Yutaka Sasaki, Yoshimasa Tsuruoka, John McNaught,
  gain on biomedical concept recognition. PloS one,          and Sophia Ananiadou. 2008. How to make the
  10(3):e0119091.                                            most of NE dictionaries in statistical NER. BMC
                                                             bioinformatics, 9(11):1.
Robert Leaman, Chih-Hsuan Wei, and Zhiyong Lu.
                                                           Damian Szklarczyk, Andrea Franceschini, Stefan
  2015. tmChem: a high performance approach for
                                                             Wyder, Kristoffer Forslund, Davide Heller, Jaime
  chemical named entity recognition and normaliza-
                                                             Huerta-Cepas, Milan Simonovic, Alexander Roth,
  tion. J. Cheminformatics, 7(S-1):S3.
                                                             Alberto Santos, Kalliopi P. Tsafou, Michael Kuhn,
Patrice Lopez and Laurent Romary. 2010. HUMB:                Peer Bork, Lars J. Jensen, and Christian von Mer-
  automatic key term extraction from scientific articles     ing. 2015. STRING v10: protein–protein interac-
  in GROBID. In Proceedings of the 5th international         tion networks, integrated over the tree of life. Nu-
  workshop on semantic evaluation, pages 248–251.            cleic Acids Research, 43(D1):D447–D452.
  Association for Computational Linguistics.               Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski,
                                                             Julia Corrigan, Girish Chavan, and Rebecca S Jacob-
Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana         son. 2016. NOBLE – Flexible concept recognition
  Tatusova. 2005. Entrez Gene: gene-centered infor-          for large-scale biomedical natural language process-
  mation at NCBI. Nucleic acids research, 33(suppl           ing. BMC bioinformatics, 17(1):1.
  1):D54–D58.
                                                           Peter D Turney.      2000.      Learning algorithms
David Nadeau and Satoshi Sekine. 2007. A sur-                for keyphrase extraction.    Information Retrieval,
  vey of named entity recognition and classification.        2(4):303–336.
  Lingvisticae Investigationes, 30(1):3–26.
                                                           Karin Verspoor, Christophe Roeder, Helen L Johnson,
Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo,            Kevin Bretonnel Cohen, William A Baumgartner Jr,
  Felice Ferrara, and Carlo Tasso. 2010. Auto-               and Lawrence E Hunter. 2010. Exploring species-
  matic keyphrase extraction and ontology mining for         based strategies for gene normalization. IEEE/ACM
  content-based tag recommendation. International            Transactions on Computational Biology and Bioin-
  Journal of Intelligent Systems, 25(12):1158–1186.          formatics, 7(3):462–471.