Species taxonomy for gene normalization
György Móra∗1 and Richárd Farkas∗2

1 University of Szeged, Department of Informatics, Szeged, Hungary

2 Hungarian Academy of Sciences, Research Group on Artificial Intelligence, Szeged, Hungary


Email: ∗ gymora@inf.u-szeged.hu; ∗ rfarkas@inf.u-szeged.hu;

∗ Corresponding author


Abstract
Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions.
Using these identifiers a great deal of information can be gathered from external databases such as interactions,
pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the
inter-species ambiguity of the gene mentions in biomedical publications is high. The experiences gained from
the BioCreative II Gene Normalization Task indicate that the biggest challenge in gene normalization is the
recognition of the species that a specific gene mention belongs to. In biomedical scientific articles the authors
often use taxonomical entities besides concrete species mentions as references to different group of organisms.
Species taxonomies are hierarchical systems (trees) of living creatures and therefore provide a classification of
species. Here we investigate the added value of the utilization of taxonomic entity mentions in the inter-species
gene normalization task.
Results: We present a method which marks those words mentioning all taxonomic entities (genus, family, etc.)
and applies filtering heuristics to select the taxonomic entities referring to species mentioned in the document.
These entities are then treated as species mentions together with standard species annotations and we employ
them in gene normalization.
Conclusion: After experiments were carried out on the BioCreative III Gene Normalization Task’s data-set to
investigate the contribution of the additional species mentions to the gene disambiguation task, we found that
our approach improves the performance of the inter-species gene mention disambiguator, both in terms of precision
and recall.


Background                                                           building brick of an information extraction system
                                                                     is a named entity recognizer which can identify bio-
A vast amount of information is present in biologi-                  logical entities such as genes and gene products, cell
cal scientific publications. Even a complete subset                  lines and organism names in a text.
of these documents in a particular scientific topic is
too large for any scientist to read nowadays. This is                    Besides the identification of entity mentions it is
why search engines and information extraction sys-                   important to normalize them. Gene normalization
tems have been developed to support the life sci-                    (GN) is a process where unique database identifiers
entist in finding the information needed. The key                    are assigned to gene mentions, where these mentions


                                                              122
refer to a specific gene entry. Gene databases may         and human BMP4’, multiple identifiers should be
contain information related to these genes such as se-     assigned. We compared the annotations of the two
quences, gene products and interaction and pathway         corpora – they consist of the same document set –
information. For instance, a system applying entity        and there were significant differences in the genes
normalization can assist automatic pathway finding         annotated for a given document. We decided to use
systems and support pharmacological investigations.        the BioCreativeIII corpus and not the document-
    Recent studies [1, 2] indicate that the intra-         level corpus used by Hakenberg et al. [2] because
species ambiguity of gene symbols is much lower            the latter contain only abstracts and we think that
than the ambiguity between species, so it is impor-        full text documents contain more taxonomical entity
tant to determine which species the gene mention           mentions because of their writing style.
belongs to. The use of synonyms instead of official            We implemented a taxonomic name identifying
gene symbols also increases ambiguity and some au-         system that tags expressions in biomedical scientific
thors prefer to use these synonyms instead of official     texts mentioning taxonomic entities (TE), and with
symbols.                                                   heuristical rules determines the exact species that
    Current inter-species GN approaches focus on           they refer to. Our approach was extrinsically evalu-
species words and employ species mention detectors         ated in an inter-species gene mention normalization
to recognize them [2, 3]. Then normalization sys-          setting. Our results show that the annotation of TEs
tems use machine learnt model or hand-crafted rules        does indeed improve the performance of a state-of-
to determine the species associated with a particu-        the-art GN system.
lar gene mention. The NACTEM’s Species Disam-
biguator we applied makes use of a natural language
parser to exploit the linguistical relations between
the gene and species mentions [3].
    There are species mention detectors available
with suitable precision and recall, but these systems      Methods
focus on identifying exact species mentions, such as
                                                           Dataset
the scientific and common names of living organ-
isms but not the names of groups, classes, genus or        We used the BioCreative III Shared Task Gene
other taxonomic categories. The authors can use            Normalization dataset for the evaluation [5]. The
these taxonomic names in a general way for a group         dataset consists of manually annotated full-text ar-
of species or as references to a finite set of species     ticles. A subset of the documents was fully anno-
mentioned earlier in the document. If the taxonomic        tated and in the remaining part only the important
name refers to an exact species, then an inter-species     genes were recognized. Here we used the fully anno-
gene disambiguation system can exploit this infor-         tated subset of documents for the evaluation of our
mation.                                                    approach.
    We used the BioCreative III full-text, document-
level corpus for our evaluation because we found               Just genes and gene products from the Entrez
no suitable mention-level gold-standard dataset with       Gene database [6] that were clearly related to a
inter-species gene normalization. Current trends in        species were annotated. Genes which had no En-
biomedical text mining are directed towards systems        trez gene identifier and gene mentions that refer to
that work on full-text articles rather than just ab-       a group of genes were not annotated at all. The an-
stracts. The two other corpora [2, 3] available for        notation also does not contain a gene mention when
inter-species GN evaluation are based on the Biocre-       the species associated with the gene cannot be de-
ative II Gene Normalization Task’s dataset and con-        termined – even with domain knowledge. Entrez
sist of biomedical article abstracts [4].                  Gene identifiers of genes contained in each document
    The corpus used by Hakenberg et al. [2] contains       were provided without the given gene mention being
all of the gene identifiers mentioned in the abstract,     marked.
but it is annotated only at the document level. Al-           We annotated the species names in the docu-
though the corpus introduced by Wang et al. [3] is         ments by the LINNAEUS species name identification
annotated at the mention level, every entity was an-       system [7] for biomedical literature, which assigns
notated with only one gene id and in cases like ’rat       NCBI Taxonomy [6] identifiers to species mentions.


                                                         123
Figure 1: Flowchart of the experimental set-up (the TE mention recognition and mapping subsystems are
marked with red)


Gene mention tagging                                       tions containing TEs mapped to species by our sys-
The gene mentions were tagged in the document              tem. The only difference was that the second set-up
by our dictionary-based gene mention tagger, which         included our TE mention mapping module and the
assigned all of the possible Entrez Gene identifiers       TE mentions were mapped to species mentions be-
to the gene mentions. The dictionary mapping is            fore gene mention normalization. The gene-mention
based on the NLM’s string normalizing method. The          normalization was then evaluated at the document
normalized substrings of each sentence are matched         level. A flowchart of the experimental set-up can
against the normalized synonyms of Entrez Gene             be seen in Figure 1. With this we investigated the
names in our database. Then hand-crafted rules are         added value of TEs inside a state-of-the art gene
applied to filter out false positive entity mentions       mention recognizer (leaving the other component of
and eliminate overlapping annotations of the same          the system unchanged).
gene mention. One-token long entities are only ac-
cepted when they contain numerals or non-standard
capitalization and if they are at least two characters     Recognizing alternative species mentions
long. Mentions longer than one token are accepted          The annotation of taxonomic entities (TE) was done
without restriction. The gene mentions had the pos-        using the same method as that for gene tagging.
sible Entrez Gene identifiers with the gene’s NCBI         The synonyms of the NCBI Taxonomy entries were
Taxonomy species id assigned.                              matched against the text and taxonomy identifiers
                                                           were assigned to the mentions. TE mentions refer-
                                                           ring to taxonomic groups that had no members an-
Experimental set-up                                        notated in the text were filtered out.
We used the NACTEM’s Species Disambiguator                     The references between TE and species mentions
component from the uCompare system to provide              were identified by using the following set of heuris-
inter-species gene-normalization [3, 8]. This compo-       tical rules:
nent assigns NCBI Taxonomy identifiers to each gene
mention. The module applies the species annota-                • Only species descending from the taxonomic
tions in the document to determine the species asso-             category of the TE in the NCBI Taxonomy
ciated with the gene mentions. Two different types               were regarded as possibly referred species.
of analysis were carried out. One was with just the
species mentions tagged by LINNAEUS (baseline)                 • If the sentence containing the TE mention
and the other used an extended set of species men-               also contained a candidate species mention like


                                                         124
                                       TE -     TE +       Tagged TE -    Tagged TE +
                         Precision     0.668    0.695         0.668          0.695
                         Recall        0.571    0.610         0.798          0.853
                         F-measure     0.616    0.650         0.727          0.766

Table 1: Performance values of the GN setting without (TE -) and with alternative species (TE +) utilized in
the normalization. Results marked with ”tagged” were produced by an evaluation applied only on a subset
of the genes taken from the evaluation set, which were then successfully mapped to the documents by the
dictionary mapper.


      this, then the TE was considered to refer to          Discussion
      this species.                                         The performance of our approach compared to the
                                                            state-of-the-art baseline method has an interesting
   • If multiple species satisfied the descendant cri-      distribution. The gene normalization with alterna-
     teria then the taxonomic entry was considered          tive species mentions outperformed the baseline sys-
     to refer to multiple species or refer to the gen-      tem in 7 out of the 32 documents and there were
     eral taxonomic class and both were removed.            only two cases where our approach achieved a lower
                                                            F-measure. In these two cases our method added
   • If there was no species annotated in the sen-
                                                            only 1-1 false positives and hence it did not affect
     tence, the search was continued at the para-
                                                            the overall results significantly.
     graph, section, and document level, respec-
                                                                There were 5 documents where there were no
     tively.
                                                            alternative species mentions tagged by our system,
   • At the end only TE mentions annotated with             so the performance of the disambiguation was the
     one species were kept and used as alternative          same. In the remaining 17 documents – where the
     species mentions in our experiments.                   TE + and TE - achieved the same results – only
                                                            a few TEs were recognized. A manual inspection
                                                            of the document-set showed that these differences
                                                            were caused by the different writing styles of the au-
Results                                                     thors. Some authors exclusively use concrete species
The NACTEM’s inter-species gene normalization               names when referring to an organism and also use
system tagged the gene mentions with a species iden-        TE names to refer to species.
tifier, but the datasets available consist of Entrez            We evaluated the 10 documents containing a sig-
Gene identifiers assigned to the documents. To eval-        nificant amount of TE mentions and the overall F-
uate the performance of our approach we mapped              score rose from 0.40 to 0.65. The precision and re-
the species identifiers assigned by the normalizer to       call values went up from 0.37 to 0.56 and from 0.44
gene identifiers and evaluated the resulting set of En-     to 0.77, respectively. This subset of the BioCre-
trez identifiers at the document level with the stan-       ative III documents represents those biomedical ar-
dard F-measure metric (see Table 1).                        ticles where the authors often refer to organisms us-
     The dictionary mapper does not provide a map-          ing broader terms instead of using exact organism
ping for each gene identifier of the evaluation data        names.
set. Therefore we provide additional scores –focusing           The following examples show how the TE men-
on the performance of the inter-species normaliza-          tions can aid gene normalization.
tion instead of the performance of the dictionary
lookup– by removing false negatives which were not               ”Indeed,    elevated  expression of
annotated by the dictionary lookup (”tagged” in Ta-              Drosophila MOF, which counteracts
ble 1).                                                          ISWI activity . . . ”
     Both the precision and the recall of the inter-
species gene mention normalization rose by 4-5 per-             Here the exact organism name (D. melanogaster )
centage points when utilizing TE mentions present           was mentioned elsewhere in the document, so the TE
in biomedical articles.                                     (in bold type) was successfully mapped to Drosophila


                                                          125
melanogaster because no other species belonging to         example the TE vertebrate was used only to name
the Drosophila subgenus was found in the given con-        the vertebrates in general, but it was incorrectly
text. The species identifier of the gene mention           referenced to D. simulans – the only vertebrate
(in Italics) was correctly determined by utilizing the     species identified in the document. Also, D. sim-
identified alternative gene mention.                       ulans was incorrectly identified by LINNAEUS as a
    Wider TEs terms (like plants) were also success-       rodent (Dipodomys simulans) and not as an insect
fully mapped to the corresponding species mentions         (Drosophila simulans).
in the text and produced correct gene normalization.
                                                                ”In spite of the similar global function of
     ”By studying plants with mutations in
                                                                insect and vertebrate OBPs . . . ”
     this gene, we found that CBP60g con-
     tributes to the increases . . . ”
     When no plants other than Arabidopsis thaliana
were mentioned in the given context it was possible
                                                           Conclusions
to identify the TE plant by the label A. thaliana.         By utilizing the TE mentions as alternative species
     There were some documents where both of the           mentions, the approach we presented here improves
procedures achieved low scores. An analysis later re-      the performance of a state-of-the-art inter-species
vealed that the LINNAEUS species detector was un-          gene normalization tool. The overall F-scores mea-
able to identify species mentions in some cases where      sured on the BioCreative III GN dataset rose from
the authors used only short and ambiguous variants         0.61 to 0.65. On a subset of the dataset – where
of the organism name, like Drosophila instead of           the writing style of the authors causes the classical
D. Melanogaster or Drosophila Melanogaster. Even           approach to achieve poorer results than on the rest
when the TEs were identified in the document and           of the testset – our method increased the F-score on
no species mentions were annotated, the TEs were           this set from 0.40 to 0.64.
filtered out. If there was no species identified in a          A subsequent error analysis indicated that more
document the NACTEM’s gene disambiguator chose             sophisticated methods are required to resolve the ref-
Homo sapiens (human) as the default organism for           erences between TE mentions and species mentions.
gene normalization.                                        We plan to develop an integrated species mention
     If a TE covers a large number of organisms (like      and alternative organism mention system in the near
the TE animal ), then false positive species associ-       future.
ations can occur. For example, if the author uses
a TE as a general term rather than as a taxonomic
category. In the next negative example the word an-        Authors contributions
imal was referenced to C. elegans by mistake, but
the word was used in the sense of other mammals            György Móra developed the software tools used
like human rather than a worm like C. elegans. As          for mention detection, taxonomy browsing, TE-to-
a result HCF-1 protein was incorrectly identified as       species linking and the evaluation of the results. He
a gene product belonging to C. elegans instead of a        was responsible for the statistical analysis done in
human protein.                                             this study. Richárd Farkas supervised the work and
                                                           participated in the writing of the manuscript. The
     ”. . . we have undertaken a genetic anal-             authors would like to thank those who maintain the
     ysis in C. elegans to study HCF-1-                    Entrez Gene and NCBI Taxonomy databases [6], the
     protein function in animal development.               authors of the NACTEM’s Species Disambiguator [3]
     The C. elegans HCF-1-related protein is               and the authors of the LINNAEUS species name rec-
     an amino acid protein encoded by the                  ognizer [7] for making these tools available.
     hcf-1 gene and referred to here as Ce
     HCF-1.”
   Another source of incorrect normalization is            Acknowledgements
when the author refers globally to the group of or-        This work was supported in part by the NKTH grant
ganisms, but our heuristics link the TE to an ex-          (project codename TEXTREND) of the Hungarian gov-
act species mention in the document. In the next           ernment.


                                                         126
References                                                    5. BioCreative III Gene Normalization Task [http:
1. Chen L, Liu H, Friedman C: Gene name ambigu-                  //www.biocreative.org/tasks/biocreative-iii/gn/].
   ity of eukaryotic nomenclatures. Bioinformatics 2005,
   21(2):248–256.                                             6. The NCBI handbook. Bethesda (MD): National Library of
2. Hakenberg J, Plake C, Leaman R, Schroeder M, Gonza-           Medicine (US), National Center for Biotechnology Infor-
   lez G: Inter-species normalization of gene mentions           mation 2002, [http://www.ncbi.nlm.nih.gov/entrez/query.
   with GNAT. Bioinformatics 2008, 24(16):i126–i132.             fcgi?db=Books].
3. Wang X, Tsujii J, Ananiadou S: Disambiguating the          7. Gerner M, Nenadic G, Bergman CM: LINNAEUS: a
   species of biomedical named entities using natural            species name identification system for biomedi-
   language parsers. Bioinformatics 2010, 26(5):661–667.         cal literature. BMC bioinformatics 2010, 11:85+, [http:
4. Morgan A, Lu Z, Wang X, Cohen A, Fluck J, Ruch P,             //dx.doi.org/10.1186/1471-2105-11-85].
   Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C,
   Liu Hh, Torres R, Krauthammer M, Lau W, Liu H, Hsu         8. Kano Y, Baumgartner WA, McCrohon L, Ananiadou S,
   CN, Schuemie M, Cohen KB, Hirschman L: Overview of            Cohen KB, Hunter L, Tsujii J: U-Compare: share and
   BioCreative II gene normalization. Genome Biology             compare text mining tools with UIMA. Bioinformat-
   2008, 9(Suppl 2):S3, [http://genomebiology.com/2008/9/        ics 2009, 25(15):1997–1998, [http://dx.doi.org/10.1093/
   S2/S3].                                                       bioinformatics/btp289].


                                                            127