=Paper=
{{Paper
|id=None
|storemode=property
|title=Species taxonomy for gene name normalization
|pdfUrl=https://ceur-ws.org/Vol-714/ShortPaper05_Mora.pdf
|volume=Vol-714
|dblpUrl=https://dblp.org/rec/conf/smbm/MoraF10
}}
==Species taxonomy for gene name normalization==
Species taxonomy for gene normalization
György Móra∗1 and Richárd Farkas∗2
1 University of Szeged, Department of Informatics, Szeged, Hungary
2 Hungarian Academy of Sciences, Research Group on Artificial Intelligence, Szeged, Hungary
Email: ∗ gymora@inf.u-szeged.hu; ∗ rfarkas@inf.u-szeged.hu;
∗ Corresponding author
Abstract
Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions.
Using these identifiers a great deal of information can be gathered from external databases such as interactions,
pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the
inter-species ambiguity of the gene mentions in biomedical publications is high. The experiences gained from
the BioCreative II Gene Normalization Task indicate that the biggest challenge in gene normalization is the
recognition of the species that a specific gene mention belongs to. In biomedical scientific articles the authors
often use taxonomical entities besides concrete species mentions as references to different group of organisms.
Species taxonomies are hierarchical systems (trees) of living creatures and therefore provide a classification of
species. Here we investigate the added value of the utilization of taxonomic entity mentions in the inter-species
gene normalization task.
Results: We present a method which marks those words mentioning all taxonomic entities (genus, family, etc.)
and applies filtering heuristics to select the taxonomic entities referring to species mentioned in the document.
These entities are then treated as species mentions together with standard species annotations and we employ
them in gene normalization.
Conclusion: After experiments were carried out on the BioCreative III Gene Normalization Task’s data-set to
investigate the contribution of the additional species mentions to the gene disambiguation task, we found that
our approach improves the performance of the inter-species gene mention disambiguator, both in terms of precision
and recall.
Background building brick of an information extraction system
is a named entity recognizer which can identify bio-
A vast amount of information is present in biologi- logical entities such as genes and gene products, cell
cal scientific publications. Even a complete subset lines and organism names in a text.
of these documents in a particular scientific topic is
too large for any scientist to read nowadays. This is Besides the identification of entity mentions it is
why search engines and information extraction sys- important to normalize them. Gene normalization
tems have been developed to support the life sci- (GN) is a process where unique database identifiers
entist in finding the information needed. The key are assigned to gene mentions, where these mentions
122
refer to a specific gene entry. Gene databases may and human BMP4’, multiple identifiers should be
contain information related to these genes such as se- assigned. We compared the annotations of the two
quences, gene products and interaction and pathway corpora – they consist of the same document set –
information. For instance, a system applying entity and there were significant differences in the genes
normalization can assist automatic pathway finding annotated for a given document. We decided to use
systems and support pharmacological investigations. the BioCreativeIII corpus and not the document-
Recent studies [1, 2] indicate that the intra- level corpus used by Hakenberg et al. [2] because
species ambiguity of gene symbols is much lower the latter contain only abstracts and we think that
than the ambiguity between species, so it is impor- full text documents contain more taxonomical entity
tant to determine which species the gene mention mentions because of their writing style.
belongs to. The use of synonyms instead of official We implemented a taxonomic name identifying
gene symbols also increases ambiguity and some au- system that tags expressions in biomedical scientific
thors prefer to use these synonyms instead of official texts mentioning taxonomic entities (TE), and with
symbols. heuristical rules determines the exact species that
Current inter-species GN approaches focus on they refer to. Our approach was extrinsically evalu-
species words and employ species mention detectors ated in an inter-species gene mention normalization
to recognize them [2, 3]. Then normalization sys- setting. Our results show that the annotation of TEs
tems use machine learnt model or hand-crafted rules does indeed improve the performance of a state-of-
to determine the species associated with a particu- the-art GN system.
lar gene mention. The NACTEM’s Species Disam-
biguator we applied makes use of a natural language
parser to exploit the linguistical relations between
the gene and species mentions [3].
There are species mention detectors available
with suitable precision and recall, but these systems Methods
focus on identifying exact species mentions, such as
Dataset
the scientific and common names of living organ-
isms but not the names of groups, classes, genus or We used the BioCreative III Shared Task Gene
other taxonomic categories. The authors can use Normalization dataset for the evaluation [5]. The
these taxonomic names in a general way for a group dataset consists of manually annotated full-text ar-
of species or as references to a finite set of species ticles. A subset of the documents was fully anno-
mentioned earlier in the document. If the taxonomic tated and in the remaining part only the important
name refers to an exact species, then an inter-species genes were recognized. Here we used the fully anno-
gene disambiguation system can exploit this infor- tated subset of documents for the evaluation of our
mation. approach.
We used the BioCreative III full-text, document-
level corpus for our evaluation because we found Just genes and gene products from the Entrez
no suitable mention-level gold-standard dataset with Gene database [6] that were clearly related to a
inter-species gene normalization. Current trends in species were annotated. Genes which had no En-
biomedical text mining are directed towards systems trez gene identifier and gene mentions that refer to
that work on full-text articles rather than just ab- a group of genes were not annotated at all. The an-
stracts. The two other corpora [2, 3] available for notation also does not contain a gene mention when
inter-species GN evaluation are based on the Biocre- the species associated with the gene cannot be de-
ative II Gene Normalization Task’s dataset and con- termined – even with domain knowledge. Entrez
sist of biomedical article abstracts [4]. Gene identifiers of genes contained in each document
The corpus used by Hakenberg et al. [2] contains were provided without the given gene mention being
all of the gene identifiers mentioned in the abstract, marked.
but it is annotated only at the document level. Al- We annotated the species names in the docu-
though the corpus introduced by Wang et al. [3] is ments by the LINNAEUS species name identification
annotated at the mention level, every entity was an- system [7] for biomedical literature, which assigns
notated with only one gene id and in cases like ’rat NCBI Taxonomy [6] identifiers to species mentions.
123
Figure 1: Flowchart of the experimental set-up (the TE mention recognition and mapping subsystems are
marked with red)
Gene mention tagging tions containing TEs mapped to species by our sys-
The gene mentions were tagged in the document tem. The only difference was that the second set-up
by our dictionary-based gene mention tagger, which included our TE mention mapping module and the
assigned all of the possible Entrez Gene identifiers TE mentions were mapped to species mentions be-
to the gene mentions. The dictionary mapping is fore gene mention normalization. The gene-mention
based on the NLM’s string normalizing method. The normalization was then evaluated at the document
normalized substrings of each sentence are matched level. A flowchart of the experimental set-up can
against the normalized synonyms of Entrez Gene be seen in Figure 1. With this we investigated the
names in our database. Then hand-crafted rules are added value of TEs inside a state-of-the art gene
applied to filter out false positive entity mentions mention recognizer (leaving the other component of
and eliminate overlapping annotations of the same the system unchanged).
gene mention. One-token long entities are only ac-
cepted when they contain numerals or non-standard
capitalization and if they are at least two characters Recognizing alternative species mentions
long. Mentions longer than one token are accepted The annotation of taxonomic entities (TE) was done
without restriction. The gene mentions had the pos- using the same method as that for gene tagging.
sible Entrez Gene identifiers with the gene’s NCBI The synonyms of the NCBI Taxonomy entries were
Taxonomy species id assigned. matched against the text and taxonomy identifiers
were assigned to the mentions. TE mentions refer-
ring to taxonomic groups that had no members an-
Experimental set-up notated in the text were filtered out.
We used the NACTEM’s Species Disambiguator The references between TE and species mentions
component from the uCompare system to provide were identified by using the following set of heuris-
inter-species gene-normalization [3, 8]. This compo- tical rules:
nent assigns NCBI Taxonomy identifiers to each gene
mention. The module applies the species annota- • Only species descending from the taxonomic
tions in the document to determine the species asso- category of the TE in the NCBI Taxonomy
ciated with the gene mentions. Two different types were regarded as possibly referred species.
of analysis were carried out. One was with just the
species mentions tagged by LINNAEUS (baseline) • If the sentence containing the TE mention
and the other used an extended set of species men- also contained a candidate species mention like
124
TE - TE + Tagged TE - Tagged TE +
Precision 0.668 0.695 0.668 0.695
Recall 0.571 0.610 0.798 0.853
F-measure 0.616 0.650 0.727 0.766
Table 1: Performance values of the GN setting without (TE -) and with alternative species (TE +) utilized in
the normalization. Results marked with ”tagged” were produced by an evaluation applied only on a subset
of the genes taken from the evaluation set, which were then successfully mapped to the documents by the
dictionary mapper.
this, then the TE was considered to refer to Discussion
this species. The performance of our approach compared to the
state-of-the-art baseline method has an interesting
• If multiple species satisfied the descendant cri- distribution. The gene normalization with alterna-
teria then the taxonomic entry was considered tive species mentions outperformed the baseline sys-
to refer to multiple species or refer to the gen- tem in 7 out of the 32 documents and there were
eral taxonomic class and both were removed. only two cases where our approach achieved a lower
F-measure. In these two cases our method added
• If there was no species annotated in the sen-
only 1-1 false positives and hence it did not affect
tence, the search was continued at the para-
the overall results significantly.
graph, section, and document level, respec-
There were 5 documents where there were no
tively.
alternative species mentions tagged by our system,
• At the end only TE mentions annotated with so the performance of the disambiguation was the
one species were kept and used as alternative same. In the remaining 17 documents – where the
species mentions in our experiments. TE + and TE - achieved the same results – only
a few TEs were recognized. A manual inspection
of the document-set showed that these differences
were caused by the different writing styles of the au-
Results thors. Some authors exclusively use concrete species
The NACTEM’s inter-species gene normalization names when referring to an organism and also use
system tagged the gene mentions with a species iden- TE names to refer to species.
tifier, but the datasets available consist of Entrez We evaluated the 10 documents containing a sig-
Gene identifiers assigned to the documents. To eval- nificant amount of TE mentions and the overall F-
uate the performance of our approach we mapped score rose from 0.40 to 0.65. The precision and re-
the species identifiers assigned by the normalizer to call values went up from 0.37 to 0.56 and from 0.44
gene identifiers and evaluated the resulting set of En- to 0.77, respectively. This subset of the BioCre-
trez identifiers at the document level with the stan- ative III documents represents those biomedical ar-
dard F-measure metric (see Table 1). ticles where the authors often refer to organisms us-
The dictionary mapper does not provide a map- ing broader terms instead of using exact organism
ping for each gene identifier of the evaluation data names.
set. Therefore we provide additional scores –focusing The following examples show how the TE men-
on the performance of the inter-species normaliza- tions can aid gene normalization.
tion instead of the performance of the dictionary
lookup– by removing false negatives which were not ”Indeed, elevated expression of
annotated by the dictionary lookup (”tagged” in Ta- Drosophila MOF, which counteracts
ble 1). ISWI activity . . . ”
Both the precision and the recall of the inter-
species gene mention normalization rose by 4-5 per- Here the exact organism name (D. melanogaster )
centage points when utilizing TE mentions present was mentioned elsewhere in the document, so the TE
in biomedical articles. (in bold type) was successfully mapped to Drosophila
125
melanogaster because no other species belonging to example the TE vertebrate was used only to name
the Drosophila subgenus was found in the given con- the vertebrates in general, but it was incorrectly
text. The species identifier of the gene mention referenced to D. simulans – the only vertebrate
(in Italics) was correctly determined by utilizing the species identified in the document. Also, D. sim-
identified alternative gene mention. ulans was incorrectly identified by LINNAEUS as a
Wider TEs terms (like plants) were also success- rodent (Dipodomys simulans) and not as an insect
fully mapped to the corresponding species mentions (Drosophila simulans).
in the text and produced correct gene normalization.
”In spite of the similar global function of
”By studying plants with mutations in
insect and vertebrate OBPs . . . ”
this gene, we found that CBP60g con-
tributes to the increases . . . ”
When no plants other than Arabidopsis thaliana
were mentioned in the given context it was possible
Conclusions
to identify the TE plant by the label A. thaliana. By utilizing the TE mentions as alternative species
There were some documents where both of the mentions, the approach we presented here improves
procedures achieved low scores. An analysis later re- the performance of a state-of-the-art inter-species
vealed that the LINNAEUS species detector was un- gene normalization tool. The overall F-scores mea-
able to identify species mentions in some cases where sured on the BioCreative III GN dataset rose from
the authors used only short and ambiguous variants 0.61 to 0.65. On a subset of the dataset – where
of the organism name, like Drosophila instead of the writing style of the authors causes the classical
D. Melanogaster or Drosophila Melanogaster. Even approach to achieve poorer results than on the rest
when the TEs were identified in the document and of the testset – our method increased the F-score on
no species mentions were annotated, the TEs were this set from 0.40 to 0.64.
filtered out. If there was no species identified in a A subsequent error analysis indicated that more
document the NACTEM’s gene disambiguator chose sophisticated methods are required to resolve the ref-
Homo sapiens (human) as the default organism for erences between TE mentions and species mentions.
gene normalization. We plan to develop an integrated species mention
If a TE covers a large number of organisms (like and alternative organism mention system in the near
the TE animal ), then false positive species associ- future.
ations can occur. For example, if the author uses
a TE as a general term rather than as a taxonomic
category. In the next negative example the word an- Authors contributions
imal was referenced to C. elegans by mistake, but
the word was used in the sense of other mammals György Móra developed the software tools used
like human rather than a worm like C. elegans. As for mention detection, taxonomy browsing, TE-to-
a result HCF-1 protein was incorrectly identified as species linking and the evaluation of the results. He
a gene product belonging to C. elegans instead of a was responsible for the statistical analysis done in
human protein. this study. Richárd Farkas supervised the work and
participated in the writing of the manuscript. The
”. . . we have undertaken a genetic anal- authors would like to thank those who maintain the
ysis in C. elegans to study HCF-1- Entrez Gene and NCBI Taxonomy databases [6], the
protein function in animal development. authors of the NACTEM’s Species Disambiguator [3]
The C. elegans HCF-1-related protein is and the authors of the LINNAEUS species name rec-
an amino acid protein encoded by the ognizer [7] for making these tools available.
hcf-1 gene and referred to here as Ce
HCF-1.”
Another source of incorrect normalization is Acknowledgements
when the author refers globally to the group of or- This work was supported in part by the NKTH grant
ganisms, but our heuristics link the TE to an ex- (project codename TEXTREND) of the Hungarian gov-
act species mention in the document. In the next ernment.
126
References 5. BioCreative III Gene Normalization Task [http:
1. Chen L, Liu H, Friedman C: Gene name ambigu- //www.biocreative.org/tasks/biocreative-iii/gn/].
ity of eukaryotic nomenclatures. Bioinformatics 2005,
21(2):248–256. 6. The NCBI handbook. Bethesda (MD): National Library of
2. Hakenberg J, Plake C, Leaman R, Schroeder M, Gonza- Medicine (US), National Center for Biotechnology Infor-
lez G: Inter-species normalization of gene mentions mation 2002, [http://www.ncbi.nlm.nih.gov/entrez/query.
with GNAT. Bioinformatics 2008, 24(16):i126–i132. fcgi?db=Books].
3. Wang X, Tsujii J, Ananiadou S: Disambiguating the 7. Gerner M, Nenadic G, Bergman CM: LINNAEUS: a
species of biomedical named entities using natural species name identification system for biomedi-
language parsers. Bioinformatics 2010, 26(5):661–667. cal literature. BMC bioinformatics 2010, 11:85+, [http:
4. Morgan A, Lu Z, Wang X, Cohen A, Fluck J, Ruch P, //dx.doi.org/10.1186/1471-2105-11-85].
Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C,
Liu Hh, Torres R, Krauthammer M, Lau W, Liu H, Hsu 8. Kano Y, Baumgartner WA, McCrohon L, Ananiadou S,
CN, Schuemie M, Cohen KB, Hirschman L: Overview of Cohen KB, Hunter L, Tsujii J: U-Compare: share and
BioCreative II gene normalization. Genome Biology compare text mining tools with UIMA. Bioinformat-
2008, 9(Suppl 2):S3, [http://genomebiology.com/2008/9/ ics 2009, 25(15):1997–1998, [http://dx.doi.org/10.1093/
S2/S3]. bioinformatics/btp289].
127