<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Species taxonomy for gene normalization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gy¨orgy M´ora</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rich´ard Farkas</string-name>
          <email>rfarkas@inf.u-szeged.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hungarian Academy of Sciences, Research Group on Artificial Intelligence</institution>
          ,
          <addr-line>Szeged</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Szeged, Department of Informatics</institution>
          ,
          <addr-line>Szeged</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <fpage>122</fpage>
      <lpage>127</lpage>
      <abstract>
        <p>Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical publications is high. The experiences gained from the BioCreative II Gene Normalization Task indicate that the biggest challenge in gene normalization is the recognition of the species that a specific gene mention belongs to. In biomedical scientific articles the authors often use taxonomical entities besides concrete species mentions as references to different group of organisms. Species taxonomies are hierarchical systems (trees) of living creatures and therefore provide a classification of species. Here we investigate the added value of the utilization of taxonomic entity mentions in the inter-species gene normalization task. Results: We present a method which marks those words mentioning all taxonomic entities (genus, family, etc.) and applies filtering heuristics to select the taxonomic entities referring to species mentioned in the document. These entities are then treated as species mentions together with standard species annotations and we employ them in gene normalization. Conclusion: After experiments were carried out on the BioCreative III Gene Normalization Task's data-set to investigate the contribution of the additional species mentions to the gene disambiguation task, we found that our approach improves the performance of the inter-species gene mention disambiguator, both in terms of precision and recall.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Background</title>
      <p>A vast amount of information is present in
biological scientific publications. Even a complete subset
of these documents in a particular scientific topic is
too large for any scientist to read nowadays. This is
why search engines and information extraction
systems have been developed to support the life
scientist in finding the information needed. The key
building brick of an information extraction system
is a named entity recognizer which can identify
biological entities such as genes and gene products, cell
lines and organism names in a text.</p>
      <p>Besides the identification of entity mentions it is
important to normalize them. Gene normalization
(GN) is a process where unique database identifiers
are assigned to gene mentions, where these mentions
refer to a specific gene entry. Gene databases may
contain information related to these genes such as
sequences, gene products and interaction and pathway
information. For instance, a system applying entity
normalization can assist automatic pathway finding
systems and support pharmacological investigations.</p>
      <p>
        Recent studies [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] indicate that the
intraspecies ambiguity of gene symbols is much lower
than the ambiguity between species, so it is
important to determine which species the gene mention
belongs to. The use of synonyms instead of official
gene symbols also increases ambiguity and some
authors prefer to use these synonyms instead of official
symbols.
      </p>
      <p>
        Current inter-species GN approaches focus on
species words and employ species mention detectors
to recognize them [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Then normalization
systems use machine learnt model or hand-crafted rules
to determine the species associated with a
particular gene mention. The NACTEM’s Species
Disambiguator we applied makes use of a natural language
parser to exploit the linguistical relations between
the gene and species mentions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>There are species mention detectors available
with suitable precision and recall, but these systems
focus on identifying exact species mentions, such as
the scientific and common names of living
organisms but not the names of groups, classes, genus or
other taxonomic categories. The authors can use
these taxonomic names in a general way for a group
of species or as references to a finite set of species
mentioned earlier in the document. If the taxonomic
name refers to an exact species, then an inter-species
gene disambiguation system can exploit this
information.</p>
      <p>
        We used the BioCreative III full-text,
documentlevel corpus for our evaluation because we found
no suitable mention-level gold-standard dataset with
inter-species gene normalization. Current trends in
biomedical text mining are directed towards systems
that work on full-text articles rather than just
abstracts. The two other corpora [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] available for
inter-species GN evaluation are based on the
Biocreative II Gene Normalization Task’s dataset and
consist of biomedical article abstracts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The corpus used by Hakenberg et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] contains
all of the gene identifiers mentioned in the abstract,
but it is annotated only at the document level.
Although the corpus introduced by Wang et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is
annotated at the mention level, every entity was
annotated with only one gene id and in cases like ’rat
and human BMP4’, multiple identifiers should be
assigned. We compared the annotations of the two
corpora – they consist of the same document set –
and there were significant differences in the genes
annotated for a given document. We decided to use
the BioCreativeIII corpus and not the
documentlevel corpus used by Hakenberg et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] because
the latter contain only abstracts and we think that
full text documents contain more taxonomical entity
mentions because of their writing style.
      </p>
      <p>We implemented a taxonomic name identifying
system that tags expressions in biomedical scientific
texts mentioning taxonomic entities (TE), and with
heuristical rules determines the exact species that
they refer to. Our approach was extrinsically
evaluated in an inter-species gene mention normalization
setting. Our results show that the annotation of TEs
does indeed improve the performance of a
state-ofthe-art GN system.</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Dataset</title>
        <p>
          We used the BioCreative III Shared Task Gene
Normalization dataset for the evaluation [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The
dataset consists of manually annotated full-text
articles. A subset of the documents was fully
annotated and in the remaining part only the important
genes were recognized. Here we used the fully
annotated subset of documents for the evaluation of our
approach.
        </p>
        <p>
          Just genes and gene products from the Entrez
Gene database [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that were clearly related to a
species were annotated. Genes which had no
Entrez gene identifier and gene mentions that refer to
a group of genes were not annotated at all. The
annotation also does not contain a gene mention when
the species associated with the gene cannot be
determined – even with domain knowledge. Entrez
Gene identifiers of genes contained in each document
were provided without the given gene mention being
marked.
        </p>
        <p>
          We annotated the species names in the
documents by the LINNAEUS species name identification
system [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for biomedical literature, which assigns
NCBI Taxonomy [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] identifiers to species mentions.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Gene mention tagging</title>
        <p>The gene mentions were tagged in the document
by our dictionary-based gene mention tagger, which
assigned all of the possible Entrez Gene identifiers
to the gene mentions. The dictionary mapping is
based on the NLM’s string normalizing method. The
normalized substrings of each sentence are matched
against the normalized synonyms of Entrez Gene
names in our database. Then hand-crafted rules are
applied to filter out false positive entity mentions
and eliminate overlapping annotations of the same
gene mention. One-token long entities are only
accepted when they contain numerals or non-standard
capitalization and if they are at least two characters
long. Mentions longer than one token are accepted
without restriction. The gene mentions had the
possible Entrez Gene identifiers with the gene’s NCBI
Taxonomy species id assigned.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Experimental set-up</title>
        <p>
          We used the NACTEM’s Species Disambiguator
component from the uCompare system to provide
inter-species gene-normalization [
          <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
          ]. This
component assigns NCBI Taxonomy identifiers to each gene
mention. The module applies the species
annotations in the document to determine the species
associated with the gene mentions. Two different types
of analysis were carried out. One was with just the
species mentions tagged by LINNAEUS (baseline)
and the other used an extended set of species
mentions containing TEs mapped to species by our
system. The only difference was that the second set-up
included our TE mention mapping module and the
TE mentions were mapped to species mentions
before gene mention normalization. The gene-mention
normalization was then evaluated at the document
level. A flowchart of the experimental set-up can
be seen in Figure 1. With this we investigated the
added value of TEs inside a state-of-the art gene
mention recognizer (leaving the other component of
the system unchanged).
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Recognizing alternative species mentions</title>
        <p>The annotation of taxonomic entities (TE) was done
using the same method as that for gene tagging.
The synonyms of the NCBI Taxonomy entries were
matched against the text and taxonomy identifiers
were assigned to the mentions. TE mentions
referring to taxonomic groups that had no members
annotated in the text were filtered out.</p>
        <p>The references between TE and species mentions
were identified by using the following set of
heuristical rules:
• Only species descending from the taxonomic
category of the TE in the NCBI Taxonomy
were regarded as possibly referred species.
• If the sentence containing the TE mention
also contained a candidate species mention like
Precision
Recall
F-measure</p>
        <p>TE
0.668
0.571
0.616
• If there was no species annotated in the
sentence, the search was continued at the
paragraph, section, and document level,
respectively.
• At the end only TE mentions annotated with
one species were kept and used as alternative
species mentions in our experiments.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The NACTEM’s inter-species gene normalization
system tagged the gene mentions with a species
identifier, but the datasets available consist of Entrez
Gene identifiers assigned to the documents. To
evaluate the performance of our approach we mapped
the species identifiers assigned by the normalizer to
gene identifiers and evaluated the resulting set of
Entrez identifiers at the document level with the
standard F-measure metric (see Table 1).</p>
      <p>The dictionary mapper does not provide a
mapping for each gene identifier of the evaluation data
set. Therefore we provide additional scores –focusing
on the performance of the inter-species
normalization instead of the performance of the dictionary
lookup– by removing false negatives which were not
annotated by the dictionary lookup (”tagged” in
Table 1).</p>
      <p>Both the precision and the recall of the
interspecies gene mention normalization rose by 4-5
percentage points when utilizing TE mentions present
in biomedical articles.</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>The performance of our approach compared to the
state-of-the-art baseline method has an interesting
distribution. The gene normalization with
alternative species mentions outperformed the baseline
system in 7 out of the 32 documents and there were
only two cases where our approach achieved a lower
F-measure. In these two cases our method added
only 1-1 false positives and hence it did not affect
the overall results significantly.</p>
      <p>There were 5 documents where there were no
alternative species mentions tagged by our system,
so the performance of the disambiguation was the
same. In the remaining 17 documents – where the
TE + and TE - achieved the same results – only
a few TEs were recognized. A manual inspection
of the document-set showed that these differences
were caused by the different writing styles of the
authors. Some authors exclusively use concrete species
names when referring to an organism and also use
TE names to refer to species.</p>
      <p>We evaluated the 10 documents containing a
significant amount of TE mentions and the overall
Fscore rose from 0.40 to 0.65. The precision and
recall values went up from 0.37 to 0.56 and from 0.44
to 0.77, respectively. This subset of the
BioCreative III documents represents those biomedical
articles where the authors often refer to organisms
using broader terms instead of using exact organism
names.</p>
      <p>The following examples show how the TE
mentions can aid gene normalization.</p>
      <p>”Indeed, elevated expression of
Drosophila MOF, which counteracts</p>
      <p>ISWI activity . . . ”</p>
      <p>Here the exact organism name (D. melanogaster )
was mentioned elsewhere in the document, so the TE
(in bold type) was successfully mapped to Drosophila
melanogaster because no other species belonging to
the Drosophila subgenus was found in the given
context. The species identifier of the gene mention
(in Italics) was correctly determined by utilizing the
identified alternative gene mention.</p>
      <p>Wider TEs terms (like plants) were also
successfully mapped to the corresponding species mentions
in the text and produced correct gene normalization.
”By studying plants with mutations in
this gene, we found that CBP60g
contributes to the increases . . . ”</p>
      <p>When no plants other than Arabidopsis thaliana
were mentioned in the given context it was possible
to identify the TE plant by the label A. thaliana.</p>
      <p>There were some documents where both of the
procedures achieved low scores. An analysis later
revealed that the LINNAEUS species detector was
unable to identify species mentions in some cases where
the authors used only short and ambiguous variants
of the organism name, like Drosophila instead of
D. Melanogaster or Drosophila Melanogaster. Even
when the TEs were identified in the document and
no species mentions were annotated, the TEs were
filtered out. If there was no species identified in a
document the NACTEM’s gene disambiguator chose
Homo sapiens (human) as the default organism for
gene normalization.</p>
      <p>If a TE covers a large number of organisms (like
the TE animal ), then false positive species
associations can occur. For example, if the author uses
a TE as a general term rather than as a taxonomic
category. In the next negative example the word
animal was referenced to C. elegans by mistake, but
the word was used in the sense of other mammals
like human rather than a worm like C. elegans. As
a result HCF-1 protein was incorrectly identified as
a gene product belonging to C. elegans instead of a
human protein.</p>
      <p>”. . . we have undertaken a genetic
analysis in C. elegans to study
HCF-1protein function in animal development.</p>
      <p>The C. elegans HCF-1-related protein is
an amino acid protein encoded by the
hcf-1 gene and referred to here as Ce
HCF-1.”
example the TE vertebrate was used only to name
the vertebrates in general, but it was incorrectly
referenced to D. simulans – the only vertebrate
species identified in the document. Also, D.
simulans was incorrectly identified by LINNAEUS as a
rodent (Dipodomys simulans) and not as an insect
(Drosophila simulans).</p>
      <p>”In spite of the similar global function of
insect and vertebrate OBPs . . . ”</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>By utilizing the TE mentions as alternative species
mentions, the approach we presented here improves
the performance of a state-of-the-art inter-species
gene normalization tool. The overall F-scores
measured on the BioCreative III GN dataset rose from
0.61 to 0.65. On a subset of the dataset – where
the writing style of the authors causes the classical
approach to achieve poorer results than on the rest
of the testset – our method increased the F-score on
this set from 0.40 to 0.64.</p>
      <p>A subsequent error analysis indicated that more
sophisticated methods are required to resolve the
references between TE mentions and species mentions.
We plan to develop an integrated species mention
and alternative organism mention system in the near
future.</p>
    </sec>
    <sec id="sec-6">
      <title>Authors contributions</title>
      <p>
        Gy¨orgy M´ora developed the software tools used
for mention detection, taxonomy browsing,
TE-tospecies linking and the evaluation of the results. He
was responsible for the statistical analysis done in
this study. Rich´ard Farkas supervised the work and
participated in the writing of the manuscript. The
authors would like to thank those who maintain the
Entrez Gene and NCBI Taxonomy databases [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the
authors of the NACTEM’s Species Disambiguator [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
and the authors of the LINNAEUS species name
recognizer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for making these tools available.
      </p>
      <p>Another source of incorrect normalization is
when the author refers globally to the group of
organisms, but our heuristics link the TE to an
exact species mention in the document. In the next</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was supported in part by the NKTH grant
(project codename TEXTREND) of the Hungarian
government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chen</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            <given-names>C</given-names>
          </string-name>
          :
          <article-title>Gene name ambiguity of eukaryotic nomenclatures</article-title>
          .
          <source>Bioinformatics</source>
          <year>2005</year>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>248</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hakenberg</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plake</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leaman</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schroeder</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            <given-names>G</given-names>
          </string-name>
          :
          <article-title>Inter-species normalization of gene mentions with GNAT</article-title>
          .
          <source>Bioinformatics</source>
          <year>2008</year>
          ,
          <volume>24</volume>
          (
          <issue>16</issue>
          ):
          <fpage>i126</fpage>
          -
          <lpage>i132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            <given-names>S</given-names>
          </string-name>
          :
          <article-title>Disambiguating the species of biomedical named entities using natural language parsers</article-title>
          .
          <source>Bioinformatics</source>
          <year>2010</year>
          ,
          <volume>26</volume>
          (
          <issue>5</issue>
          ):
          <fpage>661</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Morgan</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fluck</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruch</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Divoli</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fundel</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leaman</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hakenberg</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            <given-names>C</given-names>
          </string-name>
          , Liu Hh,
          <string-name>
            <surname>Torres</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krauthammer</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lau</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            <given-names>CN</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuemie</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            <given-names>KB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirschman</surname>
            <given-names>L</given-names>
          </string-name>
          :
          <article-title>Overview of BioCreative II gene normalization</article-title>
          .
          <source>Genome Biology</source>
          <year>2008</year>
          ,
          <volume>9</volume>
          (
          <issue>Suppl 2</issue>
          ):
          <fpage>S3</fpage>
          , [http://genomebiology.com/
          <year>2008</year>
          /9/ S2/S3].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>BioCreative III Gene Normalization Task</surname>
          </string-name>
          [http: //www.biocreative.org/tasks/biocreative-iii/gn/].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>6. The NCBI handbook</article-title>
          .
          <source>Bethesda (MD): National Library of Medicine (US)</source>
          ,
          <source>National Center for Biotechnology Information</source>
          <year>2002</year>
          , [http://www.ncbi.nlm.nih.gov/entrez/query. fcgi?db=Books].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gerner</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nenadic</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergman</surname>
            <given-names>CM</given-names>
          </string-name>
          :
          <article-title>LINNAEUS: a species name identification system for biomedical literature</article-title>
          .
          <source>BMC bioinformatics</source>
          <year>2010</year>
          ,
          <volume>11</volume>
          :
          <fpage>85</fpage>
          +, [http: //dx.doi.org/10.1186/
          <fpage>1471</fpage>
          -2105-11-85].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kano</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baumgartner</surname>
            <given-names>WA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCrohon</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            <given-names>KB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            <given-names>J</given-names>
          </string-name>
          :
          <article-title>U-Compare: share and compare text mining tools with UIMA</article-title>
          .
          <source>Bioinformatics</source>
          <year>2009</year>
          ,
          <volume>25</volume>
          (
          <issue>15</issue>
          ):
          <fpage>1997</fpage>
          -
          <lpage>1998</lpage>
          , [http://dx.doi.org/10.1093/ bioinformatics/btp289].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>