Species taxonomy for gene normalization György Móra∗1 and Richárd Farkas∗2 1 University of Szeged, Department of Informatics, Szeged, Hungary 2 Hungarian Academy of Sciences, Research Group on Artificial Intelligence, Szeged, Hungary Email: ∗ gymora@inf.u-szeged.hu; ∗ rfarkas@inf.u-szeged.hu; ∗ Corresponding author Abstract Background: The task of gene normalization is to assign a unique identifier from a database to the gene mentions. Using these identifiers a great deal of information can be gathered from external databases such as interactions, pathways, sequences and protein structures. Normalizing gene mentions in articles is a difficult task as the inter-species ambiguity of the gene mentions in biomedical publications is high. The experiences gained from the BioCreative II Gene Normalization Task indicate that the biggest challenge in gene normalization is the recognition of the species that a specific gene mention belongs to. In biomedical scientific articles the authors often use taxonomical entities besides concrete species mentions as references to different group of organisms. Species taxonomies are hierarchical systems (trees) of living creatures and therefore provide a classification of species. Here we investigate the added value of the utilization of taxonomic entity mentions in the inter-species gene normalization task. Results: We present a method which marks those words mentioning all taxonomic entities (genus, family, etc.) and applies filtering heuristics to select the taxonomic entities referring to species mentioned in the document. These entities are then treated as species mentions together with standard species annotations and we employ them in gene normalization. Conclusion: After experiments were carried out on the BioCreative III Gene Normalization Task’s data-set to investigate the contribution of the additional species mentions to the gene disambiguation task, we found that our approach improves the performance of the inter-species gene mention disambiguator, both in terms of precision and recall. Background building brick of an information extraction system is a named entity recognizer which can identify bio- A vast amount of information is present in biologi- logical entities such as genes and gene products, cell cal scientific publications. Even a complete subset lines and organism names in a text. of these documents in a particular scientific topic is too large for any scientist to read nowadays. This is Besides the identification of entity mentions it is why search engines and information extraction sys- important to normalize them. Gene normalization tems have been developed to support the life sci- (GN) is a process where unique database identifiers entist in finding the information needed. The key are assigned to gene mentions, where these mentions 122 refer to a specific gene entry. Gene databases may and human BMP4’, multiple identifiers should be contain information related to these genes such as se- assigned. We compared the annotations of the two quences, gene products and interaction and pathway corpora – they consist of the same document set – information. For instance, a system applying entity and there were significant differences in the genes normalization can assist automatic pathway finding annotated for a given document. We decided to use systems and support pharmacological investigations. the BioCreativeIII corpus and not the document- Recent studies [1, 2] indicate that the intra- level corpus used by Hakenberg et al. [2] because species ambiguity of gene symbols is much lower the latter contain only abstracts and we think that than the ambiguity between species, so it is impor- full text documents contain more taxonomical entity tant to determine which species the gene mention mentions because of their writing style. belongs to. The use of synonyms instead of official We implemented a taxonomic name identifying gene symbols also increases ambiguity and some au- system that tags expressions in biomedical scientific thors prefer to use these synonyms instead of official texts mentioning taxonomic entities (TE), and with symbols. heuristical rules determines the exact species that Current inter-species GN approaches focus on they refer to. Our approach was extrinsically evalu- species words and employ species mention detectors ated in an inter-species gene mention normalization to recognize them [2, 3]. Then normalization sys- setting. Our results show that the annotation of TEs tems use machine learnt model or hand-crafted rules does indeed improve the performance of a state-of- to determine the species associated with a particu- the-art GN system. lar gene mention. The NACTEM’s Species Disam- biguator we applied makes use of a natural language parser to exploit the linguistical relations between the gene and species mentions [3]. There are species mention detectors available with suitable precision and recall, but these systems Methods focus on identifying exact species mentions, such as Dataset the scientific and common names of living organ- isms but not the names of groups, classes, genus or We used the BioCreative III Shared Task Gene other taxonomic categories. The authors can use Normalization dataset for the evaluation [5]. The these taxonomic names in a general way for a group dataset consists of manually annotated full-text ar- of species or as references to a finite set of species ticles. A subset of the documents was fully anno- mentioned earlier in the document. If the taxonomic tated and in the remaining part only the important name refers to an exact species, then an inter-species genes were recognized. Here we used the fully anno- gene disambiguation system can exploit this infor- tated subset of documents for the evaluation of our mation. approach. We used the BioCreative III full-text, document- level corpus for our evaluation because we found Just genes and gene products from the Entrez no suitable mention-level gold-standard dataset with Gene database [6] that were clearly related to a inter-species gene normalization. Current trends in species were annotated. Genes which had no En- biomedical text mining are directed towards systems trez gene identifier and gene mentions that refer to that work on full-text articles rather than just ab- a group of genes were not annotated at all. The an- stracts. The two other corpora [2, 3] available for notation also does not contain a gene mention when inter-species GN evaluation are based on the Biocre- the species associated with the gene cannot be de- ative II Gene Normalization Task’s dataset and con- termined – even with domain knowledge. Entrez sist of biomedical article abstracts [4]. Gene identifiers of genes contained in each document The corpus used by Hakenberg et al. [2] contains were provided without the given gene mention being all of the gene identifiers mentioned in the abstract, marked. but it is annotated only at the document level. Al- We annotated the species names in the docu- though the corpus introduced by Wang et al. [3] is ments by the LINNAEUS species name identification annotated at the mention level, every entity was an- system [7] for biomedical literature, which assigns notated with only one gene id and in cases like ’rat NCBI Taxonomy [6] identifiers to species mentions. 123 Figure 1: Flowchart of the experimental set-up (the TE mention recognition and mapping subsystems are marked with red) Gene mention tagging tions containing TEs mapped to species by our sys- The gene mentions were tagged in the document tem. The only difference was that the second set-up by our dictionary-based gene mention tagger, which included our TE mention mapping module and the assigned all of the possible Entrez Gene identifiers TE mentions were mapped to species mentions be- to the gene mentions. The dictionary mapping is fore gene mention normalization. The gene-mention based on the NLM’s string normalizing method. The normalization was then evaluated at the document normalized substrings of each sentence are matched level. A flowchart of the experimental set-up can against the normalized synonyms of Entrez Gene be seen in Figure 1. With this we investigated the names in our database. Then hand-crafted rules are added value of TEs inside a state-of-the art gene applied to filter out false positive entity mentions mention recognizer (leaving the other component of and eliminate overlapping annotations of the same the system unchanged). gene mention. One-token long entities are only ac- cepted when they contain numerals or non-standard capitalization and if they are at least two characters Recognizing alternative species mentions long. Mentions longer than one token are accepted The annotation of taxonomic entities (TE) was done without restriction. The gene mentions had the pos- using the same method as that for gene tagging. sible Entrez Gene identifiers with the gene’s NCBI The synonyms of the NCBI Taxonomy entries were Taxonomy species id assigned. matched against the text and taxonomy identifiers were assigned to the mentions. TE mentions refer- ring to taxonomic groups that had no members an- Experimental set-up notated in the text were filtered out. We used the NACTEM’s Species Disambiguator The references between TE and species mentions component from the uCompare system to provide were identified by using the following set of heuris- inter-species gene-normalization [3, 8]. This compo- tical rules: nent assigns NCBI Taxonomy identifiers to each gene mention. The module applies the species annota- • Only species descending from the taxonomic tions in the document to determine the species asso- category of the TE in the NCBI Taxonomy ciated with the gene mentions. Two different types were regarded as possibly referred species. of analysis were carried out. One was with just the species mentions tagged by LINNAEUS (baseline) • If the sentence containing the TE mention and the other used an extended set of species men- also contained a candidate species mention like 124 TE - TE + Tagged TE - Tagged TE + Precision 0.668 0.695 0.668 0.695 Recall 0.571 0.610 0.798 0.853 F-measure 0.616 0.650 0.727 0.766 Table 1: Performance values of the GN setting without (TE -) and with alternative species (TE +) utilized in the normalization. Results marked with ”tagged” were produced by an evaluation applied only on a subset of the genes taken from the evaluation set, which were then successfully mapped to the documents by the dictionary mapper. this, then the TE was considered to refer to Discussion this species. The performance of our approach compared to the state-of-the-art baseline method has an interesting • If multiple species satisfied the descendant cri- distribution. The gene normalization with alterna- teria then the taxonomic entry was considered tive species mentions outperformed the baseline sys- to refer to multiple species or refer to the gen- tem in 7 out of the 32 documents and there were eral taxonomic class and both were removed. only two cases where our approach achieved a lower F-measure. In these two cases our method added • If there was no species annotated in the sen- only 1-1 false positives and hence it did not affect tence, the search was continued at the para- the overall results significantly. graph, section, and document level, respec- There were 5 documents where there were no tively. alternative species mentions tagged by our system, • At the end only TE mentions annotated with so the performance of the disambiguation was the one species were kept and used as alternative same. In the remaining 17 documents – where the species mentions in our experiments. TE + and TE - achieved the same results – only a few TEs were recognized. A manual inspection of the document-set showed that these differences were caused by the different writing styles of the au- Results thors. Some authors exclusively use concrete species The NACTEM’s inter-species gene normalization names when referring to an organism and also use system tagged the gene mentions with a species iden- TE names to refer to species. tifier, but the datasets available consist of Entrez We evaluated the 10 documents containing a sig- Gene identifiers assigned to the documents. To eval- nificant amount of TE mentions and the overall F- uate the performance of our approach we mapped score rose from 0.40 to 0.65. The precision and re- the species identifiers assigned by the normalizer to call values went up from 0.37 to 0.56 and from 0.44 gene identifiers and evaluated the resulting set of En- to 0.77, respectively. This subset of the BioCre- trez identifiers at the document level with the stan- ative III documents represents those biomedical ar- dard F-measure metric (see Table 1). ticles where the authors often refer to organisms us- The dictionary mapper does not provide a map- ing broader terms instead of using exact organism ping for each gene identifier of the evaluation data names. set. Therefore we provide additional scores –focusing The following examples show how the TE men- on the performance of the inter-species normaliza- tions can aid gene normalization. tion instead of the performance of the dictionary lookup– by removing false negatives which were not ”Indeed, elevated expression of annotated by the dictionary lookup (”tagged” in Ta- Drosophila MOF, which counteracts ble 1). ISWI activity . . . ” Both the precision and the recall of the inter- species gene mention normalization rose by 4-5 per- Here the exact organism name (D. melanogaster ) centage points when utilizing TE mentions present was mentioned elsewhere in the document, so the TE in biomedical articles. (in bold type) was successfully mapped to Drosophila 125 melanogaster because no other species belonging to example the TE vertebrate was used only to name the Drosophila subgenus was found in the given con- the vertebrates in general, but it was incorrectly text. The species identifier of the gene mention referenced to D. simulans – the only vertebrate (in Italics) was correctly determined by utilizing the species identified in the document. Also, D. sim- identified alternative gene mention. ulans was incorrectly identified by LINNAEUS as a Wider TEs terms (like plants) were also success- rodent (Dipodomys simulans) and not as an insect fully mapped to the corresponding species mentions (Drosophila simulans). in the text and produced correct gene normalization. ”In spite of the similar global function of ”By studying plants with mutations in insect and vertebrate OBPs . . . ” this gene, we found that CBP60g con- tributes to the increases . . . ” When no plants other than Arabidopsis thaliana were mentioned in the given context it was possible Conclusions to identify the TE plant by the label A. thaliana. By utilizing the TE mentions as alternative species There were some documents where both of the mentions, the approach we presented here improves procedures achieved low scores. An analysis later re- the performance of a state-of-the-art inter-species vealed that the LINNAEUS species detector was un- gene normalization tool. The overall F-scores mea- able to identify species mentions in some cases where sured on the BioCreative III GN dataset rose from the authors used only short and ambiguous variants 0.61 to 0.65. On a subset of the dataset – where of the organism name, like Drosophila instead of the writing style of the authors causes the classical D. Melanogaster or Drosophila Melanogaster. Even approach to achieve poorer results than on the rest when the TEs were identified in the document and of the testset – our method increased the F-score on no species mentions were annotated, the TEs were this set from 0.40 to 0.64. filtered out. If there was no species identified in a A subsequent error analysis indicated that more document the NACTEM’s gene disambiguator chose sophisticated methods are required to resolve the ref- Homo sapiens (human) as the default organism for erences between TE mentions and species mentions. gene normalization. We plan to develop an integrated species mention If a TE covers a large number of organisms (like and alternative organism mention system in the near the TE animal ), then false positive species associ- future. ations can occur. For example, if the author uses a TE as a general term rather than as a taxonomic category. In the next negative example the word an- Authors contributions imal was referenced to C. elegans by mistake, but the word was used in the sense of other mammals György Móra developed the software tools used like human rather than a worm like C. elegans. As for mention detection, taxonomy browsing, TE-to- a result HCF-1 protein was incorrectly identified as species linking and the evaluation of the results. He a gene product belonging to C. elegans instead of a was responsible for the statistical analysis done in human protein. this study. Richárd Farkas supervised the work and participated in the writing of the manuscript. The ”. . . we have undertaken a genetic anal- authors would like to thank those who maintain the ysis in C. elegans to study HCF-1- Entrez Gene and NCBI Taxonomy databases [6], the protein function in animal development. authors of the NACTEM’s Species Disambiguator [3] The C. elegans HCF-1-related protein is and the authors of the LINNAEUS species name rec- an amino acid protein encoded by the ognizer [7] for making these tools available. hcf-1 gene and referred to here as Ce HCF-1.” Another source of incorrect normalization is Acknowledgements when the author refers globally to the group of or- This work was supported in part by the NKTH grant ganisms, but our heuristics link the TE to an ex- (project codename TEXTREND) of the Hungarian gov- act species mention in the document. In the next ernment. 126 References 5. BioCreative III Gene Normalization Task [http: 1. Chen L, Liu H, Friedman C: Gene name ambigu- //www.biocreative.org/tasks/biocreative-iii/gn/]. ity of eukaryotic nomenclatures. Bioinformatics 2005, 21(2):248–256. 6. The NCBI handbook. Bethesda (MD): National Library of 2. Hakenberg J, Plake C, Leaman R, Schroeder M, Gonza- Medicine (US), National Center for Biotechnology Infor- lez G: Inter-species normalization of gene mentions mation 2002, [http://www.ncbi.nlm.nih.gov/entrez/query. with GNAT. Bioinformatics 2008, 24(16):i126–i132. fcgi?db=Books]. 3. Wang X, Tsujii J, Ananiadou S: Disambiguating the 7. Gerner M, Nenadic G, Bergman CM: LINNAEUS: a species of biomedical named entities using natural species name identification system for biomedi- language parsers. Bioinformatics 2010, 26(5):661–667. cal literature. BMC bioinformatics 2010, 11:85+, [http: 4. Morgan A, Lu Z, Wang X, Cohen A, Fluck J, Ruch P, //dx.doi.org/10.1186/1471-2105-11-85]. Divoli A, Fundel K, Leaman R, Hakenberg J, Sun C, Liu Hh, Torres R, Krauthammer M, Lau W, Liu H, Hsu 8. Kano Y, Baumgartner WA, McCrohon L, Ananiadou S, CN, Schuemie M, Cohen KB, Hirschman L: Overview of Cohen KB, Hunter L, Tsujii J: U-Compare: share and BioCreative II gene normalization. Genome Biology compare text mining tools with UIMA. Bioinformat- 2008, 9(Suppl 2):S3, [http://genomebiology.com/2008/9/ ics 2009, 25(15):1997–1998, [http://dx.doi.org/10.1093/ S2/S3]. bioinformatics/btp289]. 127