Benchmarking infrastructure for mutation text mining Artjom Klein∗1 , Alexandre Riazanov1 , Matthew M Hindle2 and Christopher JO Baker1 1 Computational Statistics And Science Department, University of New Brunswick,Saint John, Canada 2 Synthetic and Systems Biology, Edinburgh University, Edinburgh, UK Email: Artjom Klein∗ - aklein@unb.ca; Alexandre Riazanov - alexr@unb.ca; Matthew M Hindle -matthew.hindle@ed.ac.uk; Christopher JO Baker - bakerc@unb.ca; ∗ Corresponding author Abstract Background: Research work on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation facilities and easy-to-use infrastructure for testing and benchmarking of mutation text mining systems. Results: We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent the annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases programming is not needed to analyze system results. While large benchmark corpora for biological entity and relation extraction are focused mostly on gene, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises of: 1) an ontology for modelling annotations, 2) SPARQL queries for performance metrics computation, and 3) a sizeable collection of manually curated documents, that can minimally support mutation grounding and mutation impact extraction. Conclusion: This is the first example of benchmarking infrastructure for mutation text mining. It is designed for community uptake. 1 Introduction Mutation text mining. The use of knowledge derived from text mining for mentions of mutations and their consequences is increasingly important for systems biology, genomics and genotype-phenotype studies. Mutation text mining facilitates a wide range of activities in multiple scenarios including the modelling of cell signalling pathways [1], protein structure annotation [2, 3] , the expansion of disease-mutation database annotations [4] and the development of tools predicting the impacts of mutations [5, 6]. The types of useful text mining tasks specific to mutations range from the relatively simple identification of mutation mentions [7], to very complex tasks such as linking (”grounding”) identified mutations to the corresponding genes and proteins [8], or identifying mutation impacts [9, 10] and related phenotypes [11]. Benchmarking and evaluation difficulties. Although the demand for mutation text mining software has lead to a significant growth of the experimental research in this area, the development of such systems and the publication of results is greatly hindered by the lack of adequate benchmarking facilities. In the first place, developers of mutation text mining systems need input data – texts annotated with target information – to simply test new versions of their implementations. They further need facilities to benchmark different versions of their systems in order to monitor development progress. Finally, the developers need to be able to convincingly evaluate their systems performance by comparing their results with extensive gold standard data and results of other systems. Ideally, there should be community-based consensus corpora and utilities to make such benchmarking and evaluation easy. However, such facilities currently do not exist and developers are forced to spend time and effort on creating ad hoc corpora and scripts. As a result, the time required for benchmarking in the total development work is disproportionally high. Moreover, since only relatively small corpora are affordable to many research groups, the quality of evaluation suffers too. In developing a mutation grounding system [8] showing an encouraging level of performance accuracy, 0.73, on a homogeneous corpus of 76 documents, the authors achieved only 0.13 on a heterogeneous corpus of larger size. When the system was reimplemented (see, [12]), the authors encountered another challenge – the evaluation of the new system by comparing it to the state-of-the-art was practically unaffordable, despite the existence of similar systems, due to the lack of consensus benchmarking infrastructure. The lack of adequate community-based benchmarking infrastructure is a great hindrance to progress in the area of mutation text mining. We propose to improve this situation by developing a publicly accessible infrastructure. Requirements. To guide our work, we impose the following requirements on the infrastructure to be created: 2 • To maximize its utility for system testing and evaluation, the infrastructure must include as big a gold standard corpus (a collection of annotated texts) as possible. It must also contain results of the runs of different systems to facilitate comparison of their performance. • To be useful to a larger community, the infrastructure should support multiple mutation-related text mining tasks, such as identifying mutations both on DNA and protein levels, mutation grounding to gene and proteins, identifying effects of mutations, etc. • The infrastructure must be easy to use requiring only minimal effort from system developers. Ideally, many development tasks should be facilitated so that the developers do not need to create new data formats or write additional scripts in order to leverage the infrastructure. • The infrastructure should not only be publicly available but also support a sufficiently straightforward submission of both gold standard annotations and system results by the mutation text mining community. Content overview. In this paper we report the results on the design and implementation of a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of systems for mining information about mutations. The paper is outlined as follows. The Methods section starts by describing the motivation for the choice of representation format, it continues to outline a specification of ontology for modelling annotations and describes a method to calculate evaluation metrics. The Results and Discussion section presents details of seed corpora, methods for calculation of performance metrics, and utilities supporting benchmarking infrastructure. At the end of this section, we outline a testing infrastructure use-case and future work. Finally we provide a Conclusion summarizing results and highlight the availability of the benchmarking infrastructure. Methods Representing gold standard annotations and system results in RDF Typically, document annotations intended for text mining system testing and evaluation, are represented in various custom XML-based or tabular formats. XML is a standard and widely used format for corpora annotation which comes with a large number of tools. Nevertheless, the processing of complex annotations in XML – parsing, storing, querying, evaluation – is usually practically impossible with off-the-shelf XML tools [13]. Developers need to create schema-specific parsers and processing scripts and change them each time the schema is changed or extended. This was the primary reason we chose RDF over custom 3 XML-based formats, because the reusability and extensibility of data are among the design goals of RDF. We also use OWL ontologies as highly extensible data schemas. An existing example of the successful use of RDF with OWL for representing biological data is the BIOPAX [14] format for representing biological pathway data. The advantages of using the RDF/OWL bundle can be summarized as follows: • Extensibility. Since the benchmarking infrastructure is going to be used for different mutation text mining tasks and all requirements can not be foreseen, we need extensible representations. Moreover, the same data may be used for different tasks (e. g., we have reused mutation impact corpora for improving mutation grounding system [12]). The use of RDF data with classes and properties defined in OWL ontologies makes it possible to support easy integration of new corpora with annotation schemas that need not be identical, as long as they are compatible. This simply amounts to using compatible OWL ontologies and modelling patterns for RDF. Data defined modulo one ontology can be simply merged with data modulo another ontology. Moreover, additional alignments between the ontologies can be provided by the annotation providers – corpus curators or text mining system developers. • Tool availability. RDF and OWL are popular open formats and supported by a large number of open source and commercial tools. The following types of tools can be leveraged for the purpose of text mining annotation processing: – OWL reasoners can be used for data integrity checking. – RDF and OWL APIs for multiple programming languages, including Java, C++, Perl and Python, facilitate easy programmatic generation and manipulation of annotations or RDF data representing text mining results. – The SPARQL query language can be directly used for calculating system performance metrics as well as for various searches in the gold standard corpora. There is no need to implement custom querying mechanisms. – Multiple implementations of RDF databases (triplestores) are available that facilitate efficient storing and querying of large volumes of annotations. The diversity of available RDF tools enables out-of-the-box use of the annotation data in the main use scenarios, such as system testing and evaluation. 4 Core Ontologies and Modelling Ontologies The Mutation Impact Extraction Ontology (MIEO) [15] is central to our infrastructure. It currently describes classes and properties necessary to represent information about mutations at protein level, identified in texts, and extracted mutation impacts on molecular functions. For example, AminoAcidSequenceChange is the class for mutations at protein level. Instances of ProteinVariant are most specific types of protein molecules that completely identify the corresponding amino acid sequences. Instances of ProteinPropertyChange are the identified changes of protein properties that can be linked to: the properties that change, the corresponding documents and specific text fragments, and the mutations they result from. To characterize a property change, e. g., as positive, which may correspond to increased activity, we can use the subclass PositiveProteinPropertyChange. Protein properties, such as molecular functions, are also modelled as individuals whose types are currently taken from the Gene Ontology [16]. Note that some of our target mutation tasks are related to the extraction of relations between entities rather than just identifying some entities of interest. We use custom reification for such relations, in particular to facilitate linking them to documents and more specific provenance information. For example, extracted statements of mutations impacting protein properties are represented as instance of the class StatementOfMutationEffect. Note that our MIEO uses the Semanticscience Integrated Ontology (SIO) [17] as an upper ontology, and the LSRN ontology [18] to represent records and identifiers, as illustrated in the next section. Modelling example. We provide an RDF graph in pseudo-N3 as an example of how the gold standard corpus data and results of mutation impact text mining are represented in our infrastructure. Note that non-mnemonic ontological identifiers are replaced with pseudo-identifiers using the corresponding labels: e. g., sio:SIO 000011 and sio:SIO 000300 are replaced respectively with sio:’has attribute’ and sio:’has value’. # D e s c r i p t i o n of a singular amino acid s u b s t i t u t i o n N30A : : s i n g u l a r _ m u t a t i o n 1 rdf : type mieo : A m i n o A c i d S u b s t i t u t i o n . : s i n g u l a r _ m u t a t i o n 1 mieo : m u t a t i o n H a s W i l d t y p e R e s i d u e mieo : A s p a r a g i n e . : s i n g u l a r _ m u t a t i o n 1 mieo : m u t a t i o n H a s M u t a n t R e s i d u e mieo : Alanine . : s i n g u l a r _ m u t a t i o n 1 mieo : m u t a t i o n H a s P o s i t i o n : p o s i t i o n 1 . : p o s i t i o n 1 rdf : type sio : ’ position ’ . : p o s i t i o n 1 sio : ’ has value ’ "30"^^ xsd : integer . # D e s c r i p t i o n of a singular amino acid s u b s t i t u t i o n N50A : : s i n g u l a r _ m u t a t i o n 2 rdf : type mieo : A m i n o A c i d S u b s t i t u t i o n . : s i n g u l a r _ m u t a t i o n 2 mieo : m u t a t i o n H a s W i l d t y p e R e s i d u e mieo : A s p a r a g i n e . : s i n g u l a r _ m u t a t i o n 2 mieo : m u t a t i o n H a s M u t a n R e s i d u e mieo : Alanine . : s i n g u l a r _ m u t a t i o n 2 mieo : m u t a t i o n H a s P o s i t i o n : p o s i t i o n 2 . : p o s i t i o n 2 rdf : type sio : ’ position ’ . : p o s i t i o n 2 sio : ’ has value ’ "50"^^ xsd : integer . 5 # Combined mutation (" mutation series ") c o n s i s t i n g of the two singular m u t a t i o n s: : mutation rdf : type mieo : C o m b i n e d A m i n o A c i d C h a n g e . : mutation sio : ’ has member ’ : s i n g u l a r _ m u t a t i o n 1 . : mutation sio : ’ has member ’ : s i n g u l a r _ m u t a t i o n 2 . : mutation sio : ’ has attribute ’ : n u m b e r _ o f _ s i n g u l a r _ m u t a t i o n s . : n u m b e r _ o f _ s i n g u l a r _ m u t a t i o n s rdf : type sio : ’ count ’ : n u m b e r _ o f _ s i n g u l a r _ m u t a t i o n s sio : ’ has value ’ "2"^^ xsd : integer . # Mutation a p p l i c a t i o n (" g r o u n d i n g") to a specific protein : : m u t a t i o n _ a p p l i c a t i o n rdf : type mieo : P r o t e i n M u t a t i o n A p p l i c a t i o n . : m u t a t i o n _ a p p l i c a t i o n mieo : i s A p p l i c a t i o n O f M u t a t i o n : mutation . : m u t a t i o n _ a p p l i c a t i o n mieo : i s A p p l i c a t i o n O f M u t a t i o n T o P r o t e i n : protein . # D e s c r i p t i o n of the protein : : protein rdf : type mieo : P r o t e i n V a r i a n t . # it ’ s a specific variant ( uniquely i d e n t i f i e s the sequence ) : protein mieo : p r o t e i n H a s S e q u e n c e : p r o t e i n _ s e q u e n c e . : protein sio : ’ is subject of ’ : u n i p r o t _ r e c o r d . # Standard SIO way to link entities , DB records and IDs : : u n i p r o t _ r e c o r d rdf : type lsrn : U n i P r o t _ R e c o r d . : u n i p r o t _ r e c o r d sio : ’ has attribute ’ : u n i p r o t _ r e c o r d _ i d . : u n i p r o t _ r e c o r d _ i d rdf : type lsrn : U n i P r o t _ I d e n t i f i e r . : u n i p r o t _ r e c o r d _ i d sio : ’ has value ’ " P22635 " . # P r o v e n a n c e is mostly done with sio : ’ refers to ’ : : document rdf : type sio : ’ article ’ . : document sio : ’ refers to ’ : s i n g u l a r _ m u t a t i o n 1 . : document sio : ’ refers to ’ : s i n g u l a r _ m u t a t i o n 2 . : document sio : ’ refers to ’ : mutation . : document sio : ’ refers to ’ : m u t a t i o n _ a p p l i c a t i o n . : document sio : ’ refers to ’ : protein . : document sio : ’ has unique identifier ’ : d o c u m e n t _ i d e n t i f i e r . : d o c u m e n t _ i d e n t i f i e r rdf : type mieo : P u b M e d U R I . # subclass of mieo : URI : d o c u m e n t _ i d e n t i f i e r sio : ’ has value ’ " http :// www . ncbi . nlm . nih . gov / pubmed / 1 7 5 2 6 7 9 5 " ^ ^ xsd : anyURI . Note that, for simplicity, RDF data in this example are in “flat” RDF. In practice this is not convenient because we need to somehow separate the gold standard data from system results. Moreover, it is necessary to separate results coming from different systems or different experiments. We use named graphs [19] for this purpose: results from different experiments, and even gold standard data from different corpora, are placed in different named graphs. Benchmarking with SPARQL An infrastructure intended for benchmarking and evaluation must support the computation of performance metrics, such as precision and recall. Note that different flavours of these statistics are used by system developers: e. g., [20] proposes over 15 different metrics to evaluation protein mutation extraction systems. Moreover, text mining results sometimes need to be evaluated with different granularities, e. g., the mutant protein property change may be evaluated by considering binary outcomes (has effect vs no effect ) or with higher granularity when the outcome may also identify the direction of the effect – e. g., positive effect or negative effect. Our infrastructure has to be sufficiently flexible to accommodate many such uses. This is achieved by using SPARQL to retrieve entities, such as different flavours of true and false positives, that need to be counted 6 in order to calculate a particular metric. The current version of SPARQL (1.1) offers a sufficient degree of flexibility. In particular, the negation-as-failure related features – FILTER NOT EXISTS and MINUS – allow, e. g., for easy qualification of some results as false positives by checking whether they are absent from the gold standard data. Design of the seed corpora To facilitate a preliminary evaluation of our infrastructure, we seeded it with several corpora supporting at least two mutation text mining tasks: mutation grounding to proteins and extraction of mutation impacts on molecular functions of proteins. The document annotations for mutation grounding identify extracted mutations and proteins, and relations between them. The annotations for mutation impact extraction additionally identify molecular function of proteins and changes of these properties causally linked to some mutations, and provide references to supporting text fragments. Results and Discussion Contents of the corpora EnzyMiner-based corpus. One of our seed corpora is based on an extract from the EnzyMiner [21] abstract database. It was annotated manually and comprises 38 semi-randomly selected full text documents with 176 different singular mutations linked to 48 different protein sequences. The selection was adjusted to ensure maximal diversity by having documents with proteins from all enzyme families and 24 different species. The corpus currently contains 488 statements (occurrences of impact information in text), 61 molecular functions and 29 combined mutations. In what follows, we call it simply “the EnzyMiner corpus”. We annotated documents with mutation impact information which includes: • Studied protein-level mutations, in the form of singular amino acid substitutions. They are represented as triples specifying the wild type and mutant residues, and the absolute positions of the mutations on the corresponding amino acid sequences. For situations when the effects of several simultaneous amino acid substitutions are studied, we allow them to be expressed as combined mutations. 7 • Proteins to which the mutations are related, identified with UniProt IDs. The host organisms and sets of specific protein sequences can be identified via the UniProt IDs. • Protein properties specified as Gene Ontology Molecular function classes. • Mutation impacts qualified as Positive, Negative or Neutral. • Text fragments the information was extracted from. Typical fragments contain mentions of protein properties, impact directionality words, such as “increased” or “worse”, mutation mentions, protein and organism names, etc. • Documents identified with PubMed IDs. DHLA corpus. This is a small corpus comprising 13 documents with 52 unique per document mutations on Haloalkane Dehalogenases, manually annotated similarly to the EnzyMiner documents (see [2]). COSMIC-based corpus. We have an extract from the COSMIC database [22] containing 63 documents for three target genes: FGFR3, MEN1 and PIK3CA. Unlike the EnzyMiner and DHLA corpora, this corpus does not identify mutation impacts, although it links mutations to proteins and, thus, is suitable for mutation grounding benchmarking. KinMutBase-based corpus. We retrieved 201 documents annotated with singular amino acid substitutions grounded to proteins, from the KinMutBase [23] database. We additionally curated the selection by running MutationFinder, which is a reliable tool for this purpose due to its very high recall, and comparing the results with the annotations in the database. Based on this comparison, we discarded about 70 documents that appear annotated with protein-level mutations that they don’t seem to mention directly, although this may be due to the translation from SNPs made by the curators. The final size of the corpus is 128 documents. In total, we have 271 mutations linked to 26 different UniProt identifiers. Corpora statistics. The statistics for the corpora are summarized in Table 1. 8 Corpus size UniProt IDs Mutations (unique per document) EnzyMiner 38 49 176 KinMutBase 128 26 271 DHLA 13 4 52 PIK3CA 30 1 169 FGFR3 26 1 174 MEN1 7 1 22 Table 1: Corpus Statistics. RDF database The RDF files representing our corpora are already relatively large, so for the purposes of efficient SPARQL querying we deploy the data to a Sesame triplestore. Users have the option of downloading the RDF data and using their own querying machinery, or accessing our DB via a public SPARQL endpoint. The details can be found in [24]. SPARQL queries for performance metrics To test the idea of using SPARQL for performance metrics computation, we have formulated several SPARQL queries sufficient for computing precision and recall for systems implementing two text mining tasks: mutation grounding to proteins and the extraction of impacts of mutations on protein properties. For each task we wrote (1) a SPARQL query that selects relevant annotations in the gold standard data, representing correct cases, (2) a query that selects all relevant results of the text mining system being evaluated, and (3) a query that selects only correct results. These selections are enough to calculate precision and recall. We illustrate this by presenting a slightly simplified version of the query used to select the correct results from mutation-impact extraction, which can be used for evaluation according to the metric definitions from, e. g., [2, 10]. According to the definitions, a result is a set – document, protein, mutation, protein property changed by the mutation, and a direction of the property change. If the gold standard data contain the same set, the result is considered correct. Technically we have to compare two RDF graphs and get the corresponding intersection. Note that the query assumes that the gold standard data is kept in the named graph http://example.com/gold-standard.rdf and the system results come from another named graph http://example.com/experiment.rdf. Note that as in modelling example for readability we replace non-mnemonic SIO identifiers with their labels. 9 1 PREFIX sio : < http :// s e m a n t i c s c i e n c e. org / resource / > 2 PREFIX lsrn : < http :// purl . oclc . org / SADI / LSRN / > 3 SELECT DISTINCT ? p u b m e d _ i d ? w t _ r e s i d u e ? p o s i t i o n _ v a l u e ? m u t _ r e s i d u e ? u n i p r o t _ r e c o r d _ i d ? property_change_class ? protein_property_class 4 WHERE { 5 GRAPH < http :// example . com / gold - standard . rdf > { 6 ? document a sio : ’ article ’ . 7 ? document sio : ’ is subject of ’ ? p u b m e d _ r e c o r d . 8 ? p u b m e d _ r e c o r d sio : ’ has attribute ’ ? p u b m e d _ i d e n t i f i e r . 9 ? p u b m e d _ i d e n t i f i e r sio : ’ has value ’ ? p u b m e d _ i d . 10 ? document sio : ’ refers to ’ ? m u t a t i o n _ a p p l i c a t i o n . 11 ? m u t a t i o n _ a p p l i c a t i o n a mieo : P r o t e i n M u t a t i o n A p p l i c a t i o n . 12 ? m u t a t i o n _ a p p l i c a t i o n mieo : i s A p p l i c a t i o n O f M u t a t i o n ? mutation . 13 ? m u t a t i o n _ a p p l i c a t i o n mieo : i s A p p l i c a t i o n O f M u t a t i o n T o P r o t e i n ? protein . 14 ? document sio : ’ refers to ’ ? mutation . 15 ? mutation a mieo : C o m b i n e d A m i n o A c i d S e q u e n c e C h a n g e . 16 ? mutation sio : ’ has member ’ ? s i n g u l a r _ m u t a t i o n . 17 ? s i n g u l a r _ m u t a t i o n mieo : m u t a t i o n H a s W i l d t y p e R e s i d u e ? w t _ r e s i d u e . 18 ? s i n g u l a r _ m u t a t i o n mieo : m u t a t i o n H a s M u t a n t R e s i d u e ? m u t _ r e s i d u e . 19 ? s i n g u l a r _ m u t a t i o n mieo : m u t a t i o n H a s P o s i t i o n ? position . 20 ? position a sio : ’ position ’ . 21 ? position sio : ’ has value ’ ? p o s i t i o n _ v a l u e . 22 ? document sio : ’ refers to ’ ? protein . 23 ? protein sio : ’ is subject of ’ ? u n i p r o t _ r e c o r d . 24 ? u n i p r o t _ r e c o r d a lsrn : U n i P r o t _ R e c o r d . 25 ? u n i p r o t _ r e c o r d sio : ’ has attribute ’ ? u n i p r o t _ r e c o r d _ i d e n t i f i e r . 26 ? u n i p r o t _ r e c o r d _ i d e n t i f i e r a lsrn : U n i P r o t _ I d e n t i f i e r . 27 ? u n i p r o t _ r e c o r d _ i d e n t i f i e r sio : ’ has value ’ ? u n i p r o t _ r e c o r d _ i d . 28 ? document sio : ’ refers to ’ ? p r o p e r t y _ c h a n g e . 29 ? m u t a t i o n _ a p p l i c a t i o n mieo : m u t a t i o n A p p l i c a t i o n C a u s e s C h a n g e ? p r o p e r t y _ c h a n g e . 30 ? property_change a ? property_change_class . 31 ? document sio : ’ refers to ’ ? p r o t e i n _ p r o p e r t y . 32 ? p r o p e r t y _ c h a n g e mieo : p r o p e r t y C h a n g e A p p l i e s T o ? p r o t e i n _ p r o p e r t y . 33 ? p r o t e i n _ p r o p e r t y sio : ’ is property of ’ ? protein . 34 ? protein_property a ? protein_property_class . 35 } 36 GRAPH < http :// example . com / e x p e r i m e n t. rdf > { 37 ? d o c u m e n t 2 a sio : ’ article ’ . 38 ? d o c u m e n t 2 sio : ’ is subject of ’ ? p u b m e d _ r e c o r d 2 . 39 ? p u b m e d _ r e c o r d 2 sio : ’ has attribute ’ ? p u b m e d _ i d e n t i f i e r 2 . 40 ? p u b m e d _ i d e n t i f i e r 2 sio : ’ has value ’ ? p u b m e d _ i d . 41 ? d o c u m e n t 2 sio : ’ refers to ’ ? m u t a t i o n _ a p p l i c a t i o n 2 . 42 ? m u t a t i o n _ a p p l i c a t i o n 2 a mieo : P r o t e i n M u t a t i o n A p p l i c a t i o n . 43 ? m u t a t i o n _ a p p l i c a t i o n 2 mieo : i s A p p l i c a t i o n O f M u t a t i o n ? m u t a t i o n 2 . 44 ? m u t a t i o n _ a p p l i c a t i o n 2 mieo : i s A p p l i c a t i o n O f M u t a t i o n T o P r o t e i n ? protein2 . 45 ? d o c u m e n t 2 sio : ’ refers to ’ ? m u t a t i o n 2 . 46 ? m u t a t i o n 2 a mieo : C o m b i n e d A m i n o A c i d S e q u e n c e C h a n g e . 47 ? m u t a t i o n 2 sio : ’ has member ’ ? s i n g u l a r _ m u t a t i o n 2 . 48 ? s i n g u l a r _ m u t a t i o n 2 mieo : m u t a t i o n H a s W i l d t y p e R e s i d u e ? w t _ r e s i d u e . 49 ? s i n g u l a r _ m u t a t i o n 2 mieo : m u t a t i o n H a s M u t a n t R e s i d u e ? m u t _ r e s i d u e . 50 ? s i n g u l a r _ m u t a t i o n 2 mieo : m u t a t i o n H a s P o s i t i o n ? p o s i t i o n 2 . 51 ? p o s i t i o n 2 a sio : ’ position ’ . 52 ? p o s i t i o n 2 sio : ’ has value ’ ? p o s i t i o n _ v a l u e . 53 ? d o c u m e n t 2 sio : ’ refers to ’ ? protein2 . 54 ? protein2 sio : ’ is subject of ’ ? u n i p r o t _ r e c o r d 2 . 55 ? u n i p r o t _ r e c o r d 2 a lsrn : U n i P r o t _ R e c o r d . 56 ? u n i p r o t _ r e c o r d 2 sio : ’ has attribute ’ ? u n i p r o t _ r e c o r d _ i d e n t i f i e r 2 . 57 ? u n i p r o t _ r e c o r d _ i d e n t i f i e r 2 a lsrn : U n i P r o t _ I d e n t i f i e r . 58 ? u n i p r o t _ r e c o r d _ i d e n t i f i e r 2 sio : ’ has value ’ ? u n i p r o t _ r e c o r d _ i d . 59 ? d o c u m e n t 2 sio : ’ refers to ’ ? p r o p e r t y _ c h a n g e 2 . 60 ? m u t a t i o n _ a p p l i c a t i o n 2 mieo : m u t a t i o n A p p l i c a t i o n C a u s e s C h a n g e ? p r o p e r t y _ c h a n g e 2 . 61 ? property_change2 a ? property_change_class . 62 ? d o c u m e n t 2 sio : ’ refers to ’ ? p r o t e i n _ p r o p e r t y 2 . 63 ? p r o p e r t y _ c h a n g e 2 mieo : p r o p e r t y C h a n g e A p p l i e s T o ? p r o t e i n _ p r o p e r t y 2 . 64 ? p r o t e i n _ p r o p e r t y 2 sio : ’ is property of ’ ? protein2 . 65 ? protein_property2 a ? protein_property_class . 66 }} 10 We comment briefly on the query composition, the two halves of the query (lines 5-35 and 36-66) correspond to the selection of relevant data from the gold standard corpora and from the experimental system results. Since our goal is to select only correct results, the two selections are joined on the instances of the variables ?pubmed id (identifying documents), ?wt residue, ?mut residue and ?position value (for the wildtype and mutant residues, and positions of the corresponding mutations), ?uniprot record id (identifying proteins), ?protein property class (identifying studied properties) and ?property change class (identifying the direction of the property change). Note that the query can only be used to implement micro averaging that treats the whole corpus as one large document. If, for some reason, we were interested in macro averaging we would have to additionally group the results by the PubMed ID values. Utilities As a part of our infrastructure, we created a small set of simple utilities, which facilitate data access: • The evaluator utility calculates standard performance metrics by executing some user-provided SPARQL queries, counting the results and making necessary calculations. The user can supply the queries in a simple configuration file. • The Sesame loader and query client are simple command line applications that allow loading RDF graphs into a Sesame triplestore and executing queries from files. • The provenance enhancement utility helps in situations when the sources of annotation data only provide fragments of texts as provenance, without specifying their positions in the text. The utility simply searches the document texts for the corresponding fragments in order to provide more precise provenance information. Testing the infrastructure For concept validation, we have used our infrastructure for testing and iterative performance evaluation during a project dedicated to the development of a robust mutation impact extraction system [10], and for the evaluation of the mutation grounding subtask, intended for publication (see [12]). The purpose of the system is to identify protein-level mutations, ground them to the corresponding UniProt IDs and, most importantly, to extract the information about what properties of the proteins are affected and how, if this is described in the processed document. 11 Since early versions of the system already produced output in RDF modelled according to an ontology similar to the MIEO, it was trivial to adjust the system to produce output in a format compatible with our infrastructure. This was the major prerequisite, which was required to enable the evaluation of the system on our gold standard corpora and the subsequent comparison of results from different versions of the mutation grounding system. Although the system previously showed reasonable performance on the 76 documents, the performance on the larger and more representative data set comprising the Enzyminer and KinMutBase corpora, was very low. After an investigation in which we relied heavily on the analysis of system runs based on our annotations, including the provenance information, we have identified the mutation grounding module as a major performance bottleneck having only 0.32 precision and 0.08 recall. We focused our attention on the mutation grounding subtask, in which our infrastructure was very instrumental as the task is also supported by existing gold standard annotations, and eventually improved the performance to 0.83 precision and 0.82 recall. More details on this effort can be found in [12]. Future work In current work we are defining a procedure for the submission of third-party human-curated annotations and system results. In the future work we will further stress-test the infrastructure with text mining tasks other than mutation grounding and mutation impact extraction, and a third-party mutation text mining system. We plan to extend the ontology based on the new requirements identified through community involvement and our own research. In the near future, we plan to extend the infrastructure to include protein properties other than molecular functions, such as enzyme kinetics, and DNA-level mutations. The ontologies we are using provide only very basic means for attaching provenance information to identified entities and relations by simply linking them to the documents they were mined from. We are planning to use one of the existing ontologies for modelling sentence level provenance (Annotation Ontology [25], CALBC semantic annotation schema [26], or NLP Interchange Format [27]) to provide more precise pointers to text fragments supporting annotations. Conclusions We report preliminary results on the development of a community-oriented benchmarking infrastructure intended to relieve the developers of mutation text mining software from the burden of developing ad hoc 12 corpora and scripts for testing, benchmarking and evaluation of multiple mutation-related text mining tasks. While large benchmark corpora for biological entity and relation extraction (such as CALBC [28], BioCreative [29], GENIA [30], etc.) are focused mostly on gene, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. We have seeded the infrastructure with a sizeable gold standard corpus (242 documents). To maximize the reusability and extensibility of our infrastructure, we use RDF and OWL for annotation data representation and SPARQL queries as a means of flexible analysis of text mining results. The infrastructure was tested for benchmarking and evaluation of a mutation impact extraction system. We have done this work with the goal of initiating a community effort, and the future evolution of the benchmarking infrastructure will be based on feedback and contributions from the community. Availability The benchmark corpora, ontology for modelling annotations, example output of our mutation text mining system, benchmarking SPARQL query templates, and infrastructure support tools are available on the project page [24]. Acknowledgement This research was funded in part by the New Brunswick Innovation Foundation, New Brunswick, Canada; the NSERC, Discovery Grant Program, Canada and the Quebec-New Brunswick University Co-operation in Advanced Education - Research Program, Government of New Brunswick, Canada. References 1. Bauer-Mehren A, Furlong LI, Rautschka M, Sanz F: From SNPs to pathways: integration of functional effect of sequence variations on models of cell signalling pathways. BMC Bioinformatics 2009, 10(Suppl 8):S6. 2. Baker CJO, Witte R: Mutation Mining—A Prospector’s Tale. Information Systems Frontiers (ISF) 2006, 8:47–57. 3. Kanagasabai R, Choo KH, Ranganathan S, Baker CJO: A workflow for mutation extraction and structure annotation. J Bioinform Comput Biol 2007, 5(6):1319–37. 4. Doughty E, Kertesz-Farkas A, Bodenreider O, Thompson G, Adadey A, Peterson T, Kann MG: Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature. Bioinformatics 2011, 27(3):408–15. 5. Bromberg Y, Overton J, Vaisse C, Leibel RL, Rost B: In silico mutagenesis: a case study of the melanocortin 4 receptor. FASEB J 2009, 23(9):3059–69. 6. Winnenburg R, Plake C, Schroeder M: Improved mutation tagging with gene identifiers applied to membrane protein stability prediction. BMC Bioinformatics 2009, 10(Suppl 8):S3. 13 7. Caporaso JG, Jr WAB, Randolph DA, Cohen KB, Hunter L: MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics 2007, 23(14):1862–1865. 8. Laurila JB, Kanagasabai R, Baker CJO: Algorithm for grounding mutation mentions from text to protein sequences. In Proceedings of the 7th international conference on Data integration in the life sciences, DILS’10, Berlin, Heidelberg: Springer-Verlag 2010:122–131. 9. Naderi N, Witte R: Automated extraction and semantic analysis of mutation impacts from the biomedical literature. BMC Genomics 2012, 13(Suppl 4):S10. 10. Laurila JB, Naderi N, Witte R, Riazanov A, Kouznetsov A, Baker CJO: Algorithms and semantic infrastructure for mutation impact extraction and grounding. BMC Genomics 2010, 11(Suppl 4):S24. 11. Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R, Casari G, Kirsch H: Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res 2004, 32:135–42. 12. Klein A, Riazanov A, Al-Rababah K, Baker CJ: Towards a next generation protein mutation grounding system for full texts. Accepted at SMBM2012. 13. Eckart R: Choosing an XML database for linguistically annotated corpora. Sprache und Datenverarbeitung 2008, 32:7–22. 14. Demir E, et al: The BioPAX community standard for pathway data sharing. Nature Biotechnology 2010, 28(9):935–942. 15. The Mutation Impact Extraction Ontology (MIEO). http:// unbsj.biordf.net/ ontologies/ mutation-impact-extraction-ontology.owl . 16. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25:25–9. 17. The Semanticscience Integrated Ontology (SIO). http:// semanticscience.org/ ontology/ sio.owl . 18. Life Science Record Name (LSRN). http:// lsrn.org. 19. Carroll JJ, Bizer C, Hayes P, Stickler P: Named graphs, provenance and trust. In Proceedings of the 14th international conference on World Wide Web, WWW ’05, New York, NY, USA: ACM 2005:613–622. 20. Witte R, Baker CJO: Towards a systematic evaluation of protein mutation extraction systems. J Bioinform Comput Biol 2007, 5(6):1339–59. 21. Yeniterzi S, Sezerman U: EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts. BMC Bioinformatics 2009, 10(Suppl 8):S2. 22. Forbes SA, Tang G, Bindal N, Bamford S, Dawson E, Cole C, Kok CY, Jia M, Ewing R, Menzies A, Teague JW, Stratton MR, Futreal PA: COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res 2010, 38(Database issue):D652–7. 23. Ortutay C, Väliaho J, Stenberg K, Vihinen M: KinMutBase: a registry of disease-causing mutations in protein kinase domains. Hum Mutat 2005, 25(5):435–42. 24. Mutation text mining benchmarking infrastructure. http:// code.google.com/ p/ mutation-text-mining. 25. Ciccarese P, Ocana M, Garcia Castro LJ, Das S, Clark T: An open annotation ontology for science on web 3.0. J Biomed Semantics 2011, 2:S4. 26. Croset S, Grabmüller C, Li C, Kavaliauskas S, Rebholz-Schuhmann D: The CALBC RDF Triple Store: Retrieval over Large Literature Content. In SWAT4LS’10 2010:–1–1. 27. Hellmann S, Lehmann J, Auer S: NIF: An ontology-based and linked-data-aware NLP Interchange Format[http://svn.aksw.org/papers/2012/WWW NIF/public.pdf]. 28. Rebholz-Schuhmann D, Yepes AJJ, Van Mulligen EM, Kang N, Kors J, Milward D, Corbett P, Buyko E, Beisswanger E, Hahn U: CALBC silver standard corpus. J Bioinform Comput Biol 2010, 8:163–79. 29. The BioCreAtIvE challenge evaluation. http:// biocreative.sourceforge.net. 30. Kim JD, Ohta T, Tateisi Y, Tsujii J: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 2003, 19(Suppl 1):i180–2. 14