<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Benchmarking infrastructure for mutation text mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artjom Klein</string-name>
          <email>aklein@unb.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Riazanov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew M Hindle</string-name>
          <email>-matthew.hindle@ed.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher JO Baker</string-name>
          <email>bakerc@unb.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Statistics And Science Department, University of New Brunswick</institution>
          ,
          <addr-line>Saint John</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Synthetic and Systems Biology, Edinburgh University</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Background: Research work on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation facilities and easy-to-use infrastructure for testing and benchmarking of mutation text mining systems. Results: We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent the annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases programming is not needed to analyze system results. While large benchmark corpora for biological entity and relation extraction are focused mostly on gene, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises of: 1) an ontology for modelling annotations, 2) SPARQL queries for performance metrics computation, and 3) a sizeable collection of manually curated documents, that can minimally support mutation grounding and mutation impact extraction. Conclusion: This is the first example of benchmarking infrastructure for mutation text mining. It is designed for community uptake.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Mutation text mining. The use of knowledge derived from text mining for mentions of mutations and
their consequences is increasingly important for systems biology, genomics and genotype-phenotype studies.
Mutation text mining facilitates a wide range of activities in multiple scenarios including the modelling of
cell signalling pathways [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], protein structure annotation [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] , the expansion of disease-mutation database
annotations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the development of tools predicting the impacts of mutations [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The types of useful
text mining tasks specific to mutations range from the relatively simple identification of mutation
mentions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], to very complex tasks such as linking (”grounding”) identified mutations to the corresponding
genes and proteins [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or identifying mutation impacts [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] and related phenotypes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Benchmarking and evaluation difficulties. Although the demand for mutation text mining software
has lead to a significant growth of the experimental research in this area, the development of such systems
and the publication of results is greatly hindered by the lack of adequate benchmarking facilities. In the
first place, developers of mutation text mining systems need input data – texts annotated with target
information – to simply test new versions of their implementations. They further need facilities to
benchmark different versions of their systems in order to monitor development progress. Finally, the
developers need to be able to convincingly evaluate their systems performance by comparing their results
with extensive gold standard data and results of other systems.
      </p>
      <p>
        Ideally, there should be community-based consensus corpora and utilities to make such benchmarking and
evaluation easy. However, such facilities currently do not exist and developers are forced to spend time and
effort on creating ad hoc corpora and scripts. As a result, the time required for benchmarking in the total
development work is disproportionally high. Moreover, since only relatively small corpora are affordable to
many research groups, the quality of evaluation suffers too. In developing a mutation grounding system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
showing an encouraging level of performance accuracy, 0.73, on a homogeneous corpus of 76 documents,
the authors achieved only 0.13 on a heterogeneous corpus of larger size. When the system was
reimplemented (see, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), the authors encountered another challenge – the evaluation of the new system by
comparing it to the state-of-the-art was practically unaffordable, despite the existence of similar systems,
due to the lack of consensus benchmarking infrastructure. The lack of adequate community-based
benchmarking infrastructure is a great hindrance to progress in the area of mutation text mining. We
propose to improve this situation by developing a publicly accessible infrastructure.
      </p>
      <p>Requirements. To guide our work, we impose the following requirements on the infrastructure to be
created:
• To maximize its utility for system testing and evaluation, the infrastructure must include as big a
gold standard corpus (a collection of annotated texts) as possible. It must also contain results of the
runs of different systems to facilitate comparison of their performance.
• To be useful to a larger community, the infrastructure should support multiple mutation-related text
mining tasks, such as identifying mutations both on DNA and protein levels, mutation grounding to
gene and proteins, identifying effects of mutations, etc.
• The infrastructure must be easy to use requiring only minimal effort from system developers. Ideally,
many development tasks should be facilitated so that the developers do not need to create new data
formats or write additional scripts in order to leverage the infrastructure.
• The infrastructure should not only be publicly available but also support a sufficiently
straightforward submission of both gold standard annotations and system results by the mutation
text mining community.</p>
      <p>Content overview. In this paper we report the results on the design and implementation of a
community-oriented annotation and benchmarking infrastructure to support development, testing,
benchmarking, and comparison of systems for mining information about mutations. The paper is outlined
as follows. The Methods section starts by describing the motivation for the choice of representation format,
it continues to outline a specification of ontology for modelling annotations and describes a method to
calculate evaluation metrics. The Results and Discussion section presents details of seed corpora, methods
for calculation of performance metrics, and utilities supporting benchmarking infrastructure. At the end of
this section, we outline a testing infrastructure use-case and future work. Finally we provide a Conclusion
summarizing results and highlight the availability of the benchmarking infrastructure.</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Representing gold standard annotations and system results in RDF</title>
        <p>
          Typically, document annotations intended for text mining system testing and evaluation, are represented in
various custom XML-based or tabular formats. XML is a standard and widely used format for corpora
annotation which comes with a large number of tools. Nevertheless, the processing of complex annotations
in XML – parsing, storing, querying, evaluation – is usually practically impossible with off-the-shelf XML
tools [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Developers need to create schema-specific parsers and processing scripts and change them each
time the schema is changed or extended. This was the primary reason we chose RDF over custom
XML-based formats, because the reusability and extensibility of data are among the design goals of RDF.
We also use OWL ontologies as highly extensible data schemas. An existing example of the successful use
of RDF with OWL for representing biological data is the BIOPAX [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] format for representing biological
pathway data.
        </p>
        <p>
          The advantages of using the RDF/OWL bundle can be summarized as follows:
• Extensibility. Since the benchmarking infrastructure is going to be used for different mutation text
mining tasks and all requirements can not be foreseen, we need extensible representations. Moreover,
the same data may be used for different tasks (e. g., we have reused mutation impact corpora for
improving mutation grounding system [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]).
        </p>
        <p>The use of RDF data with classes and properties defined in OWL ontologies makes it possible to
support easy integration of new corpora with annotation schemas that need not be identical, as long
as they are compatible. This simply amounts to using compatible OWL ontologies and modelling
patterns for RDF. Data defined modulo one ontology can be simply merged with data modulo
another ontology. Moreover, additional alignments between the ontologies can be provided by the
annotation providers – corpus curators or text mining system developers.
• Tool availability. RDF and OWL are popular open formats and supported by a large number of
open source and commercial tools. The following types of tools can be leveraged for the purpose of
text mining annotation processing:
– OWL reasoners can be used for data integrity checking.
– RDF and OWL APIs for multiple programming languages, including Java, C++, Perl and
Python, facilitate easy programmatic generation and manipulation of annotations or RDF data
representing text mining results.
– The SPARQL query language can be directly used for calculating system performance metrics
as well as for various searches in the gold standard corpora. There is no need to implement
custom querying mechanisms.
– Multiple implementations of RDF databases (triplestores) are available that facilitate efficient
storing and querying of large volumes of annotations.</p>
        <p>The diversity of available RDF tools enables out-of-the-box use of the annotation data in the main
use scenarios, such as system testing and evaluation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Core Ontologies and Modelling</title>
        <sec id="sec-2-2-1">
          <title>Ontologies</title>
          <p>
            The Mutation Impact Extraction Ontology (MIEO) [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] is central to our infrastructure. It currently
describes classes and properties necessary to represent information about mutations at protein level,
identified in texts, and extracted mutation impacts on molecular functions. For example,
AminoAcidSequenceChange is the class for mutations at protein level. Instances of ProteinVariant are
most specific types of protein molecules that completely identify the corresponding amino acid sequences.
Instances of ProteinPropertyChange are the identified changes of protein properties that can be linked to:
the properties that change, the corresponding documents and specific text fragments, and the mutations
they result from. To characterize a property change, e. g., as positive, which may correspond to increased
activity, we can use the subclass PositiveProteinPropertyChange. Protein properties, such as molecular
functions, are also modelled as individuals whose types are currently taken from the Gene Ontology [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
Note that some of our target mutation tasks are related to the extraction of relations between entities
rather than just identifying some entities of interest. We use custom reification for such relations, in
particular to facilitate linking them to documents and more specific provenance information. For example,
extracted statements of mutations impacting protein properties are represented as instance of the class
StatementOfMutationEffect.
          </p>
          <p>
            Note that our MIEO uses the Semanticscience Integrated Ontology (SIO) [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] as an upper ontology, and
the LSRN ontology [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] to represent records and identifiers, as illustrated in the next section.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Modelling example.</title>
          <p>
            We provide an RDF graph in pseudo-N3 as an example of how the gold standard corpus data and results of
mutation impact text mining are represented in our infrastructure. Note that non-mnemonic ontological
identifiers are replaced with pseudo-identifiers using the corresponding labels: e. g., sio:SIO 000011 and
sio:SIO 000300 are replaced respectively with sio:’has attribute’ and sio:’has value’.
# Description of a singular amino acid substitution N30A :
: singular_mutation1 rdf: type mieo : AminoAcidSubstitution .
: singular_mutation1 mieo : mutationHasWildtypeResidue mieo : Asparagine .
: singular_mutation1 mieo : mutationHasMutantResidue mieo : Alanine .
: singular_mutation1 mieo : mutationHasPosition : position1 .
: position1 rdf : type sio:’ position ’ .
: position1 sio :’ has value ’ "30"^^ xsd : integer .
# Description of a singular amino acid substitution N50A :
: singular_mutation2 rdf: type mieo : AminoAcidSubstitution .
: singular_mutation2 mieo : mutationHasWildtypeResidue mieo : Asparagine .
: singular_mutation2 mieo : mutationHasMutanResidue mieo : Alanine .
: singular_mutation2 mieo : mutationHasPosition : position2 .
: position2 rdf : type sio:’ position ’ .
: position2 sio :’ has value ’ "50"^^ xsd : integer .
# Combined mutation (" mutation series ") consisting of the two singular mutations:
: mutation rdf : type mieo : CombinedAminoAcidChange .
: mutation sio :’ has member ’ : singular_mutation1 .
: mutation sio :’ has member ’ : singular_mutation2 .
: mutation sio :’ has attribute ’ : number_of_singular_mutations .
: number_of_singular_mutations rdf : type sio:’ count ’
: number_of_singular_mutations sio :’ has value ’ "2"^^ xsd : integer .
# Mutation application (" grounding") to a specific protein :
: mutation_application rdf: type mieo : ProteinMutationApplication .
: mutation_application mieo : isApplicationOfMutation : mutation .
: mutation_application mieo : isApplicationOfMutationToProtein : protein .
# Description of the protein :
: protein rdf : type mieo : ProteinVariant . # it ’s a specific variant ( uniquely identifies the sequence )
: protein mieo : proteinHasSequence : protein_sequence .
: protein sio :’ is subject of ’ : uniprot_record .
# Standard SIO way to link entities , DB records and IDs :
: uniprot_record rdf : type lsrn : UniProt_Record .
: uniprot_record sio :’ has attribute ’ : uniprot_record_id .
: uniprot_record_id rdf: type lsrn : UniProt_Identifier .
: uniprot_record_id sio:’ has value ’ " P22635 " .
# Provenance is mostly done with sio :’ refers to ’ :
: document rdf : type sio:’ article ’ .
: document sio :’ refers to ’ : singular_mutation1 .
: document sio :’ refers to ’ : singular_mutation2 .
: document sio :’ refers to ’ : mutation .
: document sio :’ refers to ’ : mutation_application .
: document sio :’ refers to ’ : protein .
: document sio :’ has unique identifier ’ : document_identifier .
: document_identifier rdf: type mieo : PubMedURI . # subclass of mieo : URI
: document_identifier sio:’ has value ’ " http :// www. ncbi . nlm . nih . gov/ pubmed /17526795"^^xsd: anyURI .
Note that, for simplicity, RDF data in this example are in “flat” RDF. In practice this is not convenient
because we need to somehow separate the gold standard data from system results. Moreover, it is
necessary to separate results coming from different systems or different experiments. We use named
graphs [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] for this purpose: results from different experiments, and even gold standard data from different
corpora, are placed in different named graphs.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Benchmarking with SPARQL</title>
        <p>
          An infrastructure intended for benchmarking and evaluation must support the computation of performance
metrics, such as precision and recall. Note that different flavours of these statistics are used by system
developers: e. g., [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] proposes over 15 different metrics to evaluation protein mutation extraction systems.
Moreover, text mining results sometimes need to be evaluated with different granularities, e. g., the mutant
protein property change may be evaluated by considering binary outcomes (has effect vs no effect ) or with
higher granularity when the outcome may also identify the direction of the effect – e. g., positive effect or
negative effect.
        </p>
        <p>Our infrastructure has to be sufficiently flexible to accommodate many such uses. This is achieved by using
SPARQL to retrieve entities, such as different flavours of true and false positives, that need to be counted
in order to calculate a particular metric. The current version of SPARQL (1.1) offers a sufficient degree of
flexibility. In particular, the negation-as-failure related features – FILTER NOT EXISTS and MINUS – allow,
e. g., for easy qualification of some results as false positives by checking whether they are absent from the
gold standard data.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Design of the seed corpora</title>
        <p>To facilitate a preliminary evaluation of our infrastructure, we seeded it with several corpora supporting at
least two mutation text mining tasks: mutation grounding to proteins and extraction of mutation impacts
on molecular functions of proteins.</p>
        <p>The document annotations for mutation grounding identify extracted mutations and proteins, and
relations between them. The annotations for mutation impact extraction additionally identify molecular
function of proteins and changes of these properties causally linked to some mutations, and provide
references to supporting text fragments.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <sec id="sec-3-1">
        <title>Contents of the corpora</title>
        <sec id="sec-3-1-1">
          <title>EnzyMiner-based corpus.</title>
          <p>
            One of our seed corpora is based on an extract from the EnzyMiner [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] abstract database. It was
annotated manually and comprises 38 semi-randomly selected full text documents with 176 different
singular mutations linked to 48 different protein sequences. The selection was adjusted to ensure maximal
diversity by having documents with proteins from all enzyme families and 24 different species. The corpus
currently contains 488 statements (occurrences of impact information in text), 61 molecular functions and
29 combined mutations.
          </p>
          <p>In what follows, we call it simply “the EnzyMiner corpus”.</p>
          <p>We annotated documents with mutation impact information which includes:
• Studied protein-level mutations, in the form of singular amino acid substitutions. They are
represented as triples specifying the wild type and mutant residues, and the absolute positions of the
mutations on the corresponding amino acid sequences. For situations when the effects of several
simultaneous amino acid substitutions are studied, we allow them to be expressed as combined
mutations.
• Proteins to which the mutations are related, identified with UniProt IDs. The host organisms and
sets of specific protein sequences can be identified via the UniProt IDs.
• Protein properties specified as Gene Ontology Molecular function classes.
• Mutation impacts qualified as Positive, Negative or Neutral.
• Text fragments the information was extracted from. Typical fragments contain mentions of protein
properties, impact directionality words, such as “increased” or “worse”, mutation mentions, protein
and organism names, etc.</p>
          <p>• Documents identified with PubMed IDs.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>DHLA corpus.</title>
          <p>
            This is a small corpus comprising 13 documents with 52 unique per document mutations on Haloalkane
Dehalogenases, manually annotated similarly to the EnzyMiner documents (see [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]).
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>COSMIC-based corpus.</title>
          <p>
            We have an extract from the COSMIC database [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] containing 63 documents for three target genes:
FGFR3, MEN1 and PIK3CA. Unlike the EnzyMiner and DHLA corpora, this corpus does not identify
mutation impacts, although it links mutations to proteins and, thus, is suitable for mutation grounding
benchmarking.
          </p>
        </sec>
        <sec id="sec-3-1-4">
          <title>KinMutBase-based corpus.</title>
          <p>
            We retrieved 201 documents annotated with singular amino acid substitutions grounded to proteins, from
the KinMutBase [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] database. We additionally curated the selection by running MutationFinder, which is
a reliable tool for this purpose due to its very high recall, and comparing the results with the annotations
in the database. Based on this comparison, we discarded about 70 documents that appear annotated with
protein-level mutations that they don’t seem to mention directly, although this may be due to the
translation from SNPs made by the curators. The final size of the corpus is 128 documents. In total, we
have 271 mutations linked to 26 different UniProt identifiers.
          </p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Corpora statistics.</title>
          <p>
            The statistics for the corpora are summarized in Table 1.
EnzyMiner
KinMutBase
DHLA
PIK3CA
FGFR3
MEN1
The RDF files representing our corpora are already relatively large, so for the purposes of efficient
SPARQL querying we deploy the data to a Sesame triplestore. Users have the option of downloading the
RDF data and using their own querying machinery, or accessing our DB via a public SPARQL endpoint.
The details can be found in [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>SPARQL queries for performance metrics</title>
        <p>To test the idea of using SPARQL for performance metrics computation, we have formulated several
SPARQL queries sufficient for computing precision and recall for systems implementing two text mining
tasks: mutation grounding to proteins and the extraction of impacts of mutations on protein properties.
For each task we wrote (1) a SPARQL query that selects relevant annotations in the gold standard data,
representing correct cases, (2) a query that selects all relevant results of the text mining system being
evaluated, and (3) a query that selects only correct results. These selections are enough to calculate
precision and recall.</p>
        <p>
          We illustrate this by presenting a slightly simplified version of the query used to select the correct results
from mutation-impact extraction, which can be used for evaluation according to the metric definitions
from, e. g., [
          <xref ref-type="bibr" rid="ref10 ref2">2, 10</xref>
          ]. According to the definitions, a result is a set – document, protein, mutation, protein
property changed by the mutation, and a direction of the property change. If the gold standard data
contain the same set, the result is considered correct. Technically we have to compare two RDF graphs and
get the corresponding intersection. Note that the query assumes that the gold standard data is kept in the
named graph http://example.com/gold-standard.rdf and the system results come from another named
graph http://example.com/experiment.rdf.
        </p>
        <p>Note that as in modelling example for readability we replace non-mnemonic SIO identifiers with their
labels.
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
1 PREFIX sio :&lt; http :// semanticscience. org/ resource /&gt;
2 PREFIX lsrn :&lt; http :// purl . oclc . org/ SADI / LSRN /&gt;
3 SELECT DISTINCT ? pubmed_id ? wt_residue ? position_value ? mut_residue ? uniprot_record_id ?
property_change_class ? protein_property_class
4 WHERE {
5 GRAPH &lt; http :// example . com / gold - standard . rdf &gt; {
6 ? document a sio :’ article ’ .
7 ? document sio:’ is subject of ’ ? pubmed_record .
8 ? pubmed_record sio :’ has attribute ’ ? pubmed_identifier .
9 ? pubmed_identifier sio :’ has value ’ ? pubmed_id .</p>
        <p>? document sio:’ refers to ’ ? mutation_application .
? mutation_application a mieo : ProteinMutationApplication .
? mutation_application mieo : isApplicationOfMutation ? mutation .
? mutation_application mieo : isApplicationOfMutationToProtein ? protein .
? document sio:’ refers to ’ ? mutation .
? mutation a mieo : CombinedAminoAcidSequenceChange .
? mutation sio:’ has member ’ ? singular_mutation .
? singular_mutation mieo : mutationHasWildtypeResidue ? wt_residue .
? singular_mutation mieo : mutationHasMutantResidue ? mut_residue .
? singular_mutation mieo : mutationHasPosition ? position .
? position a sio :’ position ’ .
? position sio:’ has value ’ ? position_value .
? document sio:’ refers to ’ ? protein .
? protein sio:’ is subject of ’ ? uniprot_record .
? uniprot_record a lsrn : UniProt_Record .
? uniprot_record sio :’ has attribute ’ ? uniprot_record_identifier .
? uniprot_record_identifier a lsrn : UniProt_Identifier .
? uniprot_record_identifier sio:’ has value ’ ? uniprot_record_id .
? document sio:’ refers to ’ ? property_change .
? mutation_application mieo : mutationApplicationCausesChange ? property_change .</p>
        <p>? property_change a ? property_change_class .
31
32
33
34
35 }
36 GRAPH &lt; http :// example . com / experiment. rdf &gt; {
37 ? document2 a sio :’ article ’ .
38 ? document2 sio:’ is subject of ’ ? pubmed_record2 .
39 ? pubmed_record2 sio :’ has attribute ’ ? pubmed_identifier2 .
40 ? pubmed_identifier2 sio :’ has value ’ ? pubmed_id .</p>
        <p>? document sio:’ refers to ’ ? protein_property .
? property_change mieo : propertyChangeAppliesTo ? protein_property .
? protein_property sio :’ is property of ’ ? protein .
? protein_property a ? protein_property_class .
? document2 sio:’ refers to ’ ? mutation_application2 .
? mutation_application2 a mieo : ProteinMutationApplication .
? mutation_application2 mieo : isApplicationOfMutation ? mutation2 .
? mutation_application2 mieo : isApplicationOfMutationToProtein ? protein2 .
? document2 sio:’ refers to ’ ? mutation2 .
? mutation2 a mieo : CombinedAminoAcidSequenceChange .
? mutation2 sio:’ has member ’ ? singular_mutation2 .
? singular_mutation2 mieo : mutationHasWildtypeResidue ? wt_residue .
? singular_mutation2 mieo : mutationHasMutantResidue ? mut_residue .
? singular_mutation2 mieo : mutationHasPosition ? position2 .
? position2 a sio :’ position ’ .
? position2 sio:’ has value ’ ? position_value .
? document2 sio:’ refers to ’ ? protein2 .
? protein2 sio:’ is subject of ’ ? uniprot_record2 .
? uniprot_record2 a lsrn : UniProt_Record .
? uniprot_record2 sio :’ has attribute ’ ? uniprot_record_identifier2 .
? uniprot_record_identifier2 a lsrn : UniProt_Identifier .
? uniprot_record_identifier2 sio:’ has value ’ ? uniprot_record_id .
? document2 sio:’ refers to ’ ? property_change2 .
? mutation_application2 mieo : mutationApplicationCausesChange ? property_change2 .
? property_change2 a ? property_change_class .
? document2 sio:’ refers to ’ ? protein_property2 .
? property_change2 mieo : propertyChangeAppliesTo ? protein_property2 .
? protein_property2 sio :’ is property of ’ ? protein2 .
? protein_property2 a ? protein_property_class .
We comment briefly on the query composition, the two halves of the query (lines 5-35 and 36-66)
correspond to the selection of relevant data from the gold standard corpora and from the experimental
system results. Since our goal is to select only correct results, the two selections are joined on the instances
of the variables ?pubmed id (identifying documents), ?wt residue, ?mut residue and ?position value
(for the wildtype and mutant residues, and positions of the corresponding mutations),
?uniprot record id (identifying proteins), ?protein property class (identifying studied properties) and
?property change class (identifying the direction of the property change).</p>
        <p>Note that the query can only be used to implement micro averaging that treats the whole corpus as one
large document. If, for some reason, we were interested in macro averaging we would have to additionally
group the results by the PubMed ID values.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Utilities</title>
        <p>As a part of our infrastructure, we created a small set of simple utilities, which facilitate data access:
• The evaluator utility calculates standard performance metrics by executing some user-provided
SPARQL queries, counting the results and making necessary calculations. The user can supply the
queries in a simple configuration file.
• The Sesame loader and query client are simple command line applications that allow loading RDF
graphs into a Sesame triplestore and executing queries from files.
• The provenance enhancement utility helps in situations when the sources of annotation data only
provide fragments of texts as provenance, without specifying their positions in the text. The utility
simply searches the document texts for the corresponding fragments in order to provide more precise
provenance information.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Testing the infrastructure</title>
        <p>
          For concept validation, we have used our infrastructure for testing and iterative performance evaluation
during a project dedicated to the development of a robust mutation impact extraction system [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and for
the evaluation of the mutation grounding subtask, intended for publication (see [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]). The purpose of the
system is to identify protein-level mutations, ground them to the corresponding UniProt IDs and, most
importantly, to extract the information about what properties of the proteins are affected and how, if this
is described in the processed document.
Since early versions of the system already produced output in RDF modelled according to an ontology
similar to the MIEO, it was trivial to adjust the system to produce output in a format compatible with our
infrastructure. This was the major prerequisite, which was required to enable the evaluation of the system
on our gold standard corpora and the subsequent comparison of results from different versions of the
mutation grounding system.
        </p>
        <p>
          Although the system previously showed reasonable performance on the 76 documents, the performance on
the larger and more representative data set comprising the Enzyminer and KinMutBase corpora, was very
low. After an investigation in which we relied heavily on the analysis of system runs based on our
annotations, including the provenance information, we have identified the mutation grounding module as a
major performance bottleneck having only 0.32 precision and 0.08 recall. We focused our attention on the
mutation grounding subtask, in which our infrastructure was very instrumental as the task is also
supported by existing gold standard annotations, and eventually improved the performance to 0.83
precision and 0.82 recall. More details on this effort can be found in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Future work</title>
        <p>In current work we are defining a procedure for the submission of third-party human-curated annotations
and system results.</p>
        <p>In the future work we will further stress-test the infrastructure with text mining tasks other than mutation
grounding and mutation impact extraction, and a third-party mutation text mining system. We plan to
extend the ontology based on the new requirements identified through community involvement and our
own research. In the near future, we plan to extend the infrastructure to include protein properties other
than molecular functions, such as enzyme kinetics, and DNA-level mutations.</p>
        <p>
          The ontologies we are using provide only very basic means for attaching provenance information to
identified entities and relations by simply linking them to the documents they were mined from. We are
planning to use one of the existing ontologies for modelling sentence level provenance (Annotation
Ontology [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], CALBC semantic annotation schema [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], or NLP Interchange Format [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]) to provide more
precise pointers to text fragments supporting annotations.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>
        We report preliminary results on the development of a community-oriented benchmarking infrastructure
intended to relieve the developers of mutation text mining software from the burden of developing ad hoc
corpora and scripts for testing, benchmarking and evaluation of multiple mutation-related text mining
tasks. While large benchmark corpora for biological entity and relation extraction (such as CALBC [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ],
BioCreative [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], GENIA [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], etc.) are focused mostly on gene, proteins, diseases, and species, our
benchmarking infrastructure fills the gap for mutation information. We have seeded the infrastructure with
a sizeable gold standard corpus (242 documents). To maximize the reusability and extensibility of our
infrastructure, we use RDF and OWL for annotation data representation and SPARQL queries as a means
of flexible analysis of text mining results. The infrastructure was tested for benchmarking and evaluation
of a mutation impact extraction system.
      </p>
      <p>We have done this work with the goal of initiating a community effort, and the future evolution of the
benchmarking infrastructure will be based on feedback and contributions from the community.</p>
    </sec>
    <sec id="sec-5">
      <title>Availability</title>
      <p>
        project page [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>The benchmark corpora, ontology for modelling annotations, example output of our mutation text mining
system, benchmarking SPARQL query templates, and infrastructure support tools are available on the
This research was funded in part by the New Brunswick Innovation Foundation, New Brunswick, Canada;
the NSERC, Discovery Grant Program, Canada and the Quebec-New Brunswick University Co-operation
in Advanced Education - Research Program, Government of New Brunswick, Canada.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bauer-Mehren</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furlong</surname>
            <given-names>LI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rautschka</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanz</surname>
            <given-names>F</given-names>
          </string-name>
          :
          <article-title>From SNPs to pathways: integration of functional effect of sequence variations on models of cell signalling pathways</article-title>
          .
          <source>BMC Bioinformatics</source>
          <year>2009</year>
          ,
          <volume>10</volume>
          (
          <issue>Suppl 8</issue>
          ):
          <fpage>S6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          :
          <article-title>Mutation Mining-A Prospector's Tale</article-title>
          .
          <source>Information Systems Frontiers (ISF)</source>
          <year>2006</year>
          ,
          <volume>8</volume>
          :
          <fpage>47</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kanagasabai</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choo</surname>
            <given-names>KH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranganathan</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          :
          <article-title>A workflow for mutation extraction and structure annotation</article-title>
          .
          <source>J Bioinform Comput Biol</source>
          <year>2007</year>
          ,
          <volume>5</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1319</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Doughty</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kertesz-Farkas</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodenreider</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adadey</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peterson</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kann</surname>
            <given-names>MG</given-names>
          </string-name>
          :
          <article-title>Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature</article-title>
          .
          <source>Bioinformatics</source>
          <year>2011</year>
          ,
          <volume>27</volume>
          (
          <issue>3</issue>
          ):
          <fpage>408</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bromberg</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Overton</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaisse</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leibel</surname>
            <given-names>RL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rost</surname>
            <given-names>B</given-names>
          </string-name>
          :
          <article-title>In silico mutagenesis: a case study of the melanocortin 4 receptor</article-title>
          .
          <source>FASEB J</source>
          <year>2009</year>
          ,
          <volume>23</volume>
          (
          <issue>9</issue>
          ):
          <fpage>3059</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Winnenburg</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plake</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schroeder</surname>
            <given-names>M</given-names>
          </string-name>
          :
          <article-title>Improved mutation tagging with gene identifiers applied to membrane protein stability prediction</article-title>
          .
          <source>BMC Bioinformatics</source>
          <year>2009</year>
          ,
          <volume>10</volume>
          (
          <issue>Suppl 8</issue>
          ):
          <fpage>S3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Caporaso</surname>
            <given-names>JG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jr</surname>
            <given-names>WAB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Randolph</surname>
            <given-names>DA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            <given-names>KB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            <given-names>L</given-names>
          </string-name>
          :
          <article-title>MutationFinder: a high-performance system for extracting point mutation mentions from text</article-title>
          .
          <source>Bioinformatics</source>
          <year>2007</year>
          ,
          <volume>23</volume>
          (
          <issue>14</issue>
          ):
          <fpage>1862</fpage>
          -
          <lpage>1865</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Laurila</surname>
            <given-names>JB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanagasabai</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          :
          <article-title>Algorithm for grounding mutation mentions from text to protein sequences</article-title>
          .
          <source>In Proceedings of the 7th international conference on Data integration in the life sciences, DILS'10</source>
          , Berlin, Heidelberg: Springer-Verlag
          <year>2010</year>
          :
          <fpage>122</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Naderi</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          :
          <article-title>Automated extraction and semantic analysis of mutation impacts from the biomedical literature</article-title>
          .
          <source>BMC Genomics</source>
          <year>2012</year>
          ,
          <volume>13</volume>
          (
          <issue>Suppl 4</issue>
          ):
          <fpage>S10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Laurila</surname>
            <given-names>JB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naderi</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riazanov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kouznetsov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          :
          <article-title>Algorithms and semantic infrastructure for mutation impact extraction and grounding</article-title>
          .
          <source>BMC Genomics</source>
          <year>2010</year>
          ,
          <volume>11</volume>
          (
          <issue>Suppl 4</issue>
          ):
          <fpage>S24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcel</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albert</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tolle</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casari</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirsch</surname>
            <given-names>H</given-names>
          </string-name>
          :
          <article-title>Automatic extraction of mutations from Medline and cross-validation with OMIM</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2004</year>
          ,
          <volume>32</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Klein</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riazanov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Rababah</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJ</given-names>
          </string-name>
          :
          <article-title>Towards a next generation protein mutation grounding system for full texts</article-title>
          .
          <source>Accepted at SMBM2012.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Eckart</surname>
            <given-names>R</given-names>
          </string-name>
          :
          <article-title>Choosing an XML database for linguistically annotated corpora</article-title>
          .
          <source>Sprache und Datenverarbeitung</source>
          <year>2008</year>
          ,
          <volume>32</volume>
          :
          <fpage>7</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Demir</surname>
            <given-names>E</given-names>
          </string-name>
          , et al:
          <article-title>The BioPAX community standard for pathway data sharing</article-title>
          .
          <source>Nature Biotechnology</source>
          <year>2010</year>
          ,
          <volume>28</volume>
          (
          <issue>9</issue>
          ):
          <fpage>935</fpage>
          -
          <lpage>942</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <article-title>The Mutation Impact Extraction Ontology (MIEO)</article-title>
          . http:// unbsj.biordf.net/ ontologies/ mutation-impact
          <article-title>-extraction-ontology</article-title>
          .owl .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ashburner</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            <given-names>CA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            <given-names>JA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botstein</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherry</surname>
            <given-names>JM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            <given-names>AP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolinski</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwight</surname>
            <given-names>SS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eppig</surname>
            <given-names>JT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harris</surname>
            <given-names>MA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            <given-names>DP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Issel-Tarver</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasarskis</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matese</surname>
            <given-names>JC</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richardson</surname>
            <given-names>JE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringwald</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            <given-names>GM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sherlock</surname>
            <given-names>G</given-names>
          </string-name>
          :
          <article-title>Gene ontology: tool for the unification of biology</article-title>
          .
          <source>The Gene Ontology Consortium. Nat Genet</source>
          <year>2000</year>
          ,
          <volume>25</volume>
          :
          <fpage>25</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <article-title>The Semanticscience Integrated Ontology (SIO)</article-title>
          . http:// semanticscience.org/ ontology/ sio.owl .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Life Science Record</surname>
          </string-name>
          <article-title>Name (LSRN)</article-title>
          . http:// lsrn.org.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Carroll</surname>
            <given-names>JJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stickler</surname>
            <given-names>P</given-names>
          </string-name>
          :
          <article-title>Named graphs, provenance and trust</article-title>
          .
          <source>In Proceedings of the 14th international conference on World Wide Web, WWW '05</source>
          , New York, NY, USA: ACM
          <year>2005</year>
          :
          <fpage>613</fpage>
          -
          <lpage>622</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          :
          <article-title>Towards a systematic evaluation of protein mutation extraction systems</article-title>
          .
          <source>J Bioinform Comput Biol</source>
          <year>2007</year>
          ,
          <volume>5</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1339</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yeniterzi</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sezerman</surname>
            <given-names>U</given-names>
          </string-name>
          :
          <article-title>EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts</article-title>
          .
          <source>BMC Bioinformatics</source>
          <year>2009</year>
          ,
          <volume>10</volume>
          (
          <issue>Suppl 8</issue>
          ):
          <fpage>S2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Forbes</surname>
            <given-names>SA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bindal</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bamford</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dawson</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cole</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kok</surname>
            <given-names>CY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ewing</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menzies</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teague</surname>
            <given-names>JW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stratton</surname>
            <given-names>MR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Futreal</surname>
            <given-names>PA</given-names>
          </string-name>
          :
          <article-title>COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2010</year>
          ,
          <volume>38</volume>
          (Database issue):
          <fpage>D652</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Ortutay</surname>
            <given-names>C</given-names>
          </string-name>
          , V¨aliaho
          <string-name>
            <given-names>J</given-names>
            ,
            <surname>Stenberg</surname>
          </string-name>
          <string-name>
            <given-names>K</given-names>
            ,
            <surname>Vihinen</surname>
          </string-name>
          <string-name>
            <surname>M</surname>
          </string-name>
          :
          <article-title>KinMutBase: a registry of disease-causing mutations in protein kinase domains</article-title>
          .
          <source>Hum Mutat</source>
          <year>2005</year>
          ,
          <volume>25</volume>
          (
          <issue>5</issue>
          ):
          <fpage>435</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <article-title>Mutation text mining benchmarking infrastructure</article-title>
          . http:// code.google.com/ p/ mutation-text-mining.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Ciccarese</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ocana</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia Castro</surname>
            <given-names>LJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            <given-names>T</given-names>
          </string-name>
          :
          <article-title>An open annotation ontology for science on web 3.0</article-title>
          .
          <source>J Biomed Semantics</source>
          <year>2011</year>
          ,
          <volume>2</volume>
          :
          <fpage>S4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Croset</surname>
            <given-names>S</given-names>
          </string-name>
          , Grabmu¨ller
          <string-name>
            <given-names>C</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          <string-name>
            <given-names>C</given-names>
            ,
            <surname>Kavaliauskas</surname>
          </string-name>
          <string-name>
            <given-names>S</given-names>
            ,
            <surname>Rebholz-Schuhmann</surname>
          </string-name>
          <string-name>
            <surname>D</surname>
          </string-name>
          :
          <article-title>The CALBC RDF Triple Store: Retrieval over Large Literature Content</article-title>
          .
          <source>In SWAT4LS'10</source>
          <year>2010</year>
          :
          <article-title>-1-1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Hellmann</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            <given-names>S: NIF</given-names>
          </string-name>
          :
          <article-title>An ontology-based and linked-data-aware NLP Interchange Format</article-title>
          [http://svn.aksw.org/papers/2012/WWW NIF/public.pdf].
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yepes</surname>
            <given-names>AJJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Mulligen</surname>
            <given-names>EM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kors</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milward</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corbett</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buyko</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beisswanger</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hahn</surname>
            <given-names>U</given-names>
          </string-name>
          :
          <article-title>CALBC silver standard corpus</article-title>
          .
          <source>J Bioinform Comput Biol</source>
          <year>2010</year>
          ,
          <volume>8</volume>
          :
          <fpage>163</fpage>
          -
          <lpage>79</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <article-title>The BioCreAtIvE challenge evaluation</article-title>
          . http:// biocreative.sourceforge.net.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Kim</surname>
            <given-names>JD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ohta</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tateisi</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            <given-names>J</given-names>
          </string-name>
          :
          <article-title>GENIA corpus-semantically annotated corpus for bio-textmining</article-title>
          .
          <source>Bioinformatics</source>
          <year>2003</year>
          ,
          <volume>19</volume>
          (
          <issue>Suppl 1</issue>
          ):
          <fpage>i180</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>