<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A semantic assistant for mutation mentions in PubMed ab- stracts.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonas B Laurila</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Kouznetsov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher J O Baker</string-name>
          <email>bakerc@unb.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science &amp; Applied Statistics, University of New Brunswick</institution>
          ,
          <addr-line>Saint John, New Brunswick, E2L 4L5</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical researchers consume and analyze PubMed abstracts on a daily basis seeking to update their existing knowledge with insights from newly published literature. Plain text descriptions fail to deliver contextual knowledge to users who require a comprehensive understanding of the content of a paper before deciding to access it. To achieve this biological named entities described in the abstracts must be linked to their related entries in biological databases and established controlled vocabularies such as SwissProt and Gene Ontology. Semantic Assistants support users in content retrieval, analysis, and development, by o ering context-sensitive NLP services directly integrated in standard desktop clients, like a word processor. They are implemented through an open service-oriented architecture, using Semantic Web ontologies and W3C Web Services. Here we present a deployment of the Semantic Assistants framework to provide links from mutation, protein, protein property, gene and organism mentions in abstracts to their related entry in standardized biological databases and controlled vocabularies. The underlying text mining pipeline used to identify named entities has previously shown high levels of precision and we make this functionality easily accessible through a Semantic Assistant, to end users when reviewing PubMed abstracts in through a Firefox client.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The proliferation of annotation services for biologists reviewing scienti c literature is based on the
adoption of web services designed to provide links to contextual information provided in online databases.
These annotation services, such as Re ect [1], tag gene, protein and small-molecule names and link them to
external resources with sequence and structure information for given default organisms. Further
deployments of annotation services, such as BioDEAL [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ] facilitate a feedback-loop whereby scientists
manually link biological concepts to published evidence through a web-browser frontend, making it possible
for biologists to collectively build and share knowledge. BioDEAL also facilitates users who want to make
use of natural language processing tools in their annotation work. Whereas BioDEAL leverages social
networking between biologists a standardized framework, allowing rapid customization to new application
scenarios by multiple stakeholders, is required.
      </p>
      <p>
        Recent work on Semantic Assistants [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ] has opened up the possibility of providing server side natural
language processing of texts being reviewed or drafted on client side applications through web services. In
so doing annotations can be pushed directly to users of dedicated desktop clients or browsing the web with
browsers enabled with plug-in extensions. The Semantics Assistant framework was applied in the domain
of Telecom [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ] where annotations in the form of, of OWL-DL axioms from a telecom ontology, were linked
to named entities via canonical names mapped to Semantic classes in the Telecom Ontology. In this
contribution we illustrate the deployment of a Semantic Assistant for the relay of grounded mutations and
mutation impact; moreover our implementation involves the use of GreaseMonkey, an extension to Mozilla
Firefox, as the Semantic Assistant Client allowing the delivery of these mutation annotations when
browsing PubMed on the Web.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Semantic assistants</title>
        <p>
          In previous work [
          <xref ref-type="bibr" rid="ref2">3</xref>
          ] the notion of a semantic assistants and the semantic assistant architecture was
established. Primarily this involves the integration of text analysis services and end-user clients. The
architecture consists of four tiers, described here brie y: (i) Client tier; clients are typically word
processors, web browsers or any other application that render text, (ii) Presentation and Interaction tier; a
web server containing modules that translates results from NLP services into formats compliable with the
clients, (iii) Analysis and Retrieval tier; the actual NLP systems, which for a given text produce output
annotations, (iv) Resource tier, provide support for the NLP systems with external information from other
documents or databases.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Semantic assistant clients</title>
        <p>Reported semantic assistant deployments rely on clients that are plug-ins for text editors as e.g.
OpenO ce. In current work we have implemented a Semantic assistant client with the use of
Greasemonkey [5], an extension to Mozilla Firefox allowing users to install scripts supporting augmented
browsing, i.e. making changes to webpage content. The script we have built pre-processes a PubMed entry
page and sends the abstract, via a Java Servlet wrapping the Semantic Assistant Java API, to our mutation
impact extraction service and annotates the abstract with the results as shown in Figure 1. An important
di erence between text in word processors and web page content is that the latter is not editable and
contains both texts of interest and, depending on the web page selected, content that can be considered
nonsense. For a given page the list of web services have to be extended according to the content they will
process (which webpage and which part of that webpage). As an example, the NLP service "mutation
impact extractor" can be customized to a "mutation impact extractor for pubmed abstracts". Moreover
text processing algorithms for some NLP services may be context dependent in that they may rely on
co-occurrence of terms within a certain distance from named entities. Consequently multiple texts or other
surrounding content might disrupt the underlying algorithms for entity recognition and disambiguation.
This makes the construction of semantic assistant clients for web browsers a bit more tedious. We propose
a client that will support browsing, augmented with semantic tags, by using Ajax technologies in
combination with the existing semantic assistant architecture. The core of the client is a Java servlet
making use of a Java API containing a precompiled client-side abstraction layer which performs the actual
communication with the server. The exterior of the clients consist of a set of scripts on which they can
make changes to HTML pages with annotations retrieved via asynchronous calls to the Java servlet.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Text mining pipeline</title>
        <p>
          The backbone of a Semantic Assistant is the NLP service that process text and outputs annotations or
deduced facts. We have previously developed such an NLP algorithms and semantic infrastructure which
nds, annotates and disambiguates biological entities by using a combination of gazetteer-based approaches
and methods for relation detection. The following entities are currently extracted: proteins, genes,
organisms, point mutations, protein properties and mutational impacts on protein properties. In particular
these algorithms facilitate the grounding (linking) of proteins and mutations to SwissProt entries and
correct position on amino acid sequences [
          <xref ref-type="bibr" rid="ref4">6</xref>
          ] and protein properties to gene ontology concepts [
          <xref ref-type="bibr" rid="ref5">7</xref>
          ]. The
results of grounding are made available to the end users through the semantic assistant. Here we outline
brie y these algorithms that deliver grounded mutations and impacts end users of the Semantic Assistant
Client.
        </p>
        <p>
          Firstly the cross-linking entities found in text with their real-world counterparts, called grounding, requires
the extraction and normalization of mutation mentions, for which we used the MutationFinder system [
          <xref ref-type="bibr" rid="ref6">8</xref>
          ]
in combination with the GATE text mining framework as well as custom built gazetteer lists built from the
text format version of Swiss-Prot. To facilitate grounding, a local store of mappings between names and
primary accession numbers and amino acid sequences was created. To facilitate grounding of mutations to
proteins and the correct amino acid residues on those proteins a set of candidate protein sequences must be
established. Based on protein names found in target documents a pool of protein accessions and
corresponding sequences is identi ed. Subsequently mutations extracted from the text are mapped onto
the candidate sequences using regular expressions generated from the mutation mentions extracted from
the text, where the regular expressions are constructed using multiple the wildtype residues and the
distance between them. For regular expressions matching a candidate sequence, further matching of
mutations from the target document is explored, taking into account the numbering displacement found
when using the regular expression. Accession numbers of the correctly grounded proteins are considered to
be the wildtype sequence of the protein in the document.
        </p>
        <p>
          Secondly the extraction of mutation impacts from documents relies on both the identi cation of protein
functions, found in noun phrases and the extraction of directionality terms also found in sentences adjacent
to mutation mentions. This is achieved as the result of identifying Gene Ontology Molecular Function
terms involving activity, binding, a nity or speci city as the head noun in noun phrases identi ed using
the multi-lingual noun phrase extractor [
          <xref ref-type="bibr" rid="ref7">9</xref>
          ]. Identi cation of directionality of a mutational change is
achieved using custom built gazetteer lists. Lastly relations must be established between directionality
words and protein properties in order to identify impact statements and between mutants and these
impacts statements. A rule based approach is used to identify relations and a scoring of signi cance is
achieved using heuristics based on entity distance. Full details of these algorithms are outlined in [
          <xref ref-type="bibr" rid="ref4 ref5">6, 7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Discussion</title>
        <p>
          Although we can show that our NLP services work for PubMed abstract, we know that the performance of
the underlying grounding algorithms is decreased when switching from full-text to abstracts only. To
enhance this performance and to retrieve more information, the semantic assistant client should send the
full-text content to the service, this can be made via existing web services like EFetch [
          <xref ref-type="bibr" rid="ref8">10</xref>
          ], whenever the
paper in question is publicly available as full-text.
        </p>
        <p>For future work, we will host a web site with listings of available services. The users can, when choosing a
service, also restrict it to only be run on certain web sites as e.g.:
http://www.ncbi.nlm.nih.gov/pubmed/*
To ensure that new customized services can be constructed and consumed by both service providers and
end-users, template client scripts should be made available where parameters can be set for content
restriction, i.e. only send content in speci c document elements as e.g.:
&lt;p id="abstract"&gt;content&lt;/p&gt;
And also to set custom styles for output annotations. These customized client scripts can then be stored
and listed as services, available to other users as well.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>The adoption of new frameworks providing online annotations to already published content is an emerging
theme trend in life science knowledge discovery. In this brief work we have shown that existing algorithms
for grounding of mutation mentions and related content can be deployed to great e ect albeit in prototype
scale application. This serves both as a strong motivation for deploying further semantic assistants for
other named entities and as a test bed to solicit further requirements from end users. In the process of
deploying this prototype in an open source web browser we have identi ed constraints that will impact the
design of next generation Semantic Assistants.</p>
    </sec>
    <sec id="sec-4">
      <title>Abbreviations used</title>
      <p>NLP: Natural Language Processing; OWL-DL: A Web Ontology sub-Language; API: Application
Programming Interface; HTML: HyperText Markup Language;</p>
    </sec>
    <sec id="sec-5">
      <title>Competing interests</title>
      <p>The authors declare that they have no competing interests.
References
1. Pa lis E, O'Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R: Re ect: augmented
browsing for the life scientist. Nature Biotechnology 2009, 27:508{510.
5. Greasemonkey. [http://www.greasespot.net].</p>
      <p>The abstract is tagged according to the output of our mutation impact NLP service. Here we can see that
the protein Haloalkane Dehalogenase is tagged together with two mutations on the same protein.
Algorithms for protein and mutation grounding used by the back-end NLP system make sure the
mutations refer to the correct position on the correct amino acid sequence as seen in the popup box,
produced when the mouse pointer hover over a tag surrounding the F294A point mutation mention.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2.
          <string-name>
            <surname>Breimyer</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            <given-names>V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samatova</surname>
            <given-names>N</given-names>
          </string-name>
          :
          <article-title>BioDEAL: community generation of biological annotations</article-title>
          .
          <source>BMC Medical Informatics and Decision Making</source>
          <year>2009</year>
          ,
          <volume>9</volume>
          (
          <issue>Suppl 1</issue>
          ):
          <fpage>S5</fpage>
          , [http://www.biomedcentral.com/1472-6947/9/S1/S5].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          3.
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gitzinger</surname>
            <given-names>T</given-names>
          </string-name>
          :
          <article-title>Semantic Assistants { User-Centric Natural Language Processing Services for Desktop Clients</article-title>
          .
          <source>In 3rd Asian Semantic Web Conference (ASWC</source>
          <year>2008</year>
          ), Volume
          <volume>5367</volume>
          of LNCS, Bangkok, Thailand: Springer 2009:
          <volume>360</volume>
          {
          <fpage>374</fpage>
          , [http://rene-witte.
          <source>net/semantic-assistants-aswc08].</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kouznetsov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shoebottom</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christopher</surname>
            <given-names>J OB</given-names>
          </string-name>
          :
          <article-title>Leverage of OWL-DL axioms in a Contact Centre for Technical Product Support</article-title>
          .
          <source>In OWL: Experiences and Directions (OWLED</source>
          <year>2010</year>
          ), San Francisco, California,
          <string-name>
            <surname>USA</surname>
          </string-name>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          6.
          <string-name>
            <surname>Laurila</surname>
            <given-names>JB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanagasabai</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          :
          <article-title>Algorithm for Grounding Mutation Mentions from Text to Protein Sequences</article-title>
          .
          <source>Seventh International Conference on Data Integration in the Life Sciences, Gothenburg</source>
          ,
          <year>Sweden 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          7.
          <string-name>
            <surname>Laurila</surname>
            <given-names>JB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naderi</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witte</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riazanov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kouznetsov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>CJO</given-names>
          </string-name>
          :
          <article-title>Algorithms and semantic infrastructure for mutation impact extraction and grounding</article-title>
          .
          <source>Ninth International Conference on Bioinformatics (InCoB2010)</source>
          . Tokyo,
          <year>Japan 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          8.
          <string-name>
            <surname>Caporaso</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jr</surname>
            <given-names>WB</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Randolph</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            <given-names>L</given-names>
          </string-name>
          :
          <article-title>MutationFinder: a high-performance system for extracting point mutation mentions from text</article-title>
          .
          <source>Bioinformatics</source>
          <year>2007</year>
          , 23:
          <year>1862</year>
          {
          <year>1865</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          9.
          <string-name>
            <surname>Multi-lingual Noun</surname>
          </string-name>
          Phrase Extractor. [http://www.semanticsoftware.info/munpex].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>10. EFetch. [http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch help.html].</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>