<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Crowdsourcing Protein Family Database Curation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matt Jeffryes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Liakata</string-name>
          <email>m.liakata@warwick.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Bateman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Bioinformatics Institute (EMBL-EBI)</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Warwick</institution>
          ,
          <addr-line>Coventry</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- We propose a novel method for crowdsourcing a protein family database. We discuss how we intend to identify novel groupings of proteins from user sequence similarity search, and how text mining will be applied to assist in annotation of these novel groupings, and more broadly as an enrichment of protein sequence similarity search results. We intend to use entity linking to identify literature which discusses proteins found in the search results, and present those publications which are likely to be the most useful to curators and sequence similarity search users alongside the sequence search results.</p>
      </abstract>
      <kwd-group>
        <kwd>crowdsourcing</kwd>
        <kwd>biocuration</kwd>
        <kwd>databases</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Protein families are groupings of proteins which have an
evolutionary relationship. Pfam is a database of protein families,
which has been maintained since 1996 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. During this time,
Pfam has been a human curated database. Each protein family
in the database is defined by an alignment which has been
constructed by curation staff. Using this approach, Pfam has
reached coverage of 47% of the residues in the protein sequence
database upon which it is based [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Curators use the software
package HMMER to construct Pfam families. For each family,
a number of exemplar ‘seed sequences’ are aligned, HMMER is
used to produce a sequence profile hidden Markov model
(HMM). This HMM is then used to query pfamseq, the protein
sequence database upon which Pfam is based. The regions in the
database which are significant matches for the HMM are the
members of the protein family [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        HMMER is also a valuable tool for discovering families. It
can perform sensitive sequence similarity searches against a
target protein sequence database, and is able to find homologous
proteins distantly related to the query sequence [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When used
in this way, HMMER builds a sequence profile HMM de novo
from the query sequence, and retrieves matching sequences from
the sequence database. HMMER can either be run at the
command line against a local database, or via the HMMER web
service [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In addition to the sequence profile HMM which
defines a family, Pfam curators annotate families with literature
citations which give further evidence of the existence of the
family, and cross-references to other databases, via Interpro.
      </p>
      <p>
        Since 2011, Pfam has used Wikipedia as the primary
repository for family annotation. If an article already exists,
curators link it to the family, and the contents of the article is
mirrored on the protein family's web page. For families which
do not have a Wikipedia article, users are encouraged to create
a new article [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This method of annotation was pioneered by
Pfam's sister-database Rfam and has resulted in higher quality
annotation. As an additional benefit, crowdsourcing annotation
has increased the visibility of the databases to the scientific
community [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Constructing a new Pfam entry requires that a curator
searches the scientific literature for evidence of the existence of
the new protein family, and for the possible function and
structure of the members of the family. The curator’s aim is to
identify literature which mentions a member of the family, and
the most useful papers will be those which go into detail about
the protein’s structure or function.</p>
      <p>
        Identifying literature which mentions a particular protein is
more complex than searching the literature for the protein’s
name. Biologists may use several different names and
abbreviations for the same protein. Additionally, the same name
may be shared across several species [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Previously, PubServer combined protein sequence similarity
search using PSI-BLAST with literature retrieval [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our
approach differs in two ways. Firstly, PubServer retrieves only
publications which are already attached to protein database
entries, whereas our approach is to search the open-access subset
of PubMed Central for relevant mentions of proteins. Secondly,
our approach uses article full-text rather than being limited to
title, abstract and MeSH terms.
      </p>
      <p>Our aim is to introduce crowdsourcing to the construction of
the families, and to facilitate easier annotation of families by the
application of text mining.</p>
    </sec>
    <sec id="sec-2">
      <title>II. PROPOSED SYSTEM</title>
      <sec id="sec-2-1">
        <title>A. Crowdsourcing Profile HMMs</title>
        <p>HMMER can be used to identify new protein families. This
could happen intentionally, when a Pfam curator makes
additions to the database, but it can also occur incidentally, when
a sequence similarity search user's query sequence happens to
produce an HMM which matches against sequence regions
which are as yet not matched by any existing Pfam family. It can
also occur that a user's search matches a very high proportion of
the sequences matched by an existing Pfam family, and further
additional sequences which are likely to be homologous with the
Pfam family. In this case, the HMM produced by the user's
search is a potential improvement over the HMM which
currently defines the family.</p>
        <p>Presently, such potential improvements to Pfam will only be
incorporated if the user identifies that they have found a
grouping of proteins unknown to Pfam, and makes the effort to
contact Pfam's helpdesk. We propose that users should be
alerted to this situation and prompted to submit their search's
sequence profile HMM to Pfam.</p>
        <p>We believe that this has two advantages over an alternative
system which would silently capture HMM's from sequence
similarity search. Firstly, the user who identifies the new or
improved family gets credit for their improvement, which we
consider to be both an ethical necessity, and an opportunity for
increasing community engagement. Secondly, we hypothesise
that the user who performs a sequence similarity search which
identifies a novel protein family is more likely to have the
knowledge required to annotate that family with relevant
literature or other metadata than Pfam curators are, resulting in
higher quality annotation. This second advantage also applies
over a potential alternative system which automatically
performs sequence similarity search with random sequences
which are not matched by any Pfam family. While this would
likely produce many novel families, the curation workload to
actually incorporate these families into Pfam would be high.</p>
      </sec>
      <sec id="sec-2-2">
        <title>B. Crowdsourcing Annotation</title>
        <p>Once a user has been made aware that their sequence
similarity search represents a potential improvement to Pfam,
we would like to ease the process of adding annotations to their
new family. In particular, we would like to highlight literature
which is relevant to their sequence similarity search.</p>
        <p>
          Presently, when researching a protein found by a sequence
similarity search, users could look at the literature linked to the
protein in a protein database such as UniProt. However, the
literature citations for an entry are not intended to be exhaustive,
but a representative sample [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It is also hard to use this method
to search for literature which mentions multiple proteins found
in a search. Moreover, relevant sections within known papers
are not available. Our proposed system will use named entity
recognition to identify literature mentioning protein sequences
matched by a HMMER sequence similarity search result. We
will then seek to extract the passages from the text which are
most relevant for annotation and use these to rank the relevance
of publications. We hypothesise that information about protein
function, structure, homology, and phylogeny will be the most
useful. Literature which mentions multiple proteins found by a
single sequence similarity search is also likely relevant.
        </p>
        <p>We anticipate that literature search will be useful not only to
curators and users constructing new protein families, but also to
sequence similarity search users in general, as a tool for
researching homologues of an uncharacterised protein sequence.</p>
        <p>
          Our proposed method is to first use the BANNER named
entity recogniser, trained on the BioCreative II gene mention
data set, to identify all possible mentions of genes and proteins
within the open access subset of PubMed Central [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We will
then train a ranking SVM classifier to identify the most relevant
mentions for entities in the protein database UniProt. This
method is based upon [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], but with a reversal in the direction
of association: Instead of identifying candidate entities for each
possible mention in the text, we will identify candidate mentions
for each entity within the knowledge base.
        </p>
        <p>We intend to train and evaluate this method by using the
references attached to each entry in the manually annotated
subset of UniProt, SwissProt. Each publication in the reference
list of a SwissProt entry has a ‘scope’, which describes the
elements of the entry that the publication has been cited for. For
example, an entry may have one reference with a scope of
‘nucleotide sequence’, which describes the sequencing of the
gene which codes for the protein. The proposed system can be
evaluated by its ability to extract the elements of the manually
curated reference list from the literature, and rank the
publications in order of the relevance which the manually
annotated scopes imply.</p>
      </sec>
      <sec id="sec-2-3">
        <title>C. Curation Interface</title>
        <p>Both of these components will form a new curation interface
to the HMMER web service. This will be available for use by
Pfam's curators, and by users who have identified potentially
new or improved Pfam families through their protein sequence
similarity search query.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>ACKNOWLEDGEMENT</title>
      <p>We would like to thank the anonymous reviewers for their
helpful comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Sonnhammer</surname>
            ,
            <given-names>E. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Durbin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Pfam: a comprehensive database of protein domain families based on seed alignments</article-title>
          .
          <source>Proteins</source>
          ,
          <volume>28</volume>
          (
          <issue>3</issue>
          ),
          <fpage>405</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>R. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coggill</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eberhardt</surname>
          </string-name>
          , R. Y.,
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mistry</surname>
            , J., Mitchell,
            <given-names>A. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potter</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Punta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qureshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sangrador-Vegas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salazar</surname>
            ,
            <given-names>G.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tate</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bateman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>The Pfam protein families database: towards a more sustainable future</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>44</volume>
          (Database issue),
          <fpage>D279</fpage>
          -D279-85. doi:
          <volume>10</volume>
          .1093/nar/gkv1344
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Accelerated Profile HMM Searches</article-title>
          .
          <source>PLoS Computational Biology</source>
          ,
          <volume>7</volume>
          (
          <issue>10</issue>
          ),
          <year>e1002195</year>
          .
          <source>doi:10.1371/journal.pcbi.1002195</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>R. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clements</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>HMMER web server: interactive sequence similarity searching</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>39</volume>
          (
          <issue>Web Server issue</issue>
          ),
          <fpage>W29</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkr367
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Punta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coggill</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eberhardt</surname>
          </string-name>
          , R. Y.,
          <string-name>
            <surname>Mistry</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tate</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boursnell</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forslund</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceric</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clements</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holm</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sonnhammer</surname>
            ,
            <given-names>E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bateman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finn</surname>
            ,
            <given-names>R. D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>The Pfam protein families database</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>40</volume>
          (Database issue),
          <fpage>D290</fpage>
          -
          <lpage>301</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkr1065
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daub</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tate</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>B. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osuch</surname>
            ,
            <given-names>I. H.</given-names>
          </string-name>
          , GriffithsJones,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Finn R.D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nawrocki</surname>
            ,
            <given-names>E.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolbe</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bateman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Rfam: Wikipedia, clans and the “decimal” release</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>39</volume>
          (Database issue),
          <fpage>D141</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkq1129
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valencia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hirschman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Linking genes to literature: text mining, information extraction, and retrieval applications for biology</article-title>
          .
          <source>Genome Biology</source>
          ,
          <volume>9</volume>
          <issue>Suppl 2</issue>
          (
          <issue>Suppl 2</issue>
          ), S8. doi:
          <volume>10</volume>
          .1186/gb2008-9
          <string-name>
            <surname>-</surname>
          </string-name>
          s2-s8
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Jaroszewski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koska</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sedova</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Godzik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>PubServer: literature searches by homology</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>42</volume>
          (
          <issue>Web Server issue</issue>
          ),
          <fpage>W430</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gku450
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>The</surname>
            <given-names>UniProt Consortium.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>UniProt: a hub for protein information</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>43</volume>
          (
          <issue>D1</issue>
          ),
          <fpage>D204</fpage>
          -
          <lpage>212</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gku989
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Leaman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>BANNER: an executable survey of advances in biomedical named entity recognition</article-title>
          .
          <source>Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing</source>
          ,
          <volume>652</volume>
          -
          <fpage>63</fpage>
          . PMID:
          <volume>18229723</volume>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Cucerzan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Large-Scale Named Entity Disambiguation Based on Wikipedia Data</article-title>
          .
          <source>In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)</source>
          (pp.
          <fpage>708</fpage>
          -
          <lpage>716</lpage>
          ). Prague, Czech Republic: Association for Computational Linguistic
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>