=Paper= {{Paper |id=Vol-1747/BT104_ICBO2016 |storemode=property |title=Crowdsourcing Protein Family Database Curation |pdfUrl=https://ceur-ws.org/Vol-1747/BT104_ICBO2016.pdf |volume=Vol-1747 |authors=Matt Jeffryes,Maria Liakata,Alex Bateman |dblpUrl=https://dblp.org/rec/conf/icbo/JeffryesLB16 }} ==Crowdsourcing Protein Family Database Curation == https://ceur-ws.org/Vol-1747/BT104_ICBO2016.pdf
     Crowdsourcing Protein Family Database Curation

                                          Matt Jeffryes1, Maria Liakata2, Alex Bateman1
                           1
                               European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom
                                                         {mjj, agb}@ebi.ac.uk
                                         2
                                           University of Warwick, Coventry, United Kingdom
                                                       m.liakata@warwick.ac.uk


    Abstract— We propose a novel method for crowdsourcing a            Pfam's sister-database Rfam and has resulted in higher quality
protein family database. We discuss how we intend to identify          annotation. As an additional benefit, crowdsourcing annotation
novel groupings of proteins from user sequence similarity search,      has increased the visibility of the databases to the scientific
and how text mining will be applied to assist in annotation of these   community [6].
novel groupings, and more broadly as an enrichment of protein
sequence similarity search results. We intend to use entity linking        Constructing a new Pfam entry requires that a curator
to identify literature which discusses proteins found in the search    searches the scientific literature for evidence of the existence of
results, and present those publications which are likely to be the     the new protein family, and for the possible function and
most useful to curators and sequence similarity search users           structure of the members of the family. The curator’s aim is to
alongside the sequence search results.                                 identify literature which mentions a member of the family, and
                                                                       the most useful papers will be those which go into detail about
   Keywords—crowdsourcing; biocuration; databases                      the protein’s structure or function.
                       I. INTRODUCTION                                    Identifying literature which mentions a particular protein is
    Protein families are groupings of proteins which have an           more complex than searching the literature for the protein’s
evolutionary relationship. Pfam is a database of protein families,     name. Biologists may use several different names and
which has been maintained since 1996 [1]. During this time,            abbreviations for the same protein. Additionally, the same name
Pfam has been a human curated database. Each protein family            may be shared across several species [7].
in the database is defined by an alignment which has been                   Previously, PubServer combined protein sequence similarity
constructed by curation staff. Using this approach, Pfam has           search using PSI-BLAST with literature retrieval [8]. Our
reached coverage of 47% of the residues in the protein sequence        approach differs in two ways. Firstly, PubServer retrieves only
database upon which it is based [2]. Curators use the software         publications which are already attached to protein database
package HMMER to construct Pfam families. For each family,             entries, whereas our approach is to search the open-access subset
a number of exemplar ‘seed sequences’ are aligned, HMMER is            of PubMed Central for relevant mentions of proteins. Secondly,
used to produce a sequence profile hidden Markov model                 our approach uses article full-text rather than being limited to
(HMM). This HMM is then used to query pfamseq, the protein             title, abstract and MeSH terms.
sequence database upon which Pfam is based. The regions in the
database which are significant matches for the HMM are the                 Our aim is to introduce crowdsourcing to the construction of
members of the protein family [1].                                     the families, and to facilitate easier annotation of families by the
                                                                       application of text mining.
    HMMER is also a valuable tool for discovering families. It
can perform sensitive sequence similarity searches against a                               II. PROPOSED SYSTEM
target protein sequence database, and is able to find homologous
                                                                       A. Crowdsourcing Profile HMMs
proteins distantly related to the query sequence [3]. When used
in this way, HMMER builds a sequence profile HMM de novo                   HMMER can be used to identify new protein families. This
from the query sequence, and retrieves matching sequences from         could happen intentionally, when a Pfam curator makes
the sequence database. HMMER can either be run at the                  additions to the database, but it can also occur incidentally, when
command line against a local database, or via the HMMER web            a sequence similarity search user's query sequence happens to
service [4]. In addition to the sequence profile HMM which             produce an HMM which matches against sequence regions
defines a family, Pfam curators annotate families with literature      which are as yet not matched by any existing Pfam family. It can
citations which give further evidence of the existence of the          also occur that a user's search matches a very high proportion of
family, and cross-references to other databases, via Interpro.         the sequences matched by an existing Pfam family, and further
                                                                       additional sequences which are likely to be homologous with the
   Since 2011, Pfam has used Wikipedia as the primary                  Pfam family. In this case, the HMM produced by the user's
repository for family annotation. If an article already exists,        search is a potential improvement over the HMM which
curators link it to the family, and the contents of the article is     currently defines the family.
mirrored on the protein family's web page. For families which
do not have a Wikipedia article, users are encouraged to create           Presently, such potential improvements to Pfam will only be
a new article [5]. This method of annotation was pioneered by          incorporated if the user identifies that they have found a
grouping of proteins unknown to Pfam, and makes the effort to               We intend to train and evaluate this method by using the
contact Pfam's helpdesk. We propose that users should be               references attached to each entry in the manually annotated
alerted to this situation and prompted to submit their search's        subset of UniProt, SwissProt. Each publication in the reference
sequence profile HMM to Pfam.                                          list of a SwissProt entry has a ‘scope’, which describes the
                                                                       elements of the entry that the publication has been cited for. For
     We believe that this has two advantages over an alternative
                                                                       example, an entry may have one reference with a scope of
system which would silently capture HMM's from sequence                ‘nucleotide sequence’, which describes the sequencing of the
similarity search. Firstly, the user who identifies the new or
                                                                       gene which codes for the protein. The proposed system can be
improved family gets credit for their improvement, which we            evaluated by its ability to extract the elements of the manually
consider to be both an ethical necessity, and an opportunity for
                                                                       curated reference list from the literature, and rank the
increasing community engagement. Secondly, we hypothesise              publications in order of the relevance which the manually
that the user who performs a sequence similarity search which
                                                                       annotated scopes imply.
identifies a novel protein family is more likely to have the
knowledge required to annotate that family with relevant               C. Curation Interface
literature or other metadata than Pfam curators are, resulting in          Both of these components will form a new curation interface
higher quality annotation. This second advantage also applies          to the HMMER web service. This will be available for use by
over a potential alternative system which automatically                Pfam's curators, and by users who have identified potentially
performs sequence similarity search with random sequences              new or improved Pfam families through their protein sequence
which are not matched by any Pfam family. While this would             similarity search query.
likely produce many novel families, the curation workload to
actually incorporate these families into Pfam would be high.                                     ACKNOWLEDGEMENT
B. Crowdsourcing Annotation                                               We would like to thank the anonymous reviewers for their
                                                                       helpful comments.
   Once a user has been made aware that their sequence
similarity search represents a potential improvement to Pfam,                                          REFERENCES
we would like to ease the process of adding annotations to their       [1]  Sonnhammer, E. L., Eddy, S. R., & Durbin, R. (1997). Pfam: a
new family. In particular, we would like to highlight literature            comprehensive database of protein domain families based on seed
which is relevant to their sequence similarity search.                      alignments. Proteins, 28(3), 405–20.
                                                                       [2] Finn, R. D., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Mistry, J., Mitchell,
     Presently, when researching a protein found by a sequence              A. L., Potter, S.C., Punta, M., Qureshi, M., Sangrador-Vegas, A., Salazar,
similarity search, users could look at the literature linked to the         G.A., Tate, J., Bateman, A. (2016). The Pfam protein families database:
protein in a protein database such as UniProt. However, the                 towards a more sustainable future. Nucleic Acids Research, 44(Database
literature citations for an entry are not intended to be exhaustive,        issue), D279–D279–85. doi:10.1093/nar/gkv1344
but a representative sample [9]. It is also hard to use this method    [3] Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLoS
                                                                            Computational               Biology,           7(10),             e1002195.
to search for literature which mentions multiple proteins found             doi:10.1371/journal.pcbi.1002195
in a search. Moreover, relevant sections within known papers
                                                                       [4] Finn, R. D., Clements, J., & Eddy, S. R. (2011). HMMER web server:
are not available. Our proposed system will use named entity                interactive sequence similarity searching. Nucleic Acids Research,
recognition to identify literature mentioning protein sequences             39(Web Server issue), W29–37. doi:10.1093/nar/gkr367
matched by a HMMER sequence similarity search result. We               [5] Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell,
will then seek to extract the passages from the text which are              C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L.,
most relevant for annotation and use these to rank the relevance            Sonnhammer, E.L., Eddy, S.R., Bateman, A., Finn, R. D. (2012). The
                                                                            Pfam protein families database. Nucleic Acids Research, 40(Database
of publications. We hypothesise that information about protein              issue), D290–301. doi:10.1093/nar/gkr1065
function, structure, homology, and phylogeny will be the most
                                                                       [6] Gardner, P. P., Daub, J., Tate, J., Moore, B. L., Osuch, I. H., Griffiths-
useful. Literature which mentions multiple proteins found by a              Jones, S., Finn R.D., Nawrocki, E.P., Kolbe, D.L., Eddy, S.R., Bateman,
single sequence similarity search is also likely relevant.                  A. (2011). Rfam: Wikipedia, clans and the “decimal” release. Nucleic
                                                                            Acids Research, 39(Database issue), D141–5. doi:10.1093/nar/gkq1129
    We anticipate that literature search will be useful not only to
                                                                       [7] Krallinger, M., Valencia, A., & Hirschman, L. (2008). Linking genes to
curators and users constructing new protein families, but also to           literature: text mining, information extraction, and retrieval applications
sequence similarity search users in general, as a tool for                  for biology. Genome Biology, 9 Suppl 2(Suppl 2), S8. doi:10.1186/gb-
researching homologues of an uncharacterised protein sequence.              2008-9-s2-s8
                                                                       [8] Jaroszewski, L., Koska, L., Sedova, M., & Godzik, A. (2014). PubServer:
    Our proposed method is to first use the BANNER named                    literature searches by homology. Nucleic Acids Research, 42(Web Server
entity recogniser, trained on the BioCreative II gene mention               issue), W430-5. doi: 10.1093/nar/gku450
data set, to identify all possible mentions of genes and proteins      [9] The UniProt Consortium. (2014). UniProt: a hub for protein information.
within the open access subset of PubMed Central [10]. We will               Nucleic Acids Research, 43(D1), D204–212. doi:10.1093/nar/gku989
then train a ranking SVM classifier to identify the most relevant      [10] Leaman, R., & Gonzalez, G. (2008). BANNER: an executable survey of
mentions for entities in the protein database UniProt. This                 advances in biomedical named entity recognition. Pacific Symposium on
method is based upon [11], but with a reversal in the direction             Biocomputing. Pacific Symposium on Biocomputing, 652–63.
                                                                            PMID:18229723
of association: Instead of identifying candidate entities for each
                                                                       [11] Cucerzan, S. (2007). Large-Scale Named Entity Disambiguation Based
possible mention in the text, we will identify candidate mentions           on Wikipedia Data. In Proceedings of the 2007 Joint Conference on
for each entity within the knowledge base.                                  Empirical Methods in Natural Language Processing and Computational
                                                                            Natural Language Learning (EMNLP-CoNLL) (pp. 708–716). Prague,
                                                                            Czech Republic: Association for Computational Linguistic