<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sequence Clustering Methods and Completeness of Biological Database Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qingyu Chen</string-name>
          <email>qingyuc1@student.unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiuzhen Zhang</string-name>
          <email>xiuzhen.zhang@rmit.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Wan</string-name>
          <email>wanyuac@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justin Zobel</string-name>
          <email>jzobel@unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karin Verspoor</string-name>
          <email>karin.verspoor@unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RMIT University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The University of Melbourne</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sequence clustering methods have been widely used to facilitate sequence database search. These methods convert a sequence database into clusters of similar sequences. Users then search against the resulting non-redundant database, which is typically comprised of one representative sequence per cluster, and expand search results by exploring records from matching clusters. Compared to direct search of original databases, the search results are expected to be more diverse are also more complete. While several studies have assessed diversity, completeness has not gained the same attention. We analysed the BLAST results on nonredundant versions of the UniProtKB/Swiss-Prot database generated by clustering method CD-HIT. Our findings are that (1) a more rigorous assessment on completeness is necessary, as an expanded set can have so many answers that Recall is uninformative; and (2) the Precision of expanded sets on top-ranked representatives drops by 7%. We propose a simple solution that returns a user-specified proportion of top similar records, modelled by a ranking function that aggregates sequence and annotation similarities. It removes millions of returned sequences, increases Precision by 3%, and does not need additional processing time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Biological sequence databases accumulate a wide variety of
observations of biological sequences and provide access to
a massive number of sequence records submitted from
individual labs [Baxevanis and Bateman, 2015]. Their primary
application use is in sequence database search, in which:
database users prepare query sequences such as
uncharacterised proteins; perform sequence similarity search of a
query sequence against deposited database records, often via
BLAST [Altschul et al., 1990]; and judge the output, that is,
a ranked list of retrieved sequence records.</p>
      <p>A key challenge for database search is redundancy, as
database records contain very similar or even identical
sequences [Bursteinas et al., 2016]. Redundancy has two
immediate impacts on database search: the top ranked retrieved
sequences can be highly similar, and may not be
independently informative (such as shown in Figure 1(a)); and it
makes it difficult to find potentially interesting sequences that
are distantly similar. A possible solution is to remove
redundant records. However, the notion of redundancy is
contextdependent; removed records may be redundant in some
contexts but important in others [Chen et al., 2017].</p>
      <p>Machine learning techniques are often used to solve
biological problems. In this case clustering methods have been
widely applied [Fu et al., 2012]. These cluster a sequence
database at a user-defined sequence identity threshold,
creating a non-redundant database. Users search against the
non-redundant database and expand search results by
exploring records from the same clusters. Thus it is expected that
the search results will be more diverse, as retrieved
representatives may be distantly similar. The results also will be
more complete; the expanded search results should be similar
enough to direct search of original databases that potentially
interesting records will still be found. Existing studies
measured search effectiveness primarily from the perspective of
diversity [Fu et al., 2012; Chen et al., 2016a], but, largely,
have not examined completeness. An exception is a study that
measured completeness but did not address user behaviour or
satisfaction [Suzek et al., 2015].</p>
      <p>We study search completeness in more depth by
analysing BLAST results on non-redundant versions of the
UniProtKB/Swiss-Prot. We find that a more rigorous
assessment on completeness is necessary; for example, an expanded
set brings 40 million more query-target pairs, making Recall
uninformative. Moreover, Precision of expanded sets on
topranked representatives drops by 7%. We propose a simple
solution that returns a user-specified proportion of top
similar records, modelled by a ranking function that aggregates
sequence and annotation similarities. It removes millions of
returned query-target pairs, increases Precision by 3%, and
does not need additional processing time.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Sequence clustering methods</title>
      <p>Clustering is an unsupervised machine learning technique
that groups records based on a similarity function. It has
wide applications in bioinformatics such as creation of
nonredundant databases [Mirdita et al., 2016] and classifying
sequence records into Operational Taxonomic Units [Chen et
al., 2013]. Here we explain how CD-HIT, a widely-used
clustering method, generates non-redundant databases. From an
input sequence database and a user-defined sequence
identity threshold, it constructs a non-redundant database in three
steps [Fu et al., 2012]: (1) Sequences are sorted by decreasing
length. The longest sequence is by default the representative
of the first cluster. (2) The remaining sequences are processed
in order. Each is compared with the cluster representative.
If the sequence identity for some cluster is no less than the
user-defined threshold, it is assigned to that cluster; if there is
no satisfactory representative, it becomes a new cluster
representative. (3) Two outputs are generated, representatives
and the complete clusters. These comprise the non-redundant
database. As sequence databases are often large, greedy
procedures and heuristics are used to speed up clustering. For
example, a sequence will be assigned to a cluster
immediately as long its sequence identity between the representative
satisfies the threshold.</p>
      <p>Sequence search on non-redundant databases consists of
two steps. Users first search query sequences against the
nonredundant database only, as shown in Figure 1(b). The
retrieved records are effectively a ranked list of representatives
in the non-redundant database. This step aims for diversity.
Users then expand search results by looking at the complete
clusters, that is, retrieved representatives and the associated
member records, as shown in Figure 1(c). This step focuses
on completeness.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Measurement of search effectiveness</title>
      <p>To quantify whether clustering methods indeed achieve both
diverse and complete search results, search effectiveness on
the non-redundant databases has been measured. Many
studies focus on diversity; for example, the remaining redundancy
between representatives in CD-HIT has been considered [Fu
et al., 2012] and a recent study found that this remaining
redundancy is higher as the identity threshold is reduced [Chen
et al., 2016a]. Completeness has been overlooked, despite its
value to users as indicated by several studies:</p>
      <p>Suzek et al. constructed UniRef databases using
CDHIT at different thresholds [Suzek et al., 2015]. They
measured diversity of representatives in a case study of
determining remote protein family relationship and
measured the completeness of the expanded set in a case
study of searching sequences against UniProtKB.
Mirdita et al. constructed Uniclust databases using a
similar clustering procedure to that of CD-HIT [Mirdita
et al., 2016]. They assessed cluster consistency by
measuring Gene Ontology (GO) annotation similarity and
protein-name similarity to ensure that users obtain
consistent views when expanding search results.</p>
      <p>Cole et al. created a protein sequence structure
prediction website that searches user submitted sequences
against UniRef and selects the top retrieved
representatives based on e-values [Cole et al., 2008].</p>
      <p>Remita et al. searched against UniRef for miRNAs
regulating glutathione S-transferases and expanded the
results from the associated Uniref clusters to obtain
alignment information, Gene Ontology (GO) annotations,
and expression details to ensure they did not miss any
other related data [Remita et al., 2016].</p>
      <p>The first two examples directly show that database staff
care about diversity and completeness when creating
nonredundant databases; the last two further illustrate that
database users in practice may use only representatives for
diversity or expand search results for completeness. There
are many further instances [Capriotti et al., 2012; Sato et
al., 2011; Liew et al., 2016]. These examples demonstrate
that both diversity and completeness are critical and the
associated assessments are necessary. When UniRef staff
measured search completeness, they used all-against-all BLAST
search results on UniProtKB as a gold standard [Suzek et al.,
2015]. Then they evaluated the overall Precision and Recall
of the expanded set (Formulas 1 and 5): Precision
quantifies whether expanded records are identified as relevant in the
gold standard and Recall quantifies whether the results in the
gold standard can be found in the expanded set. UniRef is one
of the best known clustered protein databases. The
measurement shows that assessing search completeness is of value.</p>
      <p>However, its measurement on completeness does have
limitations. A major limitation is that database user behaviour or
user satisfaction are not examined. Given a query, the adopted
overall Precision measures all the records in the expanded
set. However, users may only examine retrieved
representatives without expanding the search results [Sato et al., 2011].
Also, they may only examine the top-ranked representatives
and expand the associated search results [Remita et al., 2016].
Measuring only overall Precision on an expanded set fails to
reflect this behaviour. The proposed metrics should reflect
user satisfaction [Moffat et al., 2013].</p>
      <p>The adopted measure of Recall also has failings. It has
been a long-term concern that Recall may not be
effective for information retrieval measurement [Zobel, 1998;
Webber, 2010; Walters, 2016]. In this case the Recall might
be higher if the expanded set has more records than the gold
standard. But this means users will have to browse more
results. Also users may only examine and expand the top
retrieved representatives so the associated expanded set will be
always a small subset of the complete search results. Recall
is not applicable in those cases. We proposed a more
comprehensive approach below.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Data and Methods</title>
      <p>Dataset, tools, and experiments
We used full-size UniProtKB/Swiss-Prot Release 2016-15
as our experimental dataset. It consists of 551,193 protein
sequence records. CD-HIT (4.6.5) was used to construct
the associated non-redundant UniProtKB/Swiss-Prot; NCBI
BLAST (2.3.0+) was used to perform all-against-all searches.</p>
      <p>CD-HIT by default removes sequences of length no greater
than 10 since such short sequences are generally not
informative. We removed those records correspondingly in full-size
UniProtKB/Swiss-Prot. The updated dataset has 550,047
sequences. We used them as queries and performed BLAST
searches on the updated UniProtKB/Swiss-Prot and its
nonredundant version at 50% threshold generated by CD-HIT.
The non-redundant database at 50% consists of 120,043
sequences. 547,476 out of 550,047 query sequences have at
least one retrieved sequence in both databases. The BLAST
results are commonly called query-target pairs or hits. We
removed two types of query-target pairs: where the target is
the query itself; and the same sequence retrieved more than
once for a query. BLAST performs local alignment; it is
reasonable that multiple regions of a sequence are similar as the
query sequence. However repeated query-target pairs in this
case bias statistical analysis.</p>
      <p>The commands for running CD-HIT1 and BLAST2 strictly
follow user guidance. NCBI BLAST staff (personal
communication via email) advised on the maximum number of
output sequences, to ensure sensible results. Note also that
this study focuses on general uses of the tools, while, for
instance, UniRef and Uniclust may use different parameters to
construct non-redundant databases for specific purposes.</p>
      <p>1./cd-hit -i input path -o output path -c 0.5 -n 2, where -i and -o
stand for input and output path. -c stands for identity threshold, -n
specifies word size recommended in the user guide.</p>
      <p>2./blastp -task blastp -query query path -db database path
max target seqs 100000, where blastp specifies protein sequence,
-query and -db specifies query and database path. -max target seqs
is the maximum number of returned sequences for a query.
We measured the search effectiveness on the non-redundant
data set as follows. Given a query Q, let F be the list of
fetched (retrieved) representatives from the non-redundant
database, E its expanded set, and R the set of relevant
sequences. Here, F is a ranked list, consisting of
representatives ordered by BLAST scores, whereas E contains
representatives and the associated cluster members, which may
not have a particular order. R in this case stands for all the
fetched sequences for Q from the original
UniProtKB/SwissProt as the gold standard. Each sequence, either in F or E,
is scored by a function S: 0 if it is not in R, 1 otherwise.
We compared the number of query-target pairs in F , E and
R respectively. This examines how many retrieved results
users need to browse in the non-redundant version compared
with original database. We also employed standard
evaluation metrics from information retrieval, adapted specifically
for our study, as below.</p>
      <p>Since users may or may not expand the search results, we
measured Precision of both representatives and expanded set:
P recison(F ) = jF \ Rj
jF j</p>
      <p>P recision(E) = jE \ Rj
jEj
(1)
Users may focus on top-ranked retrieved representatives and
expand only those. Overall Precision cannot capture such
cases. We therefore measured P @K, Precision at top K
retrieved sequences. P @K for R measures the Precision at K
representatives, which is a standard metric used in
Information Retrieval evaluation [Webber, 2010]:</p>
      <p>K 1 jCij
P @Kequal(E) = X X S(Ci;j )
i=1 KjCij j=1</p>
      <p>K
P @Kweight(E) = X
i=1</p>
      <p>jCij
PKjCij X S(Ci;j )</p>
      <p>i=1 jCij j=1</p>
      <p>K
1 X S(Fi)
K i=1
P @K for E, however, is not straightforward. K in this
context refers to K clusters, which contain many more than K
records; thus is not directly comparable. We propose two
P @K metrics for E, summarised in Formula 3 and 4. In this
formula, Ci, jCij, Ci;j are an expanded cluster, the expanded
cluster size, and a sequence in the expanded cluster,
respectively. The idea is to transform the score of a sequence
relative to the cluster size; for example, the score of a sequence in
a cluster of 10 records will be 110 . The former formula treats
every cluster equally, that is, ( K1 ). The latter weights clusters
such that larger clusters have higher weights.
(2)
(3)
(4)</p>
      <p>We also measured Recall and Jaccard similarity to assess
whether E is (near) identical to R. Recall is used in the
previous study. However, it may be biased if an expanded set has
more hits than original search. Jaccard similarity is thus used
as a complementary metrics because it can better illustrate the
differences between two sets of results. Note that those two
metrics are not applicable for F , since F are intended to only
retrieve a subset of the complete results.</p>
      <p>Recall(E) = jE \ Rj
jRj</p>
      <p>Jaccard(E) = jE \ Rj
jE [ Rj
(5)
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>Our experiments on the number of query-target pairs in
the clustered non-redundant data as compared with original
database demonstrate that Recall is over-estimated and in turn
is not informative, due to the expanded set having even more
query-target pairs than the original dataset. Figure 2(a)
compares the number of query-target pairs. The retrieved pairs
among representatives include only about 15% of the pairs
from the original dataset. On the one hand this indicates that
users can browse the search results more efficiently. On the
other hand it shows that expansion of results is valuable since
potential interesting records may be in the other 85%.
However, the expanded set produces 40,095,619 more pairs than
the original. Figure 2(b) further shows that the expanded set
produces more pairs on over 89% of queries (492,129 out of
547,476), and on average produces about 10 pairs per query
(Figure 2(c)). Having more pairs results in high Recall. Both
median and mean Recall (Figure 2(d)) are above 90%, but
this comes with the cost of producing more 40 million pairs.
Jaccard similarity by comparison is almost 20% lower than
Recall, which clearly shows the results of the expanded set
are not similar to those of the original database.</p>
      <p>In addition, the Precision of the expanded set distinctly
degrades at top-ranked hits. Table 1 shows different levels of
Precision on representatives and the expanded sets. We
assessed both measures at depth 10, 20, 50, 100, and 200
respectively to quantify the Precision of the top-ranked hits that
are more likely examined by users. In general, top-ranked
hits from representatives are valuable: Precision is over 96%
across different K. The Precision of the expanded set, either
P @Kequal or P @Kweight, is always lower than that of
representatives, with degradation of up to 7% at K = 200. It
may be argued that, for a representative, if its relevance is 1,
the relevance of the associated expanded set will almost be
lower, since each record in the expanded set would also have
to be relevant. Conversely, the relevance of the expanded set
is likely to be higher if the relevance of the representative is 0,
since a single relevant record will improve on this.</p>
      <p>We further compared Precision in detail on an individual
query level, as summarised in Figure 3. The Precision of
representatives at the top K positions is higher than that of the
expanded sets for at least 80% of the queries; the proportion
increases as K grows.</p>
      <p>Driven by these observations, we propose a simple
solution that ranks records in terms of their similarity with cluster
representatives and only returns the top X%, a user-defined
proportion, when they expand search results. To our
knowledge, existing databases such as UniRef select
representatives based on whether a record is reviewed by biocurators,
Representatives
0.968
0.958
0.958, 0.966
0.959, 0.967
is from a model organism and other such record-external
factors. They do not compare and rank the similarity between
records. Also they expand all the records in a cluster rather
than choosing only a subset.</p>
      <p>In our proposal, the notion of similarity between a record
and its cluster representative is modelled based on sequence
identity and annotation similarity. This similarity function
is shown in Formula 6, where R and M refer to a
representative and an associated cluster member record. Simseq
and Simannotation stand for their sequence identity and
annotation similarity respectively. Annotations are based on record
metadata, such as GO terms, literature references and
descriptions. Sequence identity is arguably the dominant feature, but
existing studies for other tasks demonstrate that combining
sequence identity and metadata similarity is valuable [Chen
et al., 2016b]. and refer to their corresponding weights;
for example, sequence identity accounts for 80% of the
aggregated similarity and annotation similarity accounts for
another 20% when is 0.8 and is 0.2.</p>
      <p>Sim(R; M ) = Simseq(R; M ) + Simannotation(R; M )
(6)
The records in each cluster are thus ranked by this similarity
function in descending order. The top-ranked X% records,
with X specified by a user, will be presented when the user
expands search results. The ranked model can be adjusted
by both database staff and database users. On the one hand,
database staff can customise the ranking function, such as
adjusting weights and selecting different types of annotations,
when creating non-redundant databases. On the other hand,
database users can select how many records to browse rather
than seeing all records when expanding search results.</p>
      <p>In this study, we used sequence identity reported by
CDHIT and Molecular Function (MF) GO term similarities as
annotation similarity. MF GO terms are extracted from
UniProt-GOA dataset [Courtot et al., 2015] and the
similarity is calculated using the well-known LinAVG metric [Lin,
1998]. We applied the ranking function with two sets of
weights: the first is when = 100% and = 0%, i.e., only
rank based on sequence identity, whereas the second is =
80% and = 20%. We then measured in different proportions
20%, 30%, 50%, 70%, and 80% to reflect how much
proportion users want to expand. RA(seq; annotation; proportion)
used in Figure 4 shows the values of , and the returned
proportion, respectively.</p>
      <p>Table 1 compares detailed P @K measures for the ranked
model with the original unranked expanded set. The ranked
model always has higher Precision across different ratios and
values of K. Figure 3 shows that over 85% queries have
higher Precision in representatives than the expanded set.
The ranked model decreases this dramatically, to about 35%,
showing that the ranked model has the potential to maintain
Precision over expanded search results. Results in Figure 4
further confirmed the findings. Figure 4(b) illustrates that
user-defined proportions can significantly reduce the
number of expanded query-target pairs: even the highest
proportion 80% has about 50 million fewer query-target pairs
than the full expanded set, and its median and mean
Precision are higher than that of the full expanded set (shown in
Figure 4(a)). This shows that in practice users can browse
many fewer results. This shows the plausibility of our
solution and also demonstrates that metadata is effective in the
context of sequence search. Another advantage of our
solution is that it does not require additional time in sequence
searching: CD-HIT by default reports the identities between
representatives and members; MF GO terms similarities can
also be pre-computed.</p>
      <p>A limitation of the approach is that it has lower Recall and
Jaccard similarity than the full expanded set (shown in
Figure 4(c,d)). However, it is our view that the number of
expanded query-target pairs and Precision measures are more
critical to user satisfaction. For instance, proportion at 20%
produces around 200 million fewer query-target pairs and has
2% higher P @K and mean Precision. Users may already find
enough interesting results from the expanded 20% results.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We have analysed the search effectiveness of sequence
clustering from the perspective of completeness. The detailed
assessment results illustrate that the Precision of representatives
is high, but that expansion of search results can degrade
Precision and reduce user satisfaction by producing large numbers
of additional hits. We proposed a simple solution that ranks
records in terms of sequence identity and annotation
similarity. The comparative results show that it has the potential to
bring more precise results while still providing users with
expanded results.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We appreciate the advice of the NCBI BLAST team on
BLAST related commands and parameters. Qingyu Chen’s
work is supported by Melbourne International Research
Scholarship from the University of Melbourne. The project
receives funding from the Australian Research Council
through a Discovery Project grant, DP150101550.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Altschul et al.,
          <year>1990</year>
          ] Stephen F Altschul,
          <string-name>
            <surname>Warren</surname>
            <given-names>Gish</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Webb</given-names>
            <surname>Miller</surname>
          </string-name>
          , Eugene W Myers, and
          <string-name>
            <surname>David</surname>
          </string-name>
          J Lipman.
          <article-title>Basic local alignment search tool</article-title>
          .
          <source>Journal of molecular biology</source>
          ,
          <volume>215</volume>
          (
          <issue>3</issue>
          ):
          <fpage>403</fpage>
          -
          <lpage>410</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Baxevanis and Bateman</source>
          , 2015]
          <string-name>
            <given-names>Andreas D</given-names>
            <surname>Baxevanis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alex</given-names>
            <surname>Bateman</surname>
          </string-name>
          .
          <article-title>The importance of biological databases in biological discovery</article-title>
          .
          <source>Current protocols in bioinformatics</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Bursteinas et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Borisas</given-names>
            <surname>Bursteinas</surname>
          </string-name>
          , Ramona Britto, Benoit Bely, Andrea Auchincloss, Catherine Rivoire, Nicole Redaschi,
          <string-name>
            <surname>Claire O'Donovan</surname>
          </string-name>
          , and Maria Jesus Martin.
          <article-title>Minimizing proteome redundancy in the uniprot knowledgebase</article-title>
          .
          <source>Database: The Journal of Biological Databases and Curation</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Capriotti et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Emidio</given-names>
            <surname>Capriotti</surname>
          </string-name>
          , Nathan L Nehrt, Maricel G Kann,
          <article-title>and Yana Bromberg. Bioinformatics for personal genome interpretation</article-title>
          . Briefings in bioinformatics,
          <volume>13</volume>
          (
          <issue>4</issue>
          ):
          <fpage>495</fpage>
          -
          <lpage>512</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Chen et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Wei</given-names>
            <surname>Chen</surname>
          </string-name>
          , Clarence K Zhang, Yongmei Cheng, Shaowu Zhang, and
          <string-name>
            <given-names>Hongyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>A comparison of methods for clustering 16s rrna sequences into otus</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>8</volume>
          (
          <issue>8</issue>
          ):e70837,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Chen et al., 2016a]
          <string-name>
            <given-names>Qingyu</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Wan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Lei</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Justin Zobel</surname>
            , and
            <given-names>Karin</given-names>
          </string-name>
          <string-name>
            <surname>Verspoor</surname>
          </string-name>
          .
          <article-title>Evaluation of cd-hit for constructing non-redundant databases</article-title>
          .
          <source>In Bioinformatics and Biomedicine (BIBM)</source>
          ,
          <year>2016</year>
          IEEE International Conference on, pages
          <fpage>703</fpage>
          -
          <lpage>706</lpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Chen et al., 2016b]
          <string-name>
            <given-names>Qingyu</given-names>
            <surname>Chen</surname>
          </string-name>
          , Justin Zobel, Xiuzhen Zhang, and
          <string-name>
            <given-names>Karin</given-names>
            <surname>Verspoor</surname>
          </string-name>
          .
          <article-title>Supervised learning for detection of duplicates in genomic sequence databases</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>11</volume>
          (
          <issue>8</issue>
          ):e0159644,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Chen et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Qingyu</given-names>
            <surname>Chen</surname>
          </string-name>
          , Justin Zobel, and
          <string-name>
            <given-names>Karin</given-names>
            <surname>Verspoor</surname>
          </string-name>
          .
          <article-title>Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study</article-title>
          .
          <source>Database: The Journal of Biological Databases and Curation</source>
          ,
          <source>2017(1)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Cole et al.,
          <year>2008</year>
          ]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jonathan D Barber</surname>
          </string-name>
          , and
          <string-name>
            <surname>Geoffrey J Barton.</surname>
          </string-name>
          <article-title>The jpred 3 secondary structure prediction server</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>36</volume>
          (
          <issue>suppl 2</issue>
          ):
          <fpage>W197</fpage>
          -
          <lpage>W201</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Courtot et al.,
          <year>2015</year>
          ] Me´lanie Courtot, Aleksandra Shypitsyna, Elena Speretta, Alexander Holmes, Tony Sawford, Tony Wardell, Maria Jesus Martin, and
          <string-name>
            <surname>Claire O'Donovan.</surname>
          </string-name>
          Uniprot-goa:
          <article-title>A central resource for data integration and go annotation</article-title>
          .
          <source>In SWAT4LS</source>
          , pages
          <fpage>227</fpage>
          -
          <lpage>228</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Fu et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Limin</given-names>
            <surname>Fu</surname>
          </string-name>
          , Beifang Niu, Zhengwei Zhu,
          <string-name>
            <surname>Sitao Wu</surname>
            , and
            <given-names>Weizhong</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Cd-hit: accelerated for clustering the next-generation sequencing data</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>28</volume>
          (
          <issue>23</issue>
          ):
          <fpage>3150</fpage>
          -
          <lpage>3152</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Liew et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Yi</given-names>
            <surname>Jin</surname>
          </string-name>
          <string-name>
            <surname>Liew</surname>
          </string-name>
          , Taewoo Ryu, Manuel Aranda, and Timothy Ravasi.
          <article-title>mirna repertoires of demosponges stylissa carteri and xestospongia testudinaria</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):e0149080,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Lin</source>
          ,
          <year>1998</year>
          ]
          <string-name>
            <given-names>Dekang</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>An information-theoretic definition of similarity</article-title>
          . In ICML, volume
          <volume>98</volume>
          , pages
          <fpage>296</fpage>
          -
          <lpage>304</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Mirdita et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Milot</given-names>
            <surname>Mirdita</surname>
          </string-name>
          , Lars von den Driesch, Clovis Galiez, Maria J Martin,
          <article-title>Johannes So¨ding, and Martin Steinegger</article-title>
          .
          <article-title>Uniclust databases of clustered and deeply annotated protein sequences and alignments</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>45</volume>
          (
          <issue>D1</issue>
          ):
          <fpage>170</fpage>
          -
          <lpage>176</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Moffat et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Alistair</given-names>
            <surname>Moffat</surname>
          </string-name>
          , Paul Thomas, and
          <string-name>
            <given-names>Falk</given-names>
            <surname>Scholer</surname>
          </string-name>
          .
          <article-title>Users versus models: What observation tells us about effectiveness metrics</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Information &amp; Knowledge Management</source>
          , pages
          <fpage>659</fpage>
          -
          <lpage>668</lpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Remita et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Amine</surname>
          </string-name>
          <string-name>
            <surname>Remita</surname>
          </string-name>
          , Etienne Lord, Zahra Agharbaoui, Mickael Leclercq,
          <article-title>Mohamed A Badawi, Fathey Sarhan, and Abdoulaye Banire´ Diallo. A novel comprehensive wheat mirna database, including related bioinformatics software</article-title>
          .
          <source>Current Plant Biology</source>
          ,
          <volume>7</volume>
          :
          <fpage>31</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Sato et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Shusei</given-names>
            <surname>Sato</surname>
          </string-name>
          , Hideki Hirakawa, Sachiko Isobe, Eigo Fukai, Akiko Watanabe, Midori Kato, Kumiko Kawashima, Chiharu Minami, Akiko Muraki,
          <string-name>
            <given-names>Naomi</given-names>
            <surname>Nakazaki</surname>
          </string-name>
          , et al.
          <article-title>Sequence analysis of the genome of an oilbearing tree, jatropha curcas l</article-title>
          .
          <source>DNA research</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>65</fpage>
          -
          <lpage>76</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [Suzek et al.,
          <year>2015</year>
          ] Baris E Suzek,
          <string-name>
            <surname>Yuqi</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Hongzhan Huang,
          <string-name>
            <surname>Peter B McGarvey</surname>
          </string-name>
          , and
          <string-name>
            <surname>Cathy H Wu</surname>
          </string-name>
          .
          <article-title>Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>31</volume>
          (
          <issue>6</issue>
          ):
          <fpage>926</fpage>
          -
          <lpage>932</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Walters</source>
          , 2016] William H Walters.
          <article-title>Beyond use statistics: Recall, precision, and relevance in the assessment and management of academic libraries</article-title>
          .
          <source>Journal of Librarianship and Information Science</source>
          ,
          <volume>48</volume>
          (
          <issue>4</issue>
          ):
          <fpage>340</fpage>
          -
          <lpage>352</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>[Webber</source>
          , 2010]
          <string-name>
            <given-names>William</given-names>
            <surname>Edward</surname>
          </string-name>
          <article-title>Webber</article-title>
          .
          <article-title>Measurement in information retrieval evaluation</article-title>
          .
          <source>PhD thesis</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>[Zobel</source>
          , 1998]
          <string-name>
            <given-names>Justin</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>How reliable are the results of large-scale information retrieval experiments</article-title>
          ?
          <source>In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>307</fpage>
          -
          <lpage>314</lpage>
          . ACM,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>