<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre BAZIN</string-name>
          <email>alexandre.bazin@isima.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Didier DEBROAS</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Engelbert MEPHU NGUIFO</string-name>
          <email>mephu@isima.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University Clermont Auvergne</institution>
          ,
          <addr-line>CNRS, LIMOS, F-63000 CLERMONT-FERRAND</addr-line>
          ,
          <institution>FRANCE University Clermont Auvergne</institution>
          ,
          <addr-line>CNRS, LMGE, F-63000 CLERMONT-FERRAND</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>When analyzing microbial communities, an active and computational challenge concerns the categorization of 16S rRNA gene sequences into operational taxonomic units (OTUs). Established clustering tools use a one pass algorithm in order to tackle high numbers of gene sequences and produce OTUs in reasonable time. However, all of the current tools are based on a crisp clustering approach, where a gene sequence is assigned to one cluster. The weak quality of the output compared to more complex clustering algorithms, forces the user to post-process the obtained OTUs. Providing a membership degree when assigning a gene sequence to an OTU, will help the user during the post-processing task. Moreover it is possible to use this membership degree to automatically evaluate the quality of the obtained OTUs. So the goal of this work is to propose a new clustering approach that takes into account uncertainty when producing OTUs, and improves both the quality and the presentation of the OTUs results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Studying the structure of the communities in an ecosystem is
central in environmental microbiology [Hugoni et al., 2013;
Roux et al., 2011]. The biosphere’s diversity can be
determined by amplifying and sequencing specific
phylogenetic markers (e.g. 16S rRNA). From there, these
amplicons need to be clusterized in ”species” named Operational
Taxonomic Units (OTUs) [Chen et al., 2013; Li et al., 2012;
Mahe´ et al., 2014; Westcott and Schloss, 2015]. As the
volume of sequences has drastically increased in recent times,
new clustering tools have emerged to treat the data in
reasonable time. The currently used algorithms are, from the
point of view of algorithmic complexity, the fastest available
that do not produce random results. However, due to their
simplicity, the reliability of the results are often discussed.
These tools being essentially black boxes, their sensitivity to
the sequence order, clustering threshold and structure of the
data makes it that the users have no way of knowing whether
better Operational Taxonomic Units (OTUs) could have been
obtained with different parameters or even whether they
correctly represent the data. In these circumstances, there is no
choice but to blindly trust them.</p>
      <p>Distance-based greedy clustering algorithm such as the
ones implemented in OTUclust [Albanese et al., 2015],
VSEARCH [Rognes et al., 2016], CD-HIT [Li and Godzik,
2006] or USEARCH [Edgar, 2010] all share the same base
algorithm as shown in Algorithm 1.</p>
    </sec>
    <sec id="sec-2">
      <title>Algorithm 1: DBG Clustering principle</title>
    </sec>
    <sec id="sec-3">
      <title>Input : A set of sequences</title>
      <p>Output: A set of OTUs to which the sequences are
assigned
1 Clusters = ;
2 foreach sequence S do
3 foreach known cluster C do
4 Compute distance(S; C)
5 end
6 if a suitable cluster exists then
7 Assign S to it</p>
      <sec id="sec-3-1">
        <title>8 else</title>
        <p>9 Create a new cluster with S as the center
10 end
11 end
12 Return Clusters</p>
        <p>While more sophisticated algorithms [Antoine et al., 2014;
Gath and Geva, 1989; Pe´rez-Sua´rez et al., 2013; Hariz et
al., 2006; Antoine et al., 2012] could produce better results
quality-wise, their runtime would render them unusable on
millions of sequences. As the quality of the OTUs is
important, we have to find a way to improve it without increasing
the runtime. The different available implementations use a
variety of heuristics to counterbalance the simplicity of the
algorithm but, to the best of our knowledge, no approach has
tried to add a measure of uncertainty to the process. This is
why, in order to help increase the quality and trustworthiness
of the clustering, we propose to add uncertainty to this simple
algorithm through the use of fuzzy clustering.
2.1</p>
      </sec>
      <sec id="sec-3-2">
        <title>Motivation</title>
        <sec id="sec-3-2-1">
          <title>Adding uncertainty to clustering</title>
          <p>Distance-based greedy clustering algorithms, such as the one
in VSEARCH, produce a number of OTUs and assign each
sequence to one of them. The OTU to which a sequence is
said to belong to is usually the first one to be encountered
that is sufficiently close, i.e. within the specified threshold.
This creates two problems :</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A sequence can only belong to a single OTU</title>
      <p>An OTU either includes or does not include a sequence
Having a sequence associated to a single OTU is expected
as the ultimate output of the algorithm. For this reason,
algorithms can stop after finding the first OTU that is close enough
to a sequence, which speeds the computation up. However,
not considering all the OTUs a sequence could be assigned
to increases the sensitivity to the order - a weakness of these
algorithms - and reduces the quality of the clustering. Indeed,
what if two different OTUs are close enough ? Giving priority
to the first generated OTU only creates a bias that no heuristic
- such as sorting the sequences - could hope to overcome.</p>
      <p>Moreover, by using strict thresholds, it is possible to have
two nearly identical sequences such that one belongs to a
particular OTU while the other does not. This strictness makes
it so an OTU partitions the set of sequences into two sets
inside of which sequences are considered the same regardless
of their distance to the center of the OTU. This lack of
distinction between sequences that are isolated and sequences on the
border of OTUs hides information that could help understand
the data.</p>
      <p>While these would not be problems were the clustering
optimal, the need for fast algorithms gives rise to results that are
not always trustworthy. The OTUs being presented as
absolute, the end user has no choice, should consider them correct
and cannot know whether the algorithm has encountered
ambiguity. We believe that being less strict in the way the OTUs
partition sequences would help produce better results from
the end user’s point of view.
2.2</p>
      <sec id="sec-4-1">
        <title>Fuzzy Clustering</title>
        <p>To help increase the quality of the clustering and maximize
the information that can be gathered from the data, we
propose to add uncertainty to the clustering by means of fuzzy
sets.</p>
        <p>We define a membership function fC (S) that, for an OTU
C, associates a membership value to a sequence S. Usually,
this value is either 0 or 1. Here, we propose to have fC (S)
take its value in f 1n0 j n = 0::10g. This value represents
the degree of membership and, as such, 1 means that the
sequence certainly belongs to the OTU while 0 means that the
sequence certainly does not belong to it. Other values
represent uncertainty and are used to express that the sequence
nearly belongs to the OTU. This membership value can
easily be computed from the distance between the sequence and
the center of the OTU using two thresholds t1 and t2 such
that t1 t2. If the distance is less than the threshold t1, the
membership value is 1. If the distance is greater than t2 the
value is 0. If the distance is between t1 and t2, it increases
gradually.</p>
        <p>Using fuzzy OTUs allows us to discern the difference
between sequences close to the OTU and sequences extremely
far. Using the parameters t1 and t2, we can tune the
“detection radius” around OTUs to gather information that would
normally be discarded by the clustering algorithm.
3</p>
        <sec id="sec-4-1-1">
          <title>Evaluating fuzzy OTUs</title>
          <p>Having a non-binary membership function produces OTUs
that partition the sequences into multiple sets. If we
consider only the sequences that belong (more or less) to an
OTU, the repartition of their membership values provides
information on the topology of the OTU. An ideal OTU would
contain only sequences with a membership value of 1,
meaning a group of sequences has been perfectly regrouped with
a good threshold and no sequence lies ambiguously on the
border. More realistically, a good OTU would contain many
sequences with high membership values and little sequences
with low values. A bad OTU with the majority of its
sequences having low membership values could mean that the
algorithm has chosen as a center a sequence on the border of
a group or, even worse, between two distinct groups.</p>
          <p>We can quickly evaluate the quality of an OTU with this
repartition. If we suppose that each sequence lowers the
quality of the OTU depending on its membership value, we can
use the following formula :</p>
          <p>Quality(OT U ) = 1</p>
          <p>Pi9=1 !i
# sequences with membership value i 0:1
# sequences in the OTU
with !i being the “cost” of having a sequence with
membership value i 0:1. In our previous examples, and with the
following values of !i</p>
          <p>we obtain a quality of respectively 0.71 and 0.26 for OTU1
and OTU2, showing OTU1 is better.</p>
          <p>A problem arises with singletons that always have perfect
quality but these can safely be treated separately.
4</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Choosing an OTU</title>
          <p>A sequence can belong to multiple OTUs due to fuzzy
membership. However, in the end, we want each sequence to be
assigned to a single OTU. Hence, we have to choose one of
the possible OTUs. We have two types of values left from the
clustering process : membership and quality. The first one
is based on the distance between the OTU and the sequence
and the second one is used to recognize bad OTUs. Choosing
the OTU with the best membership value is akin to running
VSEARCH. Choosing the OTU with the best quality tends to
create bigger OTUs that absorb distant sequences. To better
compromise, we can use a linear combination of both values
:
quality +
membership</p>
          <p>Increasing the importance of the quality reduces the
number of OTUs containing sequences. When is low, the “best”
OTUs quality-wise absorb very close sequences that would
have been attributed to other OTUs. When gets too high,
the best OTUs start absorbing all the sequences around them,
effectively acting like an increase of the distance threshold.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>5 Identifying ambiguous sequences</title>
          <p>Distance-based greedy algorithms are good at clustering
objects that are easy to cluster. Groups of very similar sequences
that are different from the rest of the dataset are supposed to
birth a new OTU while isolated singletons should be
identified to be either removed or treated separately. A problem
arises when groups of sequences are close to each other but
not enough to be the same OTU. In this case and supposing
the algorithm ideally chooses the centers of the OTUs,
sequences can lie just between these OTUs. In the current
implementations, these ambiguous sequences that must be
assigned are usually put in OTUs of their own, increasing the
number of OTUs and reducing the overall quality of the
clustering.</p>
          <p>Using fuzzy clustering allows us to identify these
ambiguous sequences. Using the previously mentioned choice
strategy, they can be assigned to a good OTU even though they lie
slightly outside of the distance threshold. However, their
ambiguousness may be significant for the user. It is thus
important to highlight their existence and the various fuzzy OTUs
they could have alternatively been assigned to.
We used our algorithm on a dataset containing 5977
sequences of length between 900 and 3081 for an average of
1442 and taxonomies extracted from the SILVA database. We
used a threshold of 0.97 (97% similarity) for determining new
OTUs and a threshold of 0.95 for fuzzy membership. For the
choice of the OTU for each sequence, we present the results
of three strategies : best quality ( = 1 and = 0),
compromise ( = 0:5 and = 0:5) and distance ( = 0 and = 1).
The comparison with VSEARCH is done using identical
parameters when applicable.</p>
          <p>The program, dataset and corresponding taxonomy are
available on http://projets.isima.fr/sclust/
Expe.html.
6.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Relevant Metrics</title>
        <p>To measure the effects of introducing uncertainty to the
clustering, we consider the following metrics :
centered on isolated sequences near good OTUs. That
isolation lowers their quality and the good OTUs absorb their
sequences.</p>
        <p>The distance between two sequences in the taxonomy is
defined as the sum of the lengths of the path from their nearest
commonality. For example, if a sequence is classified as
”bacteria;proteobacteria;betaproteobacteria” and the other is
classified as ”bacteria;proteobacteria;alphaproteobacteria ”, their
distance is 2 as each of them is at a distance 1 from their
commonality ””bacteria;proteobacteria”.
6.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results</title>
        <p>First, let us begin with the results obtained using the default
values for –maxaccepts and –maxrejects in Table 3.</p>
        <p>Then, the results obtained using –maxaccepts 10000 and
–maxrejects 10000 in Table 4.
6.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Analysis</title>
        <p>Results show that the choice strategy affects every metric
relevant to the quality of the clustering : number of OTUs,
singletons and pairs, average misclassification. The fuzzy
approach uses slightly more memory than VSEARCH but
all choice strategies are similar on this metric. When using
the default –maxaccepts and –maxrejects values, computation
time is lower for VSEARCH. However, when using higher
values for these parameters – and thus more precise
clustering - the computation time is the same for both approaches.</p>
        <p>Using the quality also lowers the number of singletons and
increases the number of pairs. This most likely means that
singletons were created close to either good clusters or one
another. The fuzzy approach allows the algorithm to merge
those sequences that were slightly too far from the center with
their corresponding OTU. The increase in the number of pairs
appears to be due to the merging of singletons lying too close
to one another.</p>
        <p>The average taxonomy distance in OTUs is shown to vary
wildly. Using only the quality to choose OTUs increases this
number as the “best” OTUs attract all the sequences in their
fuzzy surroundings. This causes some sequences belonging
to different species to be classified together. However, using a
compromise between quality and distance lowers this metric
as the best clusters only absorb sequences that are sufficiently
close to them and should probably be together while rejecting
the sequences that are too different.
7</p>
        <sec id="sec-4-4-1">
          <title>Discussion</title>
          <p>We observe that the experimental results confirm that adding
uncertainty to the clustering helps improve the quality of the
output by reducing the number of singletons. Using fuzzy
clusters, we are able to extend the clustering threshold to
gather additional information on the OTUs’s surroundings
and use it to quickly assess their quality. This quality can
be used together with the distance to choose an OTU for each
sequence. The resulting output contains less singletons and
misclassifications. Being able to choose the weight of both
distance and quality allows for additional tuning.</p>
          <p>As previously mentioned, the fuzziness also makes it
possible to detect ambiguous sequences and clusters. In our
opinion, this is where further work is required. An ambiguous
sequence could be arbitrarily assigned to a nearby OTU,
become the center of its own OTU or even be considered as an
error and deleted but these operations imply such a
knowledge of the domain that interactions with the human user
become necessary. However, on datasets containing millions of
sequences, the number of alerts would render manual
treatment impractical or even impossible. Automatizing this
treatment would require being able to adapt to the type of data,
domain and preferences of the user. We suggest that machine
learning techniques be introduced in the process to
automatically learn how to handle these ambiguities.</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Acknowledgements</title>
          <p>We observe that increasing the importance of the quality
in the OTU choice strategy lowers the final number of OTUs.
This is due to the fact that some OTUs are initially created
This work was supported by the European Union’s “Fonds
Europe´en de De´veloppement Re´gional (FEDER)” program
and the Auvergne-Rhone-Alpes region.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Albanese et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Davide</given-names>
            <surname>Albanese</surname>
          </string-name>
          , Paolo Fontana, Carlotta De Filippo, Duccio Cavalieri, and
          <string-name>
            <given-names>Claudio</given-names>
            <surname>Donati</surname>
          </string-name>
          .
          <article-title>Micca: a complete and accurate software for taxonomic profiling of metagenomic data</article-title>
          .
          <source>Scientific reports</source>
          ,
          <volume>5</volume>
          :
          <fpage>9743</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Antoine et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Violaine</given-names>
            <surname>Antoine</surname>
          </string-name>
          , Benjamin Quost,
          <article-title>Marie-He´le`ne Masson, and Thierry Denoeux. CECM: constrained evidential c-means algorithm</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          ,
          <volume>56</volume>
          (
          <issue>4</issue>
          ):
          <fpage>894</fpage>
          -
          <lpage>914</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Antoine et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Violaine</given-names>
            <surname>Antoine</surname>
          </string-name>
          , Benjamin Quost,
          <article-title>Marie-He´le`ne Masson, and Thierry Denoeux. CEVCLUS: evidential clustering with instance-level constraints for relational data</article-title>
          .
          <source>Soft Comput.</source>
          ,
          <volume>18</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1321</fpage>
          -
          <lpage>1335</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Chen et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Wei</given-names>
            <surname>Chen</surname>
          </string-name>
          , Clarence K Zhang, Yongmei Cheng, Shaowu Zhang, and
          <string-name>
            <given-names>Hongyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>A comparison of methods for clustering 16s rrna sequences into otus</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>8</volume>
          (
          <issue>8</issue>
          ):e70837,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Edgar</source>
          , 2010]
          <article-title>Robert C Edgar. Search and clustering orders of magnitude faster than blast</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>26</volume>
          (
          <issue>19</issue>
          ):
          <fpage>2460</fpage>
          -
          <lpage>2461</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Gath and Geva</source>
          , 1989]
          <string-name>
            <given-names>Isak</given-names>
            <surname>Gath and Amir B. Geva</surname>
          </string-name>
          .
          <article-title>Unsupervised optimal fuzzy clustering</article-title>
          .
          <source>IEEE Transactions on pattern analysis and machine intelligence</source>
          ,
          <volume>11</volume>
          (
          <issue>7</issue>
          ):
          <fpage>773</fpage>
          -
          <lpage>780</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Hariz et al.,
          <year>2006</year>
          ]
          <string-name>
            <given-names>Sarra</given-names>
            <surname>Ben</surname>
          </string-name>
          <string-name>
            <surname>Hariz</surname>
          </string-name>
          , Zied Elouedi, and
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Mellouli</surname>
          </string-name>
          .
          <article-title>Clustering approach using belief function theory</article-title>
          .
          <source>In International Conference on Artificial Intelligence: Methodology, Systems, and Applications</source>
          , pages
          <fpage>162</fpage>
          -
          <lpage>171</lpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Hugoni et al.,
          <year>2013</year>
          ] Myle`ne Hugoni, Najwa Taib, Didier Debroas, Isabelle Domaizon, Isabelle Jouan Dufournel, Gise`le Bronner, Ian Salter, He´le`ne Agogue´,
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Mary</surname>
          </string-name>
          , and Pierre E Galand.
          <article-title>Structure of the rare archaeal biosphere and seasonal dynamics of active ecotypes in surface coastal waters</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          ,
          <volume>110</volume>
          (
          <issue>15</issue>
          ):
          <fpage>6004</fpage>
          -
          <lpage>6009</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Li and Godzik</source>
          , 2006]
          <string-name>
            <given-names>Weizhong</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Godzik</surname>
          </string-name>
          .
          <article-title>Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>22</volume>
          (
          <issue>13</issue>
          ):
          <fpage>1658</fpage>
          -
          <lpage>1659</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>[Li</surname>
          </string-name>
          et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Weizhong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Limin</given-names>
            <surname>Fu</surname>
          </string-name>
          , Beifang Niu,
          <string-name>
            <surname>Sitao Wu</surname>
            ,
            <given-names>and John</given-names>
          </string-name>
          <string-name>
            <surname>Wooley</surname>
          </string-name>
          .
          <article-title>Ultrafast clustering algorithms for metagenomic sequence analysis</article-title>
          .
          <source>Briefings in bioinformatics, page bbs035</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>[Mahe</surname>
          </string-name>
          ´ et al.,
          <year>2014</year>
          ] Fre´de´ric Mahe´,
          <string-name>
            <surname>Torbjørn</surname>
            <given-names>Rognes</given-names>
          </string-name>
          , Christopher Quince, Colomban de Vargas, and
          <string-name>
            <given-names>Micah</given-names>
            <surname>Dunthorn</surname>
          </string-name>
          .
          <article-title>Swarm: robust and fast clustering method for amplicon-based studies</article-title>
          .
          <source>PeerJ</source>
          ,
          <volume>2</volume>
          :
          <fpage>e593</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Pe´rez-Sua´rez et al.,
          <year>2013</year>
          ]
          <article-title>Airel Pe´rez-Sua´rez</article-title>
          , Jose´ F Mart´
          <article-title>ınez-</article-title>
          <string-name>
            <surname>Trinidad</surname>
          </string-name>
          ,
          <article-title>Jesu´s A Carrasco-</article-title>
          <string-name>
            <surname>Ochoa</surname>
          </string-name>
          , and Jose´ E Medina-Pagola.
          <article-title>Oclustr: A new graph-based algorithm for overlapping clustering</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>121</volume>
          :
          <fpage>234</fpage>
          -
          <lpage>247</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Rognes et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Torbjørn</given-names>
            <surname>Rognes</surname>
          </string-name>
          , Toma´sˇ Flouri, Ben Nichols, Christopher Quince, and Fre´de´ric Mahe´.
          <article-title>Vsearch: a versatile open source tool for metagenomics</article-title>
          .
          <source>PeerJ</source>
          ,
          <volume>4</volume>
          :
          <fpage>e2584</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Roux et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Simon</given-names>
            <surname>Roux</surname>
          </string-name>
          , Michae¨l Faubladier, Antoine Mahul, Nils Paulhe, Aure´lien Bernard, Didier Debroas, and Franc¸ois Enault.
          <article-title>Metavir: a web server dedicated to virome analysis</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>27</volume>
          (
          <issue>21</issue>
          ):
          <fpage>3074</fpage>
          -
          <lpage>3075</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Westcott and Schloss</source>
          , 2015] Sarah L Westcott and
          <string-name>
            <surname>Patrick D Schloss</surname>
          </string-name>
          .
          <article-title>De novo clustering methods outperform reference-based methods for assigning 16s rrna gene sequences to operational taxonomic units</article-title>
          .
          <source>PeerJ</source>
          ,
          <volume>3</volume>
          :
          <fpage>e1487</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>