<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandre Bazin</string-name>
          <email>contact@alexandrebazin.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Didier Debroas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Engelbert Mephu Nguifo</string-name>
        </contrib>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Studying the structure of the communities in an
ecosystem is central in environmental microbiology
[
        <xref ref-type="bibr" rid="ref14 ref8">8, 14</xref>
        ]. The biosphere's diversity can be determined by
amplifying and sequencing speci c phylogenetic
markers (e.g. 16S rRNA). From there, these amplicons need
to be clusterized in "species" named Operational
Taxonomic Units (OTUs) [
        <xref ref-type="bibr" rid="ref11 ref15 ref4 ref9">4, 9, 11, 15</xref>
        ]. As the volume
of sequences has drastically increased in recent times,
new clustering tools have emerged to treat the data
in reasonable time. The currently used algorithms
are, from the point of view of algorithmic
complexity, the fastest available that do not produce random
results. However, due to their simplicity, the
reliability of the results are often discussed. These tools
being essentially black boxes, their sensitivity to the
sequence order, clustering threshold and structure of the
data makes it that the users have no way of knowing
whether better Operational Taxonomic Units (OTUs)
could have been obtained with di erent parameters or
even whether they correctly represent the data. In
these circumstances, there is no choice but to blindly
trust them.
      </p>
      <p>
        Distance-based greedy clustering algorithms such as
the ones implemented in OTUclust [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], VSEARCH
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], CD-HIT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or USEARCH [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] all share the same
base naive algorithm. While more sophisticated
algorithms [
        <xref ref-type="bibr" rid="ref12 ref2 ref3 ref6 ref7">3, 6, 12, 7, 2</xref>
        ] could produce better results
quality-wise, their runtime would render them
unusable on millions of sequences. As the quality of the
OTUs is important, we have to nd a way to
improve it without increasing the runtime. The di
erent available implementations use a variety of
heuristics to counterbalance the simplicity of the algorithm
but, to the best of our knowledge, no approach has
tried to add a measure of uncertainty to the process.
This is why, in order to help increase the quality and
trustworthiness of the clustering, we propose to add
uncertainty to this simple algorithm through the use
of fuzzy clustering.
      </p>
      <p>uncertainty to
clus2</p>
    </sec>
    <sec id="sec-2">
      <title>Adding tering</title>
      <p>Distance-based greedy clustering algorithms, such as
the one in VSEARCH, produce a number of OTUs
and assign each sequence to one of them. The OTU
to which a sequence is said to belong to is usually the
rst one to be encountered that is su ciently close,
i.e. within the speci ed threshold. This makes a
sequence belong only to a single OTU and OTUs either
completely include or exclude a sequence. While these
would not be problems were the clustering optimal, the
need for fast algorithms gives rise to results that are
not always trustworthy. The OTUs being presented as
absolute, the end user has no choice, should consider
them correct and cannot know whether the algorithm
has encountered ambiguity. We believe that being less
strict in the way the OTUs partition sequences would
help produce better results from the end user's point
of view. To help increase the quality of the clustering
and maximize the information that can be gathered
from the data, we propose to add uncertainty to the
clustering by means of fuzzy sets.</p>
      <p>Using fuzzy OTUs allows us to discern the di erence
between sequences close to the OTU and sequences
extremely far. Using the parameters t1 and t2, we can
tune the \detection radius" around OTUs to gather
information that would normally be discarded by the
clustering algorithm.</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluating fuzzy OTUs</title>
      <p>An ideal OTU would contain only sequences with a
membership value of 1, meaning a group of sequences
has been perfectly regrouped with a good threshold
and no sequence lies ambiguously on the border. More
realistically, a good OTU would contain many
sequences with high membership values and little
sequences with low values. A bad OTU with the
majority of its sequences having low membership values
could mean that the algorithm has chosen as a center
a sequence on the border of a group or, even worse,
between two distinct groups.</p>
      <p>We can quickly evaluate the quality of an OTU with
this repartition. If we suppose that each sequence
lowers the quality of the OTU depending on its
membership value, we can use the following formula :
Quality(OT U ) = 1</p>
      <p>Pi9=1 !i
# sequences with membership value i 0:1
# sequences in the OTU
with !i being the \cost" of having a sequence with
membership value i 0:1.</p>
      <p>A problem arises with singletons that always have
perfect quality but these can safely be treated separately.
A sequence can belong to multiple OTUs due to fuzzy
membership. However, in the end, we want each
sequence to be assigned to a single OTU. Hence, we
have to choose one of the possible OTUs. We have
two types of values left from the clustering process :
membership and quality. The rst one is based on the
distance between the OTU and the sequence and the
second one is used to recognize bad OTUs. Choosing
the OTU with the best membership value is akin to
running VSEARCH. Choosing the OTU with the best
quality tends to create bigger OTUs that absorb
distant sequences. To better compromise, we can use a
linear combination of both values :
quality +
membership
Increasing the importance of the quality reduces the
number of OTUs containing sequences. When is
low, the \best" OTUs quality-wise absorb very close
sequences that would have been attributed to other
OTUs. When gets too high, the best OTUs start
absorbing all the sequences around them, e ectively
acting like an increase of the distance threshold.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <sec id="sec-4-1">
        <title>Data</title>
        <p>We used our algorithm on a dataset containing 5977
sequences of length between 900 and 3081 for an
average of 1442 and taxonomies extracted from the SILVA
database. We used a threshold of 0.97 (97% similarity)
for determining new OTUs and a threshold of 0.95 for
fuzzy membership. For the choice of the OTU for each
sequence, we present the results of three strategies :
best quality ( = 1 and = 0), compromise ( = 0:5
and = 0:5) and distance ( = 0 and = 1). The
comparison with VSEARCH is done using identical
parameters when applicable.</p>
        <p>The program, dataset and corresponding taxonomy
are available on http://projets.isima.fr/sclust/
Expe.html.
4.2
To measure the e ects of introducing uncertainty to
the clustering, we consider the computation time,
memory usage, number of OTUs, singletons and pairs
and average distance in the taxonomy tree between
sequences in the same cluster.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Analysis</title>
        <p>Results show that the choice strategy a ects every
metric relevant to the quality of the clustering :
number of OTUs, singletons and pairs, average misclassi
cation. The fuzzy approach uses slightly more memory
than VSEARCH but all choice strategies are similar
on this metric. When using the default {maxaccepts
and {maxrejects values, computation time is lower for
VSEARCH. However, when using higher values for
these parameters { and thus more precise clustering
the computation time is the same for both approaches.
We observe that increasing the importance of the
quality in the OTU choice strategy lowers the nal number
of OTUs. This is due to the fact that some OTUs are
initially created centered on isolated sequences near
good OTUs. That isolation lowers their quality and
the good OTUs absorb their sequences.</p>
        <p>Using the quality also lowers the number of
singletons and increases the number of pairs. This most
likely means that singletons were created close to
either good clusters or one another. The fuzzy approach
allows the algorithm to merge those sequences that
were slightly too far from the center with their
corresponding OTU. The increase in the number of pairs
appears to be due to the merging of singletons lying
too close to one another. The average taxonomy
distance in OTUs is shown to vary wildly. Using only
the quality to choose OTUs increases this number as
the \best" OTUs attract all the sequences in their
fuzzy surroundings. This causes some sequences
belonging to di erent species to be classi ed together.
However, using a compromise between quality and
distance lowers this metric as the best clusters only
absorb sequences that are su ciently close to them and
should probably be together while rejecting the
sequences that are too di erent.
acknowledgements
This work was supported by the European Union's
\Fonds Europeen de Developpement Regional
(FEDER)" program and the Auvergne-Rhone-Alpes
region.</p>
        <p>Method
Fuzzy (best quality)
Fuzzy (compromise)</p>
        <p>Fuzzy (distance)</p>
        <p>VSEARCH</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Davide</given-names>
            <surname>Albanese</surname>
          </string-name>
          , Paolo Fontana, Carlotta De Filippo, Duccio Cavalieri, and
          <string-name>
            <given-names>Claudio</given-names>
            <surname>Donati</surname>
          </string-name>
          .
          <article-title>Micca: a complete and accurate software for taxonomic pro ling of metagenomic data</article-title>
          .
          <source>Scienti c reports</source>
          ,
          <volume>5</volume>
          :
          <fpage>9743</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Violaine</given-names>
            <surname>Antoine</surname>
          </string-name>
          , Benjamin Quost,
          <string-name>
            <surname>Marie-Helene Masson</surname>
          </string-name>
          , and Thierry Denoeux.
          <article-title>CECM: constrained evidential c-means algorithm</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          ,
          <volume>56</volume>
          (
          <issue>4</issue>
          ):
          <volume>894</volume>
          {
          <fpage>914</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Violaine</given-names>
            <surname>Antoine</surname>
          </string-name>
          , Benjamin Quost,
          <string-name>
            <surname>Marie-Helene Masson</surname>
          </string-name>
          , and Thierry Denoeux.
          <article-title>CEVCLUS: evidential clustering with instance-level constraints for relational data</article-title>
          .
          <source>Soft Comput.</source>
          ,
          <volume>18</volume>
          (
          <issue>7</issue>
          ):
          <volume>1321</volume>
          {
          <fpage>1335</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Wei</given-names>
            <surname>Chen</surname>
          </string-name>
          , Clarence K Zhang, Yongmei Cheng, Shaowu Zhang, and
          <string-name>
            <given-names>Hongyu</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <article-title>A comparison of methods for clustering 16s rrna sequences into otus</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>8</volume>
          (
          <issue>8</issue>
          ):e70837,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Robert</surname>
            <given-names>C</given-names>
          </string-name>
          <string-name>
            <surname>Edgar</surname>
          </string-name>
          .
          <article-title>Search and clustering orders of magnitude faster than blast</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>26</volume>
          (
          <issue>19</issue>
          ):
          <volume>2460</volume>
          {
          <fpage>2461</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Isak</given-names>
            <surname>Gath and Amir B. Geva</surname>
          </string-name>
          .
          <article-title>Unsupervised optimal fuzzy clustering</article-title>
          .
          <source>IEEE Transactions on pattern analysis and machine intelligence</source>
          ,
          <volume>11</volume>
          (
          <issue>7</issue>
          ):
          <volume>773</volume>
          {
          <fpage>780</fpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sarra</given-names>
            <surname>Ben</surname>
          </string-name>
          <string-name>
            <surname>Hariz</surname>
          </string-name>
          , Zied Elouedi, and
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Mellouli</surname>
          </string-name>
          .
          <article-title>Clustering approach using belief function theory</article-title>
          .
          <source>In International Conference on Arti cial Intelligence: Methodology, Systems, and Applications</source>
          , pages
          <volume>162</volume>
          {
          <fpage>171</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Mylene</given-names>
            <surname>Hugoni</surname>
          </string-name>
          , Najwa Taib, Didier Debroas, Isabelle Domaizon, Isabelle Jouan Dufournel, Gisele Bronner, Ian Salter, Helene Agogue, Isabelle Mary, and Pierre E Galand.
          <article-title>Structure of the rare archaeal biosphere and seasonal dynamics of active ecotypes in surface coastal waters</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          ,
          <volume>110</volume>
          (
          <issue>15</issue>
          ):
          <volume>6004</volume>
          {
          <fpage>6009</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Weizhong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Limin</given-names>
            <surname>Fu</surname>
          </string-name>
          , Beifang Niu,
          <string-name>
            <surname>Sitao Wu</surname>
            ,
            <given-names>and John</given-names>
          </string-name>
          <string-name>
            <surname>Wooley</surname>
          </string-name>
          .
          <article-title>Ultrafast clustering algorithms for metagenomic sequence analysis</article-title>
          .
          <source>Brie ngs in bioinformatics, page bbs035</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Weizhong</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Adam</given-names>
            <surname>Godzik</surname>
          </string-name>
          .
          <article-title>Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>22</volume>
          (
          <issue>13</issue>
          ):
          <volume>1658</volume>
          {
          <fpage>1659</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Frederic</surname>
            <given-names>Mahe</given-names>
          </string-name>
          , Torbj rn Rognes, Christopher Quince, Colomban de Vargas, and
          <string-name>
            <given-names>Micah</given-names>
            <surname>Dunthorn</surname>
          </string-name>
          .
          <article-title>Swarm: robust and fast clustering method for amplicon-based studies</article-title>
          .
          <source>PeerJ</source>
          ,
          <volume>2</volume>
          :
          <fpage>e593</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Airel</given-names>
            <surname>Perez-Suarez</surname>
          </string-name>
          , Jose F Mart nez
          <article-title>-Trinidad, Jesus A Carrasco-</article-title>
          <string-name>
            <surname>Ochoa</surname>
          </string-name>
          , and Jose E MedinaPagola.
          <article-title>Oclustr: A new graph-based algorithm for overlapping clustering</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>121</volume>
          :
          <fpage>234</fpage>
          {
          <fpage>247</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Torbj</surname>
            <given-names>rn Rognes</given-names>
          </string-name>
          , Tomas Flouri, Ben Nichols, Christopher Quince, and
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Mahe</surname>
          </string-name>
          .
          <article-title>Vsearch: a versatile open source tool for metagenomics</article-title>
          .
          <source>PeerJ</source>
          ,
          <volume>4</volume>
          :
          <fpage>e2584</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Simon</surname>
            <given-names>Roux</given-names>
          </string-name>
          , Michael Faubladier, Antoine Mahul, Nils Paulhe, Aurelien Bernard, Didier Debroas, and
          <string-name>
            <given-names>Francois</given-names>
            <surname>Enault</surname>
          </string-name>
          .
          <article-title>Metavir: a web server dedicated to virome analysis</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>27</volume>
          (
          <issue>21</issue>
          ):
          <volume>3074</volume>
          {
          <fpage>3075</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sarah L Westcott and Patrick D Schloss</surname>
          </string-name>
          .
          <article-title>De novo clustering methods outperform referencebased methods for assigning 16s rrna gene sequences to operational taxonomic units</article-title>
          .
          <source>PeerJ</source>
          ,
          <volume>3</volume>
          :
          <fpage>e1487</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>