=Paper= {{Paper |id=Vol-2133/cnia-short2 |storemode=property |title=A de novo robust clustering approach for amplicon-based sequence data |pdfUrl=https://ceur-ws.org/Vol-2133/cnia-short2.pdf |volume=Vol-2133 |dblpUrl=https://dblp.org/rec/conf/rjcia/BazinDN18 }} ==A de novo robust clustering approach for amplicon-based sequence data== https://ceur-ws.org/Vol-2133/cnia-short2.pdf
       A De Novo Robust Clustering Approach for Amplicon-Based
                           Sequence Data

             Alexandre Bazin                 Didier Debroas            Engelbert Mephu Nguifo

                                     contact@alexandrebazin.com


1    Introduction                                           2    Adding uncertainty to clus-
                                                                 tering
Studying the structure of the communities in an             Distance-based greedy clustering algorithms, such as
ecosystem is central in environmental microbiology          the one in VSEARCH, produce a number of OTUs
[8, 14]. The biosphere’s diversity can be determined by     and assign each sequence to one of them. The OTU
amplifying and sequencing specific phylogenetic mark-       to which a sequence is said to belong to is usually the
ers (e.g. 16S rRNA). From there, these amplicons need       first one to be encountered that is sufficiently close,
to be clusterized in ”species” named Operational Tax-       i.e. within the specified threshold. This makes a se-
onomic Units (OTUs) [4, 9, 11, 15]. As the volume           quence belong only to a single OTU and OTUs either
of sequences has drastically increased in recent times,     completely include or exclude a sequence. While these
new clustering tools have emerged to treat the data         would not be problems were the clustering optimal, the
in reasonable time. The currently used algorithms           need for fast algorithms gives rise to results that are
are, from the point of view of algorithmic complex-         not always trustworthy. The OTUs being presented as
ity, the fastest available that do not produce random       absolute, the end user has no choice, should consider
results. However, due to their simplicity, the reliabil-    them correct and cannot know whether the algorithm
ity of the results are often discussed. These tools be-     has encountered ambiguity. We believe that being less
ing essentially black boxes, their sensitivity to the se-   strict in the way the OTUs partition sequences would
quence order, clustering threshold and structure of the     help produce better results from the end user’s point
data makes it that the users have no way of knowing         of view. To help increase the quality of the clustering
whether better Operational Taxonomic Units (OTUs)           and maximize the information that can be gathered
could have been obtained with different parameters or       from the data, we propose to add uncertainty to the
even whether they correctly represent the data. In          clustering by means of fuzzy sets.
these circumstances, there is no choice but to blindly
trust them.
                                                            Using fuzzy OTUs allows us to discern the difference
                                                            between sequences close to the OTU and sequences
                                                            extremely far. Using the parameters t1 and t2 , we can
Distance-based greedy clustering algorithms such as         tune the “detection radius” around OTUs to gather
the ones implemented in OTUclust [1], VSEARCH               information that would normally be discarded by the
[13], CD-HIT [10] or USEARCH [5] all share the same         clustering algorithm.
base naive algorithm. While more sophisticated al-
gorithms [3, 6, 12, 7, 2] could produce better results      3    Evaluating fuzzy OTUs
quality-wise, their runtime would render them unus-
able on millions of sequences. As the quality of the        An ideal OTU would contain only sequences with a
OTUs is important, we have to find a way to im-             membership value of 1, meaning a group of sequences
prove it without increasing the runtime. The differ-        has been perfectly regrouped with a good threshold
ent available implementations use a variety of heuris-      and no sequence lies ambiguously on the border. More
tics to counterbalance the simplicity of the algorithm      realistically, a good OTU would contain many se-
but, to the best of our knowledge, no approach has          quences with high membership values and little se-
tried to add a measure of uncertainty to the process.       quences with low values. A bad OTU with the ma-
This is why, in order to help increase the quality and      jority of its sequences having low membership values
trustworthiness of the clustering, we propose to add        could mean that the algorithm has chosen as a center
uncertainty to this simple algorithm through the use        a sequence on the border of a group or, even worse,
of fuzzy clustering.                                        between two distinct groups.
We can quickly evaluate the quality of an OTU with                           4.2    Relevant Metrics
this repartition. If we suppose that each sequence low-
ers the quality of the OTU depending on its member-                          To measure the effects of introducing uncertainty to
ship value, we can use the following formula :                               the clustering, we consider the computation time,
                                                                             memory usage, number of OTUs, singletons and pairs
                       P9          # sequences with membership value i×0.1
                                                                             and average distance in the taxonomy tree between
Quality(OT U ) = 1 −    i=1 ωi ×           # sequences in the OTU            sequences in the same cluster.

with ωi being the “cost” of having a sequence with
membership value i × 0.1.
                                                                             4.3    Analysis
A problem arises with singletons that always have per-
fect quality but these can safely be treated separately.                     Results show that the choice strategy affects every
                                                                             metric relevant to the quality of the clustering : num-
A sequence can belong to multiple OTUs due to fuzzy
                                                                             ber of OTUs, singletons and pairs, average misclassifi-
membership. However, in the end, we want each se-
                                                                             cation. The fuzzy approach uses slightly more memory
quence to be assigned to a single OTU. Hence, we
                                                                             than VSEARCH but all choice strategies are similar
have to choose one of the possible OTUs. We have
                                                                             on this metric. When using the default –maxaccepts
two types of values left from the clustering process :
                                                                             and –maxrejects values, computation time is lower for
membership and quality. The first one is based on the
                                                                             VSEARCH. However, when using higher values for
distance between the OTU and the sequence and the
                                                                             these parameters – and thus more precise clustering -
second one is used to recognize bad OTUs. Choosing
                                                                             the computation time is the same for both approaches.
the OTU with the best membership value is akin to
                                                                             We observe that increasing the importance of the qual-
running VSEARCH. Choosing the OTU with the best
                                                                             ity in the OTU choice strategy lowers the final number
quality tends to create bigger OTUs that absorb dis-
                                                                             of OTUs. This is due to the fact that some OTUs are
tant sequences. To better compromise, we can use a
                                                                             initially created centered on isolated sequences near
linear combination of both values :
                                                                             good OTUs. That isolation lowers their quality and
                                                                             the good OTUs absorb their sequences.
             α × quality + β × membership

Increasing the importance of the quality reduces the                         Using the quality also lowers the number of single-
number of OTUs containing sequences. When α is                               tons and increases the number of pairs. This most
low, the “best” OTUs quality-wise absorb very close                          likely means that singletons were created close to ei-
sequences that would have been attributed to other                           ther good clusters or one another. The fuzzy approach
OTUs. When α gets too high, the best OTUs start                              allows the algorithm to merge those sequences that
absorbing all the sequences around them, effectively                         were slightly too far from the center with their corre-
acting like an increase of the distance threshold.                           sponding OTU. The increase in the number of pairs
                                                                             appears to be due to the merging of singletons lying
                                                                             too close to one another. The average taxonomy dis-
4     Experimental Results                                                   tance in OTUs is shown to vary wildly. Using only
                                                                             the quality to choose OTUs increases this number as
4.1     Data
                                                                             the “best” OTUs attract all the sequences in their
We used our algorithm on a dataset containing 5977                           fuzzy surroundings. This causes some sequences be-
sequences of length between 900 and 3081 for an aver-                        longing to different species to be classified together.
age of 1442 and taxonomies extracted from the SILVA                          However, using a compromise between quality and dis-
database. We used a threshold of 0.97 (97% similarity)                       tance lowers this metric as the best clusters only ab-
for determining new OTUs and a threshold of 0.95 for                         sorb sequences that are sufficiently close to them and
fuzzy membership. For the choice of the OTU for each                         should probably be together while rejecting the se-
sequence, we present the results of three strategies :                       quences that are too different.
best quality (α = 1 and β = 0), compromise (α = 0.5
and β = 0.5) and distance (α = 0 and β = 1). The
comparison with VSEARCH is done using identical
                                                                             acknowledgements
parameters when applicable.
                                                                             This work was supported by the European Union’s
The program, dataset and corresponding taxonomy                              “Fonds Européen de Développement Régional
are available on http://projets.isima.fr/sclust/                             (FEDER)” program and the Auvergne-Rhone-Alpes
Expe.html.                                                                   region.
            Method            Time (min)    Memory       #OTUs    #Singletons    #Doubletons     Distance
      Fuzzy (best quality)       1:06       652744        3461       2581           442            0.75
      Fuzzy (compromise)         1:06       651980        3596       2776           413            0.54
       Fuzzy (distance)          1:06       683772        3631       2837           395            0.59
          VSEARCH                0:21       632832        3716       2935           388            0.57


                                       Table 1: Results of the clustering.


References                                                   [9] Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu,
                                                                 and John Wooley. Ultrafast clustering algorithms
[1] Davide Albanese, Paolo Fontana, Carlotta De Fil-             for metagenomic sequence analysis. Briefings in
    ippo, Duccio Cavalieri, and Claudio Donati.                  bioinformatics, page bbs035, 2012.
    Micca: a complete and accurate software for tax-
    onomic profiling of metagenomic data. Scientific       [10] Weizhong Li and Adam Godzik. Cd-hit: a fast
    reports, 5:9743, 2015.                                      program for clustering and comparing large sets
                                                                of protein or nucleotide sequences. Bioinformat-
[2] Violaine Antoine, Benjamin Quost, Marie-Hélène            ics, 22(13):1658–1659, 2006.
    Masson, and Thierry Denoeux. CECM: con-
                                                           [11] Frédéric Mahé, Torbjørn Rognes, Christopher
    strained evidential c-means algorithm. Computa-
                                                                Quince, Colomban de Vargas, and Micah Dun-
    tional Statistics & Data Analysis, 56(4):894–914,
                                                                thorn. Swarm: robust and fast clustering method
    2012.
                                                                for amplicon-based studies. PeerJ, 2:e593, 2014.
[3] Violaine Antoine, Benjamin Quost, Marie-Hélène       [12] Airel Pérez-Suárez, José F Martı́nez-Trinidad,
    Masson, and Thierry Denoeux. CEVCLUS: evi-                  Jesús A Carrasco-Ochoa, and José E Medina-
    dential clustering with instance-level constraints          Pagola.    Oclustr: A new graph-based algo-
    for relational data. Soft Comput., 18(7):1321–              rithm for overlapping clustering. Neurocomput-
    1335, 2014.                                                 ing, 121:234–247, 2013.

[4] Wei Chen, Clarence K Zhang, Yongmei Cheng,             [13] Torbjørn Rognes, Tomáš Flouri, Ben Nichols,
    Shaowu Zhang, and Hongyu Zhao. A comparison                 Christopher Quince, and Frédéric Mahé. Vsearch:
    of methods for clustering 16s rrna sequences into           a versatile open source tool for metagenomics.
    otus. PloS one, 8(8):e70837, 2013.                          PeerJ, 4:e2584, 2016.

[5] Robert C Edgar. Search and clustering orders           [14] Simon Roux, Michaël Faubladier, Antoine Mahul,
    of magnitude faster than blast. Bioinformatics,             Nils Paulhe, Aurélien Bernard, Didier Debroas,
    26(19):2460–2461, 2010.                                     and François Enault. Metavir: a web server
                                                                dedicated to virome analysis. Bioinformatics,
[6] Isak Gath and Amir B. Geva. Unsupervised                    27(21):3074–3075, 2011.
    optimal fuzzy clustering. IEEE Transactions
                                                           [15] Sarah L Westcott and Patrick D Schloss. De
    on pattern analysis and machine intelligence,
                                                                novo clustering methods outperform reference-
    11(7):773–780, 1989.
                                                                based methods for assigning 16s rrna gene se-
                                                                quences to operational taxonomic units. PeerJ,
[7] Sarra Ben Hariz, Zied Elouedi, and Khaled Mel-
                                                                3:e1487, 2015.
    louli. Clustering approach using belief function
    theory. In International Conference on Artificial
    Intelligence: Methodology, Systems, and Applica-
    tions, pages 162–171. Springer, 2006.

[8] Mylène Hugoni, Najwa Taib, Didier Debroas,
    Isabelle Domaizon, Isabelle Jouan Dufournel,
    Gisèle Bronner, Ian Salter, Hélène Agogué, Is-
    abelle Mary, and Pierre E Galand. Structure of
    the rare archaeal biosphere and seasonal dynam-
    ics of active ecotypes in surface coastal waters.
    Proceedings of the National Academy of Sciences,
    110(15):6004–6009, 2013.