A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data

                Alexandre BAZIN, Didier D EBROAS and Engelbert M EPHU N GUIFO
      University Clermont Auvergne, CNRS, LIMOS, F-63000 CLERMONT-FERRAND, FRANCE
      University Clermont Auvergne, CNRS, LMGE, F-63000 CLERMONT-FERRAND, FRANCE
                             alexandre.bazin@isima.fr, mephu@isima.fr


                         Abstract                                 better Operational Taxonomic Units (OTUs) could have been
                                                                  obtained with different parameters or even whether they cor-
     When analyzing microbial communities, an active              rectly represent the data. In these circumstances, there is no
     and computational challenge concerns the catego-             choice but to blindly trust them.
     rization of 16S rRNA gene sequences into opera-
     tional taxonomic units (OTUs). Established clus-                Distance-based greedy clustering algorithm such as the
     tering tools use a one pass algorithm in order to            ones implemented in OTUclust [Albanese et al., 2015],
     tackle high numbers of gene sequences and pro-               VSEARCH [Rognes et al., 2016], CD-HIT [Li and Godzik,
     duce OTUs in reasonable time. However, all of                2006] or USEARCH [Edgar, 2010] all share the same base
     the current tools are based on a crisp clustering ap-        algorithm as shown in Algorithm 1.
     proach, where a gene sequence is assigned to one
     cluster. The weak quality of the output compared
     to more complex clustering algorithms, forces the              Algorithm 1: DBG Clustering principle
     user to post-process the obtained OTUs. Provid-
     ing a membership degree when assigning a gene                   Input : A set of sequences
     sequence to an OTU, will help the user during the               Output: A set of OTUs to which the sequences are
     post-processing task. Moreover it is possible to use                      assigned
                                                                   1 Clusters = ∅
     this membership degree to automatically evaluate
                                                                   2 foreach sequence S do
     the quality of the obtained OTUs. So the goal of
     this work is to propose a new clustering approach             3     foreach known cluster C do
     that takes into account uncertainty when producing            4          Compute distance(S, C)
     OTUs, and improves both the quality and the pre-              5     end
     sentation of the OTUs results.                                6     if a suitable cluster exists then
                                                                   7          Assign S to it
                                                                   8     else
                                                                   9          Create a new cluster with S as the center
1   Introduction                                                  10     end
                                                                  11 end
Studying the structure of the communities in an ecosystem is      12 Return Clusters
central in environmental microbiology [Hugoni et al., 2013;
Roux et al., 2011]. The biosphere’s diversity can be de-
termined by amplifying and sequencing specific phyloge-
netic markers (e.g. 16S rRNA). From there, these ampli-              While more sophisticated algorithms [Antoine et al., 2014;
cons need to be clusterized in ”species” named Operational        Gath and Geva, 1989; Pérez-Suárez et al., 2013; Hariz et
Taxonomic Units (OTUs) [Chen et al., 2013; Li et al., 2012;       al., 2006; Antoine et al., 2012] could produce better results
Mahé et al., 2014; Westcott and Schloss, 2015]. As the vol-      quality-wise, their runtime would render them unusable on
ume of sequences has drastically increased in recent times,       millions of sequences. As the quality of the OTUs is impor-
new clustering tools have emerged to treat the data in rea-       tant, we have to find a way to improve it without increasing
sonable time. The currently used algorithms are, from the         the runtime. The different available implementations use a
point of view of algorithmic complexity, the fastest available    variety of heuristics to counterbalance the simplicity of the
that do not produce random results. However, due to their         algorithm but, to the best of our knowledge, no approach has
simplicity, the reliability of the results are often discussed.   tried to add a measure of uncertainty to the process. This is
These tools being essentially black boxes, their sensitivity to   why, in order to help increase the quality and trustworthiness
the sequence order, clustering threshold and structure of the     of the clustering, we propose to add uncertainty to this simple
data makes it that the users have no way of knowing whether       algorithm through the use of fuzzy clustering.
2     Adding uncertainty to clustering                               nearly belongs to the OTU. This membership value can eas-
                                                                     ily be computed from the distance between the sequence and
2.1    Motivation                                                    the center of the OTU using two thresholds t1 and t2 such
                                                                     that t1 ≥ t2 . If the distance is less than the threshold t1 , the
Distance-based greedy clustering algorithms, such as the one         membership value is 1. If the distance is greater than t2 the
in VSEARCH, produce a number of OTUs and assign each                 value is 0. If the distance is between t1 and t2 , it increases
sequence to one of them. The OTU to which a sequence is              gradually.
said to belong to is usually the first one to be encountered
that is sufficiently close, i.e. within the specified threshold.
This creates two problems :                                            Algorithm 2: Fuzzy DBG Clustering DBG Clustering
                                                                        Input : A set of sequences
    • A sequence can only belong to a single OTU                        Output: A set of OTUs to which the sequences are
    • An OTU either includes or does not include a sequence                       assigned
                                                                      1 Clusters = ∅
   Having a sequence associated to a single OTU is expected           2 foreach sequence S do
as the ultimate output of the algorithm. For this reason, algo-       3     foreach known cluster C do
rithms can stop after finding the first OTU that is close enough      4         Compute distance(S, C)
to a sequence, which speeds the computation up. However,              5         Assign S to C with value fC (S)
not considering all the OTUs a sequence could be assigned             6     end
to increases the sensitivity to the order - a weakness of these       7     if S has not been sufficiently assigned then
algorithms - and reduces the quality of the clustering. Indeed,       8         Create a new cluster with S as the center
what if two different OTUs are close enough ? Giving priority         9     end
to the first generated OTU only creates a bias that no heuristic     10 end
- such as sorting the sequences - could hope to overcome.            11 Return Clusters

   Moreover, by using strict thresholds, it is possible to have
two nearly identical sequences such that one belongs to a par-
ticular OTU while the other does not. This strictness makes
it so an OTU partitions the set of sequences into two sets in-       Figure 1: Representations of a Crisp (Left) and a Fuzzy
side of which sequences are considered the same regardless           (Right) Cluster.
of their distance to the center of the OTU. This lack of distinc-
tion between sequences that are isolated and sequences on the
border of OTUs hides information that could help understand
the data.

   While these would not be problems were the clustering op-
timal, the need for fast algorithms gives rise to results that are
not always trustworthy. The OTUs being presented as abso-
lute, the end user has no choice, should consider them correct          Using fuzzy OTUs allows us to discern the difference be-
and cannot know whether the algorithm has encountered am-            tween sequences close to the OTU and sequences extremely
biguity. We believe that being less strict in the way the OTUs       far. Using the parameters t1 and t2 , we can tune the “detec-
partition sequences would help produce better results from           tion radius” around OTUs to gather information that would
the end user’s point of view.                                        normally be discarded by the clustering algorithm.
2.2    Fuzzy Clustering
                                                                     3    Evaluating fuzzy OTUs
To help increase the quality of the clustering and maximize
the information that can be gathered from the data, we pro-          Having a non-binary membership function produces OTUs
pose to add uncertainty to the clustering by means of fuzzy          that partition the sequences into multiple sets. If we con-
sets.                                                                sider only the sequences that belong (more or less) to an
                                                                     OTU, the repartition of their membership values provides in-
   We define a membership function fC (S) that, for an OTU           formation on the topology of the OTU. An ideal OTU would
C, associates a membership value to a sequence S. Usually,           contain only sequences with a membership value of 1, mean-
this value is either 0 or 1. Here, we propose to have fC (S)         ing a group of sequences has been perfectly regrouped with
                     n
take its value in { 10  | n = 0..10}. This value represents          a good threshold and no sequence lies ambiguously on the
the degree of membership and, as such, 1 means that the se-          border. More realistically, a good OTU would contain many
quence certainly belongs to the OTU while 0 means that the           sequences with high membership values and little sequences
sequence certainly does not belong to it. Other values rep-          with low values. A bad OTU with the majority of its se-
resent uncertainty and are used to express that the sequence         quences having low membership values could mean that the
           0.1     0.2    0.3   0.4    0.5    0.6    0.7     0.8    0.9       1        5     Identifying ambiguous sequences
    OTU1    6       4      1     1      0      3      8      13     29       88
    OTU2   70      41     30    41     34     19     11       6      5       16        Distance-based greedy algorithms are good at clustering ob-
                                                                                       jects that are easy to cluster. Groups of very similar sequences
Table 1: Two example OTUs with the number of sequences                                 that are different from the rest of the dataset are supposed to
that belong to them with each possible membership value.                               birth a new OTU while isolated singletons should be identi-
                                                                                       fied to be either removed or treated separately. A problem
     ω1    ω2       ω3      ω4        ω5     ω6      ω7       ω8       ω9              arises when groups of sequences are close to each other but
     1     0.9      0.8     0.7       0.6    0.5     0.4      0.3      0.2             not enough to be the same OTU. In this case and supposing
                                                                                       the algorithm ideally chooses the centers of the OTUs, se-
                                                                                       quences can lie just between these OTUs. In the current im-
                 Table 2: Example of weight values.                                    plementations, these ambiguous sequences that must be as-
                                                                                       signed are usually put in OTUs of their own, increasing the
algorithm has chosen as a center a sequence on the border of                           number of OTUs and reducing the overall quality of the clus-
a group or, even worse, between two distinct groups.                                   tering.


   We can quickly evaluate the quality of an OTU with this                                      Figure 2: A Case of Ambiguous Sequences
repartition. If we suppose that each sequence lowers the qual-
ity of the OTU depending on its membership value, we can
use the following formula :
                                P9           # sequences with membership value i×0.1
    Quality(OT U ) = 1 −          i=1 ωi ×           # sequences in the OTU


   with ωi being the “cost” of having a sequence with mem-
bership value i × 0.1. In our previous examples, and with the
following values of ωi
                                                                                          Using fuzzy clustering allows us to identify these ambigu-
  we obtain a quality of respectively 0.71 and 0.26 for OTU1                           ous sequences. Using the previously mentioned choice strat-
and OTU2, showing OTU1 is better.                                                      egy, they can be assigned to a good OTU even though they lie
                                                                                       slightly outside of the distance threshold. However, their am-
  A problem arises with singletons that always have perfect                            biguousness may be significant for the user. It is thus impor-
quality but these can safely be treated separately.                                    tant to highlight their existence and the various fuzzy OTUs
                                                                                       they could have alternatively been assigned to.
4    Choosing an OTU
                                                                                       6     Experimental Results
A sequence can belong to multiple OTUs due to fuzzy mem-
bership. However, in the end, we want each sequence to be                              6.1    Data
assigned to a single OTU. Hence, we have to choose one of
the possible OTUs. We have two types of values left from the                           We used our algorithm on a dataset containing 5977 se-
clustering process : membership and quality. The first one                             quences of length between 900 and 3081 for an average of
is based on the distance between the OTU and the sequence                              1442 and taxonomies extracted from the SILVA database. We
and the second one is used to recognize bad OTUs. Choosing                             used a threshold of 0.97 (97% similarity) for determining new
the OTU with the best membership value is akin to running                              OTUs and a threshold of 0.95 for fuzzy membership. For the
VSEARCH. Choosing the OTU with the best quality tends to                               choice of the OTU for each sequence, we present the results
create bigger OTUs that absorb distant sequences. To better                            of three strategies : best quality (α = 1 and β = 0), compro-
compromise, we can use a linear combination of both values                             mise (α = 0.5 and β = 0.5) and distance (α = 0 and β = 1).
:                                                                                      The comparison with VSEARCH is done using identical pa-
                                                                                       rameters when applicable.

                 α × quality + β × membership                                            The program, dataset and corresponding taxonomy are
                                                                                       available on http://projets.isima.fr/sclust/
                                                                                       Expe.html.
   Increasing the importance of the quality reduces the num-
ber of OTUs containing sequences. When α is low, the “best”
OTUs quality-wise absorb very close sequences that would                               6.2    Relevant Metrics
have been attributed to other OTUs. When α gets too high,                              To measure the effects of introducing uncertainty to the clus-
the best OTUs start absorbing all the sequences around them,                           tering, we consider the following metrics :
effectively acting like an increase of the distance threshold.
       Method           Time (min)   Memory   #OTUs   #Singletons   #Doubletons   Distance
 Fuzzy (best quality)      1:06      652744    3461      2581          442          0.75     centered on isolated sequences near good OTUs. That iso-
 Fuzzy (compromise)
   Fuzzy (distance)
                           1:06
                           1:06
                                     651980
                                     683772
                                               3596
                                               3631
                                                         2776
                                                         2837
                                                                       413
                                                                       395
                                                                                    0.54
                                                                                    0.59
                                                                                             lation lowers their quality and the good OTUs absorb their
     VSEARCH               0:21      632832    3716      2935          388          0.57     sequences.

Table 3: Results of the clustering using default maxaccepts                                     Using the quality also lowers the number of singletons and
and maxrejects.                                                                              increases the number of pairs. This most likely means that
                                                                                             singletons were created close to either good clusters or one
       Method           Time (min)   Memory   #OTUs   #Singletons   #Doubletons   Distance   another. The fuzzy approach allows the algorithm to merge
 Fuzzy (best quality)     27:01      720968    3431      2575          413          0.60
 Fuzzy (compromise)       29:14      734604    3566      2767          398          0.47     those sequences that were slightly too far from the center with
   Fuzzy (distance)       28:27      723693    3631      2835          391          0.48
     VSEARCH              27:51      648052    3631      2859          394          0.52
                                                                                             their corresponding OTU. The increase in the number of pairs
                                                                                             appears to be due to the merging of singletons lying too close
                                                                                             to one another.
Table 4: Results of the clustering using maxaccepts 10000
and maxrejects 10000.
                                                                                                The average taxonomy distance in OTUs is shown to vary
                                                                                             wildly. Using only the quality to choose OTUs increases this
  • Computation time in minutes                                                              number as the “best” OTUs attract all the sequences in their
  • Memory usage                                                                             fuzzy surroundings. This causes some sequences belonging
                                                                                             to different species to be classified together. However, using a
  • Number of OTUs containing at least a sequence                                            compromise between quality and distance lowers this metric
  • Number of OTUs containing a single sequence                                              as the best clusters only absorb sequences that are sufficiently
                                                                                             close to them and should probably be together while rejecting
  • Number of OTUs containing only two sequences                                             the sequences that are too different.
  • Average distance in the taxonomy between sequences in
    the same cluster
                                                                                             7   Discussion
   The distance between two sequences in the taxonomy is de-                                 We observe that the experimental results confirm that adding
fined as the sum of the lengths of the path from their nearest                               uncertainty to the clustering helps improve the quality of the
commonality. For example, if a sequence is classified as ”bac-                               output by reducing the number of singletons. Using fuzzy
teria;proteobacteria;betaproteobacteria” and the other is clas-                              clusters, we are able to extend the clustering threshold to
sified as ”bacteria;proteobacteria;alphaproteobacteria ”, their                              gather additional information on the OTUs’s surroundings
distance is 2 as each of them is at a distance 1 from their                                  and use it to quickly assess their quality. This quality can
commonality ””bacteria;proteobacteria”.                                                      be used together with the distance to choose an OTU for each
                                                                                             sequence. The resulting output contains less singletons and
6.3      Results                                                                             misclassifications. Being able to choose the weight of both
                                                                                             distance and quality allows for additional tuning.
First, let us begin with the results obtained using the default
values for –maxaccepts and –maxrejects in Table 3.
                                                                                                As previously mentioned, the fuzziness also makes it possi-
                                                                                             ble to detect ambiguous sequences and clusters. In our opin-
  Then, the results obtained using –maxaccepts 10000 and                                     ion, this is where further work is required. An ambiguous
–maxrejects 10000 in Table 4.                                                                sequence could be arbitrarily assigned to a nearby OTU, be-
                                                                                             come the center of its own OTU or even be considered as an
                                                                                             error and deleted but these operations imply such a knowl-
6.4      Analysis                                                                            edge of the domain that interactions with the human user be-
                                                                                             come necessary. However, on datasets containing millions of
Results show that the choice strategy affects every metric                                   sequences, the number of alerts would render manual treat-
relevant to the quality of the clustering : number of OTUs,                                  ment impractical or even impossible. Automatizing this treat-
singletons and pairs, average misclassification. The fuzzy                                   ment would require being able to adapt to the type of data,
approach uses slightly more memory than VSEARCH but                                          domain and preferences of the user. We suggest that machine
all choice strategies are similar on this metric. When using                                 learning techniques be introduced in the process to automati-
the default –maxaccepts and –maxrejects values, computation                                  cally learn how to handle these ambiguities.
time is lower for VSEARCH. However, when using higher
values for these parameters – and thus more precise cluster-
ing - the computation time is the same for both approaches.                                  Acknowledgements
   We observe that increasing the importance of the quality                                  This work was supported by the European Union’s “Fonds
in the OTU choice strategy lowers the final number of OTUs.                                  Européen de Développement Régional (FEDER)” program
This is due to the fact that some OTUs are initially created                                 and the Auvergne-Rhone-Alpes region.
References                                                         [Rognes et al., 2016] Torbjørn Rognes, Tomáš Flouri, Ben
                                                                     Nichols, Christopher Quince, and Frédéric Mahé.
[Albanese et al., 2015] Davide Albanese, Paolo Fontana,
                                                                     Vsearch: a versatile open source tool for metagenomics.
   Carlotta De Filippo, Duccio Cavalieri, and Claudio Do-            PeerJ, 4:e2584, 2016.
   nati. Micca: a complete and accurate software for taxo-
   nomic profiling of metagenomic data. Scientific reports,        [Roux et al., 2011] Simon Roux, Michaël Faubladier, An-
   5:9743, 2015.                                                     toine Mahul, Nils Paulhe, Aurélien Bernard, Didier De-
                                                                     broas, and François Enault. Metavir: a web server ded-
[Antoine et al., 2012] Violaine Antoine, Benjamin Quost,             icated to virome analysis. Bioinformatics, 27(21):3074–
   Marie-Hélène Masson, and Thierry Denoeux. CECM:                 3075, 2011.
   constrained evidential c-means algorithm. Computational
   Statistics & Data Analysis, 56(4):894–914, 2012.                [Westcott and Schloss, 2015] Sarah L Westcott and
                                                                     Patrick D Schloss. De novo clustering methods out-
[Antoine et al., 2014] Violaine Antoine, Benjamin Quost,             perform reference-based methods for assigning 16s rrna
   Marie-Hélène Masson, and Thierry Denoeux. CEVCLUS:              gene sequences to operational taxonomic units. PeerJ,
   evidential clustering with instance-level constraints for re-     3:e1487, 2015.
   lational data. Soft Comput., 18(7):1321–1335, 2014.
[Chen et al., 2013] Wei Chen, Clarence K Zhang, Yongmei
   Cheng, Shaowu Zhang, and Hongyu Zhao. A compari-
   son of methods for clustering 16s rrna sequences into otus.
   PloS one, 8(8):e70837, 2013.
[Edgar, 2010] Robert C Edgar. Search and clustering or-
   ders of magnitude faster than blast. Bioinformatics,
   26(19):2460–2461, 2010.
[Gath and Geva, 1989] Isak Gath and Amir B. Geva. Unsu-
   pervised optimal fuzzy clustering. IEEE Transactions on
   pattern analysis and machine intelligence, 11(7):773–780,
   1989.
[Hariz et al., 2006] Sarra Ben Hariz, Zied Elouedi, and
   Khaled Mellouli. Clustering approach using belief func-
   tion theory. In International Conference on Artificial Intel-
   ligence: Methodology, Systems, and Applications, pages
   162–171. Springer, 2006.
[Hugoni et al., 2013] Mylène Hugoni, Najwa Taib, Didier
   Debroas, Isabelle Domaizon, Isabelle Jouan Dufournel,
   Gisèle Bronner, Ian Salter, Hélène Agogué, Isabelle Mary,
   and Pierre E Galand. Structure of the rare archaeal bio-
   sphere and seasonal dynamics of active ecotypes in sur-
   face coastal waters. Proceedings of the National Academy
   of Sciences, 110(15):6004–6009, 2013.
[Li and Godzik, 2006] Weizhong Li and Adam Godzik. Cd-
   hit: a fast program for clustering and comparing large
   sets of protein or nucleotide sequences. Bioinformatics,
   22(13):1658–1659, 2006.
[Li et al., 2012] Weizhong Li, Limin Fu, Beifang Niu, Sitao
   Wu, and John Wooley. Ultrafast clustering algorithms for
   metagenomic sequence analysis. Briefings in bioinformat-
   ics, page bbs035, 2012.
[Mahé et al., 2014] Frédéric Mahé, Torbjørn Rognes,
   Christopher Quince, Colomban de Vargas, and Micah
   Dunthorn. Swarm: robust and fast clustering method for
   amplicon-based studies. PeerJ, 2:e593, 2014.
[Pérez-Suárez et al., 2013] Airel Pérez-Suárez, José F
   Martı́nez-Trinidad, Jesús A Carrasco-Ochoa, and José E
   Medina-Pagola. Oclustr: A new graph-based algorithm for
   overlapping clustering. Neurocomputing, 121:234–247,
   2013.