A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data Alexandre BAZIN, Didier D EBROAS and Engelbert M EPHU N GUIFO University Clermont Auvergne, CNRS, LIMOS, F-63000 CLERMONT-FERRAND, FRANCE University Clermont Auvergne, CNRS, LMGE, F-63000 CLERMONT-FERRAND, FRANCE alexandre.bazin@isima.fr, mephu@isima.fr Abstract better Operational Taxonomic Units (OTUs) could have been obtained with different parameters or even whether they cor- When analyzing microbial communities, an active rectly represent the data. In these circumstances, there is no and computational challenge concerns the catego- choice but to blindly trust them. rization of 16S rRNA gene sequences into opera- tional taxonomic units (OTUs). Established clus- Distance-based greedy clustering algorithm such as the tering tools use a one pass algorithm in order to ones implemented in OTUclust [Albanese et al., 2015], tackle high numbers of gene sequences and pro- VSEARCH [Rognes et al., 2016], CD-HIT [Li and Godzik, duce OTUs in reasonable time. However, all of 2006] or USEARCH [Edgar, 2010] all share the same base the current tools are based on a crisp clustering ap- algorithm as shown in Algorithm 1. proach, where a gene sequence is assigned to one cluster. The weak quality of the output compared to more complex clustering algorithms, forces the Algorithm 1: DBG Clustering principle user to post-process the obtained OTUs. Provid- ing a membership degree when assigning a gene Input : A set of sequences sequence to an OTU, will help the user during the Output: A set of OTUs to which the sequences are post-processing task. Moreover it is possible to use assigned 1 Clusters = ∅ this membership degree to automatically evaluate 2 foreach sequence S do the quality of the obtained OTUs. So the goal of this work is to propose a new clustering approach 3 foreach known cluster C do that takes into account uncertainty when producing 4 Compute distance(S, C) OTUs, and improves both the quality and the pre- 5 end sentation of the OTUs results. 6 if a suitable cluster exists then 7 Assign S to it 8 else 9 Create a new cluster with S as the center 1 Introduction 10 end 11 end Studying the structure of the communities in an ecosystem is 12 Return Clusters central in environmental microbiology [Hugoni et al., 2013; Roux et al., 2011]. The biosphere’s diversity can be de- termined by amplifying and sequencing specific phyloge- netic markers (e.g. 16S rRNA). From there, these ampli- While more sophisticated algorithms [Antoine et al., 2014; cons need to be clusterized in ”species” named Operational Gath and Geva, 1989; Pérez-Suárez et al., 2013; Hariz et Taxonomic Units (OTUs) [Chen et al., 2013; Li et al., 2012; al., 2006; Antoine et al., 2012] could produce better results Mahé et al., 2014; Westcott and Schloss, 2015]. As the vol- quality-wise, their runtime would render them unusable on ume of sequences has drastically increased in recent times, millions of sequences. As the quality of the OTUs is impor- new clustering tools have emerged to treat the data in rea- tant, we have to find a way to improve it without increasing sonable time. The currently used algorithms are, from the the runtime. The different available implementations use a point of view of algorithmic complexity, the fastest available variety of heuristics to counterbalance the simplicity of the that do not produce random results. However, due to their algorithm but, to the best of our knowledge, no approach has simplicity, the reliability of the results are often discussed. tried to add a measure of uncertainty to the process. This is These tools being essentially black boxes, their sensitivity to why, in order to help increase the quality and trustworthiness the sequence order, clustering threshold and structure of the of the clustering, we propose to add uncertainty to this simple data makes it that the users have no way of knowing whether algorithm through the use of fuzzy clustering. 2 Adding uncertainty to clustering nearly belongs to the OTU. This membership value can eas- ily be computed from the distance between the sequence and 2.1 Motivation the center of the OTU using two thresholds t1 and t2 such that t1 ≥ t2 . If the distance is less than the threshold t1 , the Distance-based greedy clustering algorithms, such as the one membership value is 1. If the distance is greater than t2 the in VSEARCH, produce a number of OTUs and assign each value is 0. If the distance is between t1 and t2 , it increases sequence to one of them. The OTU to which a sequence is gradually. said to belong to is usually the first one to be encountered that is sufficiently close, i.e. within the specified threshold. This creates two problems : Algorithm 2: Fuzzy DBG Clustering DBG Clustering Input : A set of sequences • A sequence can only belong to a single OTU Output: A set of OTUs to which the sequences are • An OTU either includes or does not include a sequence assigned 1 Clusters = ∅ Having a sequence associated to a single OTU is expected 2 foreach sequence S do as the ultimate output of the algorithm. For this reason, algo- 3 foreach known cluster C do rithms can stop after finding the first OTU that is close enough 4 Compute distance(S, C) to a sequence, which speeds the computation up. However, 5 Assign S to C with value fC (S) not considering all the OTUs a sequence could be assigned 6 end to increases the sensitivity to the order - a weakness of these 7 if S has not been sufficiently assigned then algorithms - and reduces the quality of the clustering. Indeed, 8 Create a new cluster with S as the center what if two different OTUs are close enough ? Giving priority 9 end to the first generated OTU only creates a bias that no heuristic 10 end - such as sorting the sequences - could hope to overcome. 11 Return Clusters Moreover, by using strict thresholds, it is possible to have two nearly identical sequences such that one belongs to a par- ticular OTU while the other does not. This strictness makes it so an OTU partitions the set of sequences into two sets in- Figure 1: Representations of a Crisp (Left) and a Fuzzy side of which sequences are considered the same regardless (Right) Cluster. of their distance to the center of the OTU. This lack of distinc- tion between sequences that are isolated and sequences on the border of OTUs hides information that could help understand the data. While these would not be problems were the clustering op- timal, the need for fast algorithms gives rise to results that are not always trustworthy. The OTUs being presented as abso- lute, the end user has no choice, should consider them correct Using fuzzy OTUs allows us to discern the difference be- and cannot know whether the algorithm has encountered am- tween sequences close to the OTU and sequences extremely biguity. We believe that being less strict in the way the OTUs far. Using the parameters t1 and t2 , we can tune the “detec- partition sequences would help produce better results from tion radius” around OTUs to gather information that would the end user’s point of view. normally be discarded by the clustering algorithm. 2.2 Fuzzy Clustering 3 Evaluating fuzzy OTUs To help increase the quality of the clustering and maximize the information that can be gathered from the data, we pro- Having a non-binary membership function produces OTUs pose to add uncertainty to the clustering by means of fuzzy that partition the sequences into multiple sets. If we con- sets. sider only the sequences that belong (more or less) to an OTU, the repartition of their membership values provides in- We define a membership function fC (S) that, for an OTU formation on the topology of the OTU. An ideal OTU would C, associates a membership value to a sequence S. Usually, contain only sequences with a membership value of 1, mean- this value is either 0 or 1. Here, we propose to have fC (S) ing a group of sequences has been perfectly regrouped with n take its value in { 10 | n = 0..10}. This value represents a good threshold and no sequence lies ambiguously on the the degree of membership and, as such, 1 means that the se- border. More realistically, a good OTU would contain many quence certainly belongs to the OTU while 0 means that the sequences with high membership values and little sequences sequence certainly does not belong to it. Other values rep- with low values. A bad OTU with the majority of its se- resent uncertainty and are used to express that the sequence quences having low membership values could mean that the 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 Identifying ambiguous sequences OTU1 6 4 1 1 0 3 8 13 29 88 OTU2 70 41 30 41 34 19 11 6 5 16 Distance-based greedy algorithms are good at clustering ob- jects that are easy to cluster. Groups of very similar sequences Table 1: Two example OTUs with the number of sequences that are different from the rest of the dataset are supposed to that belong to them with each possible membership value. birth a new OTU while isolated singletons should be identi- fied to be either removed or treated separately. A problem ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 arises when groups of sequences are close to each other but 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 not enough to be the same OTU. In this case and supposing the algorithm ideally chooses the centers of the OTUs, se- quences can lie just between these OTUs. In the current im- Table 2: Example of weight values. plementations, these ambiguous sequences that must be as- signed are usually put in OTUs of their own, increasing the algorithm has chosen as a center a sequence on the border of number of OTUs and reducing the overall quality of the clus- a group or, even worse, between two distinct groups. tering. We can quickly evaluate the quality of an OTU with this Figure 2: A Case of Ambiguous Sequences repartition. If we suppose that each sequence lowers the qual- ity of the OTU depending on its membership value, we can use the following formula : P9 # sequences with membership value i×0.1 Quality(OT U ) = 1 − i=1 ωi × # sequences in the OTU with ωi being the “cost” of having a sequence with mem- bership value i × 0.1. In our previous examples, and with the following values of ωi Using fuzzy clustering allows us to identify these ambigu- we obtain a quality of respectively 0.71 and 0.26 for OTU1 ous sequences. Using the previously mentioned choice strat- and OTU2, showing OTU1 is better. egy, they can be assigned to a good OTU even though they lie slightly outside of the distance threshold. However, their am- A problem arises with singletons that always have perfect biguousness may be significant for the user. It is thus impor- quality but these can safely be treated separately. tant to highlight their existence and the various fuzzy OTUs they could have alternatively been assigned to. 4 Choosing an OTU 6 Experimental Results A sequence can belong to multiple OTUs due to fuzzy mem- bership. However, in the end, we want each sequence to be 6.1 Data assigned to a single OTU. Hence, we have to choose one of the possible OTUs. We have two types of values left from the We used our algorithm on a dataset containing 5977 se- clustering process : membership and quality. The first one quences of length between 900 and 3081 for an average of is based on the distance between the OTU and the sequence 1442 and taxonomies extracted from the SILVA database. We and the second one is used to recognize bad OTUs. Choosing used a threshold of 0.97 (97% similarity) for determining new the OTU with the best membership value is akin to running OTUs and a threshold of 0.95 for fuzzy membership. For the VSEARCH. Choosing the OTU with the best quality tends to choice of the OTU for each sequence, we present the results create bigger OTUs that absorb distant sequences. To better of three strategies : best quality (α = 1 and β = 0), compro- compromise, we can use a linear combination of both values mise (α = 0.5 and β = 0.5) and distance (α = 0 and β = 1). : The comparison with VSEARCH is done using identical pa- rameters when applicable. α × quality + β × membership The program, dataset and corresponding taxonomy are available on http://projets.isima.fr/sclust/ Expe.html. Increasing the importance of the quality reduces the num- ber of OTUs containing sequences. When α is low, the “best” OTUs quality-wise absorb very close sequences that would 6.2 Relevant Metrics have been attributed to other OTUs. When α gets too high, To measure the effects of introducing uncertainty to the clus- the best OTUs start absorbing all the sequences around them, tering, we consider the following metrics : effectively acting like an increase of the distance threshold. Method Time (min) Memory #OTUs #Singletons #Doubletons Distance Fuzzy (best quality) 1:06 652744 3461 2581 442 0.75 centered on isolated sequences near good OTUs. That iso- Fuzzy (compromise) Fuzzy (distance) 1:06 1:06 651980 683772 3596 3631 2776 2837 413 395 0.54 0.59 lation lowers their quality and the good OTUs absorb their VSEARCH 0:21 632832 3716 2935 388 0.57 sequences. Table 3: Results of the clustering using default maxaccepts Using the quality also lowers the number of singletons and and maxrejects. increases the number of pairs. This most likely means that singletons were created close to either good clusters or one Method Time (min) Memory #OTUs #Singletons #Doubletons Distance another. The fuzzy approach allows the algorithm to merge Fuzzy (best quality) 27:01 720968 3431 2575 413 0.60 Fuzzy (compromise) 29:14 734604 3566 2767 398 0.47 those sequences that were slightly too far from the center with Fuzzy (distance) 28:27 723693 3631 2835 391 0.48 VSEARCH 27:51 648052 3631 2859 394 0.52 their corresponding OTU. The increase in the number of pairs appears to be due to the merging of singletons lying too close to one another. Table 4: Results of the clustering using maxaccepts 10000 and maxrejects 10000. The average taxonomy distance in OTUs is shown to vary wildly. Using only the quality to choose OTUs increases this • Computation time in minutes number as the “best” OTUs attract all the sequences in their • Memory usage fuzzy surroundings. This causes some sequences belonging to different species to be classified together. However, using a • Number of OTUs containing at least a sequence compromise between quality and distance lowers this metric • Number of OTUs containing a single sequence as the best clusters only absorb sequences that are sufficiently close to them and should probably be together while rejecting • Number of OTUs containing only two sequences the sequences that are too different. • Average distance in the taxonomy between sequences in the same cluster 7 Discussion The distance between two sequences in the taxonomy is de- We observe that the experimental results confirm that adding fined as the sum of the lengths of the path from their nearest uncertainty to the clustering helps improve the quality of the commonality. For example, if a sequence is classified as ”bac- output by reducing the number of singletons. Using fuzzy teria;proteobacteria;betaproteobacteria” and the other is clas- clusters, we are able to extend the clustering threshold to sified as ”bacteria;proteobacteria;alphaproteobacteria ”, their gather additional information on the OTUs’s surroundings distance is 2 as each of them is at a distance 1 from their and use it to quickly assess their quality. This quality can commonality ””bacteria;proteobacteria”. be used together with the distance to choose an OTU for each sequence. The resulting output contains less singletons and 6.3 Results misclassifications. Being able to choose the weight of both distance and quality allows for additional tuning. First, let us begin with the results obtained using the default values for –maxaccepts and –maxrejects in Table 3. As previously mentioned, the fuzziness also makes it possi- ble to detect ambiguous sequences and clusters. In our opin- Then, the results obtained using –maxaccepts 10000 and ion, this is where further work is required. An ambiguous –maxrejects 10000 in Table 4. sequence could be arbitrarily assigned to a nearby OTU, be- come the center of its own OTU or even be considered as an error and deleted but these operations imply such a knowl- 6.4 Analysis edge of the domain that interactions with the human user be- come necessary. However, on datasets containing millions of Results show that the choice strategy affects every metric sequences, the number of alerts would render manual treat- relevant to the quality of the clustering : number of OTUs, ment impractical or even impossible. Automatizing this treat- singletons and pairs, average misclassification. The fuzzy ment would require being able to adapt to the type of data, approach uses slightly more memory than VSEARCH but domain and preferences of the user. We suggest that machine all choice strategies are similar on this metric. When using learning techniques be introduced in the process to automati- the default –maxaccepts and –maxrejects values, computation cally learn how to handle these ambiguities. time is lower for VSEARCH. However, when using higher values for these parameters – and thus more precise cluster- ing - the computation time is the same for both approaches. Acknowledgements We observe that increasing the importance of the quality This work was supported by the European Union’s “Fonds in the OTU choice strategy lowers the final number of OTUs. Européen de Développement Régional (FEDER)” program This is due to the fact that some OTUs are initially created and the Auvergne-Rhone-Alpes region. References [Rognes et al., 2016] Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. [Albanese et al., 2015] Davide Albanese, Paolo Fontana, Vsearch: a versatile open source tool for metagenomics. Carlotta De Filippo, Duccio Cavalieri, and Claudio Do- PeerJ, 4:e2584, 2016. nati. Micca: a complete and accurate software for taxo- nomic profiling of metagenomic data. Scientific reports, [Roux et al., 2011] Simon Roux, Michaël Faubladier, An- 5:9743, 2015. toine Mahul, Nils Paulhe, Aurélien Bernard, Didier De- broas, and François Enault. Metavir: a web server ded- [Antoine et al., 2012] Violaine Antoine, Benjamin Quost, icated to virome analysis. Bioinformatics, 27(21):3074– Marie-Hélène Masson, and Thierry Denoeux. CECM: 3075, 2011. constrained evidential c-means algorithm. Computational Statistics & Data Analysis, 56(4):894–914, 2012. [Westcott and Schloss, 2015] Sarah L Westcott and Patrick D Schloss. De novo clustering methods out- [Antoine et al., 2014] Violaine Antoine, Benjamin Quost, perform reference-based methods for assigning 16s rrna Marie-Hélène Masson, and Thierry Denoeux. CEVCLUS: gene sequences to operational taxonomic units. PeerJ, evidential clustering with instance-level constraints for re- 3:e1487, 2015. lational data. Soft Comput., 18(7):1321–1335, 2014. [Chen et al., 2013] Wei Chen, Clarence K Zhang, Yongmei Cheng, Shaowu Zhang, and Hongyu Zhao. A compari- son of methods for clustering 16s rrna sequences into otus. PloS one, 8(8):e70837, 2013. [Edgar, 2010] Robert C Edgar. Search and clustering or- ders of magnitude faster than blast. Bioinformatics, 26(19):2460–2461, 2010. [Gath and Geva, 1989] Isak Gath and Amir B. Geva. Unsu- pervised optimal fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence, 11(7):773–780, 1989. [Hariz et al., 2006] Sarra Ben Hariz, Zied Elouedi, and Khaled Mellouli. Clustering approach using belief func- tion theory. In International Conference on Artificial Intel- ligence: Methodology, Systems, and Applications, pages 162–171. Springer, 2006. [Hugoni et al., 2013] Mylène Hugoni, Najwa Taib, Didier Debroas, Isabelle Domaizon, Isabelle Jouan Dufournel, Gisèle Bronner, Ian Salter, Hélène Agogué, Isabelle Mary, and Pierre E Galand. Structure of the rare archaeal bio- sphere and seasonal dynamics of active ecotypes in sur- face coastal waters. Proceedings of the National Academy of Sciences, 110(15):6004–6009, 2013. [Li and Godzik, 2006] Weizhong Li and Adam Godzik. Cd- hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006. [Li et al., 2012] Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu, and John Wooley. Ultrafast clustering algorithms for metagenomic sequence analysis. Briefings in bioinformat- ics, page bbs035, 2012. [Mahé et al., 2014] Frédéric Mahé, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ, 2:e593, 2014. [Pérez-Suárez et al., 2013] Airel Pérez-Suárez, José F Martı́nez-Trinidad, Jesús A Carrasco-Ochoa, and José E Medina-Pagola. Oclustr: A new graph-based algorithm for overlapping clustering. Neurocomputing, 121:234–247, 2013.