=Paper=
{{Paper
|id=Vol-2133/cnia-short2
|storemode=property
|title=A de novo robust clustering approach for amplicon-based sequence data
|pdfUrl=https://ceur-ws.org/Vol-2133/cnia-short2.pdf
|volume=Vol-2133
|dblpUrl=https://dblp.org/rec/conf/rjcia/BazinDN18
}}
==A de novo robust clustering approach for amplicon-based sequence data==
A De Novo Robust Clustering Approach for Amplicon-Based Sequence Data Alexandre Bazin Didier Debroas Engelbert Mephu Nguifo contact@alexandrebazin.com 1 Introduction 2 Adding uncertainty to clus- tering Studying the structure of the communities in an Distance-based greedy clustering algorithms, such as ecosystem is central in environmental microbiology the one in VSEARCH, produce a number of OTUs [8, 14]. The biosphere’s diversity can be determined by and assign each sequence to one of them. The OTU amplifying and sequencing specific phylogenetic mark- to which a sequence is said to belong to is usually the ers (e.g. 16S rRNA). From there, these amplicons need first one to be encountered that is sufficiently close, to be clusterized in ”species” named Operational Tax- i.e. within the specified threshold. This makes a se- onomic Units (OTUs) [4, 9, 11, 15]. As the volume quence belong only to a single OTU and OTUs either of sequences has drastically increased in recent times, completely include or exclude a sequence. While these new clustering tools have emerged to treat the data would not be problems were the clustering optimal, the in reasonable time. The currently used algorithms need for fast algorithms gives rise to results that are are, from the point of view of algorithmic complex- not always trustworthy. The OTUs being presented as ity, the fastest available that do not produce random absolute, the end user has no choice, should consider results. However, due to their simplicity, the reliabil- them correct and cannot know whether the algorithm ity of the results are often discussed. These tools be- has encountered ambiguity. We believe that being less ing essentially black boxes, their sensitivity to the se- strict in the way the OTUs partition sequences would quence order, clustering threshold and structure of the help produce better results from the end user’s point data makes it that the users have no way of knowing of view. To help increase the quality of the clustering whether better Operational Taxonomic Units (OTUs) and maximize the information that can be gathered could have been obtained with different parameters or from the data, we propose to add uncertainty to the even whether they correctly represent the data. In clustering by means of fuzzy sets. these circumstances, there is no choice but to blindly trust them. Using fuzzy OTUs allows us to discern the difference between sequences close to the OTU and sequences extremely far. Using the parameters t1 and t2 , we can Distance-based greedy clustering algorithms such as tune the “detection radius” around OTUs to gather the ones implemented in OTUclust [1], VSEARCH information that would normally be discarded by the [13], CD-HIT [10] or USEARCH [5] all share the same clustering algorithm. base naive algorithm. While more sophisticated al- gorithms [3, 6, 12, 7, 2] could produce better results 3 Evaluating fuzzy OTUs quality-wise, their runtime would render them unus- able on millions of sequences. As the quality of the An ideal OTU would contain only sequences with a OTUs is important, we have to find a way to im- membership value of 1, meaning a group of sequences prove it without increasing the runtime. The differ- has been perfectly regrouped with a good threshold ent available implementations use a variety of heuris- and no sequence lies ambiguously on the border. More tics to counterbalance the simplicity of the algorithm realistically, a good OTU would contain many se- but, to the best of our knowledge, no approach has quences with high membership values and little se- tried to add a measure of uncertainty to the process. quences with low values. A bad OTU with the ma- This is why, in order to help increase the quality and jority of its sequences having low membership values trustworthiness of the clustering, we propose to add could mean that the algorithm has chosen as a center uncertainty to this simple algorithm through the use a sequence on the border of a group or, even worse, of fuzzy clustering. between two distinct groups. We can quickly evaluate the quality of an OTU with 4.2 Relevant Metrics this repartition. If we suppose that each sequence low- ers the quality of the OTU depending on its member- To measure the effects of introducing uncertainty to ship value, we can use the following formula : the clustering, we consider the computation time, memory usage, number of OTUs, singletons and pairs P9 # sequences with membership value i×0.1 and average distance in the taxonomy tree between Quality(OT U ) = 1 − i=1 ωi × # sequences in the OTU sequences in the same cluster. with ωi being the “cost” of having a sequence with membership value i × 0.1. 4.3 Analysis A problem arises with singletons that always have per- fect quality but these can safely be treated separately. Results show that the choice strategy affects every metric relevant to the quality of the clustering : num- A sequence can belong to multiple OTUs due to fuzzy ber of OTUs, singletons and pairs, average misclassifi- membership. However, in the end, we want each se- cation. The fuzzy approach uses slightly more memory quence to be assigned to a single OTU. Hence, we than VSEARCH but all choice strategies are similar have to choose one of the possible OTUs. We have on this metric. When using the default –maxaccepts two types of values left from the clustering process : and –maxrejects values, computation time is lower for membership and quality. The first one is based on the VSEARCH. However, when using higher values for distance between the OTU and the sequence and the these parameters – and thus more precise clustering - second one is used to recognize bad OTUs. Choosing the computation time is the same for both approaches. the OTU with the best membership value is akin to We observe that increasing the importance of the qual- running VSEARCH. Choosing the OTU with the best ity in the OTU choice strategy lowers the final number quality tends to create bigger OTUs that absorb dis- of OTUs. This is due to the fact that some OTUs are tant sequences. To better compromise, we can use a initially created centered on isolated sequences near linear combination of both values : good OTUs. That isolation lowers their quality and the good OTUs absorb their sequences. α × quality + β × membership Increasing the importance of the quality reduces the Using the quality also lowers the number of single- number of OTUs containing sequences. When α is tons and increases the number of pairs. This most low, the “best” OTUs quality-wise absorb very close likely means that singletons were created close to ei- sequences that would have been attributed to other ther good clusters or one another. The fuzzy approach OTUs. When α gets too high, the best OTUs start allows the algorithm to merge those sequences that absorbing all the sequences around them, effectively were slightly too far from the center with their corre- acting like an increase of the distance threshold. sponding OTU. The increase in the number of pairs appears to be due to the merging of singletons lying too close to one another. The average taxonomy dis- 4 Experimental Results tance in OTUs is shown to vary wildly. Using only the quality to choose OTUs increases this number as 4.1 Data the “best” OTUs attract all the sequences in their We used our algorithm on a dataset containing 5977 fuzzy surroundings. This causes some sequences be- sequences of length between 900 and 3081 for an aver- longing to different species to be classified together. age of 1442 and taxonomies extracted from the SILVA However, using a compromise between quality and dis- database. We used a threshold of 0.97 (97% similarity) tance lowers this metric as the best clusters only ab- for determining new OTUs and a threshold of 0.95 for sorb sequences that are sufficiently close to them and fuzzy membership. For the choice of the OTU for each should probably be together while rejecting the se- sequence, we present the results of three strategies : quences that are too different. best quality (α = 1 and β = 0), compromise (α = 0.5 and β = 0.5) and distance (α = 0 and β = 1). The comparison with VSEARCH is done using identical acknowledgements parameters when applicable. This work was supported by the European Union’s The program, dataset and corresponding taxonomy “Fonds Européen de Développement Régional are available on http://projets.isima.fr/sclust/ (FEDER)” program and the Auvergne-Rhone-Alpes Expe.html. region. Method Time (min) Memory #OTUs #Singletons #Doubletons Distance Fuzzy (best quality) 1:06 652744 3461 2581 442 0.75 Fuzzy (compromise) 1:06 651980 3596 2776 413 0.54 Fuzzy (distance) 1:06 683772 3631 2837 395 0.59 VSEARCH 0:21 632832 3716 2935 388 0.57 Table 1: Results of the clustering. References [9] Weizhong Li, Limin Fu, Beifang Niu, Sitao Wu, and John Wooley. Ultrafast clustering algorithms [1] Davide Albanese, Paolo Fontana, Carlotta De Fil- for metagenomic sequence analysis. Briefings in ippo, Duccio Cavalieri, and Claudio Donati. bioinformatics, page bbs035, 2012. Micca: a complete and accurate software for tax- onomic profiling of metagenomic data. Scientific [10] Weizhong Li and Adam Godzik. Cd-hit: a fast reports, 5:9743, 2015. program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformat- [2] Violaine Antoine, Benjamin Quost, Marie-Hélène ics, 22(13):1658–1659, 2006. Masson, and Thierry Denoeux. CECM: con- [11] Frédéric Mahé, Torbjørn Rognes, Christopher strained evidential c-means algorithm. Computa- Quince, Colomban de Vargas, and Micah Dun- tional Statistics & Data Analysis, 56(4):894–914, thorn. Swarm: robust and fast clustering method 2012. for amplicon-based studies. PeerJ, 2:e593, 2014. [3] Violaine Antoine, Benjamin Quost, Marie-Hélène [12] Airel Pérez-Suárez, José F Martı́nez-Trinidad, Masson, and Thierry Denoeux. CEVCLUS: evi- Jesús A Carrasco-Ochoa, and José E Medina- dential clustering with instance-level constraints Pagola. Oclustr: A new graph-based algo- for relational data. Soft Comput., 18(7):1321– rithm for overlapping clustering. Neurocomput- 1335, 2014. ing, 121:234–247, 2013. [4] Wei Chen, Clarence K Zhang, Yongmei Cheng, [13] Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Shaowu Zhang, and Hongyu Zhao. A comparison Christopher Quince, and Frédéric Mahé. Vsearch: of methods for clustering 16s rrna sequences into a versatile open source tool for metagenomics. otus. PloS one, 8(8):e70837, 2013. PeerJ, 4:e2584, 2016. [5] Robert C Edgar. Search and clustering orders [14] Simon Roux, Michaël Faubladier, Antoine Mahul, of magnitude faster than blast. Bioinformatics, Nils Paulhe, Aurélien Bernard, Didier Debroas, 26(19):2460–2461, 2010. and François Enault. Metavir: a web server dedicated to virome analysis. Bioinformatics, [6] Isak Gath and Amir B. Geva. Unsupervised 27(21):3074–3075, 2011. optimal fuzzy clustering. IEEE Transactions [15] Sarah L Westcott and Patrick D Schloss. De on pattern analysis and machine intelligence, novo clustering methods outperform reference- 11(7):773–780, 1989. based methods for assigning 16s rrna gene se- quences to operational taxonomic units. PeerJ, [7] Sarra Ben Hariz, Zied Elouedi, and Khaled Mel- 3:e1487, 2015. louli. Clustering approach using belief function theory. In International Conference on Artificial Intelligence: Methodology, Systems, and Applica- tions, pages 162–171. Springer, 2006. [8] Mylène Hugoni, Najwa Taib, Didier Debroas, Isabelle Domaizon, Isabelle Jouan Dufournel, Gisèle Bronner, Ian Salter, Hélène Agogué, Is- abelle Mary, and Pierre E Galand. Structure of the rare archaeal biosphere and seasonal dynam- ics of active ecotypes in surface coastal waters. Proceedings of the National Academy of Sciences, 110(15):6004–6009, 2013.