Definition
Sequence clustering is a process to group sequences into groups (clusters) such that similar sequences are clustered together and can be potentially represented by a single representative sequence. CD-HIT uses a greedy incremental clustering algorithm enhanced by an efficient word filtering heuristics and an effective parallelization technique to do clustering on big sequence datasets efficiently.
Introduction
Since the development of high-throughput sequencing technologies, the amount of available biological sequences has increased dramatically and continues to increase rapidly. Efficient handling and effective analysis of such massive amount of data has become one of the major issues and challenges in many sequencing-based research. Such challenges are typically dominated by two factors: huge data size and high sequence redundancy. Sequence clustering is a key technique that...
This is a preview of subscription content, log in via an institution.
References
Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics. 1998;14:423–9.
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9.
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics. 2001;17:282–3.
Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18:77–82.
Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012;13:656–68.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this entry
Cite this entry
Li, W. (2014). Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. In: Nelson, K. (eds) Encyclopedia of Metagenomics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6418-1_221-2
Download citation
DOI: https://doi.org/10.1007/978-1-4614-6418-1_221-2
Received:
Accepted:
Published:
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4614-6418-1
eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences