Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences

Li, Weizhong

doi:10.1007/978-1-4614-6418-1_221-2

Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences

Weizhong Li²

Living reference work entry
First Online: 01 January 2014

195 Accesses

Synonyms

CD-HIT is a fast program for clustering large amount of protein and nucleotide sequences

Definition

Sequence clustering is a process to group sequences into groups (clusters) such that similar sequences are clustered together and can be potentially represented by a single representative sequence. CD-HIT uses a greedy incremental clustering algorithm enhanced by an efficient word filtering heuristics and an effective parallelization technique to do clustering on big sequence datasets efficiently.

Introduction

Since the development of high-throughput sequencing technologies, the amount of available biological sequences has increased dramatically and continues to increase rapidly. Efficient handling and effective analysis of such massive amount of data has become one of the major issues and challenges in many sequencing-based research. Such challenges are typically dominated by two factors: huge data size and high sequence redundancy. Sequence clustering is a key technique that...

This is a preview of subscription content, log in via an institution.

References

Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics. 2012;28(23):3150–2.
Article CAS PubMed Central PubMed Google Scholar
Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics. 1998;14:423–9.
Article CAS PubMed Google Scholar
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9.
Article CAS PubMed Google Scholar
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics. 2001;17:282–3.
Article CAS PubMed Google Scholar
Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18:77–82.
Article CAS PubMed Google Scholar
Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012;13:656–68.
Article PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Center for Research in Biological Systems, University of California San Diego, 9500 Gilman Drive MC0446, La Jolla, CA, 92093-0446, USA
Weizhong Li

Authors

Weizhong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weizhong Li .

Editor information

Editors and Affiliations

J. Craig Venter Institute (JCVI), Rockville, Maryland, USA
Karen E. Nelson

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Li, W. (2014). Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. In: Nelson, K. (eds) Encyclopedia of Metagenomics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6418-1_221-2

Download citation

DOI: https://doi.org/10.1007/978-1-4614-6418-1_221-2
Received: 08 January 2014
Accepted: 08 January 2014
Published: 12 April 2014
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4614-6418-1
eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics