Skip to main content

Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences

  • Living reference work entry
  • First Online:
  • 195 Accesses

Synonyms

CD-HIT is a fast program for clustering large amount of protein and nucleotide sequences

Definition

Sequence clustering is a process to group sequences into groups (clusters) such that similar sequences are clustered together and can be potentially represented by a single representative sequence. CD-HIT uses a greedy incremental clustering algorithm enhanced by an efficient word filtering heuristics and an effective parallelization technique to do clustering on big sequence datasets efficiently.

Introduction

Since the development of high-throughput sequencing technologies, the amount of available biological sequences has increased dramatically and continues to increase rapidly. Efficient handling and effective analysis of such massive amount of data has become one of the major issues and challenges in many sequencing-based research. Such challenges are typically dominated by two factors: huge data size and high sequence redundancy. Sequence clustering is a key technique that...

This is a preview of subscription content, log in via an institution.

References

  • Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics. 2012;28(23):3150–2.

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  • Holm L, Sander C. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics. 1998;14:423–9.

    Article  CAS  PubMed  Google Scholar 

  • Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9.

    Article  CAS  PubMed  Google Scholar 

  • Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics. 2001;17:282–3.

    Article  CAS  PubMed  Google Scholar 

  • Li W, Jaroszewski L, Godzik A. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. 2002;18:77–82.

    Article  CAS  PubMed  Google Scholar 

  • Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012;13:656–68.

    Article  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weizhong Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this entry

Cite this entry

Li, W. (2014). Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. In: Nelson, K. (eds) Encyclopedia of Metagenomics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6418-1_221-2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6418-1_221-2

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Online ISBN: 978-1-4614-6418-1

  • eBook Packages: Springer Reference Biomedicine and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics