Skip to main content

Probabilistic Approach for DNA Compression

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 190))

Abstract

Rapid advancements in research in the field of DNA sequence discovery has led to a vast range of compression algorithms. The number of bits required for storing four bases of any DNA sequence is two, but efficient algorithms have pushed this limit lower. With the constant decrease in prices of memory and communication channel bandwidth, one often doubts the need of such compression algorithms. The algorithm discussed in this chapter compresses the DNA sequence, and also allows one to generate finite length sequences, which can be used to find approximate pattern matches. DNA sequences are mainly of two types, Repetitive and Non-Repetitive. The compression technique used is meant for the non-repetitive parts of the sequence, where we make use of the fact that a DNA sequence consists of only 4 characters. The algorithm achieves bit/base ratio of 1.3-1.4(dependent on the database), but more importantly one of the stages of the algorithm can be used for efficient discovery of approximate patterns.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Rivals, E., Delahaye, J.-P., Dauchet, M., Delgrange.: A Guaranteed Compression Scheme for Repetitive DNA Sequences. LIFL Lille I University Technical Report (1995)

    Google Scholar 

  2. Matsumuto, T., Sadakane, K., Imai, H.: Biological Sequences Compression Algorithms. Genome Information Ser. Workshop Genome Inform 11, 43–52 (2000)

    Google Scholar 

  3. Grumbach, S., Tahi, F.: Compression of DNA Sequences. In: Data Compression Conference, pp. 340–350 (1993)

    Google Scholar 

  4. Grumbach, S., Tahi, F.: A New Challenge for Compression Algorithms Genetic Sequences. Journal of Information Processing and Management 30, 866–875 (1994)

    Google Scholar 

  5. Ziv, J., Limpel, A.: Compression of Individual Sequences using Variable-Rate Encoding. IEE Transactions on Information Theory 24, 530–536 (1978)

    Article  MATH  Google Scholar 

  6. Ziv, J., Limpel, A.: A Universal Algorithm for Sequential Data Compression. IEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MATH  Google Scholar 

  7. Sadel, I.: Universal Data Compression Algorithm based on Approximate String Matching. In: Probability in the Engineering and Informational Sciences, pp. 465–486 (1996)

    Google Scholar 

  8. Chen, X., Kwong, S., Li, M.: A Compression Algorithm for DNA Sequences and its Application in Genome Comparison. Genomic 12, 512–514 (2001)

    Google Scholar 

  9. Chen, X., Kwong, S., Li, M.: A Compression Algorithm for DNA Sequences. IEEE Engineering in Medicine and Biology Magazine 20(4), 61–66 (2001)

    Article  Google Scholar 

  10. Li, M., Badger, J.H., Chen, J.H., Kwong, S., Kerney, P., Zhang, H.: An Information based Sequences Distance and its Application to whole Mitochondrial Genome. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  11. Chen, X., La, M., Ma, B., Tromp, J.: DnaCompress: Fast and Selective DNA Sequence Compression. Bioinformatics 18, 1696–1698 (2002)

    Article  Google Scholar 

  12. Ma, B., Tromp, J., Li, M.: Patternhunter-faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002)

    Article  Google Scholar 

  13. Sata, H., Yoshioka, T., Konagaya, A., Toyoda, T.: DNA Compression in the Post Genomic Era. Genome Informatics 12, 512–514 (2001)

    Google Scholar 

  14. Willems, F.M.J., Shtralov, Y.M., Tjalkens, T.J.: The Context Tree Weighting Method: Basic Properties. IEE Transactions on Information Theory 41(3), 653–664 (1995)

    Article  MATH  Google Scholar 

  15. Sadakane, K., Okazaki, T., Imai, H.: Implementing the Context Tree Weighting Method for Text Compression. In: DCC 2000: Proceedings of the Conference on Data Compression, USA (2000)

    Google Scholar 

  16. Rivals, E., Dauchet, M.: Fast Discerning Repeats in DNA Sequences with a Compression Algorithm. In: Proceedings of Genome Informatics Workshop, pp. 215–226. Universal Academy Press, Tokyo (1997)

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Venugopal, K.R., Srinivasa, K.G., Patnaik, L.M. (2009). Probabilistic Approach for DNA Compression. In: Soft Computing for Data Mining Applications. Studies in Computational Intelligence, vol 190. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00193-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00193-2_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00192-5

  • Online ISBN: 978-3-642-00193-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics