Skip to main content

BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species with DNA Signatures through Metagenomics Samples

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8649))

Abstract

The advancement of next generation sequencing (NGS) and shotgun sequencing technologies produced massive amounts of genomics data. Metagenomics, a powerful technique to study genetic material of uncultivable microorganisms received directly from their natural environment, is dealing with high throughput sequencing read data sets. Assembling, binning and alignment of short reads in order to identify microorganisms of a Metagenomics sample are expensive and time- consuming, regardless of other restrictions. DNA signature is a short nucleotide sequence fragment which is used to distinguish species across all other species. It can be a basis for identifying microorganisms both in environmental and clinical samples directly from the short reads, without assembling and alignment processes. In this paper, we propose a scalable method in which we use optimization techniques borrowed from database technology, namely bitmap indexes. They are used to speed up searching and matching of billions of DNA signatures in the short reads of thousands of different microorganisms, using commodity High Performance Computing, such as Hadoop MapReduce, Hive and Hbase.

This work was performed when Ramin Karimi was visiting the LIAS/ISAE-ENSMA Lab. This visit is funded by ERASMUS mobility program. The work was also supported in part by the projects TMOP-4.2.2.C-11/1/KONV-2012-0001, and TMOP 4.2.4. A/2-11-1-2012-0001 supported by the European Union, co-financed by the European Social Fund, and by the OTKA grant NK101680.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   34.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tiedje, J.M.: Microbial diversity: of value to whom. ASM News 60(10), 524–525 (1994)

    Google Scholar 

  2. Allsopp, D., Colwell, R.R., Hawksworth, D.L., et al.: Microbial Diversity and Ecosystem Function: Proceedings of the IUBS/IUMS Workshop held at Egham, UK, August 10-13. CAB INTERNATIONAL (1995)

    Google Scholar 

  3. Kaeberlein, T., Lewis, K., Epstein, S.S.: Isolating “uncultivable” microorganisms in pure culture in a simulated natural environment. Science 296(5570), 1127–1129 (2002)

    Article  Google Scholar 

  4. Trapnell, C., Salzberg, S.L.: How to map billions of short reads onto genomes. Nature Biotechnology 27(5), 455 (2009)

    Article  Google Scholar 

  5. Thomas, T., Gilbert, J., Meyer, F.: Metagenomics-a guide from sampling to data analysis. Microb. Inform. Exp. 2(3) (2012)

    Google Scholar 

  6. Haubold, B., Reed, F.A., Pfaffelhuber, P.: Alignment-free estimation of nucleotide diversity. Bioinformatics 27(4), 449–455 (2011)

    Article  Google Scholar 

  7. Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLoS Computational Biology 6(2), e1000667 (2010)

    Google Scholar 

  8. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13(1), 36–46 (2012)

    Google Scholar 

  9. Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(16), 2122–2130 (2003)

    Article  Google Scholar 

  10. Li, C., Yang, Y., Jia, M., Zhang, Y., Yu, X., Wang, C.: Phylogenetic analysis of DNA sequences based on k-word and rough set theory. Physica A: Statistical Mechanics and its Applications 398, 162–171 (2014)

    Article  MathSciNet  Google Scholar 

  11. Nagar, A., Hahsler, M.: Genomic sequence fragment identification using quasi-alignment. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, p. 359. ACM (2013)

    Google Scholar 

  12. Vinga, S., Almeida, J.: Alignment-free sequence comparison–a review. Bioinformatics 19(4), 513–523 (2003)

    Article  Google Scholar 

  13. Song, K., Ren, J., Zhai, Z., Liu, X., Deng, M., Sun, F.: Alignment-free sequence comparison based on next generation sequencing reads: Extended abstract. In: Chor, B. (ed.) RECOMB 2012. LNCS, vol. 7262, pp. 272–285. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Srinivasan, S.M., Guda, C.: MetaID: A novel method for identification and quantification of metagenomic samples. BMC Genomics 14(8), 1–12 (2013)

    Google Scholar 

  15. Phillippy, A.M., Mason, J.A., Ayanbule, K., Sommer, D.D., Taviani, E., Huq, A., ... Salzberg, S.L.: Comprehensive DNA signature discovery and validation. PLoS Computational Biology 3(5), e98 (2007)

    Google Scholar 

  16. Phillippy, A.M., Ayanbule, K., Edwards, N.J., Salzberg, S.L.: Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Research 37(suppl. 2), W229–W234 (2009)

    Google Scholar 

  17. Satya, R.V., Kumar, K., Zavaljevski, N., Reifman, J.: A high-throughput pipeline for the design of real-time pcr signatures. BMC Bioinformatics 11(1), 340 (2010)

    Article  Google Scholar 

  18. Apache Hadoop available at http://hadoop.apache.org/

  19. White, T.: Hadoop: The definitive guide. O’Reilly Media, Inc. (2012)

    Google Scholar 

  20. Cloudera Frequently Asked Questions (FAQs), http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop-and-big-data.html

  21. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

    Google Scholar 

  22. NoSQL Relational Database Management System homepage, http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20Page

  23. Michael, M., Moreira, J.E., Shiloach, D., Wisniewski, R.W.: Scale-up x scale-out: A case study using nutch/lucene. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007, pp. 1–8. IEEE (2007)

    Google Scholar 

  24. Bondi, A.B.: Characteristics of scalability and their impact on performance. In: Proceedings of the 2nd International Workshop on Software and Performance, pp. 195–203. ACM (2000)

    Google Scholar 

  25. Apache Hive available at http://hive.apache.org

  26. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE), pp. 996–1005. IEEE (2010)

    Google Scholar 

  27. Apache HBase available at http://hbase.apache.org

  28. Karande, N.D.: Efficient indexing technique using bitmap indices for data warehouses. International Journal 1(4) (2013)

    Google Scholar 

  29. Bellatreche, L., Missaoui, R., Necir, H., Drias, H.: A data mining approach for selecting bitmap join indices. JCSE 1(2), 177–194 (2007)

    Google Scholar 

  30. National Center for Biotechnology Information (NCBI), ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

  31. Insignia Homepage, http://insignia.cbcb.umd.edu/index.php

  32. Metasim Homepage, http://ab.inf.uni-tuebingen.de/software/metasim/

  33. Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: Metasima sequencing simulator for genomics and metagenomics. PloS One 3(10), e3373 (2008)

    Google Scholar 

  34. Hbase and Hive integration, https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Karimi, R., Bellatreche, L., Girard, P., Boukorca, A., Hajdu, A. (2014). BINOS4DNA: Bitmap Indexes and NoSQL for Identifying Species with DNA Signatures through Metagenomics Samples. In: Bursa, M., Khuri, S., Renda, M.E. (eds) Information Technology in Bio- and Medical Informatics. ITBAM 2014. Lecture Notes in Computer Science, vol 8649. Springer, Cham. https://doi.org/10.1007/978-3-319-10265-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10265-8_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10264-1

  • Online ISBN: 978-3-319-10265-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics