Evaluating Genomic Big Data Operations on SciDB and Spark
We are developing a new, holistic data management system for genomics, which provides high-level abstractions for querying large genomic datasets. We designed our system so that it leverages on data management engines for low-level data access. Such design can be adapted to two different kinds of data engines: the family of scientific databases (among them, SciDB) and the broader family of generic platforms (among them, Spark). Trade-offs are not obvious; scientific databases are expected to outperform generic platforms when they use features which are embedded within their specialized design, but generic platforms are expected to outperform scientific databases on general-purpose operations.
In this paper, we compare our SciDB and Spark implementations at work on genomic abstractions. We use four typical genomic operations as benchmark, stemming from the concrete requirements of our project, and encoded using SciDB and Spark; we discuss their common aspects and differences, specifically discussing how genomic regions and operations can be expressed using SciDB arrays. We comparatively evaluate the performance and scalability of the two implementations over datasets consisting of billions of genomic regions.
The authors would like to thank the SciDB support team for help during Simone Cattani’s thesis  and for comments at his seminar, given at SciDB on July 19, 2016. This work is supported by the ERC Advanced Grant GeCo (Data-Driven Genomic Computing).
- 1.Anonymous paper: Accelerating bioinformatics research with new software for big data to knowledge (BD2K). Paradigm4 (2015)Google Scholar
- 2.Anonymous paper: SciDB MAC Storage Explained, Paradigm4 (2015)Google Scholar
- 3.Apache Flink. http://flink.apache.org/
- 4.Apache Spark. http://spark.apache.org/
- 5.Bertoni, M., Ceri, S., Kaitoua, A., Pinoli, P.: Evaluating cloud frameworks on genomic applications. In: IEEE-Big Data Conference, pp. 193–202 (2015)Google Scholar
- 6.Brown, P.G.: Overview of SciDB: large scale array storage, processing and analysis. In: Proceedings of ACM-SIGMOD, pp. 963–968 (2010)Google Scholar
- 7.Cattani, S.: Genomic Computing with SciDB, a Data Management System for Scientific Computations. Master Thesis, Politecnico di Milano, July 2016Google Scholar
- 8.Chawda, B., et al.: Processing interval joins on map-reduce. In: Proceedings of EDBT, pp. 463–474, (2014)Google Scholar
- 9.Edelkamp, S., Sulewski, D., Yucel, C.: Perfect hashing for state space exploration on the GPU. In: Proceedings of ICAPS, pp. 57–64 (2010)Google Scholar
- 10.ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)Google Scholar
- 11.Kaitoua, A., Ceri, S., Bertoni, M., Pinoli, P.: Framework for supporting genomic operations. IEEE-TC (2016). doi: 10.1109/TC.2016.2603980
- 12.Masseroli, M., et al.: GenoMetric Query Language: A novel approach to large-scale genomic data management. Bioinformatics (2015). doi: 10.1093/bioinformatics/btv048
- 13.Masseroli, M., Kaitoua, A., Pinoli, P., Ceri, S.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods (2016). doi: 10.1016/j.ymeth.2016.09.002
- 17.Xin, R., et al.: Shark: SQL and rich analytics at scale. In: Proceedings of ACM-SIGMOD, June 2013Google Scholar
- 18.Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of NSDI, pp. 15–28 (2012)Google Scholar
- 19.Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, November 2013Google Scholar