Abstract
R is a powerful programming environment for data analysis. However, when dealing with big data in R, a kind of main-memory based functional programming environment, the data movement and memory swapping become the major performance bottleneck. Therefore, executing a big-data-intensive R program could be many orders of magnitude less efficient than processing the SQL query directly inside the database for dealing with the same analytic task. Although there exists a number of “parallel-R” solutions, pushing R operations down to the parallel database layer, while retaining the natural R interface and the virtual R analytics flow, remains a very competitive alternative.
This has motivated us to develop the R-Vertica framework to scale-out R applications through in-DB, data-parallel analytics. In order to extend the R programming environment to the space of parallel query processing transparently to the R users, we introduce the notion of R Proxy - the R object with instance maintained in the parallel database as partitioned data sets, and schema (header) retained in the memory-based R environment. A function (such as aggregation) applied to a proxy is pushed down to the parallel database layer as SQL queries or procedures, with the query results automatically returned and converted to R objects. By providing the transparent 2-way mappings between several major types of R objects and database tables or query results, the R environment and the underlying parallel database are seamlessly integrated. The R object proxies may be created from database table schemas, in-DB operations, or the operations for persisting R objects to the database. The instances of the R proxies can be retrieved into regular R objects using SQL queries. With this framework, an R application is expressed as the analytics flow with the R objects bearing small data and the R proxies representing, but not bearing, big data. The big data are manipulated, or flow, underneath the in-memory R environment in terms of In-DB and data-parallel operations.
We have implemented the proposed approach and used it to integrate several large-scale R applications with the multi-node Vertica parallel database system. Our experience illustrates the unique feature and efficiency of this R-Vertica framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bryant, R.E.: Data-Intensive Supercomputing: The case for DISC. CMU-CS-07-128 (2007)
Chen, Q., Hsu, M., Zeller, H.: Experience in Continuous analytics as a Service (CaaaS). In: EDBT 2011 (2011)
Chen, Q., Hsu, M.: Query Engine Net for Streaming Analytics. In: Proc. 19th International Conference on Cooperative Information Systems, CoopIS (2011)
PL/R - R Procedural Language for PostgreSQL, http://www.joeconway.com/plr/
Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., Mansmann, U.: State of the Art in Parallel Computing with R. Journal of Statistical Software 21(1) (2009)
Soroush, E., Balazinska, M., Wang, D.: ArrayStore: A Storage Manager for Complex Parallel Array Processing. In: ACM-SIGMOD 2011 (2011)
Stonebraker, M.: SciDB - A DBMS for Analytic Applications. In: ACM-SIGMOD 2011 (2011)
Stonebraker, M., SciDB Development Team: Overview of SciDB. In: ACM-SIGMOD 2010 (2010)
Teradata, In-database analytics with TeradataR (October 2010), http://developer.teradata.com/applications/articles/in-database-analytics-with-teradata-r
Zhang, Y., Zhang, W., Yang, J.: I/O-efficient statistical computing with RIOT. In: ICDE 2010 (2010)
Zhang, Y., Kersten, M., Ivanova, M., Nes, N.: SciQL: bridging the gap between science and relational DBMS. In: IDEAS 2011 (2011)
Vertica System, http://www.vertica.com
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, Q., Hsu, M., Wu, R., Shan, J. (2012). R-Proxy Framework for In-DB Data-Parallel Analytics. In: Liddle, S.W., Schewe, KD., Tjoa, A.M., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2012. Lecture Notes in Computer Science, vol 7447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32597-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-32597-7_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32596-0
Online ISBN: 978-3-642-32597-7
eBook Packages: Computer ScienceComputer Science (R0)