Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

The R Language: A Powerful Tool for Taming Big Data

  • Norman Matloff
  • Clark Fitzgerald
  • Robin Yancey
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_294-1

Definition

The R language (R Core Team 2017; Chambers 2008; Matloff 2011) is currently the most popular tool in the general data science field. It features outstanding graphics capabilities and a rich set of more than 10,000 library packages to draw upon. (Other notable languages in data science are Python and Julia. Python is popular among those trained in computer science. Julia, a new language, has as top priority producing fast code.) Its interfaces to SQL databases and the C/C++ language are first rate. All of this, along with recent developments regarding memory issues, makes R well poised as a highly effective tool in Big Data applications. In this chapter, the use of R in Big Data settings will be presented.

It should be noted that Big Data can be “big” in one of two ways, phrased in terms of the classical n ×  p matrix representing a dataset:
  • Big-n: Large number of data points.

  • Big-p: Large number of variables/features.

Both senses will come into play later. For now, though,...

This is a preview of subscription content, log in to check access.

References

  1. Breshears C (2009) The art of concurrency: a thread monkey’s guide to writing parallel applications. O’Reilly Media, SebastopolGoogle Scholar
  2. Bühlmann P, Drineas P, Kane M, van der Laan M (2016) Handbook of big data. Chapman & Hall/CRC handbooks of modern statistical methods. CRC Press, Boca RatonGoogle Scholar
  3. Chambers J (2008) Software for data analysis: programming with R. Statistics and computing. Springer, New YorkGoogle Scholar
  4. Chang W (2013) R graphics cookbook. Oreilly and associate series. O’Reilly Media, Sebastopol, CAGoogle Scholar
  5. Eddelbuettel D (2013) Seamless R and C++ integration with Rcpp. Use R! Springer, New YorkGoogle Scholar
  6. Inselberg A (2009) Parallel coordinates: visual multidimensional geometry and its applications. Springer, New YorkGoogle Scholar
  7. Kane MJ, Emerson J, Weston S (2013) Scalable strategies for computing with massive data. J Stat Softw 55(14):1–19Google Scholar
  8. Lichman M (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
  9. Luraschi J, Ushey K, Allaire J (2017) Sparklyr: R interface to Apache Spark. https://CRAN.R-project.org/package=sparklyr
  10. Matloff N (2011) The art of R programming: a tour of statistical software design. No starch press series. No Starch Press, San FranciscoGoogle Scholar
  11. Matloff N (2015) Parallel computing for data science: with examples in R, C++ and CUDA. Chapman & Hall/CRC the R series. CRC Press, Boca RatonGoogle Scholar
  12. Matloff N (2016) Software Alchemy: turning complex statistical computations into embarassingly–parallel ones. J Stat Softw 71(4):1–15Google Scholar
  13. Matloff N, Fitzgerald C, Davis R, Yancey R, Huang S (2017a) partools: tools for the ‘Parallel’ package. https://github.com/matloff/partools
  14. Matloff N, Yang V, Nguyen H (2017b) cdparcoord: top frequency-based parallel coordinates. https://CRAN.R-project.org/package=cdparcoodr
  15. Murrell P (2011) R graphics, 2nd edn. Chapman & Hall/CRC the R series. Taylor & Francis, Boca Raton, FLGoogle Scholar
  16. Nielsen F (2016) Introduction to HPC with MPI for data science. Undergraduate topics in computer science. Springer International Publishing, ChamGoogle Scholar
  17. Plotly Technologies Inc (2015) Collaborative data science. https://plot.ly
  18. R Core Team (2017) R: a language and environment for statistical computing. In: R foundation for statistical computing, Vienna. https://www.R-project.org/
  19. Reinders J (2007) Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O’Reilly series. O’Reilly Media, SebastopolGoogle Scholar
  20. Sarkar D (2008) Lattice: multivariate data visualization with R. Use R! Springer, New YorkGoogle Scholar
  21. Unwin A, Theus M, Hofmann H (2007) Graphics of large datasets: visualizing a million. Statistics and computing. Springer, New YorkGoogle Scholar
  22. Weston S (2017) foreach: provides foreach looping construct for R. https://CRAN.R-project.org/package=foreach
  23. Wickham H (2016) Ggplot2: elegant graphics for data analysis. Use R! Springer International Publishing, New YorkGoogle Scholar
  24. Yang V, Nguyen H, Matloff N, Xie Y (2017) Top-frequency parallel coordinates plots (arxiv). arXiv:1709.00665Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Norman Matloff
    • 1
  • Clark Fitzgerald
    • 2
  • Robin Yancey
    • 3
  1. 1.Department of Computer ScienceUniversity of CaliforniaDavisUSA
  2. 2.Department of StatisticsUniversity of CaliforniaDavisUSA
  3. 3.Department of Electrical and Computer EngineeringUniversity of CaliforniaDavisUSA

Section editors and affiliations

  • Sherif Sakr
    • 1
  1. 1.School of Computer Science and Engineering (CSE)University of New South Wales