Skip to main content

Exploring Apache Spark Data APIs for Water Big Data Management

  • Conference paper
  • First Online:
  • 414 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 913))

Abstract

Managing data complexity is a recurrent problem in multiple domains related to water resources management such as utilities, hydrological and meteorological modelling. Recently and since the advent of intelligent sensors, we observe a systemic growth in the volume of collected data. Besides, these kinds of sensors generate near real-time data under various formats. To get the right value of this kind of water datasets we need to design new solutions, efficient enough to manage massive data coming from intelligent sensors in near real time and under various formats. We present in our paper a reference architecture for managing massive data collected from smart meters. Also, we show how recent advances in big data technologies mainly the Apache Spark project can effectively be used to obtain insights from massive datasets. Finally, we will focus on presenting the advantages that provide the distributed execution model of Spark by exploring three Apache Spark APIs: RDD, Dataframe, and SparkR.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Such advanced queries will be developed separately in further works.

  2. 2.

    https://code.google.com/p/smart-meter-information-portal/.

References

  1. Akyildiz, L.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks (2002)

    Google Scholar 

  2. Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B., Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V.: Position paper : characterising performance of environmental models. Environ. Model. Softw. 40, 1–20 (2013)

    Article  Google Scholar 

  3. Bernardo, V., Curado, M., Staub, T., Braun, T.: Towards energy consumption measurement in a cloud computing wireless testbed. In: Proceedings of the 2011 First International Symposium on Network Cloud Computing and Applications, NCCA 2011, Washington, DC, pp. 91–98. IEEE Computer Society (2011)

    Google Scholar 

  4. D’Agostino, D., Clematis, A., Galizia, A., Quarati, A., Danovaro, E., Roverelli, L., Zereik, G., Kranzlmüller, D., Schiffers, M., Felde, N.G., Straube, C., Caumont, O., Richard, E., Garrote, L., Harpham, Q., Jagers, H.R.A., Dimitrijevic, V., Dekic, L., Fiorii, E., Delogu, F., Parodi, A.: The DRIHM project: a flexible approach to integrate HPC, grid and cloud resources for hydro-meteorological research. In: Proceeding of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, Piscataway, pp. 536–546. IEEE Press (2014)

    Google Scholar 

  5. Dunning, T., Friedman, E.: Time Series Databases. O’Reilly Media, Greenwich (2014)

    Google Scholar 

  6. Eichinger, F., Pathmaperuma, D., Vogt, H., Muller, E.: Data analysis challenges in the future energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data Analysis for Sustainable Development; Data Mining and Knowledge Discovery Series. CRC Press, Taylor Francis Group, Boca Raton. Chapter 7

    Google Scholar 

  7. Vatsavai, R.R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., Shekhar, S.: Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, BigSpatial 2012, New York, pp. 1–10. ACM (2012)

    Google Scholar 

  8. Fang, X., Misra, S., Xue, G., Yang, D.: Smart grid - the new and improved power grid: a survey. IEEE Commun. Surv. Tutor. (2011)

    Google Scholar 

  9. Yigit, M., Cagri Gungor, V., Baktir, S.: Cloud computing for smart grid applications. Comput. Netw. 70, 312–329 (2014)

    Article  Google Scholar 

  10. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, Berkeley, p. 2. USENIX Association (2012)

    Google Scholar 

  11. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Berkeley, p. 10. USENIX Association (2010)

    Google Scholar 

  12. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, New York, pp. 423–438. ACM (2013)

    Google Scholar 

  13. Laney, D.: META Group, 3D Data Management: Controlling Data Volume, Velocity, and Variety, February 2001

    Google Scholar 

  14. Eichinger, F., Pathmaperuma, D., Vogt, H., Müller, E.: Data analysis challenges in the future energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data Analysis for Sustainable Development. Chapman and Hall/CRC, London (2013)

    Google Scholar 

  15. http://camel.apache.org/

  16. http://sqoop.apache.org/

  17. https://kafka.apache.org/

  18. http://cassandra.apache.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nassif El Hassane .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

El Hassane, N., Hajji, H. (2019). Exploring Apache Spark Data APIs for Water Big Data Management. In: Ezziyyani, M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2018). AI2SD 2018. Advances in Intelligent Systems and Computing, vol 913. Springer, Cham. https://doi.org/10.1007/978-3-030-11881-5_10

Download citation

Publish with us

Policies and ethics