Abstract
Managing data complexity is a recurrent problem in multiple domains related to water resources management such as utilities, hydrological and meteorological modelling. Recently and since the advent of intelligent sensors, we observe a systemic growth in the volume of collected data. Besides, these kinds of sensors generate near real-time data under various formats. To get the right value of this kind of water datasets we need to design new solutions, efficient enough to manage massive data coming from intelligent sensors in near real time and under various formats. We present in our paper a reference architecture for managing massive data collected from smart meters. Also, we show how recent advances in big data technologies mainly the Apache Spark project can effectively be used to obtain insights from massive datasets. Finally, we will focus on presenting the advantages that provide the distributed execution model of Spark by exploring three Apache Spark APIs: RDD, Dataframe, and SparkR.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Such advanced queries will be developed separately in further works.
- 2.
References
Akyildiz, L.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: A survey on sensor networks (2002)
Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.J., Marsili-Libelli, S., Newham, L.T.H., Norton, J.P., Perrin, C., Pierce, S.A., Robson, B., Seppelt, R., Voinov, A.A., Fath, B.D., Andreassian, V.: Position paper : characterising performance of environmental models. Environ. Model. Softw. 40, 1–20 (2013)
Bernardo, V., Curado, M., Staub, T., Braun, T.: Towards energy consumption measurement in a cloud computing wireless testbed. In: Proceedings of the 2011 First International Symposium on Network Cloud Computing and Applications, NCCA 2011, Washington, DC, pp. 91–98. IEEE Computer Society (2011)
D’Agostino, D., Clematis, A., Galizia, A., Quarati, A., Danovaro, E., Roverelli, L., Zereik, G., Kranzlmüller, D., Schiffers, M., Felde, N.G., Straube, C., Caumont, O., Richard, E., Garrote, L., Harpham, Q., Jagers, H.R.A., Dimitrijevic, V., Dekic, L., Fiorii, E., Delogu, F., Parodi, A.: The DRIHM project: a flexible approach to integrate HPC, grid and cloud resources for hydro-meteorological research. In: Proceeding of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, Piscataway, pp. 536–546. IEEE Press (2014)
Dunning, T., Friedman, E.: Time Series Databases. O’Reilly Media, Greenwich (2014)
Eichinger, F., Pathmaperuma, D., Vogt, H., Muller, E.: Data analysis challenges in the future energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data Analysis for Sustainable Development; Data Mining and Knowledge Discovery Series. CRC Press, Taylor Francis Group, Boca Raton. Chapter 7
Vatsavai, R.R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., Shekhar, S.: Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In: Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data, BigSpatial 2012, New York, pp. 1–10. ACM (2012)
Fang, X., Misra, S., Xue, G., Yang, D.: Smart grid - the new and improved power grid: a survey. IEEE Commun. Surv. Tutor. (2011)
Yigit, M., Cagri Gungor, V., Baktir, S.: Cloud computing for smart grid applications. Comput. Netw. 70, 312–329 (2014)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI 2012, Berkeley, p. 2. USENIX Association (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, Berkeley, p. 10. USENIX Association (2010)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, New York, pp. 423–438. ACM (2013)
Laney, D.: META Group, 3D Data Management: Controlling Data Volume, Velocity, and Variety, February 2001
Eichinger, F., Pathmaperuma, D., Vogt, H., Müller, E.: Data analysis challenges in the future energy domain. In: Yu, T., Chawla, N., Simoff, S. (eds.) Computational Intelligent Data Analysis for Sustainable Development. Chapman and Hall/CRC, London (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
El Hassane, N., Hajji, H. (2019). Exploring Apache Spark Data APIs for Water Big Data Management. In: Ezziyyani, M. (eds) Advanced Intelligent Systems for Sustainable Development (AI2SD’2018). AI2SD 2018. Advances in Intelligent Systems and Computing, vol 913. Springer, Cham. https://doi.org/10.1007/978-3-030-11881-5_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-11881-5_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11880-8
Online ISBN: 978-3-030-11881-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)