Abstract
A stream is defined as continuously arriving unbounded data. Analytics of such real-time data has become an utmost necessity. This evolution required a technology capable of efficient computing of data distributed over several clusters. Current parallelized streaming systems lacked consistency, faced difficulty in combining historical data with streaming data, and handling slow nodes. These needs resulted in the birth of Apache Spark API that provides a framework which enables such scalable, error tolerant streaming with high throughput. This chapter introduces many concepts associated with Spark Streaming, including a discussion of supported operations. Finally, two other important platforms and their integration with Spark, namely Apache Kafka and Amazon Kinesis are explored.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache Spark. https://spark.apache.org/
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM CACM Homepage Archive New York NY USA 59(11), 56–65 (2016)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, San Francisco, CA, pp. 10–10 (2006)
Logothetis, D., Trezzo, C., Webb, K. C., Yocum, K.: In-situ MapReduce for log processing. In: USENIX Annual Technical Conference (2011)
Scala; http://www.scala-lang.org
Alsheikh, M.A., Niyato, D., Lin, S., Tan, H.P., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Netw. 30(3), 22–29 (2016)
Owen, S., Ryza, Laserson S., Wills U.: Advanced Analytics with Apache Spark. O’Reilly Media (2015)
Spark homepage. http://www.spark-project.org
Apache Hive; http://hadoop.apache.org/hive
Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst. 33(1), 3–16 (2008)
Shah, M., Hellerstein, J., Brewer, E.: Highly available, fault-tolerant, parallel dataflows. In: Proceeding of ACM SIGMOD Conference, pp. 827–838 (2004)
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient programming model for large-scale stream processing. In: 4th USENIXÂ Workshop on Hot Topics in Cloud Computing (2012)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2–2 (2012)
Apache Kafka. https://kafka.apache.org/
Amazon Kinesis. http://docs.aws.amazon.com/streams/latest/dev/introduction.html
Nair, L.R., Shetty, S.D.: Streaming twitter data analysis using spark for effective job search. J. Theor. Appl. Inf. Technol. 80(2), 349–353 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Bansal, A., Jain, R., Modi, K. (2019). Big Data Streaming with Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_2
Download citation
DOI: https://doi.org/10.1007/978-981-13-0550-4_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)