Big Data Streaming with Spark

Bansal, Ankita; Jain, Roopal; Modi, Kanika

doi:10.1007/978-981-13-0550-4_2

Ankita Bansal⁶,
Roopal Jain⁷ &
Kanika Modi⁷

Part of the book series: Studies in Big Data ((SBD,volume 43 ))

2444 Accesses
4 Citations

Abstract

A stream is defined as continuously arriving unbounded data. Analytics of such real-time data has become an utmost necessity. This evolution required a technology capable of efficient computing of data distributed over several clusters. Current parallelized streaming systems lacked consistency, faced difficulty in combining historical data with streaming data, and handling slow nodes. These needs resulted in the birth of Apache Spark API that provides a framework which enables such scalable, error tolerant streaming with high throughput. This chapter introduces many concepts associated with Spark Streaming, including a discussion of supported operations. Finally, two other important platforms and their integration with Spark, namely Apache Kafka and Amazon Kinesis are explored.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache Spark. https://spark.apache.org/
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM CACM Homepage Archive New York NY USA 59(11), 56–65 (2016)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, San Francisco, CA, pp. 10–10 (2006)
Google Scholar
Logothetis, D., Trezzo, C., Webb, K. C., Yocum, K.: In-situ MapReduce for log processing. In: USENIX Annual Technical Conference (2011)
Google Scholar
Scala; http://www.scala-lang.org
Alsheikh, M.A., Niyato, D., Lin, S., Tan, H.P., Han, Z.: Mobile big data analytics using deep learning and apache spark. IEEE Netw. 30(3), 22–29 (2016)
Article Google Scholar
Owen, S., Ryza, Laserson S., Wills U.: Advanced Analytics with Apache Spark. O’Reilly Media (2015)
Google Scholar
Spark homepage. http://www.spark-project.org
Apache Hive; http://hadoop.apache.org/hive
Balazinska, M., Balakrishnan, H., Madden, S.R., Stonebraker, M.: Fault-tolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst. 33(1), 3–16 (2008)
Article Google Scholar
Shah, M., Hellerstein, J., Brewer, E.: Highly available, fault-tolerant, parallel dataflows. In: Proceeding of ACM SIGMOD Conference, pp. 827–838 (2004)
Google Scholar
Zaharia, M., Das, T., Li, H., Shenker, S., Stoica, I.: Discretized streams: an efficient programming model for large-scale stream processing. In: 4th USENIX Workshop on Hot Topics in Cloud Computing (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2–2 (2012)
Google Scholar
Apache Kafka. https://kafka.apache.org/
Amazon Kinesis. http://docs.aws.amazon.com/streams/latest/dev/introduction.html
Nair, L.R., Shetty, S.D.: Streaming twitter data analysis using spark for effective job search. J. Theor. Appl. Inf. Technol. 80(2), 349–353 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Netaji Subhash Institute of Technology, New Delhi, India
Ankita Bansal
Department of Computer Engineering, Netaji Subhash Institute of Technology, New Delhi, India
Roopal Jain & Kanika Modi

Authors

Ankita Bansal
View author publications
You can also search for this author in PubMed Google Scholar
Roopal Jain
View author publications
You can also search for this author in PubMed Google Scholar
Kanika Modi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ankita Bansal .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, GB Pant Government Engineering College, New Delhi, India
Mamta Mittal
Department of Automation and Applied Informatics, Aurel Vlaicu University of Arad, Arad, Romania
Valentina E. Balas
Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India
Lalit Mohan Goyal
Department of Computer Science and Engineering, Laxmi Narayan College of Technology, Jabalpur, Madhya Pradesh, India
Raghvendra Kumar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bansal, A., Jain, R., Modi, K. (2019). Big Data Streaming with Spark. In: Mittal, M., Balas, V., Goyal, L., Kumar, R. (eds) Big Data Processing Using Spark in Cloud. Studies in Big Data, vol 43 . Springer, Singapore. https://doi.org/10.1007/978-981-13-0550-4_2

Download citation

DOI: https://doi.org/10.1007/978-981-13-0550-4_2
Published: 17 June 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0549-8
Online ISBN: 978-981-13-0550-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics