Performance Analysis and Optimization of Spark Streaming Applications Through Effective Control Parameters Tuning

Prasad, Bakshi Rohit; Agarwal, Sonali

doi:10.1007/978-981-10-3376-6_11

Bakshi Rohit Prasad¹⁹ &
Sonali Agarwal¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 719))

1111 Accesses
2 Citations

Abstract

High-speed data stream processing is in demand. Performance analysis and optimization of streaming applications are hot research areas. Apache Spark is one of the most extensively used frameworks for in-memory data stream computing and capable of handling high-speed data streams. In streaming applications, controlling, and processing of data streams for optimized and stable performance within the available resources is of utmost requirement. There are various parameters that can be tuned to achieve the optimum performance of streaming applications deployed on Spark. This work explores the performance of stream applications in the light of various tunable parameters in Spark. Further, a relationship among the performance response and controlling parameters is established using linear regression. This regression model enables the prediction of performance response before actual deployment of a streaming application. The work determines an interrelationship between block interval and number of threads for optimized performance of streaming application also.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Abadi, D. J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal—The International Journal on Very Large Data Bases, 12, 2 (2003) 120–139.
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: DATA STREAM MINING - A Practical Approach. The University of Waikato, (2011).
Google Scholar
Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: Distributed stream computing platform. In: IEEE International Conference on Data Mining Workshops (ICDMW’ 10), pp. 170–177, IEEE Press, Washington DC, USA (2010).
Google Scholar
Leibiusky, J., Eisbruch, G., Simonassi, D.: Getting Started with Storm-Continuous Streaming Computation with Twitter’s Cluster Technology. O’Reilly, (2012).
Google Scholar
Murdopo, A., Severien, A., Morales, G.D.F., and Bifet, A.: SAMOA: Developer’s Guide. Yahoo Labs, (2013).
Google Scholar
Prasad, B. R., Agarwal, S.: Handling Big Data Stream Analytics using SAMOA Framework-A Practical Experience. Int. J. Database Theory & Application, 7, 4 (2014).
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10), pp. 10–10, Berkeley, USA: USENIX Association (2010).
Google Scholar
Hamstra, M., Karau, H., Zaharia, M., Konwinski, A., Wendell, P.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., (2015).
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’ 12), pp. 2–2, USENIX Association (2012).
Google Scholar
Prasad, B. R., Agarwal, S.: High speed streaming data analysis of web generated log streams. In: 10th IEEE International Conference on Industrial and Information Systems (ICIIS’ 15), pp. 413–418, IEEE-Press, Peradeniya, Sri Lanka (2015).
Google Scholar
Spark Streaming Programming Guide. https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving.
Spark Configuration. http://spark.apache.org/docs/latest/configuration.html#scheduling.
NASA Dataset source, http://www.ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.
Chatterjee, S., Hadi, A. S.: Regression analysis by example. John Wiley & Sons, (2015).
Google Scholar
Draper, N. R., Smith, H., Pownell, E. Applied regression analysis. John Wiley & Sons, New York (2014).
Google Scholar
Shirley, M. W., Patel, N.: Estimating Beta: Interpreting Regression Statistics. Cost of Capital: Applications and Examples, (2014) 234–242.
Google Scholar
Ashenfelter, O., Levine, P. B., Zimmerman, D. J.: Statistics and econometrics: methods and applications. John Wiley & Sons, New York (2003).
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM, 51, 1 (2008) 107–113.
Google Scholar
Isard, M., et al.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41, 3 (2007) 59–72.
Google Scholar
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: ACM SIGMOD International Conference on Management of data, pp. 135–146, ACM, (2010).
Google Scholar
M. Stonebraker, M., Çetintemel, U., Zdonik, S.: The 8 requirements of real-time stream processing. ACM SIGMOD Record, 34, 4 (2005) 42–47.
Google Scholar
Barlow, M.: Real-time big data analytics: emerging architecture. O’Reilly Media, Inc., 2013.
Google Scholar
Cugola, G., Margara, A.: Processing flows of information: From data stream to complex event processing. ACM Computing Surveys (CSUR), 44, 3 (2012) 15:1–62.
Google Scholar
Cong, J., Huang, M., Zhang, P.: Combining computation and communication optimizations in system synthesis for streaming applications. In: ACM/SIGDA International Symposium on Field-Programmable Gate Array, pp. 213–222, ACM, (2014).
Google Scholar
Kim, G. H., Trimi, S., Chung, J. H.: Big-data applications in the government sector. Communications of the ACM, 57, 3 (2014) 78–85.
Google Scholar
Broekema, P. C., Boonstra, A. J., Cabezas, V. C., Engbersen, T., Holties, H., Jelitto, J., Ronald P. L. Offrein, B. J.: DOME: towards the ASTRON & IBM center for exascale technology. In: Workshop on High-Performance Computing for Astronomy Date, pp. 1–4, ACM, (2012).
Google Scholar
Akidau, T., et al.: MillWheel: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In: VLDB Endowment, 6, 11, pp. 1033–1044, (2013).
Google Scholar
Armbrust, M., et al.: Spark SQL: Relational data processing in Spark. In: ACM SIGMOD International Conference on Management of Data, pp. 1383–1394, ACM (2015).
Google Scholar
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: Fault-tolerant streaming computation at scale. In: 24th ACM Symposium on Operating Systems Principles, pp. 423–438, ACM, (2013).
Google Scholar
Gonzalez, J. E., et al.: Graphx: Graph processing in a distributed dataflow framework. In: Proceedings of OSDI, pp. 599–613, (2014).
Google Scholar
Zeng, K., Agarwal, S., Dave, A., Armbrust, M., Stoica, I.: G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data. In: ACM SIGMOD International Conference on Management of Data, pp. 913–918, ACM, (2015).
Google Scholar
Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., Sofroniew, N. J., Bennett, D. V., Rosen, J., Yang, C. T., Looger, L. L., Ahrens, M. B.: Mapping brain activity at scale with cluster computing. Nature methods, 11, 9 (2014) 941–950.
Google Scholar
Nothaft, F. A., et al.: A. Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 631–646, ACM, (2015).
Google Scholar
Prasad, B. R., Agarwal, S.: Critical parameter analysis of Vertical Hoeffding Tree for optimized performance using SAMOA. Int. J. Mach. Learn. and Cybernetics, (2016) 1–14.
Google Scholar
Ditzler, G., Polikar, R.: Semi-supervised learning in nonstationary environments. In: IEEE International Joint Conference on Neural Networks, pp. 2741–2748, IEEE Press, (2011).
Google Scholar
Zliobaite, I., et al.: Next challenges for adaptive learning systems. ACM SIGKDD Explorations Newsletter, 14, 1 (2012) 48–55.
Google Scholar
Gaber, M. M., Gama, J., Krishnaswamy, S., Gomes, J. B., Stahl, F. Data stream mining in ubiquitous environments: state‐of‐the‐art and current directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4, 2 (2014) 116–138.
Google Scholar
Feng, Y., Shen, X., Tian, J., Zhao, D., Wang, D., Zou, L.: S-store: An engine for large rdf graph integrating spatial information. In: 18th International Conference on Database Systems for Advanced Applications (DASFAA’ 13), pp. 33, Springer, Wuhan, China (2013).
Google Scholar
Jiang, D., Ooi, B. C., Shi, L., Wu, S.: The performance of mapreduce: An in-depth study. In: VLDB Endowment, 3, 1–2, pp. 472-483, VLDB, (2010).
Google Scholar
Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Stoica, I., Wendell, P., Xin, R., and Zaharia, M.: Scaling Spark in the Real World: Performance and Usability. In: VLDB Endowment, 8, 12 (2015).
Google Scholar
Davidson, A., Or, A.: Optimizing Shuffle Performance in Spark. Technical Report, Berkeley, University of California, 2013.
Google Scholar
Amos, B., Tompkins, D.: Performance study of Spindle, a web analytics query engine implemented in Spark. In: 6th IEEE International Conference on Cloud Computing Technology and Science (CloudCom’ 14), pp. 505–510, IEEE, (2014).
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Information Technology, Allahabad, India
Bakshi Rohit Prasad & Sonali Agarwal

Authors

Bakshi Rohit Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Sonali Agarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bakshi Rohit Prasad .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha, India
Pankaj Kumar Sa
Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha, India
Manmath Narayan Sahoo
School of Mechatronic Engineering, Universiti Malaysia Perlis (UniMAP), Arau, Perlis, Malaysia
M. Murugappan
The University of Exeter, Exeter, Devon, United Kingdom
Yulei Wu
Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha, India
Banshidhar Majhi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prasad, B.R., Agarwal, S. (2018). Performance Analysis and Optimization of Spark Streaming Applications Through Effective Control Parameters Tuning. In: Sa, P., Sahoo, M., Murugappan, M., Wu, Y., Majhi, B. (eds) Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Advances in Intelligent Systems and Computing, vol 719. Springer, Singapore. https://doi.org/10.1007/978-981-10-3376-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-3376-6_11
Published: 05 August 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3375-9
Online ISBN: 978-981-10-3376-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics