Advertisement

TPICDS: A Two-Phase Parallel Approach for Incremental Clustering of Data Streams

  • Ammar Al Abd AlazeezEmail author
  • Sabah Jassim
  • Hongbo Du
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)

Abstract

Parallel and distributed solutions are essential for clustering data streams due to the large volumes of data. This paper first examines a direct adaptation of a recently developed prototype-based algorithm into three existing parallel frameworks. Based on the evaluation of performance, the paper then presents a customised pipeline framework that combines incremental and two-phase learning into a balanced approach that dynamically allocates the available processing resources. This new framework is evaluated on a collection of synthetic datasets. The experimental results reveal that the framework not only produces correct final clusters on the one hand, but also significantly improves the clustering efficiency.

Keywords

Big data Data stream clustering algorithms Distributed and parallel frameworks 

Notes

Acknowledgements

The first author wishes to thank the University of Mosul and Government of Iraq/Ministry of Higher Education and Research (MOHESR) for funding him to conduct this research at the University of Buckingham.

References

  1. 1.
    Liu, C., Ranjan, R., Zhang, X., Yang, C., Georgakopoulos, D., Chen, J.: Public auditing for big data storage in cloud computing – a survey. In: 2013 IEEE 16th International Conference on Computational Science and Engineering, pp. 1128–1135, December 2013Google Scholar
  2. 2.
    Olshannikova, E., Ometov, A., Koucheryavy, Y.: Towards big data visualization for augmented reality. In: 2014 IEEE 16th Conference on Business Informatics, pp. 33–37, July 2014Google Scholar
  3. 3.
    Kaur, N., Sood, S.K.: Efficient resource management system based on 4Vs of big data streams. J. Big Data Res. 9, 98–106 (2017)CrossRefGoogle Scholar
  4. 4.
    Basanta-Val, P., Fernandez-Garcia, N., Sanchez-Fernandez, L., Arias-Fisteus, J.: Patterns for real-time stream processing. IEEE Trans. Parallel Distrib. Syst. 28(11), 1–91 (2017)CrossRefGoogle Scholar
  5. 5.
    Yogita, Y., Toshniwal, D.: Clustering techniques for streaming data – a survey. In: 3rd IEEE International Advance Computing Conference (IACC), pp. 951–956 (2012)Google Scholar
  6. 6.
    Sliwinski, T.S., Kang, S.-L.: Applying parallel computing techniques to analyze terabyte atmospheric boundary layer model outputs. J. Big Data Res. 7, 31–41 (2017)CrossRefGoogle Scholar
  7. 7.
    Yusuf, I.I., Thomas, I.E., Spichkova, M., Schmidt, H.W.: Chiminey: connecting scientists to HPC, cloud and big data. J. Big Data Res. 8, 39–49 (2017)CrossRefGoogle Scholar
  8. 8.
    Lv, Z., Song, H., Basanta-val, P., Steed, A., Jo, M.: Next-generation big data analytics: state of the art, challenges, and future research topics. IEEE Trans. Industr. Inf. 13(4), 1891–1899 (2017)CrossRefGoogle Scholar
  9. 9.
    Aggarwal, C.C.: Data Streams: Models and Algorithms, Book. Yorktown Hieghts, NY 10598. Kluwer Academic Publishers, Boston/Dordrecht/London (2007)Google Scholar
  10. 10.
    Al Abd Alazeez, A., Jassim, S., Du, H.: EINCKM: an enhanced prototype-based method for clustering evolving data streams in big data. In: Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, ICPRAM, pp. 173–183 (2017)Google Scholar
  11. 11.
    Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: IEEE FOCS Conference, pp. 359–366 (2000)Google Scholar
  12. 12.
    Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th VLDB Conference, Germany, pp. 1–12 (2003)Google Scholar
  13. 13.
    Silva, J., Faria, E., Barros, R., Hruschka, E., Carvalho, A.: Data stream clustering: a survey. ACM Comput. Surv. (CSUR), 1–37 (2013)CrossRefGoogle Scholar
  14. 14.
    Bandyopadhyay, S., Giannella, C., Maulik, U., Kargupta, H., Liu, K., Datta, S.: Clustering distributed data streams in peer-to-peer environments. J. Inf. Sci. 176(14), 1952–1985 (2006)CrossRefGoogle Scholar
  15. 15.
    Gao, X., Ferrara, E., Qiu, J.: Parallel clustering of high-dimensional social media data streams. arXiv, pp. 323–332 (2015)Google Scholar
  16. 16.
    Rodrigues, P.P., Gama, J., Pedroso, J.P.: Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20(5), 615–627 (2008)CrossRefGoogle Scholar
  17. 17.
    Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed data stream clustering : a fast EM-based approach. 1-4244-0803-2/07/$20.00 ©2007, pp. 736–745. IEEE (2007)Google Scholar
  18. 18.
    Yeh, M.Y., Dai, B.R., Chen, M.S.: Clustering over multiple evolving streams by events and correlations. IEEE Trans. Knowl. Data Eng. 19(10), 1349–1362 (2007)CrossRefGoogle Scholar
  19. 19.
    Guerrieri, A., Montresor, A.: DS-means: distributed data stream clustering. In: Kaklamanis, C., Papatheodorou, T., Spirakis, Paul G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 260–271. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32820-6_27CrossRefGoogle Scholar
  20. 20.
    Gama, J., Rodrigues, P.P., Lopes, M.L.: Clustering distributed sensor data streams using local processing and reduced communication. Intell. Data Anal. 15(1), 3–28 (2011)CrossRefGoogle Scholar
  21. 21.
    Talistu, M., Moh, T.S., Moh, M.: Gossip-based spectral clustering of distributed data streams. In: 2015 International Conference on High Performance Computing Simulation (HPCS), pp. 325–333 (2015)Google Scholar
  22. 22.
    Fu, T.Z.J., Ding, J., Ma, R.T.B., Winslett, M., Yang, Y., Zhang, Z.: DRS: dynamic resource scheduling for real-time analytics over fast streams. In: Proceedings of International Conference on Distributed Computing Systems, pp. 411–420, July 2015Google Scholar
  23. 23.
    Jin, C., Patwary, M.A., Agrawal, A., Hendrix, W., Liao, W., Choudhary, A.: DiSC: a distributed single-linkage hierarchical clustering algorithm using MapReduce. In: Proceedings of the International SC Workshop on Data Intensive Computing in the Clouds (DataCloud), pp. 1–10 (2013)Google Scholar
  24. 24.
    Bhatia, S.K., Louis, S.: Adaptive K-Means clustering. Am. Assoc. Artif. Intell. 1–5 (2004)Google Scholar
  25. 25.
    Chakraborty, S., Nagwani, N.K.: Analysis and study of incremental K-means clustering algorithm. In: Mantri, A., Nandi, S., Kumar, G., Kumar, S. (eds.) HPAGC 2011. CCIS, vol. 169, pp. 338–341. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22577-2_46CrossRefGoogle Scholar
  26. 26.
    Stoica, I.: Trends and challenges in big data processing. Proc. VLDB Endowment 9(13), 1619–1622 (2016)CrossRefGoogle Scholar
  27. 27.
    Basanta-Val, P., Fernández-García, N., Wellings, A.J., Audsley, N.C.: Improving the predictability of distributed stream processors. Future Gener. Comput. Syst. 52, 22–36 (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ammar Al Abd Alazeez
    • 1
    Email author
  • Sabah Jassim
    • 1
  • Hongbo Du
    • 1
  1. 1.Department of Applied ComputingThe University of BuckinghamBuckinghamUK

Personalised recommendations