Cooperative Preprocessing at Petabytes on High Performance Computing System

  • Rujun SunEmail author
  • Lufei Zhang
  • Xiyang Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11335)


With the explosion of data, we have an urgent demand for data throughput in high performance computing systems. Data-intensive applications are becoming increasingly common in HPC environments. As data scale increases faster than systems, it’s time to fully utilize resources in every aspect, including computing power, storage capacity and data throughput. We can no longer ignore data preprocessing since it’s an important procedure, especially when dealing with large amount of data. How to efficiently perform data preprocessing in current HPC systems? How to make full use of system resources on data-intensive applications? What should be valued when designing new HPC architectures? All these questions need answers. In this paper, we drew a sketch for procedure of data-intensive applications, which lead to an adaptive resource allocation scheme according to procedure requirements. We analyzed characters of preprocessing and designed a preprocessing model for data-intensive applications in HPC systems. It has not only fulfilled the demand for computing but also meet the need of throughput, with cooperative work in storage system and storage management system. Experiments were done on Sunway TaihuLight, one of the world’s fastest supercomputers. The whole procedure of preprocessing at Petabytes can be done in hours without interfering other ongoing applications.


HPC Data intensive applications Cooperative preprocessing High throughput computing 


  1. 1.
    Chodorow, K.: MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O’Reilly. Media Inc., Newton (2013)Google Scholar
  2. 2.
    Fu, H., et al.: The sunway taihulight supercomputer: system and applications. Sci. China Inf. Sci. 59(7), 072001 (2016)CrossRefGoogle Scholar
  3. 3.
    Huang, H., Lin, J., Chen, C., Fan, M.: Review of outlier detection. Appl. Res. Comput. 8, 002 (2006)Google Scholar
  4. 4.
    Islam, N.S., Lu, X., Wasi-ur Rahman, M., Shankar, D., Panda, D.K.: Triple-h: a hybrid approach to accelerate hdfs on hpc clusters with heterogeneous storage architecture. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 101–110. IEEE (2015)Google Scholar
  5. 5.
    Islam, N.S., Shankar, D., Lu, X., Wasi-Ur-Rahman, M., Panda, D.K.: Accelerating I/O performance of big data analytics on HPC clusters through RDMA-based key-value store. In: 44th International Conference on Parallel Processing (ICPP), pp. 280–289. IEEE (2015)Google Scholar
  6. 6.
    Jian, Z., Jin, X.: Research on data preprocess in data mining and its application. Appl. Res. Comput. 7(117–118), 157 (2004)Google Scholar
  7. 7.
    Kalmegh, P., Navathe, S.B.: Graph database design challenges using hpc platforms. In: High Performance. Computing, Networking, Storage and Analysis (SCC), SC Companion, pp. 1306–1309. IEEE (2012)Google Scholar
  8. 8.
    Miller, J.J.: Graph database applications and concepts with neo4j. In: Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, vol. 2324, p. 36 (2013)Google Scholar
  9. 9.
    Miyoshi, T., Kondo, K., Terasaki, K.: Big ensemble data assimilation in numerical weather prediction. Computer 48(11), 15–21 (2015)CrossRefGoogle Scholar
  10. 10.
    Miyoshi, T., et al.: “Big data assimilation” revolutionizing severe weather prediction. Bull. Am. Meteorol. Soc. 97(8), 1347–1354 (2016)CrossRefGoogle Scholar
  11. 11.
    Wenguang, C.: Big data and high performance computing, 003, pp. 1–6 (2015)Google Scholar
  12. 12.
    Team at the University of Wisconsin Madison, H.: High Throughput Computing, June 2015.
  13. 13.
    Yi, Z., Peng, Z., Xuebin, C., Tie, N., Zongyan, C.: A brief view on requirements and development of high performance computing application. J. Comput. Res. Dev. 10, 001 (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.State Key Laboratory of Mathematical Engineering and Advanced ComputingWuxiChina
  2. 2.National Super Computing Wuxi CenterWuxiChina

Personalised recommendations