Journal of Computer Science and Technology

, Volume 14, Issue 3, pp 224–233 | Cite as

Hierarchical bulk synchronous parallel model and performance optimization

  • Huang Linpeng 
  • Sun Yongqiang 
  • Yuan Wei 
Regular Papers


Based on the framework of BSP, a Hierarchical Bulk Synchronous Parallel (HBSP) performance model is introduced in this paper to capture the performance optimization problem for various stages in parallel program development and to accurately predict the performance of a parallel program by considering factors causing variance at local computation and global communication. The related methodology has been applied to several real applications and the results show that HBSP is a suitable model for optimizing parallel programs.


parallel programming bulk synchronous parallel model performance optimization 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    McColl W F. Scalable parallel computing: A grand unified theory and its practical development. InProc. 13th IFIP World Computer Congress, 1994, 1.Google Scholar
  2. [2]
    Hill M D. The theory, practice and a tool for BSP performance prediction. InEuro Par’96, LNCS 1124, Springer-Verleg, 1996, pp.697–705.Google Scholar
  3. [3]
    Hill M D, Skillicorn D B. Communication performance optimisation requires minimising variance. InHPCN’98, LNCS, Springer-Verlag, 1998.Google Scholar
  4. [4]
    IBM Corp. Optimization and tuning guide for the XL FORTRAN and XL C compilers.IBM SC09-1545-00, 1992.Google Scholar
  5. [5]
    Abandah G A, Davidson E S. Modeling the communication performance of the IBM SP2. In10th International Parallel Processing Symposium, 1996.Google Scholar
  6. [6]
    Hill M D, McColl B, Stefanescu D C. BSPlib — The BSP programming library. Technical Report, PRG-TR-29-9, Oxford University, 1997.Google Scholar
  7. [7]
    Bailey D H, Harris T, Saphir W. The NAS parallel benchmarks 2.0. Report NAS-95-020, 1995.Google Scholar
  8. [8]
    Hyaric A L. Converting the NAS benchmarks from MPI to BSP. Computing Laboratory, Oxford University, 1996.Google Scholar
  9. [9]
    Goedeecker S. Fast radix 2, 3, 4, and 5 kernel for fast Fourier transformations on computing with overlapping multiply-add instructions.SIAM J. Sci. Comput, 1997, 18(6): 1605–1611.CrossRefMathSciNetGoogle Scholar
  10. [10]
    Pfrommer B, Tokuyasu T. Slow Fourier transformations on fast microprocessors. Technical Report, University of Berkeley,, 1996.Google Scholar
  11. [11]
    Agarwal R, Gustavson F, Zubair M. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms.IBM J. Res. Develop., 1994, 38(5).Google Scholar
  12. [12]
    Hockney R W, Jesshope C R. Parallel Computers — Architecture, Programming and Algorithms. IOP Publishing Ltd., 1988.Google Scholar
  13. [13]
    Bailey D H. Unfavorable strides in cache memory systems. RNR Technical Report RNR-92-015, 1992.Google Scholar
  14. [14]
    Saini S, Bailey D H. NAS parallel benchmark results 12–95. Report NAS-95-021, 1995.Google Scholar

Copyright information

© Science Press, Beijing China and Allerton Press Inc. 1999

Authors and Affiliations

  • Huang Linpeng 
    • 1
  • Sun Yongqiang 
    • 1
  • Yuan Wei 
    • 1
  1. 1.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityShanghaiP.R. China

Personalised recommendations