Abstract
We study the numerical behavior of heterogeneous systems such as CPU with GPU or IBM Cell processors for some orthogonalization processes. We focus on the influence of the different floating arithmetic handling of these accelerators with Gram-Schmidt orthogonalization using single and double precision. We observe for dense matrices a loss of at worst 1 digit for CUDA-enabled GPUs as well as a speed-up of 20x, and 2 digits for the Cell processor for a 7x speed-up. For sparse matrices, the result between CPU and GPU is very close and the speed-up is 10x. We conclude that the Cell processor is a good accelerator for double precision because of its full IEEE compliance, and not sufficient for single precision applications. The GPU speed-up is better than Cell and the decent IEEE support delivers results close to the CPU ones for both precisions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
An updated set of basic linear algebra subprograms (blas). ACM Trans. Math. Softw. 28(2), 135–151 (2002)
Arevalo, A., Matinata, R.M., (Raj)Pandian, M., Peri, E., Ruby, K., Thomas, F., Almond, C.: Architecture overview and its impact on programming. In: Programming the Cell Broadband Engine Architecture: Examples and Best Practices, ch. 4.61. IBM (2008)
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: SC 2009: Proceedings of the 2009 ACM/IEEE Conference on Supercomputing. ACM, New York (2009)
Braconnier, T., Langlois, P., Rioual, J.C.: The influence of orthogonality on the arnoldi method. Linear Algebra and its Applications 309(1-3), 307–323 (2000)
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph. 23(3), 777–786 (2004)
NVidia Corporation. Nvidia: Cublas library. Technical report. Whitepaper. Part of CUDA Toolkit
Duff, I.S., Grimes, R.G., Lewis, J.G.: Sparse matrix test problems. ACM Trans. Math. Softw. 15(1), 1–14 (1989)
Frigo, M., Johnson, S.G.: Fftw on the cell processor, http://www.fftw.org/cell/
Giraud, L., Langou, J., Rozložník, M., van den Eshof, J.: Rounding error analysis of the classical Gram-Schmidt orthogonalization process. Numerische Mathematik 101(1), 87–100 (2005)
Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys (1991)
Golub, G.H., Van Loan, C.F.: Matrix Computations (Johns Hopkins Studies in Mathematical Sciences). The Johns Hopkins University Press, Baltimore (1996)
Hernandez, V., Roman, J.E., Tomas, A.: Parallel arnoldi eigensolvers with enhanced scalability via global communications rearrangement. Parallel Comput. 33(7-8), 521–540 (2007)
IEEE: IEEE standard for binary floating-point arithmetic. ACM SIGPLAN Notices 22(2), 9–25 (1985)
Meuer, H., Strohmaier, E., Dongarra, J., Simon, H.: Architecture share over time, http://www.top500.org/overtime/list/32/archtype
NVIDIA. NVIDIA CUDA Programming Guide 2.0 (2008)
Rozlozník, M., Strakos, Z., Tuma, M.: On the role of orthogonality in the gmres method. In: Král, J., Bartosek, M., Jeffery, K. (eds.) SOFSEM 1996. LNCS, vol. 1175, pp. 409–416. Springer, Heidelberg (1996)
Takuya, Y., Daisuke, T., Taisuke, B., Mitsuhisa, S.: Parallel implementation of classical gram-schmidt orthogonalization using matrix multiplication. IPSJ SIG Technical Reports (63(HPC-106)), 31–36 (2006)
Clint Whaley, R., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the atlas project. Parallel Computing 27, 2001 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dubois, J., Calvin, C., Petiton, S. (2011). Performance and Numerical Accuracy Evaluation of Heterogeneous Multicore Systems for Krylov Orthogonal Basis Computation. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds) High Performance Computing for Computational Science – VECPAR 2010. VECPAR 2010. Lecture Notes in Computer Science, vol 6449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19328-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-19328-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19327-9
Online ISBN: 978-3-642-19328-6
eBook Packages: Computer ScienceComputer Science (R0)