Abstract
Applying appropriate data structures is critical to attain superior performance in heterogeneous many-core systems. A heterogeneous many-core system is comprised of a host for control flow management, and a device for massive parallel data processing. However, the host and device require different types of data structures. The host prefers Array-of-Structures (AoS) to ease the programming, while the device requires Structure-of-Arrays (SoA) for efficient data accesses. The conflicted preferences cost excessive effort for programmers to transform the data structures between two parts. The separately designed kernels with different coding styles also cause difficulty in maintaining programs. This paper addresses this issue by proposing a fully automated data layout transformation framework. Programmers can maintain the code in AoS style on the host, while the data layout is converted into SoA when being transferred to the device. The proposed framework streamlines the design flow and demonstrates up to 177% performance improvement.
Chapter PDF
Similar content being viewed by others
References
Sung, I.-J., Stratton, J.A., Hwu, W.-M.W.: DL: A Data Layout Transformation System for Heterogeneous Computing. In: Proc. IEEE InPar, San Jose, pp. 513–522 (May 13, 2012)
Jang, B., Schaa, D., Mistry, P., Kaeli, D.: Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures. Proc. IEEE Transactions on Parallel and Distributed Systems 22(1) (January 2011)
Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proc. SC, pp. 13–13 (2011)
Karlsson, L.: Blocked in-place transposition with application to storage format conversion. Technical report (2009)
Gustavson, F., Karlsson, L., Kagström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software
Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA (January 2009)
Che, S., Sheaffer, J.W., Skadron, K.: Dymaxion: optimizing memory access patterns for heterogeneous systems. In: Proc. SC, p. 13 (2011)
CUDA C programming guide, http://docs.nvidia.com/cuda/cuda-c-programmingguide/index.html
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA Workloads Using a Detailed GPGPU Simulator. In: Ispass 2009: IEEE International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)
Garland, M., Grand, S.L., Nickolls, J.: Parallel Computing Experiences with Cuda. IEEE Computer Society (2008)
GPU Computing SDK, https://developer.nvidia.com/gpu-computing-sdk
Parboil Benchmarks, http://impact.crhc.illinois.edu/Parboil/parboil.aspx
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Tseng, YY., Huang, YH., Lai, BC.C., Lin, JL. (2014). Automatic Data Layout Transformation for Heterogeneous Many-Core Systems. In: Hsu, CH., Shi, X., Salapura, V. (eds) Network and Parallel Computing. NPC 2014. Lecture Notes in Computer Science, vol 8707. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44917-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-662-44917-2_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44916-5
Online ISBN: 978-3-662-44917-2
eBook Packages: Computer ScienceComputer Science (R0)