A Non-Stop Double Buffering Mechanism for Dataflow Architecture

Tan, Xu; Shen, Xiao-Wei; Ye, Xiao-Chun; Wang, Da; Fan, Dong-Rui; Zhang, Lunkai; Li, Wen-Ming; Zhang, Zhi-Min; Tang, Zhi-Min

doi:10.1007/s11390-017-1747-6

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

Regular Paper
Published: 26 January 2018

Volume 33, pages 145–157, (2018)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xu Tan^1,2,
Xiao-Wei Shen^1,2,
Xiao-Chun Ye^1,3,
Da Wang¹,
Dong-Rui Fan^1,2,
Lunkai Zhang⁴,
Wen-Ming Li¹,
Zhi-Min Zhang¹ &
…
Zhi-Min Tang¹

204 Accesses
10 Citations
Explore all metrics

Abstract

Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataflow architecture achieves a 16.2% average efficiency improvement over that without the optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao: A small-foot print high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.269-284.
Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao: A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.369-381.
Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.
Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. the Int. Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.
Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.
Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40.
Article Google Scholar
Theobald K B. EARTH: An efficient architecture for running threads [Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.
Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer Press, 2015.
Sancho J C, Kerbyson D J. Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE. In Proc. the IEEE Int. Symp. Parallel and Distributed Processing, Apr. 2008.
Che W J, Chatha K. Compilation of stream programs onto scratchpad memory based embedded multicore processors through retiming. In Proc. the 48th Design Automation Conference, Jun. 2011, pp.122-127.
Saidi S, Tendulkar P, Lepley T, Maler O. Optimizing explicit data transfers for data parallel applications on the cell architecture. ACM Transactions on Architecture and Code Optimization, 2012, 8(4): Article No. 37.
Deng Y, Wang L, Yan X B, Yang X J. A double-buffering strategy for the SRF management in the Imagine stream processor. In Proc. the 9th International Conference for Young Computer Scientists, Nov. 2008, pp.160-165.
Zinner C, Kubinger W. ROS-DMA: A DMA double buffering method for embedded image processing with resource optimized slicing. In Proc. the 12th IEEE Real-Time and Embedded Technology and Applications Symp., Apr. 2006, pp.361-372.
Bai Y W, Liu C C. The performance improvement of a photo card reader by the use of a high-integration chip solution with double FIFO buffers. IEEE Transactions on Consumer Electronics, 2005, 51(2): 329-334.
Article Google Scholar
Li J, Han K P, Hong S, Luo SM, Dong Z J, Lu P. A prefetching method with double-buffer for multimedia streaming servers. In Proc. International Conference on Transportation, Mechanical and Electrical Engineering, Dec. 2011, pp.1485-1489.
Singh H, Lee M H, Lu G M, Kurdahi F J, Bagherzadeh N, Filho E C. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers, 2000, 49(5): 465-481.
Article Google Scholar
Zhang C, Li P, Sun G Y, Guan Y J, Xiao B J, Cong J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proc. the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2015, pp.161-170.
Shen X W, Ye X C, Tan X, Wang D, Lunkai Zhang, Li W M, Zhang Z M, Fan D R, Sun N H. An efficient network-onchip router for dataflow architecture. Journal of Computer Science and Technology, 2017, 32(1): 11-25.
Article Google Scholar
Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.
Holewinski J, Pouchet L N, Sadayappan P. Highperformance code generation for stencil computations on GPU architectures. In Proc. the 26th ACM International Conference on Supercomputing, Jun. 2012, pp.311-320.
Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel and Distributed Systems, 2013, 24(3): 417-427.
Article Google Scholar
Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel and Distributed Systems, 2012, 23(11): 2045-2057.
Article Google Scholar
Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.
Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luj’an M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project: Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.
Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede: An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.
Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7): 44-55.
Article Google Scholar
Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2): Article No. 4.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xu Tan, Xiao-Wei Shen, Xiao-Chun Ye, Da Wang, Dong-Rui Fan, Wen-Ming Li, Zhi-Min Zhang & Zhi-Min Tang
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
Xu Tan, Xiao-Wei Shen & Dong-Rui Fan
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, 214125, China
Xiao-Chun Ye
Department of Computer Science, The University of Chicago, Chicago, IL, 60637, U.S.A.
Lunkai Zhang

Authors

Xu Tan
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Wei Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Chun Ye
View author publications
You can also search for this author in PubMed Google Scholar
Da Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Rui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Lunkai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Min Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong-Rui Fan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 350 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, X., Shen, XW., Ye, XC. et al. A Non-Stop Double Buffering Mechanism for Dataflow Architecture. J. Comput. Sci. Technol. 33, 145–157 (2018). https://doi.org/10.1007/s11390-017-1747-6

Download citation

Received: 02 September 2016
Revised: 13 March 2017
Published: 26 January 2018
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11390-017-1747-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

Abstract

Access this article

Similar content being viewed by others

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

NoC-based hardware software co-design framework for dataflow thread management

An Efficient Network-on-Chip Router for Dataflow Architecture

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Non-Stop Double Buffering Mechanism for Dataflow Architecture

Abstract

Access this article

Similar content being viewed by others

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

NoC-based hardware software co-design framework for dataflow thread management

An Efficient Network-on-Chip Router for Dataflow Architecture

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation