Skip to main content

Design of Instruction Analyzer with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture

  • Conference paper
  • First Online:
New Trends in Computer Technologies and Applications (ICS 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1013))

Included in the following conference series:

  • 1298 Accesses

Abstract

Nowadays ILP processors can’t analyze the semantic information of instruction thread to change instruction series automatically for increasing ILP degree. High performance required programs such as image processing or machine learning contain a lot of loop structure. Loop structure will be bounded with the instruction number of one basic block. That cause processors are hard to enhance the computing efficiency. The characteristics of the loop structure in the program are as follows: (1) Instruction will be fetched from cache and be decoded repeatedly. (2) The issued instructions are bounded by the loop body. (3) There is data dependence between iterations. These factors will get worse the poor ILP in the loop codes. In this paper, we propose an architecture called semantic-based dynamic loop unrolling mechanism. The proposed architecture can buffer the instruction series of nested loop, unroll it automatically by analyzing the instruction flow to find the loop body with the semantic of loop instructions, store them to the instruction buffer, and dispatch them to target the processor cores. The proposed architecture consists of three units: loop detect unit (LDU), unrolling control unit (UCU) and loop unrolling unit (LUU). LDU will parse the semantic of instructions to find the closed interval of the loop body instructions. UCU will control LUU in the whole process. LUU will unroll the loop based on the information collected by LDU. Loop controller will handle the complementation overhead for branch miss prediction and the loop finish-up codes. The verifications use ARM instructions generated by Keil \( \mu \)Vision5 compiler. The results show that eliminating iteration dependence can improve ILP by 140% to 180%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Rotenberg, S.B., Smith, J.E.: Trace cache: a low latency approach to high bandwidth instruction fetching. In: Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-29, pp. 24–34 (1996)

    Google Scholar 

  2. Chou, Y.-L.: Study of the hyperscalar multi-core architecture, Department of Electrical Engineering National Sun Yat-Sen University (2011)

    Google Scholar 

  3. Su, D.-S.: Design of the execution-driven simulation environment for hyper-scalar architecture, Department of Electrical Engineering National Sun Yat-Sen University (2008)

    Google Scholar 

  4. Chiu, J.-C., Chou, Y.-L., Chen, P.-K., Ding-Siang, S.: A unitable computing architecture for chip multiprocessors. Comput. J. 54(12), 2033–2052 (2011)

    Article  Google Scholar 

  5. Chen, P.-K.: ESL model of the hyper-scalar processor on a chip, Department of Electrical Engineering National Sun Yat-Sen University (2007)

    Google Scholar 

  6. Chiu, J.-C., Huang, Y.-J., Ye, Y.-L.: Design of the optimized group management unit by detecting thread parallelism on the hyperscalar architecture, National Computer Symposium, December 2013

    Google Scholar 

  7. Yeh, T.Y., Marr, D.T., Patt, Y.N.: Increasing the instruction fetch rate via multiple branch prediction and a branch address cache. In: 7th International Conference on Supercomputing, pp. 67–76, July 1993

    Google Scholar 

  8. Dennis, J.B., Misunas, D.P.: A preliminary architecture for a basic data-flow processor. In: Proceedings of the 2nd Annual Symposium on Computer Architecture, Houston, TX, pp. 126–131, January 1975

    Google Scholar 

  9. Lerner, E.J.: Data-flow architecture. IEEE Spectr., 57–62 (1984)

    Article  MathSciNet  Google Scholar 

  10. Fisher, J.A., Faraboschi, P., Young, C.: Embedded Computing, A VLIW Approach to Architecture, Compilers and Tools. Elsevier (2005)

    Google Scholar 

  11. Huang, Y.-J.: Design of the optimized group management unit by detecting thread parallelism on the hyperscalar architecture, Department of Electrical Engineering National Sun Yat-Sen University (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jih-Ching Chiu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lu, YX., Chiu, JC., Chao, SJ., Ye, YB. (2019). Design of Instruction Analyzer with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture. In: Chang, CY., Lin, CC., Lin, HH. (eds) New Trends in Computer Technologies and Applications. ICS 2018. Communications in Computer and Information Science, vol 1013. Springer, Singapore. https://doi.org/10.1007/978-981-13-9190-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-9190-3_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-9189-7

  • Online ISBN: 978-981-13-9190-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics