Central Processing Unit

Bindal, Ahmet

doi:10.1007/978-3-030-00223-7_6

Ahmet Bindal²

2434 Accesses
1 Citations

Abstract

This chapter describes a basic Central Processing Unit (CPU) , operating with a simplified Reduced Instruction Set (RISC ).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

Computer Engineering Department, San Jose State University, San Jose, CA, USA
Ahmet Bindal

Authors

Ahmet Bindal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmet Bindal .

Appendix: Iterative Fixed-Point Multiplication

The MUL instruction and its operational equation shown earlier in this chapter is rewritten below:

MUL RS1, RS2, RD1, RD2
Reg[RS1] * Reg[RS2] → {Reg[RD2], Reg[RD1]}

According to the equation above, the contents of the source registers, RS1 and RS2, are multiplied by a 32-bit fixed-point multiplier, and the 64-bit result is divided into two parts. The most significant 32-bit of the result is returned to RD2, and the least significant 32-bit result to RD1.

The iterative method suggests an algorithm that uses only SLI and ADD instructions. This algorithm does not require any additional hardware for the CPU ; however, it takes many clock cycles to complete. The algorithm for a four-bit multiplication is shown in Figs. 6.212 and 6.213. The same method can be extended to 32 bits.

Assume that the quantities, r = {r3 r2 r1 r0} and c = {c3 c2 c1 c0}, represent a four-bit multiplier and a four-bit multiplicand, respectively. Also assume that the term, pp, corresponds to a partial product sum that adds a newly generated partial product to the old partial product in each iteration. The variable, i, represents the iteration index bounded by 0 and (n-1) where n signifies the number multiplier and multiplicand bits.

The first iteration (i = 0) generates the partial product pp0, which is equal to pp0 = {c3 c2 c1 c0} if r0 = 1; otherwise, pp0 becomes equal to {0 0 0 0} as shown in Fig. 6.212.

The second iteration is composed of three steps. The first step evaluates pp1 much like pp0. If r1 = 1, then pp1 = {c3 c2 c1 c0} else pp1 = {0 0 0 0}. In the second step, pp1 is shifted one bit to the left before adding it to pp0. This step produces pp1 = {c3 c2 c1 c0 0} if r1 = 1; otherwise, pp1 = {0 0 0 0 0}. The third step adds pp1 to the old partial product, pp0, and forms the partial product sum, pp = {s4 s3 s2 s1 s0}, as mentioned earlier.

The third and fourth iterations are also composed of three steps as shown in Fig. 6.213. The difference between them is that pp2 and pp3 are shifted to the left by two bits and three bits, respectively. This way, they will be in the correct bit position before they are added to the old partial product.

Finally, generating, left shifting and adding steps of partial products result in a compact flow chart shown in Fig. 6.214. Note that each r-term in this figure is indexed by the variable i. Therefore, r = {r0 r1 r2 r3} is identical to {r[0] r[1] r[2] r[3]}. Similarly, each pp-term uses indexed representations. Therefore, pp = {pp0, pp1, pp2, pp3} is identical to {pp[0], pp[1], pp[2], pp[3]}.

Review Questions

1. A 32-bit RISC CPU organized in Big Endian format has three pipeline stages to execute only the following two instructions:

Draw the detailed ALU and the CPU schematic that executes these two instructions. Label all interconnections, bus widths and control signals.

2. The following specification is given for implementing a 32-bit RISC processor that executes only integer multiply-add (MADD) and add (ADD) instructions:
1. (i)
  Data, a, b, c and d are read at the same time from DOut1, DOut2, DOut3 and DOut4 ports of a 32-bit RF with 32 general purpose registers.
2. (ii)
  There are four stages in the processor. The ALU consists of two stages.
3. (iii)
  Multiplication is the first ALU stage for the MADD instruction between a and b, and between c and d. It takes one clock cycle to produce results which are eventually written to DinH (for higher 32 data input bits) and DinL (for lower 32 data input bits) ports of the RF simultaneously. This stage can be bypassed if addition is performed between a and c.
4. (iv)
  Addition is the second ALU stage, and it also takes one clock cycle to produce results.
5. (v)
  For MADD instruction , RS1 is the first source address that contains a, RS2 is the second source address that contains b, RS3 is the third source address that contains c, and RS4 is the fourth source address that contains d. RD1 is the first destination address that stores the lower 32 bits of the result, and RD2 the second destination address that stores higher 32 bits. For the ADD instruction , RS1 is the first source address that contains a, RS3 is the second address that contains c, and RD1 is the destination address that stores the result.
1. (a)
  This CPU executes only these two instructions. Draw the instruction bit field format, indicating the opcode and operand fields for MADD and ADD instructions.
2. (b)
  Draw the architectural diagram of the processor that executes ADD and MADD, indicating all the necessary hardware such as the required memories, the RF, the detailed ALU with all the port names and bit widths. Show how the opcode decoder enables multiplexers and other hardware in each stage.

Note: The reader should also attempt to implement the hardware that executes the integer multiply (MUL) instruction and superimpose it on top of the data-path that executes ADD and MADD instructions.

3. The area under y = x is calculated until the area equals 18. Here, x increments by one as shown in the figure below.

The incremental area is calculated by the flow chart given below.

(a)
Assuming Reg[R0] = 0, write a program using the instruction set given in Chapter 6. Make comments next to each instruction in the program.
(b)
Form an instruction chart for this program, executing in a five-stage CPU , and show all the data dependencies that require forwarding loops . Stall the pipeline using the NOP instruction if necessary. Consider the branch or jump delay penalty to be 1 cycle.

4. A RISC CPU computes the following:

X = 2 A² +1

A is located at the data cache address 100. X needs to be stored at the address 200. All instructions take one cycle except multiply, which takes three cycles. The RF contains only R0 and R1. Reg[R0] = 0.

Make sure to have only 16-bit values in source registers, RS1 and RS2, in order to avoid the overflow condition in the destination register, RD, when the MUL instruction is used.

(a)
Write an assembly code to compute and store the value of X. Make sure to write comments next to each instruction to keep track of the register values.
(b)
Rewrite the assembly code with an instruction chart. Indicate all stalls caused by NOP instructions and forwarding loops on this chart.

5. Design a four-way set -associative write-through cache for an eight-bit CPU . The cache is organized in Little Endian format. It has four sets, and each data block in the set contains two eight-bit words.

The replacement policy on a cache miss is as follows:

(i)
An entire block of data is transferred between the CPU and the cache
(ii)
The block of with the fewest amount of references is replaced
(iii)
The least significant block is replaced if all the memory references are the same in a set

The CPU transactions and the contents of the main memory before these transactions are shown below:

(a)
Draw the block diagram of the cache and tag memories. Show the field format of the CPU address in terms of tag, index and block offset .
(b)
Show the cache and tag memory contents after the eighth, tenth and twelfth transactions by individually drawing the cache and tag contents. Update the main memory contents if there is any change.

6. A 32-bit, five-stage RISC CPU organized in Little Endian format executes the flow chart below. The CPU contains an integer RF with 32 registers where Reg[R0] = 0. The integer values, such as SUM = 0, are stored at the data memory address 100, i = 1 is stored at 101, and the compare value of 100 for i is stored at 102. The final SUM value needs to be stored in the data memory address of 200.

(a)
Write an assembly program using the following instruction set . Accompany each instruction in the program with register data and comments.

(b)
Draw the CPU schematic that executes the instructions in the flow chart above.

7. The function,\( {\text{Y}} = \frac{5({\text{A}} - {\text{B}})}{32} \), needs to be executed using the instruction set below.

A is located at the memory address 100.

B is located at the memory address 101.

Y needs to be stored at the memory address 102.

Reg[R0] = 0.

(a)
Write a program to compute Y.
(b)
This program executes in a six-stage CPU . Two clock cycles are required to access data memory for a LOAD operation. Rewrite the program to accommodate this requirement. Show all forwarding loops and include all the necessary NOPs in the instruction chart.
(c)
Indicate the minimum number of clock cycles to execute the program in part (b).

8. A 32-bit CPU organized in Big Endian format has 32 general purpose registers (R0 is also a general purpose register whose contents are not zero). This CPU executes the following flow chart :

The instruction set and the bit-field format for each instruction are shown below.

The CPU maintains the following rules:

(i)
Every instruction is executed in a different number of clock cycles
(ii)
No NOP instruction is allowed
(iii)
LOAD does not have an ALU cycle but requires two data memory cycles
(iv)
INVERT does not have a data memory cycle but requires one ALU cycle
(v)
MUL does not have a data memory cycle but requires three ALU cycles
(vi)
ADD does not have a data memory cycle but requires two ALU cycles
(vii)
STORE does not have an ALU cycle but requires one data memory cycle

Construct the instruction chart to execute the flow chart above. Show all the necessary forwarding loops and possible data hazards . Show the cases in which there may be structural hazards and indicate how to prevent them.

9. The following instruction set needs to be executed in a 32-bit RISC CPU organized in Little Endian format. The CPU has three pipeline stages where the ALU and write-back stages are combined. The CPU is capable of executing the integer (ADDI, SLI and SRI) and floating-point (ADDF and MULF) instructions. The CPU stores the fixed and floating-point numbers in two separate register files, each containing 32 registers.

In the instruction set below, RS and RD are defined as the source and destination addresses for the integer registers, and FS1, FS2 and FD are the source and destination addresses for the floating-point registers, respectively.

Show a detailed data-path of this CPU , indicating all internal bus widths and port names. Include only the necessary functional units.

Projects

1.
Implement a 32-bit four-stage RISC CPU that executes only ADD instruction using Verilog. On a timing diagram, trace through the data and control signals at the output ports of the instruction memory, RF, ALU and write-back stages.
2.
Implement ADD, SUB, AND, NAND , OR , NOR, XOR , XNOR, SL and SR instructions in a 32-bit four-stage RISC CPU, and perform complete verification using Verilog.
3.
Implement a 32-bit five-stage RISC CPU that executes LOAD, STORE, MOVE and MOVEI instructions using Verilog. Trace through the data and control signals at the output ports of the instruction memory, RF, ALU, data memory and write-back stages in a timing diagram.
4.
Implement a 32-bit four-stage RISC CPU that executes only the BRA instruction using Verilog. Trace through the data and control signals at the output ports of the instruction memory and RF stages on a timing diagram.
5.
Implement and verify the 32-bit floating-point adder using Verilog. Verify the validity of data at the outputs of every major stage using timing diagrams and perform functional verification for the entire adder.
6.
Implement and verify the 32-bit floating-point multiplier using Verilog. Verify the validity of data at the outputs of every major stage using timing diagrams, and perform functional verification for the entire multiplier. Use behavioral Verilog to mimic the exponent adder and the integer multiplier.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bindal, A. (2019). Central Processing Unit. In: Fundamentals of Computer Architecture and Design. Springer, Cham. https://doi.org/10.1007/978-3-030-00223-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-00223-7_6
Published: 01 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00222-0
Online ISBN: 978-3-030-00223-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Central Processing Unit

Abstract

Access this chapter

Author information

Authors and Affiliations

Corresponding author

Appendix: Iterative Fixed-Point Multiplication

Appendix: Iterative Fixed-Point Multiplication

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation