Tuesday, November 26, 2013

Ex 5.21 and 5.22 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual



Q.5.21: Simulate the execution of the following code snippet using Tomasulo’s algorithm. Show the contents of the reservation station entries, register file busy, tag (the tag is the RS ID number), and data fields for each cycle (make copy of the table below for each cycle that you simulate). Indicate which instruction is executing in  each functional unit in each cycle. Also indicate any result forwarding across a common
data bus by circling the producer and consumer and connecting them with an arrow.
i: R4  <-  R0  +  R8
j: R2  <-  R0  *  R4
k: R4  <-  R4  +  R8
l: R8   <-  R4  *  R2
Assume dual dispatch and dual CDB (common data bus). Add latency is two cycles, and multiply latency is 3 cycles. An instruction can begin execution in the same cycle that it is dispatched, assuming all dependencies are satisfied. 

Sol: 









Q.5.22: Determine whether or not the code executes at the data flow limit for Problem 1. Explain why or why not. Show your work.

Sol: 

       Critical path is 1+3+3=7 cycles
       Prob. 18 executes in 7 cycles.
       Therefore, Tomasulo executes this at the dataflow limit.





Tomasulo algorithm

The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially (out-of-order execution). It was first implemented for the IBM System/360 Model 91’s floating point unit.
This algorithm differs from scoreboarding in that it utilizes register renaming. Where scoreboarding resolves Write-after-Write (WAW), Read-after-Write (RAW) and Write-after-Read (WAR)computer architecture hazards by stalling, register renaming allows the continual issuing of instructions. The Tomasulo algorithm also uses a common data bus (CDB) on which computed values are broadcast to all the reservation stations that may need it. This allows for improved parallel execution of instructions which may otherwise stall under the use of scoreboarding.

Implementation concept

The following are the concepts necessary to the implementation of Tomasulo's Algorithm.
  • Instructions are issued sequentially so that the effects of a sequence of instructions such as exceptions raised by these instructions occur in the same order as they would on an in-order processor, regardless of the fact that they are being executed out-of-order (i.e. non-sequentially).
  • All general-purpose and reservation station registers hold either real or virtual values. If a real value is unavailable to a destination register during the issue stage, a virtual value is initially used. The functional unit that is computing the real value is assigned as the virtual value. The virtual register values are converted to real values as soon as the designated functional unit completes its computation.
  • Functional units use reservation stations with multiple slots. Each slot holds information needed to execute a single instruction, including the operation and the operands. The functional unit begins processing when it is free and when all source operands needed for an instruction are real.

Instruction lifecycle

The three stages listed below are the stages through which each instruction passes from the time it is issued to the time its execution is complete.

Stage 1: issue

In the issue stage, instructions are issued for execution if all operands and reservation stations are ready or else they are stalled. Registers are renamed in this step, eliminating WAR and WAW hazards.
  • Retrieve the next instruction from the head of the instruction queue. If the instruction operands are currently in the registers, then
    • If a matching functional unit is available, issue the instruction.
    • Else, as there is no available functional unit, stall the instruction until a station or buffer is free.
  • Otherwise, we can assume the operands are not in the registers, so use virtual values; the functional unit must calculate the real value, in order to keep track of the functional units that will produce the operand.

Stage 2: execute

In the execute stage, the instruction operations are carried out. Instructions are delayed in this step until all of their operands are available, eliminating RAW hazards. Program correctness is maintained through effective address calculation to prevent hazards through memory.
  • If one or more of the operands is not yet available then: wait for operand to become available on the CDB.
  • When all operands are available, then: if the instruction is a load or store
    • Compute the effective address when the base register is available, and place it in the load/store buffer
      • If the instruction is a load then: execute as soon as the memory unit is available
      • Else, if the instruction is a store then: wait for the value to be stored before sending it to the memory unit
  • Else, the instruction is an arithmetic logic unit (ALU) operation then: execute the instruction at the corresponding functional unit

Stage 3: write result

In the write Result stage, ALU operations results are written back to registers and store operations are written back to memory.
  • If the instruction was an ALU operation
    • If the result is available, then: write it on the CDB and from there into the registers and any reservation stations waiting for this result
  • Else, if the instruction was a store then: write the data to memory during this step



Next Topic:
Q.5.23: As presented in this chapter, load bypassing is a technique for enhancing memory data flow. With load bypassing, load instructions are allowed to jump ahead of earlier store instructions. Once address generation is done, a store instruction can be completed architecturally and can then enter the store buffer to await available bus cycle for writing to memory. Trailing loads are allowed to bypass these stores in the store buffer if there is no address aliasing. 
In this problem you are to simulate such load bypassing (there is no load forwarding). You are given a sequence of load/store instructions and their addresses (symbolic). The number to the left of each instruction indicates the cycle in which that instruction is dispatched to the reservation station; it can begin execution in that same cycle. Each store instruction will have an additional number to its right, indicating the cycle in which it is ready to retire, i.e., exit the store buffer and write to the memory.
Assumptions:
•All operands needed for address calculation are available at dispatch.
•One load and one store can have their addresses calculated per cycle.
•One load OR store can be executed, i.e., allowed to access the cache, per cycle.
•The reservation station entry is deallocated the cycle after address calculation and issue.
•The store buffer entry is deallocated when the cache is accessed.
•A store instruction can access the cache the cycle after it is ready to retire.
•Instructions are issued in order from the reservation stations.
•Assume 100% cache hits.


Previous Topic:
Q.5.19 and 5.20: Register Renaming Given the DAXPY kernel shown in Figure 5.31 and the IBM RS/6000 (RIOS-I) floating-point load renaming scheme also discussed in class (both are shown in the following figure), simulate the execution of two iterations of the DAXPY loop and show the state of the floating-point map table, the pending target return queue, and the free list.
• Assume the initial state shown in the table for Problem 5.19.
• Note the table only contains columns for the registers that are referenced in the DAXPY loop.
• As in the RS/6000 implementation discussed, assume only a single load instruction is renamed per cycle and that only a single floating-point instruction can complete per cycle.
• Only floating-point load, multiply, and add instructions are shown in the table, since only these are relevant to the renaming scheme.
• Remember that only load destination registers are renamed.
• The first load from the loop prologue is filled in for you.

Q.5.19:  Fill in the remaining rows in the following table with the map table state and pending target return queue state after the instruction is renamed, and the free list state after the instruction completes.

No comments:

Post a Comment