Q.2.4: Consider that you would like to add a load-immediate instruction to the TYP instruction set and pipeline. This instruction extracts a 16-bit immediate value from the instruction word, sign-extends the immediate value to 32 bits, and stores the result in the destination register specified in the instruction word. Since the extraction and sign-extension can be accomplished without the ALU, your colleague suggests that such instructions be able to write their results into the register in the decode (ID) stage. Using the hazard detection algorithm described in Figure 2-15, identify what additional hazards such a change might introduce.
Solution: Since there are now 2 stages that write the register file (ID and WB), WAW hazards may also occur in addition to RAW hazards. WAW hazards exist with respect to instructions that are ahead of the load immediate in the pipeline. WAR hazards do exist since the ID register write stage is earlier than the RD register read stage (assuming a write-before-read register file). If the register file is read-before-write, the ID write occurs after the RD read, and therefore WAR hazards do not exist.
Q.2.5: Ignoring pipeline interlock hardware (discussed in Problem 6), what additional pipeline resources does the change outline in Problem 4 require? Discuss these resources and their cost.
Solution: Since there are 2 stages that write the register file (ID and WB), the register file must have two write ports. Additional write ports are expensive, since they require the RF array and bit cells to be redesigned to support multiple writes per cycle. Alternatively, a banked RF design with bank conflict resolution logic could be added. However, this would require additional control logic to stall the pipeline on bank conflicts.
Q.2.6: Considering the change outlined in Problem 4, redraw the pipeline interlock hardware shown in Figure 2-18 to correctly handle the load-immediate instructions.
Solution: The modified figure should show the ID stage destination latch connected to a second write port register identifier input.
Further, comparators that check the ID stage destination latch against the destination latch of instructions further in the pipeline should drive a stall signal to handle WAW hazards. in the pipeline should drive a stall signal to handle WAW hazards.
Pipelined Processors:
For enhancing the throughput of the system, pipelining is a powerful way without requiring the massive replication of the hardware. There are two types of pipelines, i.e. 1. arithmatic pipelines and 2. instruction pipelines. But the instruction pipelines plays an important role in the evolution of the processors. The throughput is the main motivation for any processor. The throughput of the system can be increased by pipelining if there are many tasks that require the use of the same system. For each task, the latency remains the same or may increase slightly. Pipelining involves partitioning the system into multiple stages with added buffering between the stages. These stages and buffers in between them constitute the pipeline. The ultimate physical limit to the depth of the pipeline is determined by clocking. Also, maximum pipeline depth may not be the optimal design when cost, or pipelining overhead is considerd. The foregoing tradeoff model is purely based on hardware design consideration. There is no consideration of the dynamic behavior of the pipeline or the computations being performed.
Next Topic:
Q.2.8: Consider adding a store instruction with indexed addressing mode to the TYP pipeline. This store differs from the existing store with register+immediate addressing mode by computing its effective address as the sum of two source registers, that is, stx r3, r4, r5 performs r3<-MEM[r4+r5]. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline. Discuss the advantages and disadvantages of such an instruction.
Q.2.9: Consider adding a load-update instruction with register+immediate and postupdate addressing mode. In this addressing mode, the effective address for the load is computed as register+immediate, and the resulting address is written back into the base register. That is, lwu r3,8(r4) performs r3<-MEM[r4+8]; r4<r4+8. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline.
Q.2.15: The IBM study of pipelined processor performance assumed an instruction mix based on popular C programs in use in the 1980s. Since then, object oriented languages like C++ and Java have become much more common. One of the effects of these languages is that object inheritance and polymorphism can be used to replace conditional branches with virtual function calls. Given the IBM instruction mix and CPI shown in the following table, perform the following transformations to reflect the use of C++/Java, and recompute the overall CPI and speedup or slowdown due to this change:
• Replace 50% of taken conditional branches with a load instruction followed by a jump register instruction
(the load and jump register implement a virtual function call).
• Replace 25% of not-taken branches with a load instruction followed by a jump register instruction.
Previous Topic:
Q.1.8: Recent processors like the Pentium 4 processors do not implement single-cycle shifts. Given the scenario of Problem 7, assume that s = 50% of the additional integer and shift instructions introduced by strength reduction are shifts, and shifts now take four cycles to execute. Recompute the cycles per instruction and overall program speedup. Is strength reduction still a good optimization?
Q.1.9: Given the assumptions of Problem 8, solve for the break-even ratio s (percentage of additional instructions that are shifts). That is, find the value of s (if any) for which program performance is identical to the baseline case without strength reduction (Problem 6).
Q.1.10: Given the assumptions of Problem 8, assume you are designing the shift unit on the Pentium 4 processor. You have concluded there are two possible implementation options for the shift unit: 4-cycle shift latency at a frequency of 2 GHz, or 2-cycle shift latency at 1.9 GHz. Assume the rest of the pipeline could run at 2 GHz, and hence the 2-cycle shifter would set the entire processor’s frequency to 1.9 GHz. Which option will provide better overall performance?
No comments:
Post a Comment