Q.1.8: Recent processors like the Pentium 4 processors do not implement single-cycle shifts. Given the scenario of Problem 7, assume that s = 50% of the additional integer and shift instructions introduced by strength reduction are shifts, and shifts now take four cycles to execute. Recompute the cycles per instruction and overall program speedup. Is strength reduction still a good optimization?
Solution:
CPI computation :
Type Old Mix New Mix Cost CPI
store 15% 15% 1 0.15
load 25% 25% 2 0.50
branch 15% 15% 4 0.60
integer 35% 38.75% 1 0.3875
shift 5% 8.75% 4 0.35
multiply 5% 2.5% 10 0.25
Total 100% 105% 2.2375/105% = 2.131
Speedup is now a slowdown: 2.15/2.2375 = 0.96 or 4% slowdown, hence strength reduction is a bad idea.
Q.1.9: Given the assumptions of Problem 8, solve for the break-even ratio s (percentage of additional instructions that are shifts). That is, find the value of s (if any) for which program performance is identical to the baseline case without strength reduction (Problem 6).
2.15 = (0.15 + 0.50 + 0.60 + 0.35 + (0.05 x 4) + 0.25 + (1-s) x 0.075 x 1 + s x 0.075 x 4
0.025 = 0.225s
=> s = 0.111 = 11.1%
Q.1.10: Given the assumptions of Problem 8, assume you are designing the shift unit on the Pentium 4 processor. You have concluded there are two possible implementation options for the shift unit: 4-cycle shift latency at a frequency of 2 GHz, or 2-cycle shift latency at 1.9 GHz. Assume the rest of the pipeline could run at 2 GHz, and hence the 2-cycle shifter would set the entire processor’s frequency to 1.9 GHz. Which option will provide better overall performance?
Solution:
4-cycle shifter:
Time per program = 1.05 IPP x 2.2375 CPI x 1/2.0GHz = 1.17e-9
2-cycle shifter:
Time per program = 1.05 IPP x (2.2375-0.175) CPI x 1/1.9GHz = 1.13e-9
Hence, 2-cycle shifter is a better option if strength reduction is applied.
If there is no strength reduction (back to Problem 6):
4-cycle shifter:
Time per program = 1.00 IPP x 2.30 CPI x 1/2.0GHz = 1.150e-9 s
2-cycle shifter:
Time per program = 1.00 IPP x 2.20 CPI x 1/1.9GHz = 1.157e-9
Hence, the 4-cycle shifter is a better option.
Overall, the best choice is still strength reduction with a 2-cycle shifter at 1.9 GHz.
Instruction level Parallelism (ILP):
In Instruction level parallelism (ILP) is one of the factor in revolution in the microprocessors. Now a days, almost in all the microprocessors instruction level parallelism is used. In instruction level parallelism, multiple instructions are processed concurrently/simultaneously. Whereas, in traditional sequential microprocessors, one instruction is executed at a time. A present instruction has to complete before next instruction to arrive. To avoid such cases and to improve the performance of microprocessors, the Scientists and Engineers worked on Pipelined processors. Pipelined processors achieves the instruction level parallelism up to some extent. As many pipeline stages possessed in architecture, that much instructions can be executed concurrently. Hence, by overlapping the multiple instructions processing in the pipeline, the effective average cycles per instruction (CPI) can be reduced to close to one. But there are limitations of fetching and initiating at most instruction in pipeline of scalar pipelined processors. Because of these limitations, in scalar processor, the best possible throughput is one instruction per cycle (IPC).
There are some processors, which are capable of instruction per cycle (IPC) greater than one. These processors are termed as Superscalar processors. As we know, in Scalar processor, it fetches and issues at most one instruction per every machine cycle. Whereas in Superscalar processors, multiple instructions can be fetched and issued per machine cycle. In Superscalar processors, the overall performance improvement is very sensitive to the vectorizability of the program. The overall speedup due to parallel processing is strongly dictated by the sequential part of the program as the machine parallelism increases.
Next Topic
Q.2.4: Consider that you would like to add a load-immediate instruction to the TYP instruction set and pipeline. This instruction extracts a 16-bit immediate value from the instruction word, sign-extends the immediate value to 32 bits, and stores the result in the destination register specified in the instruction word. Since the extraction and sign-extension can be accomplished without the ALU, your colleague suggests that such instructions be able to write their results into the register in the decode (ID) stage. Using the hazard detection algorithm described in Figure 2-15, identify what additional hazards such a change might introduce.
Q.2.5: Ignoring pipeline interlock hardware (discussed in Problem 6), what additional pipeline resources does the change outline in Problem 4 require? Discuss these resources and their cost.
Q.2.6: Considering the change outlined in Problem 4, redraw the pipeline interlock hardware shown in Figure 2-18 to correctly handle the load-immediate instructions.
SOLUTION
Previous Topic
Q.1.6: A program's run time is determined by the product of instructions per program, cycles per instruction, and clock frequency. Assume the following instruction mix for a MlPS-like RISC instruction set: 15% stores, 25% loads, 15% branches, and 35% integer arithmetic, 5% integer shift, and 5% integer multiply. Given that load instructions require two cycles, branches require four cycles, integer ALU instructions require one cycle, and integer multiplies require ten cycles, compute the overall CPI.
SOLUTION
No comments:
Post a Comment