Friday, December 6, 2013

Ex 6.3, 6.11 &6.12 : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual



Q.6.3: Given the dispatch and retirement bandwidth specified, how many integer ARF (architected register file) read and write ports are needed to sustain peak throughput? Given instruction mixes in Table 5-2, also compute average ports needed for each benchmark. Explain why you would not just build for the average case. Given the actual number of read and write ports specified, how likely is it that dispatch will be port-limited? How likely is it that retirement will be port-limited?


Sol: For peak throughput, 8 ARF read ports are needed at dispatch, and 4 ARF write ports are needed at retirement.  Given an average mix of 36.14% ALU which require 2 read ports and 1 write port, 1.44% multicycle ALU which require 2 read ports and 1 write port, 14.64% int loads which require one read port and one write port, 11.03% FP loads which require one read port but no write ports (only write FP), 7.18% int stores which require two read ports and no write ports, and 4.12% FP stores which require one read port.  The weighted average for 4-wide dispatch read ports is 

                       = 4 x (.3614x2 + .0144x2 + .1464x1 + .1103x1 + .0718x2 + .0412x1) 
                       = 4.77 read ports.  
The weighted average for 4-wide retirement write ports is 
                       = 4 x (.3614 + .0144 + .1464) 
                       = 2.1 write ports.


Q.6.11: The IBM POWER3 can detect up to four regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.



Sol:    Example code : for(i=0;i<10000;++i) sum += A[i] + B[i] + C[i] + D[i];

    Reference stream : A, B, C, D, A+4, B+4, C+4, D+4, A+8, B+8, C+8, D+8, etc.




Q.6.12: The IBM POWER4 can detect up to eight regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.



Sol:    Example code : for(i=0;i<10000;++i) sum += A[i] + B[i] + C[i] + D[i];

    Reference stream : A, B, C, D, A+8, B+8, C+8, D+8, A+16, B+16, C+16, D+16, etc.








Next Topic:
Q.7.10: If the P6 microarchitecture had to support an instruction set that included predication, what effect would that have on the register renaming process?
Q.7.11: As described in the text, the P6 microarchitecture splits store operations into a STA and STD pair for handling address generation and data movement.  Explain why this makes sense from a microarchitectural implementation perspective.
Q.7.12: Following up on Problem 7, would there be a performance benefit (measured in instructions per cycle) if stores were not split?  Explain why or why not?
Q.11.1: Using the syntax in Figure 11-2, show how to use the load-linked/store conditional primitives to synthesize a compare-and-swap operation.
Q.11.8: Real coherence controllers include numerous transient states in addition to the ones shown in Figure  to support split-transaction buses. For example, when a processor issues a bus read for an invalid line (I), the line is placed in a IS transient state until the processor has received a valid data response that then causes the line to transition into shared state (S). Given a split-transaction bus that separates each bus command (bus read, bus write, and bus upgrade) into a request and response, augment the state table and state transition diagram of Figure  to incorporate all necessary transient states and bus responses. For simplicity, assume that any bus command for a line in a transient state gets a negative acknowledge (NAK) response that forces it to be retried after some delay.
Q.11.10:  Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level-2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors, and 8-byte address messages (including command and snoopaddresses)compute the inbound and outbound snoop bandwidth required at each processor node.
SOLUTION

Previous Topic:
Q.5.23: As presented in this chapter, load bypassing is a technique for enhancing memory data flow. With load bypassing, load instructions are allowed to jump ahead of earlier store instructions. Once address generation is done, a store instruction can be completed architecturally and can then enter the store buffer to await available bus cycle for writing to memory. Trailing loads are allowed to bypass these stores in the store buffer if there is no address aliasing. 
In this problem you are to simulate such load bypassing (there is no load forwarding). You are given a sequence of load/store instructions and their addresses (symbolic). The number to the left of each instruction indicates the cycle in which that instruction is dispatched to the reservation station; it can begin execution in that same cycle. Each store instruction will have an additional number to its right, indicating the cycle in which it is ready to retire, i.e., exit the store buffer and write to the memory.
Assumptions:
•All operands needed for address calculation are available at dispatch.
•One load and one store can have their addresses calculated per cycle.
•One load OR store can be executed, i.e., allowed to access the cache, per cycle.
•The reservation station entry is deallocated the cycle after address calculation and issue.
•The store buffer entry is deallocated when the cache is accessed.
•A store instruction can access the cache the cycle after it is ready to retire.
•Instructions are issued in order from the reservation stations.
•Assume 100% cache hits.

No comments:

Post a Comment