Saturday, December 7, 2013

Ex 7.10, 7.11, 7.12, 11.1, 11.8 & 11.10 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual


Q.7.10: If the P6 microarchitecture had to support an instruction set that included predication, what effect would that have on the register renaming process?

Sol: Cache Address Cache Write data Predicated instructions complicate renaming, since a false predicate nullifies the register write of an instruction that otherwise writes a register.  Hence, until the predicate is known, the renamer does not know whether subsequent instructions should read from the previous or the new definition of a register written by a predicated instruction.  Hence, the renamer could stall until the predicate is determined.  Or, it could insert a move operation after the predicated op that reads both the old and new definitions of the predicated instruction’s destination register, and then copies one or the other definition to its own (nonarchitected) destination. All subsequent readers will then get renamed to the output of the move operation.


Q.7.11: As described in the text, the P6 microarchitecture splits store operations into a STA and STD pair for handling address generation and data movement.  Explain why this makes sense from a microarchitectural implementation perspective.

Sol: Logically, the STA and STD perform two different operations that interact with different control portions of the microarchitecture: the STA uses an AGEN unit to generate the address, and then resides in the MOB to resolve memory dependences against newer loads. The STD simply transfers data from the register file to the store port at commit.  Hence, it makes sense to split them.  Note that the new Banias (Centrino) designs based on the P6 core no longer split STA/STD, but treat them as a single micro-op.  This increases decode bandwidth and reduces ROB and RS occupancy.


Q.7.12: Following up on Problem 7, would there be a performance benefit (measured in instructions per cycle) if stores were not split?  Explain why or why not?

Sol: Front-end decode bandwidth would increase, while ROB and RS occupancy would decrease, permitting an effectively larger window.  Also, it is possible that commit bandwidth would increase.



Q.11.1: Using the syntax in Figure 11-2, show how to use the load-linked/store conditional primitives to synthesize a compare-and-swap operation.

Sol:  /* r1 contains compare value, r2 contains swap value */
      cmpswap: l1 r0, A
         cmp r0,r1
         bne fail
         stc r2, A
         bfail cmpswap
       fail:    ...



Q.11.8: Real coherence controllers include numerous transient states in addition to the ones shown in Figure  to support split-transaction buses. For example, when a processor issues a bus read for an invalid line (I), the line is placed in a IS transient state until the processor has received a valid data response that then causes the line to transition into shared state (S). Given a split-transaction bus that separates each bus command (bus read, bus write, and bus upgrade) into a request and response, augment the state table and state transition diagram of Figure  to incorporate all necessary transient states and bus responses. For simplicity, assume that any bus command for a line in a transient state gets a negative acknowledge (NAK) response that forces it to be retried after some delay.

Sol: 




Q.11.10:  Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level-2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors, and 8-byte address messages (including command and snoop addresses)compute the inbound and outbound snoop bandwidth required at each processor node.

Sol: Outbound snoop rate = .01 miss/inst x 1 inst/2 cyc x 1 cyc/ns x 8 bytes/miss 
                                       = .04b/ns 
                                       = 40 million bytes per second

         Inbound snoop rate = 31 x 40 
                                       = 1240 million bytes per second 
                                       = 1182 MB/sec.








Previous Topic:
Q.6.3: Given the dispatch and retirement bandwidth specified, how many integer ARF (architected register file) read and write ports are needed to sustain peak throughput? Given instruction mixes in Table 5-2, also compute average ports needed for each benchmark. Explain why you would not just build for the average case. Given the actual number of read and write ports specified, how likely is it that dispatch will be port-limited? How likely is it that retirement will be port-limited?
Q.6.11: The IBM POWER3 can detect up to four regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.
Q.6.12: The IBM POWER4 can detect up to eight regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.

No comments:

Post a Comment