Q.6.3: Given the dispatch and retirement bandwidth specified, how many integer ARF (architected register file) read and write ports are needed to sustain peak throughput? Given instruction mixes in Table 5-2, also compute average ports needed for each benchmark. Explain why you would not just build for the average case. Given the actual number of read and write ports specified, how likely is it that dispatch will be port-limited? How likely is it that retirement will be port-limited?
Sol: For peak throughput, 8 ARF read ports are needed at dispatch, and 4 ARF write ports are needed at retirement. Given an average mix of 36.14% ALU which require 2 read ports and 1 write port, 1.44% multicycle ALU which require 2 read ports and 1 write port, 14.64% int loads which require one read port and one write port, 11.03% FP loads which require one read port but no write ports (only write FP), 7.18% int stores which require two read ports and no write ports, and 4.12% FP stores which require one read port. The weighted average for 4-wide dispatch read ports is
= 4 x (.3614x2 + .0144x2 + .1464x1 + .1103x1 + .0718x2 + .0412x1)
= 4.77 read ports.
The weighted average for 4-wide retirement write ports is
= 4 x (.3614 + .0144 + .1464)
= 2.1 write ports.
Q.6.11: The IBM POWER3 can detect up to four regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.
Sol: Example code : for(i=0;i<10000;++i) sum += A[i] + B[i] + C[i] + D[i];
Reference stream : A, B, C, D, A+4, B+4, C+4, D+4, A+8, B+8, C+8, D+8, etc.
Q.6.12: The IBM POWER4 can detect up to eight regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams.
Sol: Example code : for(i=0;i<10000;++i) sum += A[i] + B[i] + C[i] + D[i];
Reference stream : A, B, C, D, A+8, B+8, C+8, D+8, A+16, B+16, C+16, D+16, etc.