Q.2.16: In a TYP-based pipeline design with a data cache, load instructions check the tag array for a cache hit in parallel with accessing the data array to read the corresponding memory location. Pipelining stores to such a cache is more difficult, since the processor must check the tag first, before it overwrites the data array. Otherwise, in the case of a cache miss, the wrong memory location may be overwritten by the store. Design a solution to this problem that does not require sending the store down the pipe twice, or stalling the pipe for every store instruction. Referring to Figure 2-15, are there any new RAW, WAR, and/or WAW memory hazards?
Solution: One possible design is to do the tag check for the store in the EX stage, but buffer the store data. On a store miss, the pipeline will stall until the line is filled.
On a hit (or once the miss is resolved), the store data stays in the buffer until the next store comes down the pipe. In parallel with checking the tag for the next store, the data for the previous store is written into the data array. This is a fully pipelined solution.
However, there is now a later memory write stage (since the write doesn’t occur until the next store reaches the MEM stage, it appears as if the store operations are in an arbitrarily deep pipeline with the write stage at the bottom of that pipeline). Hence, RAW memory hazards are now possible, since a subsequent load from the same address could enter the MEM stage before the store data has been written from the buffer to the cache. Hence, hazard detection logic must compare the address of the data in the store buffer with subsequent loads, and must either stall the load to allow the store to write into the cache or bypass the data to the load directly from the buffer.
Q.2.17: The MIPS pipeline shown in Table 2-7 employs a two-phase clocking scheme that makes efficient use of a shared TLB, since instruction fetch accesses the TLB in phase one and data fetch accesses in phase two. However, when resolving a conditional branch, both the branch target address and the branch fall-through address need to be translated during phase one in parallel with the branch condition check in phase one of the ALU stage to enable instruction fetch from either the target or the fall-through during phase two. This seems to imply a dual-ported TLB. Suggest an architected solution to this problem that avoids dual-porting the TLB.
Solution: Two solutions are possible. In either case, the instruction translation for the branch instruction is reused for the subsequent fetch. The first solution requires that all branch targets lie within the same physical page as the branch instruction. Hence, the physical page number of the branch instruction can be reused with the branch target. The compiler and programmer must take special care to ensure that this is the case. Alternatively, the branch fall-through path can be restricted to be on the same page. In this scenario, the pipeline reuses the physical page number of the branch when fetching the fall-through path, and uses the TLB to translate the target address. This restriction is simpler, since it only forbids the compiler or programmer from placing a branch instruction at the end of a physical page. Whenever a branch does fall into such a location, the compiler can pad it with NOPs to place it on the next page.
Next Topic:
Q.3.1: Given the following benchmark code, and assuming a virtually-addressed fully-associative cache with infinite capacity and 64 byte blocks, compute the overall miss rate (number of misses divided by number of references). Assume that all variables except array locations reside in registers, and that arrays A, B, and C are placed consecutively in memory.
double A[1024], B[1024], C[1024];
for(int i=0;i<1000;i += 2) {
A[i] = 35.0 * B[i] + C[i+1];
}
Q.3.3: Given the example code in Problem 1, and assuming a virtually-addressed two-way set associative cache of capacity 8KB and 64 byte blocks, compute the overall miss rate (number of misses divided by number of references). Assume that all variables except array locations reside in registers, and that arrays A, B, and C are placed consecutively in memory.
Previous Topic:
Q.2.8: Consider adding a store instruction with indexed addressing mode to the TYP pipeline. This store differs from the existing store with register+immediate addressing mode by computing its effective address as the sum of two source registers, that is, stx r3, r4, r5 performs r3<-MEM[r4+r5]. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline. Discuss the advantages and disadvantages of such an instruction.
Q.2.9: Consider adding a load-update instruction with register+immediate and postupdate addressing mode. In this addressing mode, the effective address for the load is computed as register+immediate, and the resulting address is written back into the base register. That is, lwu r3,8(r4) performs r3<-MEM[r4+8]; r4<r4+8. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline.
Q.2.15: The IBM study of pipelined processor performance assumed an instruction mix based on popular C programs in use in the 1980s. Since then, object oriented languages like C++ and Java have become much more common. One of the effects of these languages is that object inheritance and polymorphism can be used to replace conditional branches with virtual function calls. Given the IBM instruction mix and CPI shown in the following table, perform the following transformations to reflect the use of C++/Java, and recompute the overall CPI and speedup or slowdown due to this change:
• Replace 50% of taken conditional branches with a load instruction followed by a jump register instruction
(the load and jump register implement a virtual function call).
• Replace 25% of not-taken branches with a load instruction followed by a jump register instruction.
No comments:
Post a Comment