

# Reading

 <u>Data prefetch mechanisms</u>, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

## Prefetching

- Predict future cache misses
- Issue a fetch to memory system in advance of the actual memory reference
- Hide memory access latency

3



### **Basic Questions**

- 1. When to initiate prefetches?
  - Timely
  - Too early → replace other useful data (cache pollution) or be replaced before being used
  - Too late → cannot hide processor stall
- 2. Where to place prefetched data?
  - Cache or dedicated buffer
- 3. What to be prefetched?

5

## Prefetching Approaches

- Software-based
- □ Explicit "fetch" instructions
- Additional instructions executed
- Hardware-based
- □ Special hardware
- □ Unnecessary prefetchings (w/o compiletime information)

. .





```
Loop-based Prefetching

Loops of large array calculations
Common in scientific codes
Poor cache utilization
Predictable array referencing patterns

fetch instructions can be placed inside loop bodies s.t. current iteration prefetches data for a future iteration
```



```
Example: Vector Product (Cont.)
■ Prefetching + loop unrolling
                                 ■ Prefetching + software
for (i = 0; i < N; i+=4) {
                                    pipelining
  fetch (&a[i+4]);
                                 fetch (&sum);
  fetch (&b[i+4]);
                                 fetch (%a[01):
                                 fetch (&b[01):
  sum += a[i]*b[i]:
                                 for (i = 0; i < N-4; i+=4) {
  sum += a[i+1]*b[i+1];
  sum += a[i+2]*b[i+2];
                                    fetch (&a[i+4]);
   sum += a[i+3]*b[i+3];
                                    fetch (&b[i+4]);
                                    sum += a[i]*b[i];
                                    sum += a[i+1]*b[i+1]:

    Problem

                                    sum += a[i+2]*b[i+2];
      First and last iterations
                                    sum += a[i+3]*b[i+3];
                                 for (i = N-4; i < N; i++)
                                    sum = sum + a[i]*b[i];
```

```
Example: Vector Product (Cont.)

• Previous assumption: prefetching I iteration chead is sufficient to hide the memory latency.

• When loops contain small computational bodies, it may be necessary to initiate prefetches \delta iterations before the data is referenced \delta = \frac{I}{s}
• \delta: prefetch distance, I: avg memory latency, s is the estimated cycle time of the shortest possible execution path through one loop iteration
```

### Limitation of Software-based Prefetching

- Normally restricted to loops with array accesses
- Hard for general applications with irregular access patterns
- Processor execution overhead
- Significant code expansion
- Performed statically

Hardware Inst. and Data Prefetching

- No need for programmer or compiler intervention
- No changes to existing executables
- Take advantage of run-time information
- E.g., Instruction Prefetching
- Alpha 21064 fetches 2 blocks on a miss
- Extra block placed in "stream buffer" On miss check stream buffer
- Works with data blocks too:
- Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches
- Prefetching relies on having extra memory bandwidth that can be used without penalty

## Sequential Prefetching

- Take advantage of spatial locality
- One block lookahead (OBL) approach
- □ Initiate a prefetch for block *b+1* when block *b* is accessed
- □ Prefetch-on-miss
  - Whenever an access for block b results in a cache miss
- □ Tagged prefetch
  - Associates a tag bit with every memory block
  - When a block is demand-fetched or a prefetched block is referenced for the first time.

## **OBL** Approaches ■ Prefetch-on-miss Tagged prefetch demand-fetched demand-fetched prefetched

## Degree of Prefetching

- OBL may not initiate prefetch far enough to avoid processor memory stall
- Prefetch K > 1 subsequent blocks Additional traffic and cache pollution
- Adaptive sequential prefetching
- □ Vary the value of Kduring program execution ☐ High spatial locality → large K value
- □ Prefetch efficiency metric
- □ Periodically calculated
- □ Ratio of useful prefetches to total prefetches

### Stream Buffer

- K prefetched blocks → FIFO stream buffer
- As each buffer entry is referenced
  - Move it to cache
  - □ Prefetch a new block to stream buffer
- Avoid cache pollution



## Prefetching with Arbitrary Strides

- Employ special logic to monitor the processor's address referencing pattern
- Detect constant stride array references originating from looping structures
- Compare successive addresses used by load or store instructions

20









#### r.

## Software vs. Hardware Prefetching

#### Software

□ Compile-time analysis, schedule fetch instructions within user program

#### Hardware

□ Run-time analysis w/o any compiler or user support

#### Integration

e.g. compiler calculates degree of prefetching (K) for a particular reference stream and pass it on to the prefetch hardware.

25