### Lecture 24:   Two-level Cache;  Cache Performance

#### Two-level cache

[this section adapted from Gottlieb's notes]

Modern high end PCs and workstations all have at least two levels of caches: A very fast, and hence not very big, first level (L1) cache together with a larger but slower L2 cache.  Some recent microprocessors (e.g., Core i7) have 3 levels.

When a miss occurs in L1, L2 is examined, and only if a miss occurs there is main memory referenced.

So the average miss penalty for an L1 miss is

`	(L2 hit rate)*(L2 time) + (L2 miss rate)*(L2 time + memory time)`
We are assuming L2 time is the same for an L2 hit or L2 miss. We are also assuming that the access doesn't begin to go to memory until the L2 miss has occurred.

An example

• Assume
1. L1 I-cache miss rate 4%
2. L1 D-cache miss rate 5%
3. 40% of instructions reference data (load or store)
4. L2 miss rate 6%
5. L2 time of 10 ns
6. Memory access time 40 ns
7. Base CPI of 2
8. Clock rate 2 GHz
• How many instructions per second does this machine execute?
• How many instructions per second would this machine execute if the L2 cache were eliminated.?
• How many instructions per second would this machine execute if both caches were eliminated.??
• How many instructions per second would this machine execute if the L2 cache had a 0% miss rate (L1 as originally specified)?
• How many instructions per second would this machine execute if both L1 caches had a 0% miss rate?

#### Effect of cache size on code

For programs operating on large arrays, there can be a significant speed-up if the array fits in the cache.  For example, if we try the program

#include <stdio.h>
#include <time.h>

#define ARRAYSIZE 200000
#define ITERATIONS 5000

int ray[ARRAYSIZE];
int sum;

main() {
int iter, i;
clock_t time1, time2;
int time;
time1 = clock();
for (iter=0; iter<ITERATIONS; iter++)
for (i=0; i<ARRAYSIZE; i++)
sum += ray[i];
time2 = clock();
time = (time2 - time1) * 1000 / CLK_TCK;
printf ("CPU time: %i milliseconds\n", time);
}

with different values of ARRAYSIZE and ITERATIONS (keeping the product, and hence the number of instructions, constant), the program will run faster if the array fits in the cache.  Even if the array size exceeds the cache, we benefit from having the cache fetch an entire line, so we don't get a cache miss on each element.  If we skip array elements

#include <stdio.h>
#include <time.h>

#define ARRAYSIZE 800000
#define SKIPSIZE 4
#define ITERATIONS 5000

int ray[ARRAYSIZE];
int sum;

main() {
int iter, i;
clock_t time1, time2;
int time;
time1 = clock();
for (iter=0; iter<ITERATIONS; iter++)
for (i=0; i<ARRAYSIZE; i+=SKIPSIZE)
sum += ray[i];
time2 = clock();
time = (time2 - time1) * 1000 / CLK_TCK;
printf ("CPU time: %i milliseconds\n", time);
}

this benefit decreases.  Large scientific programs can structure their array operations to take advantage of these cache properties.

#### Virtual memory

Virtual memory provides the next step in the memory hierarchy after main memory: disk memory. The gap in access times, however, is much larger (between 100ns main memory and 10ms disk access time) and this affects the parameters of the design (pages are much larger than cache lines, and hit ratios must be much higher for the machine to work efficiently). As technology changes, we can expect the parameters of the memory levels to change, but the basic idea of a memory hierarchy to remain.