======== START LECTURE #19
2.3: Relating the metrics
The execution time for a given job on a given computer is
(CPU) execution time = (#CPU clock cycles required) * (cycle time)
= (#CPU clock cycles required) / (clock rate)
The number of CPU clock cycles required equals the number of
instructions executed times the average number of cycles in each
- In our single cycle implementation, the number of cycles required
is just the number of instructions executed.
- If every instruction took 5 cycles, the number of cycles required
would be five times the number of instructions executed.
But real systems are more complicated than that!
- Some instructions take more cycles than others.
- With pipelining, several instructions are in progress at different
stages of their execution.
- With super scalar (or VLIW) many instructions are issued at once.
- Since modern superscalars (and VLIWs) are also pipelined we have
many many instructions executing at once.
Through a great many measurement, one calculates for a given machine
the average CPI (cycles per instruction).
The number of instructions required for a given program depends on
the instruction set.
For example, we saw in chapter 3 that 1 Vax instruction is often
accomplishes more than 1 MIPS instruction.
Complicated instructions take longer; either more cycles or longer cycle
Older machines with complicated instructions (e.g. VAX in 80s) had CPI>>1.
With pipelining can have many cycles for each instruction but still
have CPI nearly 1.
Modern superscalar machines have CPI < 1.
They issue many instructions each cycle.
They are pipelined so the instructions don't finish for several cycles.
If we consider a 4-issue superscalar and assume that all
instructions require 5 (pipelined) cycles, there are
up to 20=5*4 instructions in progress (often called in flight) at
Putting this together, we see that
Time (in seconds) = #Instructions * CPI * Cycle_time (in seconds).
Time (in ns) = #Instructions * CPI * Cycle_time (in ns).
Carefully go through and understand the example on page 59
Make sure you can easily do all the problems with a rating of
 and can do all with a rating of .
What is the MIPS rating for a computer and how useful is it?
MIPS stands for Millions of Instructions Per Second.
- It is a unit of rate or speed (like MHz), not of time (like ns.).
It is not the same as the MIPS computer (but the name
similarity is not a coincidence).
- The number of seconds required to execute a given (machine
language) program is
the number of instructions executed / the number executed per second.
- The number of microseconds required to execute a given (machine
language) program is
the number of instructions executed / MIPS.
BUT ... .
- The same program in C (or Java, or Ada, etc) might need
different number of instructions
on different computers.
For example, one VAX instruction might require 2
instructions on a power-PC and 3 instructions on an X86).
- The same program in C when compiled by two different compilers
for the same computer architecture, might need to executed
different numbers of instructions.
Different programs may achieve different MIPS ratings on the
- Some programs execute more long instructions
than do other programs.
- Some programs have more cache misses and hence cause
more waiting for memory.
- Some programs inhibit full pipelining
(e.g., they may have more mispredicted branches).
- Some programs inhibit full superscalar behavior
(e.g., they may have unhideable data dependencies).
One can often raise the MIPS rating by adding NOPs, despite
increasing execution time. How?
Ans. MIPS doesn't require useful instructions and
NOPs, while perhaps useless, are nonetheless very fast.
- So, unlike MHz, MIPS is not a value that be defined for a specific
computer; it depends on other factors, e.g., language/compiler used,
problem solved, and algorithm employed.
Carefully go through and understand the example on pages 61-3
How about MFLOPS (Million of FLoating point OPerations per Second)?
For numerical calculations floating point operations are the
ones you are interested in; the others are ``overhead'' (a
very rough approximation to reality).
It has similar problems to MIPS.
- The same program needs different numbers of floating point operations
on different machines (e.g., is sqrt one instruction or several?).
- Compilers effect the MFLOPS rating.
- MFLOPS is Not as bad as MIPS since adding NOPs lowers the MFLOPs
- But you can insert unnecessary floating point ADD instructions
and this will probably raise the MFLOPS rating. Why?
Because it will lower the percentage of ``overhead'' (i.e.,
non-floating point) instructions.
Benchmarks are better than MIPS or MFLOPS, but still have difficulties.
- It is hard to find benchmarks that represent your future
- Compilers can be ``tuned'' for important benchmarks.
- Benchmarks can be chosen to favor certain architectures.
- If your processor has 256KB of cache memory and
your competitor's has 128MB, you try to find a benchmark that
frequently accesses a region of memory having size between 128MB
- If your 128MB cache is 2 way set associative (defined later this
month) while your competitors 256MB cache is direct mapped, then
you build/choose a benchmark that frequently accesses exactly two
10K arrays separated by an exact multiple of 256KB.
Carefully go through and understand 2.7 ``fallacies and pitfalls''.
Chapter 7: Memory
Read Chapter 7
Ideal memory is
Big (in capacity; not physical size).
So we use a memory hierarchy ...
Cache (really L1, L2, and maybe L3)
... and try to catch most references
in the small fast memories near the top of the hierarchy.
There is a capacity/performance/price gap between each pair of
We will study the cache <---> memory gap
In modern systems there are many levels of caches.
Similar considerations apply to the other gaps (e.g.,
memory<--->disk, where virtual memory techniques are applied).
This is the gap studied in 202.
But the terminology is often different, e.g., in architecture we
evict cache blocks or lines whereas in OS we evict pages.
In fall 97 my OS class was studying ``the same thing'' at this
exact point (memory management). That wasn't true last year since
the OS text changed and memory management came earlier. This year
memory management is a little later so we are back to the fall 97
We observe empirically (and teach in 202).
Temporal Locality: The word referenced now is likely to be
referenced again soon. Hence it is wise to keep the currently
accessed word handy (high in the memory hierarchy) for a while.
Spatial Locality: Words near the currently referenced
word are likely to be referenced soon. Hence it is wise to
prefetch words near the currently referenced word and keep them
handy (high in the memory hierarchy) for a while.
A cache is a small fast memory between the
processor and the main memory. It contains a subset of the contents
of the main memory.
A Cache is organized in units of blocks.
Common block sizes are 16, 32, and 64 bytes.
This is the smallest unit we can move to/from a cache.
We view memory as organized in blocks as well. If the block
size is 16, then bytes 0-15 of memory are in block 0, bytes 16-31
are in block 1, etc.
Transfers from memory to cache and back are one block.
Big blocks make good use of spatial locality.
If you remember memory management in OS, think of pages and page
A hit occurs when a memory reference is found in
the upper level of memory hierarchy.
We will be interested in cache hits (OS courses
study page hits), when the reference is found in the cache (OS:
when found in main memory).
A miss is a non-hit.
The hit rate is the fraction of memory references
that are hits.
The miss rate is 1 - hit rate, which is the
fraction of references that are misses.
The hit time is the time required for a hit.
The miss time is the time required for a miss.
The miss penalty is Miss time - Hit time.