Class 13
CS 202
24 March 2015

On the board
------------

1. Last time
2. Page faults: uses
3. Page faults: costs
4. Page replacement policies
5. Thrashing

---------------------------------------------------------------------------

1. Last time

    --paging, page tables

    --virtual memory on the x86: page table walking

    --TLB

    --page faults: mechanics. discussed. 
        also, see handout.
        also, see chapter 21 in OSTEP
        (NOTE: book assumes that a PTE (page table entry) has a "valid"
        bit and a "present" bit, both of which the hardware understands.
        The x86 doesn't work this way. The x86 defines a "present"
        bit in page table entries. The OS can define one of the unused
        bits to mean "valid", but this is not something the hardware
        would understand.)

2. Uses of page faults

    --Best example: overcommitting physical memory (the classical use of
    "virtual memory")

	--your program thinks it has, say, 512 MB of memory, but your
	hardware has only 256 MB of memory

	--the way that this worked is that the disk was (is) used to
	store memory pages

	--advantage: address space looks huge

	--disadvantage: accesses to "paged" memory (as disk pages that
	live on the disk are known) are sllooooowwwww:

	--Rough implementation:

	    --on a page fault, the kernel reads in the faulting page

	    --QUESTION: what is listed in the page structures? how does
	    kernel know whether the address is invalid, in memory,
	    paged, what?

	    --kernel may need to send a page to disk (under what
	    conditions? answer: two conditions must hold for kernel to
	    HAVE to write to disk)

		(1) kernel is out of memory

		(2) the page that it selects to write out is dirty

	--Computers have lots of memory, so less common to hear the
	sound of swapping these days. You would need multiple large memory
	consumers running on the same computer.


    --Many other uses

	--store memory pages across the network! (Distributed Shared
	Memory)

	    --basic idea was that on a page fault, the page fault
	    handler went and retrieved the needed page from some
	    other machine

	--copy-on-write

	    --when creating a copy of another process, don't copy
	    its memory. just copy its page tables, mark the pages as
	    read-only

	    --QUESTION: do you need to mark the parent's pages
	    as read-only as well? 

	    --program semantics aren't violated when programs do
	    reads

	    --when a write happens, a page fault results. at that
	    point, the kernel allocates a new page, copies the 
	    memory over, and restarts the user program to do a write

		--then, only do copies of memory when there is a
		fault as a result of a write

	    --this idea is all over the place; used in fork(), mmap(),
	    etc.

	--accounting

	    --good way to sample what percentage of the memory pages
	    are written to in any time slice: mark a fraction of
	    them not present, see how often you get faults

	--if you are interested in this, check out the paper
	"Virtual Memory Primitives for User Programs", by Andrew W.
	Appel and Kai Li, Proc. ASPLOS, 1991.

        --high-level idea: by giving kernel (or even user-level
        program) the opportunity to do interesting things on page
        faults, you can build interesting functionality


    --Paging in day-to-day use

        --Demand paging: bring program code into memory "lazily"

        --Growing the stack (contiguous in virtual space, probably not
        in physical space)

        --BSS page allocation (BSS segment contains the part of the
        address space with global variables, statically initialized to
        zero. OS can delay allocating and zeroing a page until the
        program accesses a variable on the page.)

        --Shared text 

        --Shared libraries 

        --Shared memory 

3. Page faults: costs

    --What does paging from the disk cost?

	--let's look at average memory access time (AMAT)

	--AMAT = (1-p)*memory access time + p * page fault time,
	where p is the prob. of a page fault.
	
	memory access time ~ 100ns 
	disk access time   ~ 10 ms = 10^7 ns

	--QUESTION: what does p need to be to ensure that paging hurts
	performance by less than 10%?

	1.1*t_M = (1-p)*t_M + p*t_D
	p = .1*t_M / (t_D - t_M) ~ 10^1 ns / 10^7 ns = 10^{-6} 

	so only one access out of 1,000,000 can be a page fault!!

	--basically, page faults are super-expensive (good thing the
	machine can do other things during a page fault)

        Concept is much larger than OSes: need to pay attention to the
	slow case if it's really slow and common enough to matter.


4. Page replacement policies

    --the fundamental problem/question:
	
	--some entity holds a cache of entries and gets a cache miss.
	The entity now needs to decide which entry to throw away. How
	does it decide?

	--make sure you understand why page faults that result from
	"page-not-present in memory" are a particular kind of cache miss

	    --(the answer is that in the world of virtual memory, the
	    pages resident in memory are basically a cache to the
	    backing store on the disk; make sure you see why this claim,
	    about virtual memory vis-a-vis the disk, is true.)

    --the system needs to decide which entry to throw away, which calls
    for a *replacement policy*.
    
    --let's cover some policies....
 
    Specific policies

        * FIFO: throw out oldest (results in every page spending
	the same number of references in memory. not a good idea.  pages
	are not accessed uniformly.)

	* MIN (also known as OPT). throw away the entry that won't
	be used for the longest time. this is optimal.
	
	    our textbook and other references assert its optimality, but
	    they do not prove it. it's a good idea to get in the habit
	    of convincing yourselves of (or disproving) assertions.
	    Here's a proof, under the assumption that the cache is
	    always full:

	    Choose any other scheme. Call it ALT. Now let's sum the
	    number of misses under ALT or OPT, and induct over the
	    number of references. Four cases at any given reference:
	    {OPT hits, ALT hits}, {OPT hits, ALT misses}, {OPT misses,
	    ALT misses}, {OPT misses, ALT hits}. The only interesting
	    case is the last one (in the other cases, OPT does as well
	    or better than ALT, so OPT keeps pace with, or beats, the
	    competition at every reference). Say that the last case
	    happens at a reference, r. By the induction hypothesis, OPT
	    was optimal right up until the *last* miss OPT experienced,
	    at reference, say, r - a.  After that reference, there has
	    been only one miss (the current one, at r). The alternative,
	    ALT, couldn't have done better than OPT up until r-a (by the
	    induction hypothesis). And since r-a, OPT has had only one
	    miss. But ALT could not have had 0 misses between r-a and
	    now because if it did, it means that OPT replaced the wrong
	    entry at r-a (another way to say the same thing: OPT chose
	    which page to evict so that a is maximal). Thus, OPT is no
	    worse than ALT at r. In the remaining cases, OPT is as good
	    or better than ALT in terms of contributing to the number of
	    misses. So by induction, OPT is optimal.

    --evaluating these algorithms
	
	input
	--reference string: sequence of page accesses
	--cache (e.g., physical memory) size 

	output
	--number of cache evictions (e.g., number of swaps)

    --examples......

	--time goes left to right. 
	--cache hit = h

        ------------------------------------

	FIFO

	phys_slot    A B C A B D A D B C B
	S1           A     h   D   h   C 
	S2             B     h   A 
	S3               C           B   h

		7 swaps, 4 hits

        ------------------------------------

	OPTIMAL

	phys_slot    A B C A B D A D B C B
	S1           A     h     h     C
	S2             B     h       h   h
	S3               C     D   h

		5 swaps, 6 hits

        ------------------------------------

       * LRU: throw out the least recently used (this is often a good
	idea, but it depends on the future looking like the past. what
	if we chuck a page from our cache and then were about to use
	it?)


	LRU

	phys_slot    A B C A B D A D B C B
	S1           A     h     h     C
	S2             B     h       h   h
	S3               C     D   h

		5 swaps, 6 hits

        --LRU looks awesome!

        --but what if our reference string were ABCDABCDABCD?

	phys_slot   A B C D A B C D A B C D 
	 S1         A     D     C     B
	 S2           B     A     D     C
	 S3             C     B     A     D

	 12 swaps, 0 hits. BUMMER.

        --same thing happens with FIFO.

        --what about OPT? [not as much of a bummer at all.]

        --other weirdness: Belady's anomaly: what happens if you add memory
        under a FIFO policy?

	phys_slot   A B C D A B E A B C D E 
	S1          A     D     E         h
	S2            B     A     h   C
	S3              C     B     h   D

	    9 swaps, 3 hits. not great. let's add some slots. maybe we
	    can do better

	phys_slot   A B C D A B E A B C D E 
	S1          A       h   E       D
	S2            B       h   A       E
	S3              C           B
	S4                D           C

	   10 swaps, 2 hits. this is worse. 

        --do these anomalies always happen?

	    --answer: no. with policies like LRU, contents of memory with X
	    pages is subset of contents with X+1 pages

    --all things considered, LRU is pretty good. let's try to implement
    it......

    --implementing LRU 

	--reasonable to do in application programs like Web servers that
	cache pages (or dedicated Web caches).
	    [use queue to track least recently accessed and use hash map
	    to implement the (k,v) lookup]

	--in OS, LRU itself does not sound great. would be doubling
	memory traffic (after every reference, have to move some
	structure to the head of some list)

	--and in hardware, it's way too much work to timestamp each
	reference and keep the list ordered (remember that the TLB may
	also be implementing these solutions)

    --how can we approximate LRU?

    --another algorithm:
        * CLOCK

	--arrange the slots in a circle. hand sweeps around, clearing
	a bit. the bit is set when the page is accessed. just evict a
	page if the hand points to it when the bit is clear.
    
	--approximates LRU ... because we're evicting pages that haven't
	been used in a while....though of course we may not be evicting
	the *least* recently used one (why not?)

    --can generalize CLOCK:
        * NTH CHANCE

	--don't throw a page out until the hand has swept by N times.

	--OS keeps counter per page: # sweeps

	--On page fault, OS looks at page pointed to by the hand,
	and checks that page's use bit
	    1 --> clear use bit and clear counter
	    0 --> increment counter
		if counter < N, keep going
		if counter = N, replace the page: it hasn't been used in
		  a while

	--How to pick N?
	    Large N --> better approximation to LRU
	    Small N --> more efficient. otherwise going around the
	    circle a lot (might need to keep going around and around
	    until a page's counter gets set = to N)

	--modification:

	    --dirty pages are more expensive to evict (why?)

	    --so give dirty pages an extra chance before replacing

	    common approach (supposedly on Solaris but I don't know):
	    --clean pages use N = 1
	    --dirty pages use N = 2 
		(but initiate write back when N=1, i.e., try to get the
		page clean at N=1)


    --Summary:

	--optimal is known as OPT or MIN (textbook asserts but doesn't
	prove optimality)

	--LRU is usually a good approximation to optimal

	--Implementing LRU in hardware or at OS/hardware interface is a
	pain

	--So implement CLOCK or NTH CHANCE ... decent approximations to
	LRU, which is in turn good approximation to OPT *assuming that
	past is a good predictor of the future* (this assumption does
	not always hold!)


    Miscellaneous implementation points

	Note that many machines, x86 included, maintain 4 bits per page
	table entry:

	    --*use*: Set when page referenced; cleared by an algorithm
	    like CLOCK (the bit is called "Accessed" on x86)

	    --*modified*: Set when page modified; cleared when page
	    written to disk (the bit is called "Dirty" on x86)

	    --*valid*: Program can reference this page without getting a
	    page fault. Set if page is in memory? [no. it is "only if", not
	    "if". *valid*=1 implies page in physical memory. but page in
	    physical memory does not imply *valid*=1; in other words,
	    *valid*=0 does not imply page is not in physical memory.]

	    --*read-only*: program can read page, but not modify it. Set
	    if page is truly read-only? [no. similar case to above, but
	    slightly confusing because the bit is called "writable". if
	    a page's bits are such that it appears to be read-only, that
	    page may or may not be truly "read only". meanwhile, if a
	    page is truly read-only, it better have its bits set to be
	    read-only.]

	Do we actually need Use and Modified bits in the page tables
	set by the harware?

	    --[again, x86 calls these the Accessed and Dirty bits]

	    --answer: no.

	    --how could we simulate them?

	    --OS maintains bits itself in unused page table entry bits
	    or in a parallel data structure (it doesn't matter, since
	    we're just talking here about how the OS can "create" this
	    bit abstractly).

	    --for the Modified [x86: Dirty] bit, just mark all pages
	    read-only. Then if a write happens, the OS gets a page fault
	    and can set the bit itself. Then the OS should mark the page
	    writable so that this page fault doesn't happen again

	    --for the Use [x86: Accessed] bit, just mark all pages as
	    not present (even if they are present). Then if a reference
	    happens, the OS gets a page fault, and can set the bit,
	    after which point the OS should mark the page present (i.e.,
	    set the PRESENT bit).


    Fairness

	--if OS needs to swap a page out, does it consider all pages in one
	pool or only those of the process that caused the page fault? 

	--what is the trade-off between local and global policies?

	    --global: more flexible but less fair

	    --local: less flexible but fairer

5. Thrashing

    [The points below apply to any caching system, but for the sake of
    concreteness, let's assume that we're talking about page replacement
    in particular.]

    What is thrashing?

    Processes require more memory than system has

    Specifically, each time a page is brought in, another page, whose
    contents will soon be referenced, is thrown out

	Example:

	    --one program touches 50 pages (each equally likely); only 
	      have 40 physical page frames 
	    
	    --If we have enough physical pages, 100ns/ref 
     
	    --If we have too few physical pages, assume every 5th
	    reference leads to a page fault 
     
	    --4refs x 100ns  and 1 page fault x 10ms for disk I/O 

	    --this gets us
		5 refs per (10ms + 400ns) = 2ms/ref = 20,000x slowdown!!! 
     

	--What we wanted: virtual memory the size of disk with access
	time the speed of physical memory 

	--What we have here: memory with access time roughly of disk
	(2 ms/mem_ref compare to 10 ms/disk_access)

	As stated earlier, this concept is much larger than OSes: need
	to pay attention to the slow case if it's really slow and common
	enough to matter.


    Reasons/cases:

	--process doesn't reuse memory (or has no temporal locality)

	--process reuses memory but the memory that is absorbing
	most of the accesses doesn't fit.

	--individually, all processes fit, but too much for the system

    what do we do?

	--well, in the first two reasons above, there's nothing you can
	do, other than restructuring your computation or buying memory
	(e.g., expensive hardware that keeps entire customer database in
	RAM)

	--in the third case, can and must shed load. how?
    
    two approaches:
	a. working set
	b. page fault frequency

    a. working set

	--only run a set of processes s.t. the union of their
	working sets fit in memory

	--definition of working set (short version): the pages a
	processed has touched over some trailing window of time

    b. page fault frequency

	--track the metric (# page faults/instructions executed)

	--if that thing rises above a threshold, and there is not enough
	memory on the system, swap out the process


[Acknowledgments: David Mazieres, Mike Dahlin]