Class 16
CS 439
7 March 2013

On the board
------------

1. Last time
2. Finish Page replacement policies
3. Miscellaneous points about replacement
4. Heap memory management

---------------------------------------------------------------------------

1. Last time

    --uses of page faults, other page structures, page replacement
    policies

    --interesting use of page faults: DSM

2. Page replacement policies, contd.

    --implementing LRU 

	--reasonable to do in application programs like Web servers that
	cache pages (or dedicated Web caches).
	    [use queue to track least recently accessed and use hash map
	    to implement the (k,v) lookup]

	--in OS, LRU itself does not sound great. would be doubling
	memory traffic (after every reference, have to move some
	structure to the head of some list)

	--and in hardware, it's way too much work to timestamp each
	reference and keep the list ordered (remember that the TLB may
	also be implementing these solutions)

    --how can we approximate LRU?

    --another algorithm: * CLOCK

	--arrange the slots in a circle. hand sweeps around, clearing
	a bit. the bit is set when the page is accessed. just evict a
	page if the hand points to it when the bit is clear.
    
	--approximates LRU ... because we're evicting pages that haven't
	been used in a while....though of course we may not be evicting
	the *least* recently used one (why not?)

    --can generalize this: * NTH CHANCE

	--don't throw a page out until the hand has swept by N times.

	--OS keeps counter per page: # sweeps

	--On page fault, OS looks at page pointed to by the hand,
	and checks that page's use bit
	    1 --> clear use bit and clear counter
	    0 --> increment counter
		if counter < N, keep going
		if counter = N, replace the page: it hasn't been used in
		  a while

	--How to pick N?
	    Large N --> better approximation to LRU
	    Small N --> more efficient. otherwise going around the
	    circle a lot (might need to keep going around and around
	    until a page's counter gets set = to N)

	--modification:
	    --dirty pages are more expensive to evict (why?)

	    --so give dirty pages an extra chance before replacing

	    common approach (supposedly on Solaris but I don't know):
	    --clean pages use N = 1
	    --dirty pages use N = 2 
		(but initiate write back when N=1, i.e., try to get the
		page clean at N=1)

    --Our summary:

	--optimal is known as OPT or MIN (textbook asserts but doesn't
	prove optimality)

	--LRU is usually a good approximation to optimal

	--Implementing LRU in hardware or at OS/hardware interface is a
	pain

	--So implement CLOCK or NTH CHANCE ... decent approximations to
	LRU, which is in turn good approximation to OPT *assuming that
	past is a good predictor of the future*

    --Note that caching doesn't always save the day: there may simply be
    too much demand on memory

	--so what do we do. see below......


3. Miscellaneous points about replacement

    These miscellaneous points apply to any caching system, but for the
    sake of concreteness, and to illustrate some important points, let's
    assume that we're talking about page replacement in particular.

    A. Implementation points

	Note that many machines, x86 included, maintain 4 bits per page
	table entry:

	    --*use*: Set when page referenced; cleared by an algorithm like
	    CLOCK (the bit is called "Accessed" on x86)

	    --*modified*: Set when page modified; cleared when page written
	    to disk (the bit is called "Dirty" on x86)

	    --*valid*: Program can reference this page without getting a
	    page fault. Set if page is in memory? [no. it is "only if", not
	    "if". *valid*=1 implies page in physical memory. but page in
	    physical memory does not imply *valid*=1; in other words,
	    *valid*=0 does not imply page is not in physical memory.]

	    --*read-only*: program can read page, but not modify it. Set if
	    page is truly read-only? [no. similar case to above, but
	    slightly confusing because the bit is called "writable". if a
	    page's bits are such that it appears to be read-only, it may or
	    may not be because it is truly "read only". but if a page is
	    truly read-only, it better have its bits set to be read-only.]

	Do we actually need Modified and Reffed bits in the page tables
	set by the harware?

	    --[again, x86 calls these the Dirty and Accessed bits]

	    --answer: no.

	    --how could we simulate them?

	    --for the Modified [x86: Dirty] bit, just mark all pages
	    read-only. Then if a write happens, the OS gets a page fault
	    and can set the bit itself. Then the OS should mark the page
	    writable so that this page fault doesn't happen again

	    --for the Use [x86: Accessed] bit, just mark all pages as
	    not present (even if they are present). Then if a reference
	    happens, the OS gets a page fault, and can set the bit,
	    after which point the OS should mark the page present (i.e.,
	    set the PRESENT bit).

    B. What if caching doesn't work?

	reasons

	    --process doesn't reuse memory

	    --process reuses memory but it doesn't fit.

	    --individually, all processes fit, but too much for the system

	what do we do?

	    --well, in the first two cases, there's nothing you can do,
	    other than restructuring your computation or buying memory
	    (e.g., expensive hardware that keeps entire customer
	    database in RAM)

	    --in the third case, can and must shed load. how?
	
	two approaches:
	    a. working set
	    b. page fault frequency

	a. working set

	    --only run a set of processes s.t. the union of their
	    working sets fit in memory

	    --book defines working set. short version: the pages a processed
	    has touched over some trailing window of time

	b. page fault frequency

	    --track the metric (# page faults/instructions executed)

	    --if that thing rises above a threshold, and there is not enough
	    memory on the system, swap out the process

    C. Fairness

	--if OS needs to swap a page out, does it consider all pages in one
	pool or only those of the process that caused the page fault? 

	--what is the trade-off between local and global policies?

	    --global: more flexible but less fair

	    --local: less flexible but fairer

4. Heap memory management

A. Intro

    big picture:

        process's address space:
            [bunch of regions including the heap]

        we are going to talk about the management of the heap. this
        is mostly an application-level consideration; kernel not
        involved.

    how exactly is the heap managed/used/etc.? this is the subject of
    this section of the class.

    dynamic memory allocation
        
        required for useful programs

            (without it, the programmer would have to statically specify how
            much memory they needed; what happens when input size changes,
            and hence the program does something different)

        can affect performance

        unfortunately, there's no perfect allocator

    when is memory allocation/freeing invoked?
        
        --automatically (garbage-collected languages). For example,
        Java, Lisp. also known as automatic memory management

        --when the programmer calls malloc() and free(). also known as
        explicit memory management.

            --we focus on this one

B. Challenges

    --satisfy arbitrary sequence of alloc()/free().

    --NOTE: if you didn't have free(), this would be easy:
        
        [<allocated region> [free mem]     ]
                          |
                          |curr. free pos.

    --free creates holes ("fragmentation"). The result is lots of free
    space, but some requests cannot be satisfied:
        [  |    |        | | |     |      | ]

    --more abstractly, here's the game board

        --the allocator has a list of free regions (book describes how
        this list is implemented: basically pointers inside the free
        blocks, which induces a linked list structure

        --the allocator can decide which block to use to satisfy a
        free() request. the allocator's goal is to avoid wasting space
        and to have very little overhead

        --the allocator cannot:

            --control the number and size of requested blocks

            --move allocated regions (bad placement decisions are
            permanent)

        --the core fight is to minimize fragmentation

    --fragmentation requires two things:

        --different sized requests

            [ X |  | X |  | X |  | X | ... | ]

        --different lifetimes

            [              |XXXXXXXXXXXXXXXXXX ] 

        (if all blocks requested are same size, and that size is known
        OR all objects freed/allocated together, then fragmentation is
        not a concern.)

C. Choices by the allocator

    --placement: where in free memory does a requested block go?
    ideally, it can be put where it won't cause fragmentation later.

    --split free blocks to satisfy smaller requests? (design decision:
    which blocks to split?)

    --coalesce free blocks to yield larger blocks? (design decision:
    when to do this?)

    theoretical result: for any possible allocation algorithm, there are
    adversarial request patterns that cause that allocator's decisions
    to result in severe fragmentation.

D. Pathological examples

    --Given allocation of 7 20-byte chunks, what's a bad stream of
    free()s then allocates?

        [ 20 | 20 | 20 | 20 | .... | 20 ]

        (free every other chunk, then alloc 21 bytes)

    --Given a 128-byte limit on malloced space

        what's a bad combination of mallocs and frees?

            Malloc 128 1-byte chunks, free every other

            Malloc 32 2-byte chunks, free every other 1- and 2-byte chunk
       
E. Best fit, first fit

    Heap is a list of free blocks. Each block has a header that holds
    the block size and a pointer to the next block. (See B&O Fig. 9.48)

    1. Best fit: allocate space from block that is closest in size to the
    request. Ideal is exact match.

        (During free(), coalesce)

        Problem: sawdust: left with little bits everywhere. Doesn't seem to
        be a problem in practice

        Pathological example:

            alloc 19,21,19,21,19,.....
            [19 | 21 | 19 | 21 | 19 | ... ]

            free 19, 19, 19:

            alloc 20? fails! lots of wasted space. (can ask OS for memory,
            but still.)

`       [in lecture, we used this as a pathological case for BF, but it
        is also a pathological case for FF (below).]

    2. First fit: pick first block that fits

        put free blocks on list in LIFO order

        simple

        results in higher fragmentation (sometimes waste space):

            intermix 2n-byte allocations that are short-lived and
            (n+1)-byte allocations that are long-lived

            when a 2n-byte object is freed, a chunk of size (n+1) will
            be taken, leaving a useless fragment

        could also sort the free block list in address order (which
        makes coalescing easy
        
            Blocks at front preferentially split, ones at back only
            split when no larger one found before them

            This roughly sorts free list by size

            Which makes first fit operationally similar to best fit. The
            insight is that the first fit of a sorted list *is* best
            fit.

            issue: large requests skip over small blocks, so some kind
            of data structure is required

    3. FF vs. BF

        assume free block list of [20, 15]

        if alloc() requests are 10 then 20, best fit wins

        what about ff?

        what if requests are 8, 12, 12. then first fit wins.
   
    4. supposedly FF and BF perform similarly

        perhaps because over time, the list composition is similar in
        both: small free blocks at the beginning, and large open spaces
        at the end
   
F. App request pattern

    ramp pattern: app asks for memory, never gives it back.
    (then not really evaluating fragmentation)

    peaks: allocate a bunch of objects, use them briefly, return them
    all.

        --this can be exploited with arena allocation
        
        --allocate a big area. don't manage it. just keep incrementing
        the "beginning of the free block" pointer.

        --at the end, return the whole arena

G. sbrk

    Use sbrk to ask OS to expand size of heap.

    Not a good idea to use this for large allocations since it's hard to
    use sbrk to give the page _back_

    Instead, use mmap in MAP_ANON mode


[thanks to David Mazieres and Alison Norman for portions of this content]