Class 16 CS 439 7 March 2013 On the board ------------ 1. Last time 2. Finish Page replacement policies 3. Miscellaneous points about replacement 4. Heap memory management --------------------------------------------------------------------------- 1. Last time --uses of page faults, other page structures, page replacement policies --interesting use of page faults: DSM 2. Page replacement policies, contd. --implementing LRU --reasonable to do in application programs like Web servers that cache pages (or dedicated Web caches). [use queue to track least recently accessed and use hash map to implement the (k,v) lookup] --in OS, LRU itself does not sound great. would be doubling memory traffic (after every reference, have to move some structure to the head of some list) --and in hardware, it's way too much work to timestamp each reference and keep the list ordered (remember that the TLB may also be implementing these solutions) --how can we approximate LRU? --another algorithm: * CLOCK --arrange the slots in a circle. hand sweeps around, clearing a bit. the bit is set when the page is accessed. just evict a page if the hand points to it when the bit is clear. --approximates LRU ... because we're evicting pages that haven't been used in a while....though of course we may not be evicting the *least* recently used one (why not?) --can generalize this: * NTH CHANCE --don't throw a page out until the hand has swept by N times. --OS keeps counter per page: # sweeps --On page fault, OS looks at page pointed to by the hand, and checks that page's use bit 1 --> clear use bit and clear counter 0 --> increment counter if counter < N, keep going if counter = N, replace the page: it hasn't been used in a while --How to pick N? Large N --> better approximation to LRU Small N --> more efficient. otherwise going around the circle a lot (might need to keep going around and around until a page's counter gets set = to N) --modification: --dirty pages are more expensive to evict (why?) --so give dirty pages an extra chance before replacing common approach (supposedly on Solaris but I don't know): --clean pages use N = 1 --dirty pages use N = 2 (but initiate write back when N=1, i.e., try to get the page clean at N=1) --Our summary: --optimal is known as OPT or MIN (textbook asserts but doesn't prove optimality) --LRU is usually a good approximation to optimal --Implementing LRU in hardware or at OS/hardware interface is a pain --So implement CLOCK or NTH CHANCE ... decent approximations to LRU, which is in turn good approximation to OPT *assuming that past is a good predictor of the future* --Note that caching doesn't always save the day: there may simply be too much demand on memory --so what do we do. see below...... 3. Miscellaneous points about replacement These miscellaneous points apply to any caching system, but for the sake of concreteness, and to illustrate some important points, let's assume that we're talking about page replacement in particular. A. Implementation points Note that many machines, x86 included, maintain 4 bits per page table entry: --*use*: Set when page referenced; cleared by an algorithm like CLOCK (the bit is called "Accessed" on x86) --*modified*: Set when page modified; cleared when page written to disk (the bit is called "Dirty" on x86) --*valid*: Program can reference this page without getting a page fault. Set if page is in memory? [no. it is "only if", not "if". *valid*=1 implies page in physical memory. but page in physical memory does not imply *valid*=1; in other words, *valid*=0 does not imply page is not in physical memory.] --*read-only*: program can read page, but not modify it. Set if page is truly read-only? [no. similar case to above, but slightly confusing because the bit is called "writable". if a page's bits are such that it appears to be read-only, it may or may not be because it is truly "read only". but if a page is truly read-only, it better have its bits set to be read-only.] Do we actually need Modified and Reffed bits in the page tables set by the harware? --[again, x86 calls these the Dirty and Accessed bits] --answer: no. --how could we simulate them? --for the Modified [x86: Dirty] bit, just mark all pages read-only. Then if a write happens, the OS gets a page fault and can set the bit itself. Then the OS should mark the page writable so that this page fault doesn't happen again --for the Use [x86: Accessed] bit, just mark all pages as not present (even if they are present). Then if a reference happens, the OS gets a page fault, and can set the bit, after which point the OS should mark the page present (i.e., set the PRESENT bit). B. What if caching doesn't work? reasons --process doesn't reuse memory --process reuses memory but it doesn't fit. --individually, all processes fit, but too much for the system what do we do? --well, in the first two cases, there's nothing you can do, other than restructuring your computation or buying memory (e.g., expensive hardware that keeps entire customer database in RAM) --in the third case, can and must shed load. how? two approaches: a. working set b. page fault frequency a. working set --only run a set of processes s.t. the union of their working sets fit in memory --book defines working set. short version: the pages a processed has touched over some trailing window of time b. page fault frequency --track the metric (# page faults/instructions executed) --if that thing rises above a threshold, and there is not enough memory on the system, swap out the process C. Fairness --if OS needs to swap a page out, does it consider all pages in one pool or only those of the process that caused the page fault? --what is the trade-off between local and global policies? --global: more flexible but less fair --local: less flexible but fairer 4. Heap memory management A. Intro big picture: process's address space: [bunch of regions including the heap] we are going to talk about the management of the heap. this is mostly an application-level consideration; kernel not involved. how exactly is the heap managed/used/etc.? this is the subject of this section of the class. dynamic memory allocation required for useful programs (without it, the programmer would have to statically specify how much memory they needed; what happens when input size changes, and hence the program does something different) can affect performance unfortunately, there's no perfect allocator when is memory allocation/freeing invoked? --automatically (garbage-collected languages). For example, Java, Lisp. also known as automatic memory management --when the programmer calls malloc() and free(). also known as explicit memory management. --we focus on this one B. Challenges --satisfy arbitrary sequence of alloc()/free(). --NOTE: if you didn't have free(), this would be easy: [ [free mem] ] | |curr. free pos. --free creates holes ("fragmentation"). The result is lots of free space, but some requests cannot be satisfied: [ | | | | | | | ] --more abstractly, here's the game board --the allocator has a list of free regions (book describes how this list is implemented: basically pointers inside the free blocks, which induces a linked list structure --the allocator can decide which block to use to satisfy a free() request. the allocator's goal is to avoid wasting space and to have very little overhead --the allocator cannot: --control the number and size of requested blocks --move allocated regions (bad placement decisions are permanent) --the core fight is to minimize fragmentation --fragmentation requires two things: --different sized requests [ X | | X | | X | | X | ... | ] --different lifetimes [ |XXXXXXXXXXXXXXXXXX ] (if all blocks requested are same size, and that size is known OR all objects freed/allocated together, then fragmentation is not a concern.) C. Choices by the allocator --placement: where in free memory does a requested block go? ideally, it can be put where it won't cause fragmentation later. --split free blocks to satisfy smaller requests? (design decision: which blocks to split?) --coalesce free blocks to yield larger blocks? (design decision: when to do this?) theoretical result: for any possible allocation algorithm, there are adversarial request patterns that cause that allocator's decisions to result in severe fragmentation. D. Pathological examples --Given allocation of 7 20-byte chunks, what's a bad stream of free()s then allocates? [ 20 | 20 | 20 | 20 | .... | 20 ] (free every other chunk, then alloc 21 bytes) --Given a 128-byte limit on malloced space what's a bad combination of mallocs and frees? Malloc 128 1-byte chunks, free every other Malloc 32 2-byte chunks, free every other 1- and 2-byte chunk E. Best fit, first fit Heap is a list of free blocks. Each block has a header that holds the block size and a pointer to the next block. (See B&O Fig. 9.48) 1. Best fit: allocate space from block that is closest in size to the request. Ideal is exact match. (During free(), coalesce) Problem: sawdust: left with little bits everywhere. Doesn't seem to be a problem in practice Pathological example: alloc 19,21,19,21,19,..... [19 | 21 | 19 | 21 | 19 | ... ] free 19, 19, 19: alloc 20? fails! lots of wasted space. (can ask OS for memory, but still.) ` [in lecture, we used this as a pathological case for BF, but it is also a pathological case for FF (below).] 2. First fit: pick first block that fits put free blocks on list in LIFO order simple results in higher fragmentation (sometimes waste space): intermix 2n-byte allocations that are short-lived and (n+1)-byte allocations that are long-lived when a 2n-byte object is freed, a chunk of size (n+1) will be taken, leaving a useless fragment could also sort the free block list in address order (which makes coalescing easy Blocks at front preferentially split, ones at back only split when no larger one found before them This roughly sorts free list by size Which makes first fit operationally similar to best fit. The insight is that the first fit of a sorted list *is* best fit. issue: large requests skip over small blocks, so some kind of data structure is required 3. FF vs. BF assume free block list of [20, 15] if alloc() requests are 10 then 20, best fit wins what about ff? what if requests are 8, 12, 12. then first fit wins. 4. supposedly FF and BF perform similarly perhaps because over time, the list composition is similar in both: small free blocks at the beginning, and large open spaces at the end F. App request pattern ramp pattern: app asks for memory, never gives it back. (then not really evaluating fragmentation) peaks: allocate a bunch of objects, use them briefly, return them all. --this can be exploited with arena allocation --allocate a big area. don't manage it. just keep incrementing the "beginning of the free block" pointer. --at the end, return the whole arena G. sbrk Use sbrk to ask OS to expand size of heap. Not a good idea to use this for large allocations since it's hard to use sbrk to give the page _back_ Instead, use mmap in MAP_ANON mode [thanks to David Mazieres and Alison Norman for portions of this content]