Class 12
CS 439
21 Feburary 2013

On the board
------------

1. Last time
2. scheduling
    --finish up scheduling disciplines
    --lessons and conclusions
3. virtual memory intro
4. segmentation
5. paging

---------------------------------------------------------------------------

1. Last time

2. Scheduling disciplines, continued

    E. multilevel feedback queues

	[first used in CTSS; also used in FreeBSD. Linux up until 2.6.23
	did something roughly similar.]
	
	two ideas:
	    --*multiple queues, each with different priority*. OS does RR
	    at each non-empty priority level before moving onto next priority
	    level. 32 levels, for example.
	    --feedback: process's priority changes, for example based on
	    how little or much it's used the CPU
	
	result is to favor interactive jobs that use less CPU but
	without starving all of the jobs

	a process's priority might be set like this:
	    --decreases whenever timer interrupt found the process
	    running
	    --increases while the process is runnable but not running

	advantages:
	    --approximates SRTCF
	disadvantages:
	    --gameable: user puts in meaningless I/O to keep job's
	    priority high
	    --can't donate priority
	    --not very flexible
	    --not good for real-time and multimedia

    F. Real time
	--examples: cars, video playback, robots on assembly lines
	--Soft real time: miss deadline and CD will sound funny 
	--Hard real time: miss deadline and plane will crash 

	--Many strategies. Long literature here. Basically, as long as
	\sum (CPU_needed/period) <= 1, then first-deadline-first works.

3. Scheduling lessons and conclusions

    --Scheduling comes up all over the place

	--m requests share n resources

	--disk arm: which read/write request to do next?

	--memory: which process to take physical page from?

    --This topic was popular in the days of time sharing, when there was
    a shortage of resources all around, but many scheduling problems
    become not very interesting when you can just buy a faster CPU or a
    faster network.

	--Exception 1: web sites and large-scale networks often cannot
	be made fast enough to handle peak demand (flash crowds,
	attacks) so scheduling becomes important again. For example may
	want to prioritize paying customers, or address
	denial-of-service attacks.

	--Exception 2: some scheduling decisions have non-linear effects
	on overall system behavior, not just different performance for
	different users. For example, livelock scenario, which we are
	discussing.

	--Exception 3: real-time systems:
	    soft real time: miss deadline and CD or MPEG decode will skip
	    hard real time: miss deadline and plane will crash

	    Plus, at some level, every system with a human at the other
	    end is a real-time system. If a Web server delays too long,
	    the user gives up. So the ultimate effect of the system may
	    in fact depend on scheduling!
	    
    --In principle, scheduling decisions shouldn't affect program's
    results

	--This is good because it's rare to be able to calculate the
	best schedule

	--So instead, we build the kernel so that it's correct to do a
	context switch and restore at any time, and then *any* schedule
	will get the right answer for the program

	--This is a case of a concept that comes up a fair bit in
	computer systems: the policy/mechanism split. In this case, the
	idea is that the *mechanism* allows the OS to switch any time
	while the *policy* determines when to switch in order to meet
	whatever goals are desired by the scheduling designer

	    [[--In my view, the notion of "policy/mechanism split" is
	    way overused in computer systems, for two reasons:
	    
		--when someone says they separated policy from mechanism
		in some system, usually what's going on is that they
		separated the hard problem from the easy problem and
		solved the easy problem; or

		--it's simply not the case that the two are separate.
		*every* mechanism encodes a range of possible policies,
		and by choice of mechanism you are usually constraining
		what policies are possible. That point is obvious but
		tends to be overlooked when people advertise that
		they've "fully separated policy from mechanism"]]

    --But there are cases when the schedule *can* affect correctness

	--multimedia: delay too long, and the result looks or sounds
	wrong

	--Web server: delay too long, and users give up


    --Three "systems" or "engineering" lessons (besides policy/mechanism split):

	(i) Know your goals; write them down

	(ii) Compare against optimal, even if optimal can't be built. 

	    --It's a useful benchmark. Don't waste your time improving
	    something if it's already at 99% of optimal.

	    --Provides helpful insight. (For example, we know from the
	    fact that SJF is optimal that it's impossible to be optimal
	    and fair, so don't spend time looking for an optimal
	    algorithm that is also fair.)

	(iii) There are actually many different schedulers in the
	system that interact:

	    --mutexes, etc. are implicitly making scheduling decisions

	    --interrupts: likewise (by invoking handlers)

	    --disk: the disk scheduler doesn't know to favor one
	    process's I/O above another

	    --network: same thing: how does the network code know which
	    process's packets to favor? (it doesn't)

	    --example of multiple interacting schedulers:

		you can optimize the CPU's scheduler and still find it
		does nothing (e.g., if you're getting interrupted
		200,000 times per second, only the interrupt handler is
		going to get the CPU, so you need to solve that problem
		before you worry about how the main CPU scheduler
		allocates the CPU to jobs)

	    --Basically, the _existence_ of interrupts is bad for
	    scheduling (also true in life)

---------------------------------------------------------------------------

admin annoucement:

    lab 4 is challenging. start it now

---------------------------------------------------------------------------

4. Virtual memory

   A. top-most idea:
	--let programs use addresses like 0, 0xc000, whatever.
	--OS arranges for hardware to translate these addresses 
	    --what piece of hardware does this? (A: MMU)
	--what doesn't OS just translate the stuff itself? [slow]
	  
	idea is to fool programs

	but OS also fools itself! (JOS thinks it is running at the top
	of physical memory [0xf0000000], but it is not)

	--draw picture:

	[CPU ---> translation box --> physical addresses]

	that translation box gives us a bunch of things
	    --protection: processes can't touch each other's memory
		--idea: if you cannot name it, you cannot use it. deep idea.
	    --relocation:
		--two instances of program foo are each loaded, each
		think they're using memory addresses like 0,0x1234,
		whatever, but of course they're not using the same
		actual memory cells
	    --sharing:
		--processes share memory under controlled circumstances,
		but that physical memory may show up at very different
		virtual addresses
		--that is, two processes have a different way to refer
		to the same physical memory cells

    B. applied to x86:

	logical [virtual] addresses ---> linear addresses ---> physical addresses

	    --logical addresses are also known as virtual addresses
	    --physical addresses are what is on the CPU's address pins
		--do they address RAM?
		--no, they refer to the physical memory map (i.e.,
		hardware may do more translation)

	the first translation happens via *segment translation*

	the second translation happens via *page translation*

	segmentation is old-school and these days mostly an annoyance
	(but it cannot be turned off!)
	
	    --however, it comes in handy every now and then for
	    things like sandboxing (advanced topic) or thread-local
	    memory 

5. segmentation	

    A. segmentation in general
    
	segmentation means:

	    memory addresses treated like offsets into a contiguous
	    region.

	QUESTION: if segmentation can't be turned off, how do we pretend
	it's not there?

	    setting its mapping to be the identity function
		
	    offset of 0 and no limit

    B. segmentation on the x86

	linear address = base + virtual_address
	    (virtual_address is the offset here)

	what's the interface to segmentation?

	there are tables:
	    
	    GDT, LDT

	    processor told where this table lives via
		LLDT, LGDT, SLDT, SGDT

	every instruction comes with an implicit *or* explicit segment
	register (the implicit case is the usual one):

	    pop %ebx	               ; implicitly uses %ss
	    call $0x7000               ; implicitly uses %cs
	    movl $0x1234, (%eax)       ; implicitly uses %ds
	    movl $0x1234, %gs:(%eax)   ; explicitly uses %gs

	    [all references to %eip (such as instruction fetches) uses %cs for translation.]
 
	    some instructions can take "far addresses":
	           ljmp $selector, $offset 
   
	a segment register holds a segment selector

	    different registers for the stack (ss), data (ds), code
	    (cs), string [extra] operations (es), other fun stuff (fs,
	    gs)

	a selector indexes into the LDT or GDT, and chooses *which* table
	and which *entry* in that table

	determines base, limit, **protection** (R/W/X, user/kernel, etc/), type

	offset better be less than limit
	 
	example #1:
	   
	    say that %ds refers to this descriptor entry:
		base 0x30000000
		limit 0x0f0

	    now, when program does:
		
		mov 0x50, %eax

		what happens?
		
		[0x50 gets translated into 0x3000 0050]

	example #2:

	    what about if program does:

		mov 0x100, %eax ?

	        [error.]
	
	NOTES:

	    --Current privilege level (CPL) is in the low 2 bits of CS

	    --CPL=0 is privileged O/S, CPL=3 is user
	    
	    --can app modify the descriptors in the LDT? it's in memory...
		yes it can. useful for certain things, like one
		user-level program sandboxing another.

	    --app cannot just lower the CPL 

	    --don't confuse LDT and GDT with **IDT** (which you'll see
	    in lab 3)

---------------------------------------------------------------------------
    potentially useful reference:
	4KB    = 2^{12} = 0x00001000  = 0x00000fff + 1
	4MB    = 2^{22} = 0x00400000  = 0x003fffff + 1
	256 MB = 2^{28} = 0x10000000  = 0x0fffffff + 1
	4GB    = 2^{32} =0x100000000  = 0xffffffff (+1) = ~0x00000000

	(0xef400000 >> 22) = 0x3bd = 957
	(0xef800000 >> 22) = 0x3be = 958
	(0xefc00000 >> 22) = 0x3bf = 959
	(0xf0000000 >> 22) = 0x3c0 = 960
	
---------------------------------------------------------------------------

[didn't make it to here in class. leaving it in the notes for reference.
will be useful for lab 4. we will cover this on Tuesday.]

6. Paging

    --Basic idea: all of memory (physical and virtual) gets broken up into
    chunks called **PAGES**. those chunks have size = **PAGE SIZE**

	--we will be working almost exclusively with PAGES of PAGE SIZE
	= 4096 B = 4KB = 2^{12}

	--how many pages are there on a 32-bit architecture?

	--2^{32} bytes / (2^{12} bytes/page) = 2^{20} pages

    --it is proper and fitting to talk about pages having **NUMBERS**. 

	--page 0:   [0,4095]
	--page 1:   [4096, 8191]
	--page 2:   [8192, 12277]
	--page 3:   [12777, 16384]
	.....

	--page 2^{20}-1 [ ......, 2^{32} - 1]

    --unfortunately, it is also proper and fitting to talk about _both_
    virtual and physical pages having numbers.

	--sometimes we will try to be clear with terms like:
	    vpn 
	    ppn
	
    --why isn't segmentation enough?

	segmentation can be a bummer when a segment grows or shrinks

	paging much more flexible: instead of mapping a large range onto
	a large range, we are going to independently control the mapping
	for every 4 KB. [wow! how are we going to do that? seems like a
	lot of information to keep track of, since every virtual page in
	every process can conceivably be backed by *any* physical page.]

	still, segments have uses

	    easy to share: just use the same segment registers

7. Paging on the x86

    [for the rest of this course, we will assume that segmentation on
    the x86 is configured to implement the identity mapping.]

    A. page mapping

	--4KB pages and 4GB address space so 2^{20} pages

	--top bits of VA selects the PPN

	--bottom bits indicate where in the page the memory reference is
	happening. sometimes called offset.

	--QUESTION: if our pages are of size 4KB = 2^{12}, then how many
	bottom bits are we talking about, and how many top bits are used
	for the layer of indirection?
	    
	    [answer: top 20 bits are doing the indirection. bottom 12
	    bits just figure out where on the page the access should
	    take place.]
	
	--conceptual model: 

	    there is in the sky a 2^{20} sized array that maps the
	    linear address to a *physical* page

	    table[20-bit linear page number] = 20-bit physical page #

	so now all we have to do is create this mapping

	why is this hard? why not just create the mapping?

	    --answer: then you need, per process, roughly 4MB (2^{20}
	    entries * 32 bits per entry).

	so here's an idea:

	    --break the 4MB table up into 4096 byte chunks, and
	    reference those chunks in another table.

		--so how many entries does that other table need?
		    --1024

		--so how big is that other table?
		    --4096 bytes!
		  
	    --so basically every data structure is going to be 4096
	    bytes

	here's how it works in the standard configuration on the x86,
	but there are others

	two-level mapping structure.......

	[refer to handout as we go through this example....]

	--%cr3 is the address of the page directory.

	--top 10 bits select an entry in the page directory, 
	    which picks a **page table**

	--next 10 bits select the entry in the page table, which is a
	physical page number

	--so there are 1024 entries in page directory

	--how big is entry in page directory? 4 bytes

	--entry in page directory and page table:

		[   base address   |  bunch of bits | U/S R/W P ]
		31..............12

	    why 20 bits?
		[answer: there are 2^20 4KB pages in the system]

	    is that base address a physical address, a linear address, a
	    virtual address, what?
		[answer: it is a physical address. hardware needs to be able
		to follow the page table structure.]

	--EXAMPLE

	    JOS maps 0xf0000000 to 0x00000000
		     0xf0001000 to 0x00001000

	    WHAT DOES THIS LOOK LIKE?

	    [
	      pgdir with entry 960 pointing to page table.
		[put the physical page table at PPN 3.]

	      page table has PPN(0th entry) = to 0
	      page table has PPN(1st entry) = to 1
	    ]

	--EXAMPLE

	    what if JOS wanted 
		      0xf0001000 to 0x91210000

	    [no problem]

	    point of this example: the mapping from VA to PA can be all
	    over the place


	--ALWAYS REMEMBER

	    --each entry in the page *directory* corresponds to 4MB of
	    virtual address space

	    --each entry in the page *table* corresponds to 4KB of
	    virtual address space

	    --so how much virtual memory is each page *table*
	    responsible for translating? 4KB? 4MB? something else?

	    --each page directory and each page table itself consumes
	    4KB of physical memory, i.e., each one of these fits on a
	    page

	--So this is the picture we have so far: a VA is 32 bits:

		pg dir        table          offset
	    31 ....... 22  21 ...... 12  11 ....... 0
	    
	    31 ................................... 0

	--go back to entry in page directory and page table:

		[   base address   |  bunch of bits | U/S R/W P ]
		31..............12

		  bunch includes 
		    dirty
		    acccessed
		    cache disabled 
		    write through

	--what do these U/S and R/W bits do?

	    --are these for the kernel, the hardware, what?
	    --who is setting them? what is the point?

	--what happens if U/S and R/W differ in pgdir and table?
	    [processor does something deterministic; look up in
	    references]

	--can user modify page tables? they are in memory.......

	    --but how can the user see them?

	    --the page tables themselves can be mapped into the user's
	    address space!

	    --we will see this in the case of JOS below

	------------------------------------------------------------------
	putting it all together.... here is how the x86's MMU translates a
	linear address to a physical address:

	    [not discussing in class but make sure you perfectly
	    understand what is written below.]

	   uint
	   translate (uint la, bool user, bool write)
	   {
	     uint pde; 
	     pde = read_mem (%CR3 + 4*(la >> 22));
	     access (pde, user, write);
	     pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff));
	     access (pte, user, write);
	     return (pte & 0xfffff000) + (la & 0xfff);
	   }

	   // check protection. pxe is a pte or pde.
	   // user is true if CPL==3.
	   // write is true if the attempted access was a write.
	   // PG_P, PG_U, PG_W refer to the bits in the entry above
	   void
	   access (uint pxe, bool user, bool write)
	   {
	     if (!(pxe & PG_P)  
		=> page fault -- page not present
	     if (!(pxe & PG_U) && user)
		=> page fault -- not access for user
	   
	     if (write && !(pxe & PG_W)) {
	       if (user)   
		  => page fault -- not writable
	       if (%CR0 & CR0_WP) 
		  => page fault -- not writable
	     }
	   }
	--------------------------------------------------------------------


    B. TLBS
    
	--so it looks like the CPU (specifically its MMU) has to go out
	to memory on every memory reference?

	    --called "walking the page tables"

	--to make this fast, we need a cache

	--TLB: translation lookaside buffer

	    hardware that stores virtual address --> physical address;
	    the reason that all of this page table walking does not slow
	    down the process too much
	    
	    --hardware managed?

	    --software managed? (MIPS. OS's job is to load the TLB when
	    the OS receives a "TLB miss". Not the same thing as a page
	    fault.)

	--what happens to the TLB when %cr3 is loaded? [answer: flushed]

	--can we flush individual entries in the TLB otherwise? 
	    INVLPG addr

	--how does stuff get in the TLB?

	    --answer: hardware populates it

	--questions:

	    --does TLB miss imply page fault?

	    --does the existence of a page fault imply that there was a TLB
	    miss?

    C. memory in JOS

	--segments only used to switch privilege level into and out of kernel    

	--paging structures the address space

	--paging limits process memory access to its own address space

	--see handout for JOS virtual memory map

	--why are kernel and current process both mapped into address space?

	    --convenient for kernel

	--why is all of physical memory mapped at the top? that must mean
	that there are physical memory pages that are mapped in multiple
	places....

	    --need to be able to get access to physical memory when setting
	    up page tables: *kernel* has to be able to use physical
	    addresses from time to time

	--what the heck is UVPT? ......

	--remember how we wanted a contiguous set of entries?

	    --wouldn't it be awesome if the 4MB worth of page table appeared
	    inside the virtual address space, at address, say,
	    0xef400000 (which we call UVPT)?

	    --to do that, we sneakily insert a pointer in the pgdir back
	    to the pgdir itself, like this:

		 1023   |             |
		 .....   ........
		  960   | <as before> |
		  959   | <...>       |
		  958   | <...>       |
		  957   | self..    U |
		 ....
		    0   | ........ not present|


	    --result: the page tables *themselves* show up in the
	    program's virtual address space 
	    
	    --in more detail, the virtual address space looks like this:

	    [0xef400000,0xef800000) --> looks like one contiguous page
	    table, visible to users. read only to user ; r/w to kernel.

	    more specifically: the picture of [UVPT,UVPT+4MB) in virtual
	    space is:

	    UVPT+4MB
	                __________________
                        
                           PGTABLE 1023
                        __________________
                             .
                             .
                             .
                        __________________

                             PGTABLE 2
                        __________________

                             PGTABLE 1
                        ___________________
                        
                             PGTABLE 0
	    UVPT        ___________________


	--QUESTION: 

	    * where does the pgdir itself live in the virtual address space?

		--0xef400000 ?

		--0xef400000 + 4KB ?

		--0xef400000 + 4KB * 957 ?

	--so user processes can also see their own page tables, but we will
	set the R/W bit to 0 so that they cannot modify them

	--but the kernel maps another copy where it can work on them even
	when there's no user process running

    --something it is probably worth internalizing:

	one of the things that a second-level page table is doing is to
	take as many as 1024 disparate physical pages, perhaps scattered
	throughout RAM, and then glue them together in a logical way,
	making them appear as a contiguous 4MB region in virtual space
	(just as the entire page structure glues disparate physical
	pages into a 4GB "region").

	if the second-level page table that is chosen for this gluing is
	the page directory itself, then the disparate physical pages
	that all appear as a contiguous 4MB region wind up being the
	page tables themselves

   --with the above as background, here is further detail on the JOS
   implementation trick:

	this works because the page directory has the same structure as
	a page table and because the CPU just "follows arrows", namely:
	
	    (1) From the relevant entry in the pgdir [which entry,
	    recall, covers 4MB worth of VA space] to the physical page
	    number where the relevant page table lives
	    
	    (2) From the physical page number where the relevant page
	    table lives, more specifically the relevant entry in the
	    relevant page table (which is relevant to 4KB of address
	    space), to the physical page number that is the target of
	    the <VA,PA> mapping.

	now, if you "trick" the CPU into following the first arrow back
	to the pgdir itself, and the program references an address
	0xef400000+x, where x < 4MB, then the logic goes like this
	(compare the exact words below to the exact words of the
	numbered items above):

	    (1) From the relevant entry in the pgdir [which entry,
	    recall, is covering the 4MB worth of VA space from
	    [0xef40000,0xef800000)] to the physical page number where
	    the page directory lives 

	    (2) From the physical page number where the page directory
	    lives, more specifically the relevant entry in the page
	    directory (which now is relevant to only 4KB of address
	    space), to the physical page number that is the target of
	    the <0xef40000+x,PA> mapping. that physical page holds a
	    second-level page table!
	    
	    result: the second-level page table appears at 0xef40000+x