Class 15
CS 372H
09 March 2010

On the board
------------

1. Last time
    --exokernel
    --microkernels vs monolithic kernels
2. Liedtke paper
3. Therac-25

---------------------------------------------------------------------------

1. Last time

    --microkernels vs monolithic kernels vs exokernels

    --More about the trade-off:

	--monolithic makes it easier to have sub-systems cooperate (such
	as the pager and the file system; these modules can read each
	other's data structures)

	--monolithic creates lots of strange interactions, which leads
	to complexity and bugs

    --message passing is another way to handle concurrency. every
    process has its own memory region, and then kernel's job is to
    shuttle messages back and forth.

2. Liedtke paper
 
    L3: commercially available microkernel. evolved into L4, which is
    still in use.
 
    --Remember, everything works via message passing

    --Therefore message passing and IPC needs to be fast

  
    list:
	background
	principles
	interface
	optimizations
	discussion

    A. Background

	--What's an RPC? What's an IPC?
	    --IPC: message from thread (or process A) to thread or
	    process B
	    --RPC: a round-trip of IPCs (there and back)

	--What's the minimum need to do an IPC?

	    --See Table 3, back page:  127 cycles
   
	--The big cost: our friends "int" and "iret"

	--Why are these expensive?

	    --pipeline flushed

	    --registers dumped on stack

	    --TLB misses in the wake of context switches

	    --Why are 5 TLB misses needed?
		(i) B's thread control block
		(ii) loading %cr3 flushes TLB, so kernel text causes miss on
		the next kernel instruction after "lcr3"
		(iii,iv)  iret, accesses both *kernel* stack and *user* text - two pages
		(v) B's user code looks at message

	--How do you think this trend has progressed since the paper?
	    
	    --ANSWER:
		1. Worse now.  Faster processors optimized for straight-line code
		2. Traps/Exceptions flush deeper pipeline, cache misses cost more cycles


	--Actual IPC time of optimized L3: 5 usec

	--Is that expensive?

	    Compared to what?
	    --accessing a disk? (milliseconds to access disk, so no
	    problem)
	    --network interrupts when packets arrive?

		well, what if you wanted to handle 50,000
		packets/second? two IPCs/packet = 100,000 IPCs/second.
		processor can only do 200,000 IPCs/second, so IPCs would
		take up 100,000/200,000 = 100,000/200,000 = 50% of CPU


    B. Principles

	-- *IPC performance is the master*

	-- Plus a bunch of other things that emphasize IPC performance

	    --All design decisions require a *performance discussion*

	    --If something performs poorly, look for new techniques

	    --*Synergistic effects* have to be taken into consideration
		[What does this mean?  That a lot of little things might
		add up to a big gain, or a big loss if two changes interact
		poorly. Need to test each combination of features?!]

	    --The design has to *cover all levels* from architecture down to
	      coding

	    --The design has to be made on a *concrete basis*

	-- Up until this point, a bunch of principles that argue that you should do
	   endless IPC optimization!

	    --How do we know when to stop?
	    --How do we know when we can't optimize further?

	    --Answer: One of the nicer principles in L3:
		"The design has to aim at a concrete performance goal."

		    -- Without this, you'd get lost optimizing things
		    that don't matter

		    -- Take minimum IPC time (172 cycles), multiply by 2
		    -- 350 cycles = 7 usec (50 Mhz)
		    -- set **T** = 5 usec
		    -- Minimum null RPC is already at 69% T!
		    -- System calls + address space switches = 60% T
		    -- L3 achieves 250 cycles = 5 usec

	-- Basic approach: Design the microkernel for a specific CPU


    C. Interface
	
	old:
	    send (threadID, send-message, timeout);  /* nonblocking */
	    receive (receive-message, timeout);	/* nonblocking */

	if A sends to B:
	    A: send(); receive();
	    
	    B: 
		while (1) {
		    select();
		    receive(&requestbuf);
		    replybuf = process();
		    send(replybuf);
		}

	new:
	  call (threadID, send-message, receive-message, timeout);
	  reply_and_receive_next (reply-message, receive-message, timeout); 

	now:
	    A: call(threadID, send-buf, receive-buf, timeout)
	    B: 
		receive(&requestbuf);
		do {
		    replybuf = process(requestbuf);
		    reply_and_receive_next(replybuf, &requestbuf, timeout);
		} while (1);


    D. Optimizations

	(i) new system call: 2 system calls per RPC, instead of 4.

	(ii) complex messages: send one message instead of a bunch

	(iii) direct transfer with memory mapping

	    what's going on here?

	    naive solution: two copies:  A --> kernel --> B

	    okay, so why not share user-level pages between A and B?
	    have sender copy into shared buffer? well, then receiver might
	    need write access to signal when it's done processing. problem:
		--security issue: information can flow back from B to A 

	    other problems with shared buffers in this context:

		--receiver checks message legality, then message changes
		(if receiver copies message first, then we're back where we
		started)

		--with many clients, a server could run out of VA space

		--somehow need to coordinate first

		--not app-friendly. why? [have to copy data anyway. can't
		get data into buffer, etc.]

	    Liedtke's approach: one copy: A --> remapped B.
		
		--Kernel does copy inside A

		--How to do this maximally cheaply?
		    
		    --ANSWER: Copy two PDE's (8MB) from B's address space
		    into kernel range of A's pgdir. 

			--Then execute the copy in A's kernel space

		--Literally copy the entries?

		    --No! copy the entry *except* the PTE_U bit needs to
		    be cleared because only the kernel should be using
		    this window in A

		    --Why two PDEs?  Maximum message size is 4 Meg, so the copy is guaranteed
			to work regardless of how B aligned the message buffer
		    --Why not just copy PTEs?  Would be much more expensive]

		--What does it mean for the TLB to be "window clean"?
		Why do we care?  Why can't we just invalidate the
		mappings?

		      --Means TLB contains no mappings within communication window

		      --We care because mapping is cheap (copy PDE), but invalidation not
			x86 only lets you invalidate one page at a time, or whole TLB

		      --Why isn't it enough to invalidate the two pages?
			--trick question. it's not two pages. 
			    it's two PDEs --> 8 MB.

		      --We need to invalidate because the same kernel
		      virtaddr space may refer to multiple physical
		      pages (the kernel's window is in the same
		      virtual place in every process)

		     --Does TLB invalidation of communication window
		     turn out to be a problem?  Not usually, because
		     have to load %cr3 during IPC anyway (Unless the
		     address space doesn't change)

	
	(iv) Thread control block (TCB)

	  tcb contains basic info about thread

	    --registers, links for various doubly-linked lists, pgdir, uid, ...
	    --commonly accessed fields packed together on the same cache line

	  [Draw picture of array, with kernel stack inside TCB]

	  Kernel stack is on same page as tcb.  Why?

	    a. Minimizes TLB misses (since accessing kernel stack will bring in tcb)
		--consider the alternative
		--NOTE: in table 3, switching the stacks doesn't cause
		TLB miss. the reason is because B's TCB was accessed
		earlier in table 3.

	    b. Very efficient access to current TCB -- just mask off
	    lower 12 bits of %esp

	  Another nice thing: can access *any* TCB efficiently, given
	  the thread id. why?
	   
	    --actual thread number is in 32-bit thread id in very
	    particular way
				 b
		[     {thr_num}<---->]
	    where tcb size = 2^b.

	    --doing it this way replaces an {"and", "multiply", "add"} with
	    {"and","add"}!

	Note that the thread ID here is like the JOS env ID (has a
	number that serves as an index, a generation, etc.)

	(v) Lazy scheduling

	    conventional approach to scheduling:

		A sends message to B:
		  Move A from ready queue to waiting queue
		  Move B from waiting queue to ready queue

		This requires 58 cycles, including 4 TLB misses.  What are TLB misses?
		  Using doubly linked lists [go over best implementation]
		  Most efficient would be to insert A at B's old position in list
		  So previous and next elements in each list must be touched

	    lazy scheduling:

		Insight: After A blocks, *don't take it off the ready queue yet!*
		It will probably get right back on very quickly.

		Ready queue must contain all ready threads, EXCEPT POSSIBLY CURRENT ONE
		  Might contain other threads that aren't actually ready, though

		Each wakeup queue contains AT LEAST all threads waiting in that queue
		  Again, might contain other threads, too
		  Scheduler removes inappropriate queue entries when scanning queue

		Why does this help performance?
		  Only three situations in which thread gives up CPU but stays ready:
		    "send" syscall (as opposed to call), preemption, and hardware interrupts
		  [these are the only cases when the thread needs to be put in the ready list]

		  So very often can IPC into thread while not putting it on ready list

		"ipc : lazy queue update" ratio can reach 50:1 with high ipc rates


	(vi) Segment register optimization

	    --Loading segment registers is slow -- have to access GDT, etc.

	    --But common case is that users don't change their segment registers

	    --Observation:  It's faster to check a segment register than load it
		So just check that segment registers are okay
		Only need to load if user code changed them

	(vii) Various other tricks

	    --Short messages passed through registers

	    --Minimize TLB misses by putting things on the same page

	    --Put commonly used-data on same cache lines

	    --Other coding tricks: short offsets, avoid jumps, etc.

    E. Discussion

	--Great performance numbers!  Much better than other microkernels
	  (Fig 7, 8)
	    --Too bad microbenchmark performance might not matter 
	    --Too bad, too, that hardware evolution has made ipc inherently more expensive
	
	--What do you think of theme of paper?
	    Liedtke was fighting a losing battle against CPU makers:
	    hardware evolution making IPC inherently more expensive

	    [But very nice series of design decisions (or hacks).]

	--Is fast IPC something that computer architects should design
	hardware to take into account?
------------------------------------------------------------------------

Admin notes

--review session was Monday

    --notes from review session will be posted

    --remember to check announcements every 24 hours (or subscribe via
    RSS)
    
    --am having office hours on Wed and will do further review then

--ground rules for exam

    --75 minute exam

	--at 70 minutes, you have to stay seated; do not get up and distract
	your classmates.

	--you must hand your exam to me (we are not going to collect them).
	the purpose of this is so everyone gets the same amount of time.
	
	--at 78 minutes, I will walk out of the room and won't accept any
	exams when I leave

	--thus you must hand in your exam at time x minutes, where:
	    x <= 70 OR
	    75 <= x < 78 

    --bring ONE two-sided sheet of notes; formatting requirements listed on
    Web page

    --bring your ID

------------------------------------------------------------------------

3. Therac-25

    A. Mechanics
    B. What went wrong?
    C. What could/should they have done?

    A. Mechanics

	[draw picture of this thing]

	dual-mode machine (actually, triple mode, given the disasters)

			    beam               beam                beam
			    energy            current            modifier
			                                        (given by TT
								position)
intended settings:      ---------------------------------------------------
   for electron therapy |    5-25 MeV          low                magnets
			| 
			|
   for X-ray therapy    |   25 MeV            high (100 x)       flattener
	photon mode     |  
			|  
   for field light mode |      0                 0                 none

	      (b/c of the flattener, more current is needed in X-ray mode)
    
       What can go wrong?

	(a) if beam has high current, but turntable has 'magnets', not
	the flattener, it is a disaster: patient gets hit with high
	current electron beam

	(b) another way to kill a patient is to turn the beam on with
	the turntable in the field-light position

	So what's going on? (Multiple modes, and mixing them up is very,
	    very bad)

    B.  What actually went wrong?

	--two software problems

	--a bunch of non-technical problems

	(i) software problem #1:

	[this is our best guess; actually hard to know for sure, given
	the way that the paper is written.]

	--three threads
	    --keyboard
	    --turntable
	    --general parameter setting

	--see handout for the pseudocode

	--now, if the operator sets a consistent set of parameters for x
	(X-ray (photon) mode), realizes that the doctor ordered something
	different, and then edits very quickly to e (electron) mode,
	then what happens?

	    --if the re-editing takes less than 8 seconds, the general
	    parameter setting thread never sees that the editing
	    happened because it's busy doing something else. when it
	    returns, it misses the setup signal

	    --now the turntable is in 'e' position (magnets)

	    --but the beam is a high intensity beam because the 'Treat'
	    never saw the request to go to electron mode

	    --each thread and the operator thinks everything is okay

	    --operator presses BEAM ON --> patient mortally injured

	 --so why doesn't the computer check the set-up for consistency
	 before turning on the beam? [all it does it check that there's
	 no more input processing.] 
	    alternatives:
		--double-check with operator
		--end-to-end consistency check in software
		--hardware interlocks
		[probably want all of the above] 


	(ii) software problem #2:

	how it's supposd to work:

	    --operator sets up parameters on the screen

	    --operator moves turntable to field-light mode, and visually
	    checks that patient is properly positioned

	    --operator hits "set" to store the parameters

	    --at this point, the class3 "interlock" is supposed to tell the 
	    software to check and perhaps modify the turntable position

	    --operator presses "beam on"

	how they implemented this:

	    --see pseudocode on handout

	but it doesn't always work out that way. why?
	    
	    --because this boolean flag is implemented as a counter.

	    --(why implemented as a counter? PDP-11 had an Increment
	    Byte instruction that added 1 ("inc A"). This increment thing
	    presumably took a bit less code space than materializing the
	    constant 1 in an instruction like "A = 1".)

	    --so what goes wrong?
		
		--operator presses "beam on", and a beam is delivered in
		field light position, with no scanning magnets or
		flattener --> patient injured or killed

		--why?

	(iii) Lots of larger issues here too

	    --***No end-to-end consistency checks***. What you actually
	    want is:
		--right before turning the beam on, the software checks
		that parameters line up
		--hardware that won't turn beam on if the parameters are
		inconsistent
		--then double-check that by using a radiation "phantom"

	    --too easy to say 'go', errors reported by number, no
	    documentation

	    --false alarms (operators learn the following response:
	    "it'll probably work the next time")

	    --unnecessarily complex and poor code

	    --weird software reuse: wrote own OS ... but used code from
	    a different machine
	    
	    --measuring devices that report _underdoses_ when they are
	    ridiculously saturated

	    --no real quality control, unit tests, etc.

	    --no error documentation, no documentation on software
	    design

	    --no follow-through on Therac-20's blown fuses

	    --company lied; didn't tell users about each other's
	    failures

	    --company assumed software wasn't the problem

    C. What could/should they have done?

	--Addressing the stuff above

	--You might be thinking, "So many things went wrong. There was no single cause
	of failure. Does that mean no single design change could have contributed to
	success?"

	--Answer: no! do end-to-end consistency checks! that single
	change would have prevented these errors!

    D. What happened in disasters reported by NYT?

	--Hard to know for sure

	--Looks like: software lost the treatment plan, and it defaulted 
	to "all leaves open". Analog of field light position.

	What could/should have been done?

	    --a good rule is: "software should have sensible defaults".
	    looks like this rule is violated here.

	    --in a system like this, there should be hardware interlocks
	    (for example: no turning on the beam unless the leaves are
	    closed)

    E. Amateur ethics/philosophy

      (i). Philosophical/ethical question: you have a 999/1000 chance of being
      cured by this machine. 1/1000 times it will cause you to die a gruesome
      death. do you pick it? most people would.

	--> then, what *should* the FDA do?

      (ii). should people have to be licensed to write software?
      (food for thought)