Class 12
CS 372H
25 February 2010

On the board
------------

1. Trade-offs and problems from synchronization primitives
    A. Performance
    B. Performance v. complexity trade-off
    C. Deadlock
    D. Starvation
    E. Priority inversion
    F. Broken modularity
    G. Careful coding required (more advice)
2. Reflections and conclusions
3. Alternatives

---------------------------------------------------------------------------

1. Trade-offs and problems from synchronization primitives

    Locking (in all its forms: mutexes, monitors, semaphores) raises
    many issues:

    A. performance
    B. performance v. complexity trade-off
    C. deadlock
    D. starvation
    E. priority inversion
    F. broken modularity
    G. careful coding required (here we give some more advice)

    We'll discuss these now:

    A. Performance 

	quick digression:
	    --_dance hall_ architecture: any CPU can "dance with" any
	    memory equally (equally slowly)
	    --NUMA (non-uniform memory access): each CPU has fast access
	    to some "close" memory; slower to access memory that is
	    further away
		--AMD Opterons like this
		--Intel CPUs moving toward this
	    --two further choices: cache coherent or not. in the former
	    case, hardware runs a cache coherence (cc) protocol to
	    invalidate caches when a local change happens . in the
	    latter case, it does not. former case is far more common.

	let's assume NUMA machines...back to performance issues....

	    our baseline is a test-and-test_and_set spinlock, which is
	    basically what Linux uses:
		    
		void acquire(Lock* lock) {
		    pushcli();
		    while (xchg_val(&lock->locked, 1) == 1) {
			while (lock->locked) ;
		    }
		}

		void release(Lock* lock) {
		    xchg_val(&lock->locked, 0);
		    popcli();
		}

	the performance issues are:
	
	(i) fairness 
	    --one CPU gets lock because the memory holding the
	    "locked" variable is closer to that CPU
	    --allegedly, Google had fairness problems on Opterons (I
	    have no proof of this)
	(ii) lots of traffic over memory bus: if lots of contention for
	     lock, then cache coherence protocol creates lots of remote
	     invalidations every time someone tries to do a lock acquisition 
	(iii) cache line bounces (same reason as (ii))
	(iv) locking inherently reduces concurrency

	mitigation of (i)--(iii): better locks

	    --MCS locks

		--see handout

		--advantages
		    --guarantees FIFO ordering of lock acquisitions
		    (addresses (i))
		    --spins on local variable only (addresses (ii), (iii))
		    --[not discussing this, but: works equally well on
		      machines with and without coherent caches]

		--NOTE: with fewer cores, spinlocks are better. why?

		--In fact, if there is high contention, performance will
		be poor, though MCS locks will make it be a little less
		poor. More on that in a bit.
		
	mitigation of (iv): more fine-grained locking.

	    --unfortunately, fine-grained locking leads to the next
	    issue, which is also fundamental

    B. Performance v complexity trade-off 

	--one big lock is often not great for performance, even when we
	use the fancier locks above 
	
	    --indeed, locking itself is the issue, which is one reason
	    why Linux uses test-and-test_and_set spinlocks rather than
	    MCS locks: the gain from the fancier lock is non-existent
	    with a small number of CPUs and only marginal relative to
	    the gain of restructuring the code

	    --the fundamental issue with coarse-grained locking is that
	    only one CPU at a time can execute anywhere in your code. If
	    your code is called a lot, this may reduce the performance
	    of an expensive multiprocessor to that of a single CPU.

	    --if this happens inside the kernel, it means that
	    applications will inherit the performance problems from the
	    kernel

	--Perhaps locking at smaller granularity would get higher
	performance through more concurrency. 

	    --But how to best reduce lock granularity is a bit of an art.

	    --And unfortunately finer-grained locking makes incorrect
	    code far more likely

	    --And modularity also suffers (a theme we will return to)

	--Two examples of the above issues:
	
	--Example 1: imagine that every file in the file system is
	represented by a number, in a big table
	
	    --You might inspect the file system code and notice that
	    most operations use just one file or directory, leading you
	    to have one lock per file

	    --You could imagine the code implementing directories
	    exporting various operations like
		dir_lookup(d, name)
		dir_add(d, name, file_number)
		dir_del(d, name)

	    --With fine-grained locking, these directory operations
	    would *internally* acquire the lock on d, do their work, and
	    release the lock
	    
	    --Then higher-level code could implement operations like
	    moving a file from one directory to another:

	    move(olddir, oldname, newdir, newname) {
	      file_number = dir_lookup(olddir, oldname)
	      dir_del(olddir, oldname)
	      dir_add(newdir, newname, file_number)
	    }

	    --Unfortunately, this isn't great:

		--period of time when file is visible in neither
		directory. to fix that requires that the directory locks
		_not_ be hidden inside the dir_* operations.

		--so we need something like this:

		move(olddir, oldname, newdir, newname){
		  acquire(olddir.lock)
		  acquire(newdir.lock)
		  file_number = dir_lookup(olddir, oldname)
		  dir_del(olddir, oldname)
		  dir_add(newdir, newname, file_number)
		  release(newdir.lock)
		  release(olddir.lock)

	    --The above code is a bummer in that it exposes the
	    implementation of directories to move(), but (if all you
	    have is locks) you have to do it this way.

	--Example 2: see filemap.c at end of handout for an extreme case 

	--Mitigation? Unfortunately, no way around this trade-off.
	    
	    --worse, easy to get this stuff wrong: correct code is
	    harder to write than buggy code

	--If you have fine-grained locking (i.e., you are trading off
	simplicity), then you are much more likely to encounter the two
	types of errors:
	    (i) safety errors (race conditions)
	    (ii) liveness errors (deadlocks, etc.)

	--***So what do people do?***

	    --in app space:

		--don't worry too much about performance up front. makes
		it easier to keep your code free of safety problems
		*and* liveness problems

		--if you are worrying about performance, make sure there
		are no race conditions. much more important than
		worrying about deadlock.

		    --SAFETY FIRST.

		    --almost always far better for your program to do
		    nothing than to do the wrong thing (example of using
		    Linear Acceletor for radiation therapy: **way**
		    better not to subject patient to radiation beam than
		    to subject patient to a beam that is 100x too
		    strong, leading to gruesome, atrocious injuries)

		    --if the program deadlocks, the evidence is intact, and we
		    can go back and see what the problem was.

		    --there are ways around deadlock, as we will discuss
		    in a moment

		    --but we shouldn't be too cavalier about liveness
		    issues because it could lead to catastrophic cases.
		    Example: Mars Pathfinder (which was addressed; see
		    below), but still.

	    --in kernel space:

		--same thing, to some extent

		--but performance matters more in kernel space, so
		likely to be dealing with more complex issues

		    --here again, SAFETY FIRST
			--lock more aggressively
			--worry about deadlock later

		--not a satisfying answer, but there is no silver bullet
		for concurrency-related issues

	--By the way, if there is lots of contention, then the style and
	granularity of locks will not eliminate the problem.  where does
	contention come from?

	    --application requirements. lots of contention from
	    applications that inherently require global resources or
	    shared data.
	    
	    --example of Apache: every CPU needs to write to a
	    global logfile, which causes contention in the kernel


    C. Deadlock

	--more *likely* with finer-grained locking, but a definite
	*possibility* even with coarse-grained locking. that is, you
	always have to worry about deadlock.

	--see handout: simple example based on two locks

	--see handout: more complex example
	    --M calls N 
	    --N waits
	    --but let's say condition can only become true if N is invoked
	    through M
	    --now the lock inside N is unlocked, but M remains locked; that
	    is, no one is going to be able to enter M

	--lesson: dangerous to hold locks (M's mutex in the case on the
	handout) when crossing abstraction barriers

	--deadlocks without mutexes:
	    
	    --Real issue is resources & how required 

	    --non-computer example
	    
		**[picture of bridge]**

		--bridge only allows traffic in one direction 

		--Each section of a bridge can be viewed as a resource. 

		--If a deadlock occurs, it can be resolved if one car
		backs up (preempt resources and rollback). 

		--Several cars may have to be backed up if a deadlock occurs. 

		--Starvation is possible. 

	    --other example:
		
		--one thread/process grabs disk and then tries to grab
		scanner

		--another thread/process grabs scanner and then tries to
		grab disk

	--how do we get around deadlock?

	    (i) ignore it: worry about it when happens

	    (ii) detect and recover: not great

		--could imagine attaching debugger

		    --not really viable for production software, but
		    works well in development

		--threads package can keep track of resource-allocation graph

		--see book

		    --For each lock acquired, order with other locks held 
		    
		    --If cycle occurs, abort with error 
		
		    --Detects potential deadlocks even if they do not occur 

	    (iii) avoid algorithmically

		--banker's algorithm (see book)

		    --very elegant but impractical

		    --if you're using banker's algorithm, the gameboard
		    looks like this:

			ResourceMgr::Request(ResourceID resc,
					     RequestorID thrd) {
			    acquire(&mutex);
			    assert(system in a safe state);
			    while (state that would result from giving 
			           resc to thrd is not safe) {
				wait(&cv, &mutex);	
			    }
			    update state by giving resc to thrd
			    assert(system in a safe state);
			    release(&mutex);
			}

			Now we need to determine if a state is safe....

			To do so, see book

		--disadvantage to banker's algorithm:

		    --requires every single resource request to go
		    through a single broker

		    --requires every thread to state its maximum
		    resource needs up front. unfortunately, if threads
		    are conservative and claim they need huge quantities
		    of resources, the algorithm will reduce concurrency

	    (iv) prevent them by careful coding

		--negate one of the four conditions:
		    1. mutual exclusion
		    2. hold-and-wait
		    3. no preemption
		    4. circular wait

		--can sort of negate 1
		    --put a queue in front of resources, like the printer
		    --virtualize memory

		--not much hope of negating 2

		--can sort of negate 3:
		    --consider physical memory: virtualized with VM, can
		    take physical page away and give to another process! 

		--what about negating #4?

		    --in practice, this is what people do

		    --idea: partial order on locks

			--Establishing an order on all locks and making
			sure that every thread acquires its locks in
			that order

			--For the files-and-directory example, the rule
			might be to lock the access to the files in
			order of file #.

			--see filemap.c

		    --why this works:

			--can view deadlock as a cycle in the resource
			acquisition graph

			--partial order implies no cycles and hence no
			deadlock

		    --three bummers:

			1. hard to represent CVs inside this framework.
			works best only for locks.

			2. compiler can't check at compile time that
			partial order is being adhered to because
			calling pattern is impossible to determine
			without running the program (thanks to function
			pointers)

			3. Picking and obeying the order on *all* locks
			requires that modules make public their locking
			behavior, and requires them to know about other
			modules' locking.  This can be painful and
			error-prone. 

	    (v) Static and dynamic detection tools

		--See, for example, these citations, citations
		therein, and papers that cite them:

		    Engler, D. and K. Ashcraft. RacerX: effective,
		    static detection of race conditions and deadlocks.
		    Proc. ACM Symposium on Operating Systems Principles
		    (SOSP), October, 2003, pp237-252.
		    http://portal.acm.org/citation.cfm?id=945468

		    Savage, S., M. Burrows, G. Nelson, P. Sobalvarro,
		    and T. Anderson. Eraser: a dynamic data race
		    detector for multithreaded programs. ACM
		    Transactions on Computer Systems (TOCS), Volume 15,
		    No 4., Nov., 1997, pp391-411.
		    http://portal.acm.org/citation.cfm?id=265927

		    a long literature on this stuff

		--Disadvantage to dynamic checking: slows program down

		--Disadvantage to static checking: many false alarms
		(tools says "there is deadlock", but in fact there is
		none) or else missed problems

    D. Starvation

	--thread waiting indefinitely

	--livelock example

---------------------------------------------------------------------------

Admin details

--comments about confusing nature of labs: tell us so we can clarify

    --of course, this requires that you start early

    --but starting early has another benefit: you may actually wind up
    spending less time total if you spread it over two days than if you
    do the lab in a chunk. the reason is that you catch things on a
    second look or after a night of sleep that you did not catch at
    first.

---------------------------------------------------------------------------