Class 12 CS 372H 25 February 2010 On the board ------------ 1. Trade-offs and problems from synchronization primitives A. Performance B. Performance v. complexity trade-off C. Deadlock D. Starvation E. Priority inversion F. Broken modularity G. Careful coding required (more advice) 2. Reflections and conclusions 3. Alternatives --------------------------------------------------------------------------- 1. Trade-offs and problems from synchronization primitives Locking (in all its forms: mutexes, monitors, semaphores) raises many issues: A. performance B. performance v. complexity trade-off C. deadlock D. starvation E. priority inversion F. broken modularity G. careful coding required (here we give some more advice) We'll discuss these now: A. Performance quick digression: --_dance hall_ architecture: any CPU can "dance with" any memory equally (equally slowly) --NUMA (non-uniform memory access): each CPU has fast access to some "close" memory; slower to access memory that is further away --AMD Opterons like this --Intel CPUs moving toward this --two further choices: cache coherent or not. in the former case, hardware runs a cache coherence (cc) protocol to invalidate caches when a local change happens . in the latter case, it does not. former case is far more common. let's assume NUMA machines...back to performance issues.... our baseline is a test-and-test_and_set spinlock, which is basically what Linux uses: void acquire(Lock* lock) { pushcli(); while (xchg_val(&lock->locked, 1) == 1) { while (lock->locked) ; } } void release(Lock* lock) { xchg_val(&lock->locked, 0); popcli(); } the performance issues are: (i) fairness --one CPU gets lock because the memory holding the "locked" variable is closer to that CPU --allegedly, Google had fairness problems on Opterons (I have no proof of this) (ii) lots of traffic over memory bus: if lots of contention for lock, then cache coherence protocol creates lots of remote invalidations every time someone tries to do a lock acquisition (iii) cache line bounces (same reason as (ii)) (iv) locking inherently reduces concurrency mitigation of (i)--(iii): better locks --MCS locks --see handout --advantages --guarantees FIFO ordering of lock acquisitions (addresses (i)) --spins on local variable only (addresses (ii), (iii)) --[not discussing this, but: works equally well on machines with and without coherent caches] --NOTE: with fewer cores, spinlocks are better. why? --In fact, if there is high contention, performance will be poor, though MCS locks will make it be a little less poor. More on that in a bit. mitigation of (iv): more fine-grained locking. --unfortunately, fine-grained locking leads to the next issue, which is also fundamental B. Performance v complexity trade-off --one big lock is often not great for performance, even when we use the fancier locks above --indeed, locking itself is the issue, which is one reason why Linux uses test-and-test_and_set spinlocks rather than MCS locks: the gain from the fancier lock is non-existent with a small number of CPUs and only marginal relative to the gain of restructuring the code --the fundamental issue with coarse-grained locking is that only one CPU at a time can execute anywhere in your code. If your code is called a lot, this may reduce the performance of an expensive multiprocessor to that of a single CPU. --if this happens inside the kernel, it means that applications will inherit the performance problems from the kernel --Perhaps locking at smaller granularity would get higher performance through more concurrency. --But how to best reduce lock granularity is a bit of an art. --And unfortunately finer-grained locking makes incorrect code far more likely --And modularity also suffers (a theme we will return to) --Two examples of the above issues: --Example 1: imagine that every file in the file system is represented by a number, in a big table --You might inspect the file system code and notice that most operations use just one file or directory, leading you to have one lock per file --You could imagine the code implementing directories exporting various operations like dir_lookup(d, name) dir_add(d, name, file_number) dir_del(d, name) --With fine-grained locking, these directory operations would *internally* acquire the lock on d, do their work, and release the lock --Then higher-level code could implement operations like moving a file from one directory to another: move(olddir, oldname, newdir, newname) { file_number = dir_lookup(olddir, oldname) dir_del(olddir, oldname) dir_add(newdir, newname, file_number) } --Unfortunately, this isn't great: --period of time when file is visible in neither directory. to fix that requires that the directory locks _not_ be hidden inside the dir_* operations. --so we need something like this: move(olddir, oldname, newdir, newname){ acquire(olddir.lock) acquire(newdir.lock) file_number = dir_lookup(olddir, oldname) dir_del(olddir, oldname) dir_add(newdir, newname, file_number) release(newdir.lock) release(olddir.lock) --The above code is a bummer in that it exposes the implementation of directories to move(), but (if all you have is locks) you have to do it this way. --Example 2: see filemap.c at end of handout for an extreme case --Mitigation? Unfortunately, no way around this trade-off. --worse, easy to get this stuff wrong: correct code is harder to write than buggy code --If you have fine-grained locking (i.e., you are trading off simplicity), then you are much more likely to encounter the two types of errors: (i) safety errors (race conditions) (ii) liveness errors (deadlocks, etc.) --***So what do people do?*** --in app space: --don't worry too much about performance up front. makes it easier to keep your code free of safety problems *and* liveness problems --if you are worrying about performance, make sure there are no race conditions. much more important than worrying about deadlock. --SAFETY FIRST. --almost always far better for your program to do nothing than to do the wrong thing (example of using Linear Acceletor for radiation therapy: **way** better not to subject patient to radiation beam than to subject patient to a beam that is 100x too strong, leading to gruesome, atrocious injuries) --if the program deadlocks, the evidence is intact, and we can go back and see what the problem was. --there are ways around deadlock, as we will discuss in a moment --but we shouldn't be too cavalier about liveness issues because it could lead to catastrophic cases. Example: Mars Pathfinder (which was addressed; see below), but still. --in kernel space: --same thing, to some extent --but performance matters more in kernel space, so likely to be dealing with more complex issues --here again, SAFETY FIRST --lock more aggressively --worry about deadlock later --not a satisfying answer, but there is no silver bullet for concurrency-related issues --By the way, if there is lots of contention, then the style and granularity of locks will not eliminate the problem. where does contention come from? --application requirements. lots of contention from applications that inherently require global resources or shared data. --example of Apache: every CPU needs to write to a global logfile, which causes contention in the kernel C. Deadlock --more *likely* with finer-grained locking, but a definite *possibility* even with coarse-grained locking. that is, you always have to worry about deadlock. --see handout: simple example based on two locks --see handout: more complex example --M calls N --N waits --but let's say condition can only become true if N is invoked through M --now the lock inside N is unlocked, but M remains locked; that is, no one is going to be able to enter M --lesson: dangerous to hold locks (M's mutex in the case on the handout) when crossing abstraction barriers --deadlocks without mutexes: --Real issue is resources & how required --non-computer example **[picture of bridge]** --bridge only allows traffic in one direction --Each section of a bridge can be viewed as a resource. --If a deadlock occurs, it can be resolved if one car backs up (preempt resources and rollback). --Several cars may have to be backed up if a deadlock occurs. --Starvation is possible. --other example: --one thread/process grabs disk and then tries to grab scanner --another thread/process grabs scanner and then tries to grab disk --how do we get around deadlock? (i) ignore it: worry about it when happens (ii) detect and recover: not great --could imagine attaching debugger --not really viable for production software, but works well in development --threads package can keep track of resource-allocation graph --see book --For each lock acquired, order with other locks held --If cycle occurs, abort with error --Detects potential deadlocks even if they do not occur (iii) avoid algorithmically --banker's algorithm (see book) --very elegant but impractical --if you're using banker's algorithm, the gameboard looks like this: ResourceMgr::Request(ResourceID resc, RequestorID thrd) { acquire(&mutex); assert(system in a safe state); while (state that would result from giving resc to thrd is not safe) { wait(&cv, &mutex); } update state by giving resc to thrd assert(system in a safe state); release(&mutex); } Now we need to determine if a state is safe.... To do so, see book --disadvantage to banker's algorithm: --requires every single resource request to go through a single broker --requires every thread to state its maximum resource needs up front. unfortunately, if threads are conservative and claim they need huge quantities of resources, the algorithm will reduce concurrency (iv) prevent them by careful coding --negate one of the four conditions: 1. mutual exclusion 2. hold-and-wait 3. no preemption 4. circular wait --can sort of negate 1 --put a queue in front of resources, like the printer --virtualize memory --not much hope of negating 2 --can sort of negate 3: --consider physical memory: virtualized with VM, can take physical page away and give to another process! --what about negating #4? --in practice, this is what people do --idea: partial order on locks --Establishing an order on all locks and making sure that every thread acquires its locks in that order --For the files-and-directory example, the rule might be to lock the access to the files in order of file #. --see filemap.c --why this works: --can view deadlock as a cycle in the resource acquisition graph --partial order implies no cycles and hence no deadlock --three bummers: 1. hard to represent CVs inside this framework. works best only for locks. 2. compiler can't check at compile time that partial order is being adhered to because calling pattern is impossible to determine without running the program (thanks to function pointers) 3. Picking and obeying the order on *all* locks requires that modules make public their locking behavior, and requires them to know about other modules' locking. This can be painful and error-prone. (v) Static and dynamic detection tools --See, for example, these citations, citations therein, and papers that cite them: Engler, D. and K. Ashcraft. RacerX: effective, static detection of race conditions and deadlocks. Proc. ACM Symposium on Operating Systems Principles (SOSP), October, 2003, pp237-252. http://portal.acm.org/citation.cfm?id=945468 Savage, S., M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS), Volume 15, No 4., Nov., 1997, pp391-411. http://portal.acm.org/citation.cfm?id=265927 a long literature on this stuff --Disadvantage to dynamic checking: slows program down --Disadvantage to static checking: many false alarms (tools says "there is deadlock", but in fact there is none) or else missed problems D. Starvation --thread waiting indefinitely --livelock example --------------------------------------------------------------------------- Admin details --comments about confusing nature of labs: tell us so we can clarify --of course, this requires that you start early --but starting early has another benefit: you may actually wind up spending less time total if you spread it over two days than if you do the lab in a chunk. the reason is that you catch things on a second look or after a night of sleep that you did not catch at first. ---------------------------------------------------------------------------