Class 8 CS372H 9 February 2012 On the board ------------ 1. Last time 2. Managing concurrency: protecting critical sections (cont'd.) 3. Condition variables 4. Semaphores 5. Monitors 6. Standards for concurrent programming 7. Getting practice with concurrent programming 8. Trade-offs and problems from locking --------------------------------------------------------------------------- 1. Last time --intro. to concurrency --what makes concurrency hard --clarify what happens when there isn't sequential consistency: operations that happen on another processor may appear to happen out of order [draw this] --see http://www.kernel.org/doc/Documentation/memory-barriers.txt for how Linux deals with memory barriers --last time, we said that memory barriers aren't needed on uniprocessor system. this is not quite right. they are needed, for handling interactions with memory-mapped I/O devices. --on the x86, the memory consistency model is hard to define, but it amounts to the fact that writes happen in program order, aside from string operations, but reads can be reordered. --see section 7.2 of the IA-32 System Programming Guide --how to provide atomicity: use hardware 2. Protecting critical sections Recall gameboard: --we are trying to protect _critical sections_ --we need some interface like lock()/unlock(), acquire()/release(), enter()/leave() --lots of names for this: --mutex_init(mutex_t* m), mutex_lock(mutex_t* m), mutex_unlock(mutex_t* m),.... --pthread_mutex_init(), pthread_mutex_lock(), ... --in each case, the semantics are that once the thread of execution is executing inside the critical section, no other thread of execution is executing there Peterson's A. Disable interrupts [last time] B. Spinlocks --How do spinlocks guarantee mutual exclusion? [see handout. draw picture of two CPUs, memory cell, and atomic exchange.] --Fine for quick operations in kernel --Not good in user space or even for waiting for long periods of time in kernel --Question: why not use spinlocks for access to disk drive? --answer: wastes CPU --note: it's unavoidable that we need hardware support because at the lowest level, we're trying to decide which particular thread is doing something first --Wait, why do spinlocks have to disable and reenable interrupts? --consider memory shared between an interrupt handler and a thread-inside-kernel (e.g., interrupt handler enqueues I/O events and inside-kernel thread handles those events). --interrupt handlers do not themselves get interrupted. So if acquiring spinlock *didn't* disable interrupts, we could get the following: --thread has spinlock. interrupts enabled. interrupt happens. interrupt routine tries to acquire spinlock; spins forever --> machine wedged --solution: turn off interrupts before trying to acquire spinlock: spinlock.acquire() "pushes" interrupts (saves their current state), and spinlock.release() "pops" interrupts (restores their current state). --Perhaps confusingly, "interrupt" here should be viewed *abstractly*: --certainly if the code is executing inside the kernel, disabling and enabling interrupts means "turning off the processor's interrupts" --but if we're talking about a preemptive user-level threading package, then "interrupt" might just mean a non-deterministic timer signal that invokes the thread scheduler. in that case, "turning off interrupts" could mean "deregistering for the signal from the timer that would otherwise invoke the run-time" or else "record the fact that signals were delivered but don't act on that fact until 'interrupts' are reenabled". C. Mutexes --not going to cover in much detail --several important points: --mutexes are *everywhere*. when you program in the "real world", you will probably use these a lot. --mutexes are implemented in terms of a lower-level lock. we again have the pattern that the lower-level lock is protecting a list (more generally, data structure). here, the list (more generally, data structure) is the queue of threads waiting for the mutex (more generally, all of the fields of the mutex). Review how we got here --to deal with concurrency, need atomic operations --atomic operations ultimately requires hardware support --single CPU: turning off interrupts sometimes enough --multiple CPUs: use special hardware instructions --different options on different architectures --test_and_set() very common --on the x86, one uses xchg to implement test_and_set() 3. Condition variables A. Motivation --producer/consumer queue --very common paradigm. also called "bounded buffer": --producer puts things into a shared buffer --consumer takes them out --producer must wait if buffer is full; consumer must wait if buffer is empty --shows up everywhere --Soda machine: producer is delivery person, consumer is soda drinkers, shared buffer is the machine --DMA buffers --producer/consumer queue using mutexes (see handout, 3b) --what's the problem with that? --answer: a form of busy waiting. not quite as bad as spinlock, but the pattern is similar: thread keeps checking a condition -- (count == BUFFER_SIZE) or (COUNT == 0) -- until the respective condition is true. --It is convenient to break synchronization into two types: --*mutual exclusion*: allow only one thread to access a given set of shared state at a time --*scheduling constraints*: wait for some other thread to do something (finish a job, produce work, consume work, accept a connection, get bytes off the disk, etc.) B. Usage --API --void cond_init (Cond *, ...); --Initialize --void cond_wait(Cond *c, Mutex* m); --Atomically unlock m and sleep until c signaled --Then re-acquire m and resume executing --void cond_signal(Cond* c); --Wake one thread waiting on c [in some pthreads implementations, the analogous call wakes *at least* one thread waiting on c. Check the the documentation (or source code) to be sure of the semantics. But, actually, your implementation shouldn't change since you need to be prepared to be "woken" at any time, not just when another thread calls signal(). More on this below.] --void cond_broadcast(Cond* c); --Wake all threads waiting on c --QUESTION: Why must cond_wait both release the mutex and sleep? (see handout, 3c) --Answer: can get stuck waiting. Producer: while (count == BUFFER_SIZE) Producer: release() Consumer: acquire() Consumer: ..... Consumer: cond_signal(&nonfull) Producer: cond_wait(&nonfull) --Producer will never hear the signal! --QUESTION: Why not use "if"? (Why use "while"?) --Answer: we can get an interleaving like this: --The signal() puts the waiting thread on the ready list but doesn't run it --That now-ready thread is ready to acquire() the mutex (inside cond_wait()). --But a *different* thread (a third thread: not the signaler, not the now-ready thread) could acquire() the mutex, work in the critical section, and now invalidates whatever condition was being checked --Our now-ready thread eventually acquire()s the mutex... --...with no guarantees that the condition it was waiting for is still true --Solution is to use "while" when waiting on a condition variable --DO NOT VIOLATE THIS RULE; doing so will (almost always) lead to incorrect code 4. Semaphores --Don't use these. We're mentioning them only for completeness and for historical reasons: they were the first general-purpose synchronization primitive, and they were the first synchronization primitive that Unix supported. --Introduced by Edsger Dijkstra in late 1960s --Dijkstra was a highly notable figure in computer science who spent the latter part of his career here at UT --Semaphore is initialized with an integer, N --Two functions: --Down() and Up() [also known as P() and V()] --The guarantee is that Down() will return only N more times than Up() is called --Basically a counter that, when it reaches 0, causes a thread to sleep() --Another way to say the same thing: --Semaphore holds a count --Down() is an atomic operation that waits for the count to become positive; it then decrements the count by 1 --Up() is an atomic operation that increments the count by 1 and then wakes up a thread waiting on Down(), if any --Don't use these! (Notice that Andrew Birrell [who is a Threading Ninja] doesn't even mention them in his paper.) --Problems: --semaphores are dual-purpose (for mutual exclusion and scheduling constraints), so hard to read code and hard to get code right --semaphores have hidden internal state --getting a program right requires careful interleaving of "synchronization" and "mutex" semaphores 5. Monitors Monitors = mutex + condition variables [note: this year, we're not covering most of this in class, but programming with monitors will be covered on the exams.] --High-level idea: an object (as in object-oriented systems) --in which methods do not execute concurrently; and --that has one or more condition variables --More detail --Every method call starts with acquire(&mutex), and ends with release(&mutex) --Technically, these acquire()/release() are invisible to the programmer because it is the programming language (i.e., the compiler+run-time) that is implementing the monitor --So, technically, a monitor is a programming language concept --Book follows this technical definition --But technical definition isn't hugely useful because no programming languages in widespread usage have true monitors --Java has something close: a class in which every method is set by the programmer to be "synchronized" (i.e., implicitly protected by a mutex) --Not exactly a monitor because there's nothing forcing every method to be synchronized --And we can *use* mutexes and condition variables to implement our own manual versions of monitors, though we have to be careful --Given the above, we are going to use the term "monitor" more loosely to refer to both the technical definition and also a "manually constructed" monitor, wherein: --all method calls are protected by a mutex (that is, the programmer inserts those acquire()/release() on entry and exit from every procedure *inside* the object) --synchronization happens with condition variables whose associated mutex is the mutex that protects the method calls --In other words, we will use the term "monitor" to refer to the programming conventions that you should follow when building multithreaded applications --you must follow these conventions on lab T --Example: see handout, #4 --RULE: --acquire/release at beginning/end of functions --RULE: --hold lock when doing condition variable operations --Some (e.g., Birrell) will say: "for experts only, no need to hold the lock when signaling". IGNORE THIS. Putting the signal outside the lock is only a small performance optimization, and it is likely to lead you to write incorrect code. --to get credit in Lab T, you must hold the associated mutex when doing a condition variable operation --Different styles of monitors: --Hoare-style: signal() immediately wakes the waiter --What the book calls Hansen-style: signal() required to be last statement in a procedure --What everyone else calls Hansen-style and what we will use: signal() eventually wakes the waiter. Not an immediate transfer --Can we replace SIGNAL with BROADCAST, given our monitor semantics? (Answer: yes, always.) Why? --while() condition tests the needed invariant. program doesn't progress pass while() unless the needed invariant is true. --result: spurious wake-ups are acceptable.... --...which implies you can always wakeup a thread at any moment with no loss of correctness.... --....which implies you can replace SIGNAL with BROADCAST [though it may hurt performance to have a bunch of needlessly awake threads contending for a mutex that they will then acquire() and release().] --RULE: --a thread that is in wait() must be prepared to be restarted at any time, not just when another thread calls "signal()". --why? because the implementor of the threads and condition variables package *assumes* that the user of the threads package is doing while(){wait()}. --Can we replace BROADCAST with SIGNAL? --Answer: not always. --Example: --memory allocator --threads allocate and free memory in variable-sized chunks --if no memory free, wait on a condition variable --now posit: --two threads waiting to allocate chunks of memory --no memory free at all --then, a third thread frees 10,000 bytes --SIGNAL alone does the wrong thing: we need to awaken both threads 6. Standards [and advice] for concurrent programming A. Standards --see Mike D's "Programming With Threads", linked from lab T --You are required to follow this document --You will lose points (potentially many!) on the lab and on the exam if you stray from these standards --Note that in his example in section 4, there needs to be another line: --right before mutex->release(), he should have: assert(invariants hold) --the primitives may seem strange, and the rules may seem arbitrary: why one thing and not another? --there is no absolute answer here --**However**, history has tested the approach that we're using. If you use the recommended primitives and follow their suggested use, you will find it easier to write correct code --For now, just take the recommended approaches as a given, and use them for a while. If you can come up with something better after that, by all means do so! --But please remember three things: a. lots of really smart people have thought really hard about the right abstractions, so a day or two of thinking about a new one or a new use is unlikely to yield an advance over the best practices. b. the consequences of getting code wrong can be atrocious. see for example: http://www.nytimes.com/2010/01/24/health/24radiation.html http://sunnyday.mit.edu/papers/therac.pdf http://en.wikipedia.org/wiki/Therac-25 c. people who tend to be confident about their abilities tend to perform *worse*, so if you are confident you are a Threading and Concurrency Ninja and/or you think you truly understand how these things work, then you may wish to reevaluate..... --Dunning-Kruger effect --http://www.nytimes.com/2000/01/23/weekinreview/january-16-22-i-m-no-doofus-i-m-a-genius.html --MikeD stands on the desk when proclaiming the standards [NOTE: this year, we're not covering what's below in class, but programming with monitors will be covered on the exams.] B. Top-level piece of advice: SAFETY FIRST. --Locking at coarse grain is easiest to get right, so do that (one big lock for each big object or collection of them) --Don't worry about performance at first --In fact, don't even worry about liveness at first --In other words don't view deadlock as a disaster --Key invariant: make sure your program never does the wrong thing C. More detailed advice: design approach [We will use item #5 on handout as a case study.....] --Here's a four-step design approach: 1. Getting started: 1a. Identify units of concurrency. Make each a thread with a go() method or main loop. Write down the actions a thread takes at a high level. 1b. Identify shared chunks of state. Make each shared *thing* an object. Identify the methods on those objects, which should be the high-level actions made *by* threads *on* these objects. Plan to have these objects be monitors. 1c. Write down the high-level main loop of each thread. Advice: stay high level here. Don't worry about synchronization yet. Let the objects do the work for you. Separate threads from objects. The code associated with a thread should not access shared state directly (and so there should be no access to locks/condition variables in the "main" procedure for the thread). Shared state and synchronization should be encapsulated in shared objects. --QUESTION: how does this apply to the example on the handout? --separate loops for producer(),consumer(), and synchronization happens inside MyBuffer. Now, for each object: 2. Write down the synchronization constraints on the solution. Identify the type of each constraint: mutual exclusion or scheduling. For scheduling constraints, ask, "when does a thread wait"? --NOTE: usually, the mutual exclusion constraint is satisfied by the fact that we're programming with monitors. --QUESTION: how does this apply to the example on the handout? --Only one thread can manipulate the buffer at a time (mutual exclusion constraint) --Producer must wait for consumer to empty slots if all full (scheduling constraint) --Consumer must wait for producer to fill buffers if all empty (scheduling constraint) 3. Create a lock or condition variable corresponding to each constraint --QUESTION: how does this apply to the example on the handout? --Answer: need a lock and two condition variables. But lock was sort of a given from the monitor. 4. Write the methods, using locks and condition variables for coordination D. More advice 1. Don't manipulate synchronization variables or shared state variables in the code associated with a thread; do it with the code associated with a shared object. --Threads tend to have "main" loops. These loops tend to access shared objects. *However*, the "thread" piece of it should not include locks or condition variables. Instead, locks and CVs should be encapsulated in the shared objects. --Why? (a) Locks are for synchronizing across multiple threads. Doesn't make sense for one thread to "own" a lock. (b) Encapsulation -- details of synchronization are internal details of a shared object. Caller should not know about these details. "Let the shared objects do the work." --Common confusion: trying to acquire and release locks inside the threads' code (i.e., not following this advice). Bad idea! Synchronization should happen within the shared objects. Mantra: "let the shared objects do the work". --Note: our first example of condition variables -- 4c on today's handout -- doesn't actually follow the advice, but that is in part so you can see all of the parts working together. 2. Different way to state what's above: --You want to decompose your problem into objects, as in object-oriented style of programming. --Thus: (1) Shared object encapsulates code, synchronization variables, and state variables (2) Shared objects are separate from threads --Warning: most examples in the book talk about "thread 1's code" and "thread 2's code", etc. This is b/c most of the "classic" problems were studied before OO programming was widespread. 7. Practice with concurrent programming [note: we're not covering this in class, but you're responsible for programming with monitors, and this will be useful practice.] --example: --workers interact with a database --motivation: banking, airlines, etc. --readers never modify database --writers read and modify data --using only a single mutex lock would be overly restrictive. Instead, want --many readers at the same time --only one writer at a time --let's follow the concurrency advice from last time (and above)..... 1. Getting started a. what are units of concurrency? [readers/writers] b. what are shared chunks of state? [database] c. what does the main function look like? read() check in -- wait until no writers access DB check out -- wake up waiting writer, if appropriate write() check in -- wait until no readers or writers access DB check out -- wake up waiting readers or writers 2. and 3. Synchronization constraints and objects --reader can access DB when no writers (condition: okToRead) --writer can access DB when no other readers or writers (condition: okToWrite) --only one thread manipulates shared variables at a time. NOTE: **this does not mean only one thread in the DB at a time** (mutex) 4. write the methods --inspiration required: int AR = 0; // active readers int AW = 0; // # active writers int WR = 0; // waiting readers int WW = 0; // waiting writers --see handout for the code --QUESTION: why not just hold the lock all the way through "Execute req"? (Answer: the whole point was to provide more concurrency, i.e., to move away from exclusive access.) --QUESTION: what if we had shared locks? The implementation of shared locks is given on the handout 8. Trade-offs and problems from locking A. Hard to get right (though the advice above helps). --example: doublecheck_alloc B. Performance quick digression: --_dance hall_ architecture: any CPU can "dance with" any memory equally (equally slowly) --NUMA (non-uniform memory access): each CPU has fast access to some "close" memory; slower to access memory that is further away --AMD Opterons like this --Intel CPUs moving toward this --see next-to-last page of handout --two further choices: cache coherent or not. in the former case, hardware runs a cache coherence (cc) protocol to invalidate caches when a local change happens. in the latter case, it does not. former case is far more common. let's assume ccNUMA machines...back to performance issues.... our baseline is a test-and-test_and_set spinlock, which is basically what Linux uses: void acquire(Lock* lock) { pushcli(); while (xchg_val(&lock->locked, 1) == 1) { while (lock->locked) ; } } void release(Lock* lock) { xchg_val(&lock->locked, 0); popcli(); } the performance issues are: (i) fairness --one CPU gets lock because the memory holding the "locked" variable is closer to that CPU --allegedly, Google had fairness problems on Opterons (I have no proof of this) (ii) lots of traffic over memory bus: if lots of contention for lock, then cache coherence protocol creates lots of remote invalidations every time someone tries to do a lock acquisition (iii) cache line bounces (same reason as (ii)) (iv) locking inherently reduces concurrency mitigation of (i)--(iii): better locks --MCS locks --see handout --advantages --guarantees FIFO ordering of lock acquisitions (addresses (i)) --spins on local variable only (addresses (ii), (iii)) --[not discussing this, but: works equally well on machines with and without coherent caches] --NOTE: with fewer cores, spinlocks are better. why? --In fact, if there is high contention, performance will be poor, though MCS locks will make it be a little less poor. More on that in a bit. --futexes --see notes below or next time mitigation of (iv): more fine-grained locking. --unfortunately, fine-grained locking leads to the next issue, which is also fundamental