Class 7 CS372H 7 February 2012 On the board ------------ 1. Last time 2. Intro. to concurrency 3. What makes concurrency hard to deal with? 4. Managing concurrency: provide atomicity 5. Managing concurrency: protecting critical sections --------------------------------------------------------------------------- 1. Last time --finished shell --power of the fork/exec separation --discussion of Unix 2. Introduction to concurrency A. What are the sources of concurrency? 1. Multiple processors but common memory 2. Multiplexing in time a/k/a scheduling: multiple processes or threads share memory (even if not running at same time). 3. Interrupts (actually a form of scheduling and also used to implement scheduling) (a) from devices (e.g., disk finished, new data from network, etc.) (b) from a periodic timer 4. Anything else? B. Detour: the thread abstraction **This abstraction can be implemented at multiple levels** a. in-kernel, for kernel b. in-kernel, for processes c. in process, for user-level threads A thread is an independent unit of control accessing the same shared memory as the other threads. Roughly speaking, two reasons to use threads: --want a single process to take advantage of multiple CPUs (*) --> but whether the process can in fact take advantage of multiple CPUs depends on the implementation of threads --often very natural to structure some computation (or task or job or whatever) as multiple units of control that see the same memory A thread is a set of registers and a stack Multiple threads share the same value of %cr3 For now, just explain a thread in terms of its API tid thread_create (void (*fn) (void *), void *arg); void thread_exit (); void thread_join (tid thread); The execution of multiple threads is interleaved Different kinds of threads, in terms of how "hard" it is to synchronize them: --non-preemptive threads: a thread executes exclusively until it makes a blocking call. (e.g., a read() on a file). --preemptive threads: between any two instructions, another thread can run [how is this implemented? answer: with interrupts and context switches] Note that under multiple CPUs, we are inherently in a preemptive world. Consider a thread T on a CPU 0. Another thread on CPU 1 can execute between any two instructions of T. We may talk later about the implementation of threads. C. How do various entities "observe" concurrency? 1. kernel --it's executing in one area, and then an interrupt comes in, so now it's executing somewhere else --multiple threads inside the kernel 2. user processes --multiple threads --sharing memory with another process --even signals, fault-handling, etc. can be thought of as "observing" concurrency. [one way, then, to think about threads is that they are an abstraction provided by the OS to expose concurrency, and possibly parallel hardware resources, to user processes.] D. Detour: context switches. Question: When does the kernel switch which process or thread is running? (This is called a context switch.) Answer: under three scenarios: --interrupt (from device or timer) --traps (running process performs a system call). reason: syscall by one process can result in that process being put to sleep (e.g., if it does a read from the disk) or some other process becoming runnable (e.g., if it wakes up another process with an inter-process message) confusingly, the 'int' instruction on the x86 generates traps --exception --divide by 0 --page fault --.... Question: How does the kernel switch which process is running? (This is called a context switch.) Answer: the kernel implements it, at a high level, by saving the IP, registers, and VM translations of the running process. 3. What makes concurrency hard to deal with? A. Hard or impossible to reason about all possible interleavings --see handout; panels 1, 2, 3 2a: x = 1 or x = 2. 2b: x = 13 or x = 25. 2c: x = 1 or x = 2 or x = 3 3: incorrect list structure 4: incorrect count in buffer --all of these are called *race conditions*; not all are errors, though. --worst part of errors from race conditions is that a program may work fine most of the time but only occasionally show problems. why? (because the instructions of the various threads or processes or whatevever get interleaved in a non-deterministic order.) --and it's worse than that because inserting debugging code may change the timing so that the bug doesn't show up B. Sequential consistency not always in effect --see panel 4 --the correct answers are: --don't know for (a)--(c) --reason: it depends on the hardware. if the hardware provides sequential consistency, then the answers to all are "no". (i) Defn of sequential consistency: "The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program" [citation: L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. _IEEE Transactions on Computers_, Volume C-28, Number 9, September 1979, pp690-691. http://research.microsoft.com/en-us/um/people/lamport/pubs/multi.pdf] Basically means: --Maintaining program order on individual processors --Ensuring that writes happen to each memory location (viewed separately) in the order that they are issued --NOTE: without SC, multiple CPUs can be "worse" than preemptive threads because a program may see results that cannot occur with *any* interleaving on 1 CPU. C. Why don't we always have sequential consistency? --S.C. thwarts hardware optimization, so hardware may not want to support --S.C. complicates write buffering --S.C. means can't re-order overlapping write operations, so the following optimizations are out: --Concurrent writes to different memory modules --Coalescing writes to same cache line --S.C. complicates non-blocking reads --for instance speculatively prefetching --cache coherence becomes more expensive --S.C. thwarts compiler optimizations: --moving code around --caching values in registers --common subexpression elimination (could cause memory to be read fewer times) --re-arrange loops for better cache performance --software pipelining --what does the x86 do? --x86 supports multiple consistency/caching models --Memory Type Range Registers (MTRR) specify consistency for ranges of physical memory (e.g., frame buffer) --Page Attribute Table (PAT) allows control for each 4K page --Choices include: WB: Write-back caching (the default) WT: Write-through caching (all writes go to memory) UC: Uncacheable (for device memory) WC: Write-combining: weak consistency & no caching --Some instructions have weaker consistency --String instructions --Special "non-temporal" instructions that bypass cache --x86 WB consistency: processor can read its own writes early D. Wait, if we don't have S.C., what do we do? --lock prefix --the lock prefix makes a memory instruction atomic (by locking the bus for the duration of an instruction, which is expensive). --all lock instructions are totally ordered --other memory instructions cannot be re-ordered with respect to locked ones --xchg (always locked; no prefix needed) --fence instructions (also called memory barriers) that can prevent re-ordering LFENCE -- can't be reordered with reads (or later writes) SFENCE -- can't be reordered with writes MFENCE -- can't be reordered with reads or writes --MFENCE: all memory operations before the barrier appear to all processors to have executed before all operations after the barrier. 4. Managing concurrency: provide atomicity --first attempt to deal with race conditions: make the needed operations atomic. --how? A. A single-instruction add? 'count' is in memory (that is what the example in #4 stipulates.) assume that %ecx holds the address of 'count' --Then, can we use the x86 instruction addl? For instance: addl $1, (%ecx) ; count++ --So looks like we can implement count++/-- with one instruction? --So we're safe? --No: not atomic on multiprocessor! --Will experience same race condition at the hardware level B. How about using x86 LOCK prefix? --can make read-modify-write instructions atomic by preceding them with "LOCK". examples of such instructions are: XADD, CMPXCHG, INC, DEC, NOT, NEG, ADD, SUB... (when their destination operand refers to memory) --but using LOCK is very expensive (flushes processor caches) and not a "general-purpose abstraction" --only applies to one instruction: what if we need to execute three or four instructions as a unit? --compiler won't generate it by default, assumes you don't want penalty C. Critical sections --Place count++ and count-- in critical section --Protect critical sections from concurrent execution --Now we need solution to _critical section_ problem --Solution must satisfy 3 rules: 1. mutual exclusion only one thread can be in c.s. at a time 2. progress if no threads executing in c.s., one of the threads trying to enter a given c.s. will eventually get in 3. bounded waiting once a thread T starts trying to enter the critical section, there is a bound on the number of other threads that may enter the critical section before T enters --Note progress vs. bounded waiting --If no thread can enter C.S., don't have progress --If thread A waiting to enter C.S. while B repeatedly leaves and re-enters C.S. ad infinitum, don't have bounded waiting --Gameboard is that we're now going to build primitives to protect critical sections. 5. Managing concurrency: protecting critical sections --Peterson's algorithm.... --see book --*if* there is sequential consistency, then Peterson's algorithm satisfies mutual exclusion, progress, bounded waiting --But expensive and not encapsulated --High-level: --want: lock()/unlock() or enter()/leave() or acquire()/release() --lots of names for the same idea --mutex_init(mutex_t* m), mutex_lock(mutex_t* m), mutex_unlock(mutex_t* m),.... --pthread_mutex_init(), pthread_mutex_lock(), ... --in each case, the semantics are that once the thread of execution is executing inside the critical section, no other thread of execution is executing there --How to implement locks/mutexes/etc.? --We'll probably need hardware support A. Disable interrupts --only works on a uni-processor system [thanks to David Mazieres for content in portions of this lecture.]