Class 7 
CS372H
7 February 2012

On the board
------------

1. Last time
2. Intro. to concurrency
3. What makes concurrency hard to deal with?
4. Managing concurrency: provide atomicity
5. Managing concurrency: protecting critical sections

---------------------------------------------------------------------------

1. Last time

    --finished shell

    --power of the fork/exec separation

    --discussion of Unix

2. Introduction to concurrency

    A. What are the sources of concurrency?

	1. Multiple processors but common memory

	2. Multiplexing in time a/k/a scheduling: multiple processes or
	threads share memory (even if not running at same time). 

	3. Interrupts (actually a form of scheduling and also used to
	implement scheduling)

	    (a) from devices (e.g., disk finished, new data from
	    network, etc.)

	    (b) from a periodic timer

	4. Anything else?
    
    B. Detour: the thread abstraction

	**This abstraction can be implemented at multiple levels**

	a. in-kernel, for kernel

	b. in-kernel, for processes

	c. in process, for user-level threads

	A thread is an independent unit of control accessing the same
	shared memory as the other threads. Roughly speaking, two
	reasons to use threads:

	    --want a single process to take advantage of multiple CPUs

		(*) --> but whether the process can in fact take
		advantage of multiple CPUs depends on the
		implementation of threads

	    --often very natural to structure some computation (or task
	    or job or whatever) as multiple units of control that see
	    the same memory

	A thread is a set of registers and a stack

	Multiple threads share the same value of %cr3

	For now, just explain a thread in terms of its API

	    tid thread_create (void (*fn) (void *), void *arg);

	    void thread_exit ();

	    void thread_join (tid thread);

	The execution of multiple threads is interleaved

	Different kinds of threads, in terms of how "hard" it is to
	synchronize them:

	    --non-preemptive threads: a thread executes exclusively
	    until it makes a blocking call. (e.g., a read() on a file).

	    --preemptive threads: between any two instructions, another
	    thread can run [how is this implemented? answer: with
	    interrupts and context switches]

	Note that under multiple CPUs, we are inherently in a preemptive
	world. Consider a thread T on a CPU 0. Another thread on CPU 1
	can execute between any two instructions of T.

	We may talk later about the implementation of threads.
	    
    C. How do various entities "observe" concurrency?
   
	1. kernel

	    --it's executing in one area, and then an interrupt comes
	    in, so now it's executing somewhere else

	    --multiple threads inside the kernel 

	2. user processes

	    --multiple threads

	    --sharing memory with another process

	    --even signals, fault-handling, etc. can be thought of as
	    "observing" concurrency.

	[one way, then, to think about threads is that they are an
	abstraction provided by the OS to expose concurrency, and
	possibly parallel hardware resources, to user processes.]

    D. Detour: context switches.

    Question: When does the kernel switch which process or thread is
    running? (This is called a context switch.)

    Answer: under three scenarios:

	--interrupt (from device or timer)

	--traps (running process performs a system call).
		
	    reason: syscall by one process can result in that
	    process being put to sleep (e.g., if it does a read from
	    the disk) or some other process becoming runnable (e.g.,
	    if it wakes up another process with an inter-process
	    message)

	    confusingly, the 'int' instruction on the x86 generates traps

	--exception
	    --divide by 0
	    --page fault
	    --....

     Question: How does the kernel switch which process is running?
     (This is called a context switch.) 

     Answer: the kernel implements it, at a high level, by saving the
     IP, registers, and VM translations of the running process.

3. What makes concurrency hard to deal with?

    A. Hard or impossible to reason about all possible interleavings

	--see handout; panels 1, 2, 3

	    2a:  x = 1 or x = 2.
	    2b:  x = 13 or x = 25.
	    2c:  x = 1 or x = 2 or x = 3 

	    3: incorrect list structure

	    4: incorrect count in buffer

	--all of these are called *race conditions*; not all are errors,
	though.

	--worst part of errors from race conditions is that a program
	may work fine most of the time but only occasionally show
	problems. why?  (because the instructions of the various threads
	or processes or whatevever get interleaved in a
	non-deterministic order.)

	    --and it's worse than that because inserting debugging code
	    may change the timing so that the bug doesn't show up

    B. Sequential consistency not always in effect
	
	--see panel 4

	--the correct answers are:
	    --don't know for (a)--(c)
	   
	--reason: it depends on the hardware. if the hardware provides
	sequential consistency, then the answers to all are "no".

	(i) Defn of sequential consistency: 

	    "The result of any execution is the same as if the
	    operations of all the processors were executed in some
	    sequential order, and the operations of each individual
	    processor appear in this sequence in the order specified by
	    its program"

	    [citation: L. Lamport. How to Make a Multiprocessor Computer that
	    Correctly Executes Multiprocess Programs. _IEEE Transactions
	    on Computers_, Volume C-28, Number 9, September 1979,
	    pp690-691.
	    http://research.microsoft.com/en-us/um/people/lamport/pubs/multi.pdf]

	    Basically means:

	    --Maintaining program order on individual processors 

	    --Ensuring that writes happen to each memory location
	    (viewed separately) in the order that they are issued

	--NOTE: without SC, multiple CPUs can be "worse" than preemptive
	threads because a program may see results that cannot occur with
	*any* interleaving on 1 CPU.

    C. Why don't we always have sequential consistency?

	--S.C. thwarts hardware optimization, so hardware may not want
	to support

	    --S.C. complicates write buffering

	    --S.C. means can't re-order overlapping write operations, so
	    the following optimizations are out:

		--Concurrent writes to different memory modules

		--Coalescing writes to same cache line

	    --S.C. complicates non-blocking reads

		--for instance speculatively prefetching 

	    --cache coherence becomes more expensive

	--S.C. thwarts compiler optimizations:

	    --moving code around

	    --caching values in registers

	    --common subexpression elimination (could cause memory
	    to be read fewer times)

	    --re-arrange loops for better cache performance

	    --software pipelining


	--what does the x86 do?

	    --x86 supports multiple consistency/caching models 

		--Memory Type Range Registers (MTRR) specify consistency for 
		ranges of physical memory (e.g., frame buffer) 

		--Page Attribute Table (PAT) allows control for each 4K page 

	    --Choices include: 

		WB: Write-back caching (the default) 
		WT: Write-through caching (all writes go to memory) 
		UC: Uncacheable (for device memory) 
		WC: Write-combining: weak consistency & no caching 

	    --Some instructions have weaker consistency 
		--String instructions 
		--Special "non-temporal" instructions that bypass cache 

	--x86 WB consistency: processor can read its own writes early

    D. Wait, if we don't have S.C., what do we do?

	--lock prefix
	
	    --the lock prefix makes a memory instruction atomic (by
	    locking the bus for the duration of an instruction, which is
	    expensive).

	    --all lock instructions are totally ordered

	    --other memory instructions cannot be re-ordered with
	    respect to locked ones

	--xchg (always locked; no prefix needed)

	--fence instructions (also called memory barriers) that can
	prevent re-ordering
	
	    LFENCE -- can't be reordered with reads (or later writes)

	    SFENCE -- can't be reordered with writes

	    MFENCE -- can't be reordered with reads or writes

	--MFENCE: all memory operations before the barrier appear to all
	processors to have executed before all operations after the
	barrier.

4. Managing concurrency: provide atomicity

    --first attempt to deal with race conditions: make the needed
    operations atomic.

    --how?

    A. A single-instruction add?

	'count' is in memory (that is what the example in #4 stipulates.)
	assume that %ecx holds the address of 'count'

	--Then, can we use the x86 instruction addl? For instance:

		addl $1, (%ecx)   ; count++

	--So looks like we can implement count++/-- with one
	instruction?

	--So we're safe?

	--No: not atomic on multiprocessor! 

	--Will experience same race condition at the hardware level

    B. How about using x86 LOCK prefix?
	    
	--can make read-modify-write instructions atomic by preceding
	them with "LOCK". examples of such instructions are:
	    XADD, CMPXCHG, INC, DEC, NOT, NEG, ADD, SUB...
	    (when their destination operand refers to memory)

	--but using LOCK is very expensive (flushes processor
	caches) and not a "general-purpose abstraction"
	    --only applies to one instruction: what if we need to
	    execute three or four instructions as a unit?

	--compiler won't generate it by default, assumes you don't
	want penalty

    C. Critical sections

	--Place count++ and count-- in critical section 

	--Protect critical sections from concurrent execution 

	--Now we need solution to _critical section_ problem

	--Solution must satisfy 3 rules:

	    1. mutual exclusion
		only one thread can be in c.s. at a time		

	    2. progress
		if no threads executing in c.s., one of the threads
		trying to enter a given c.s. will eventually get in
		
	    3. bounded waiting
		once a thread T starts trying to enter the critical
		section, there is a bound on the number of other threads
		that may enter the critical section before T enters


	--Note progress vs. bounded waiting 

	    --If no thread can enter C.S., don't have progress 

	    --If thread A waiting to enter C.S. while B repeatedly
	    leaves and re-enters C.S. ad infinitum, don't have bounded
	    waiting 

    --Gameboard is that we're now going to build primitives to
    protect critical sections. 

5. Managing concurrency: protecting critical sections

    --Peterson's algorithm....
	
	--see book

	--*if* there is sequential consistency, then Peterson's
	algorithm satisfies mutual exclusion, progress, bounded waiting

	--But expensive and not encapsulated


    --High-level:

	--want: lock()/unlock() or enter()/leave() or
	acquire()/release()

	    --lots of names for the same idea

	    --mutex_init(mutex_t* m), mutex_lock(mutex_t* m),
	    mutex_unlock(mutex_t* m),....

	    --pthread_mutex_init(), pthread_mutex_lock(), ...

	--in each case, the semantics are that once the thread of
	execution is executing inside the critical section, no other
	thread of execution is executing there

	--How to implement locks/mutexes/etc.?

    --We'll probably need hardware support

    A. Disable interrupts

	--only works on a uni-processor system

[thanks to David Mazieres for content in portions of this lecture.]