Class 23
CS 480-008
21 April 2016

On the board
------------

1. Last time
2. Finish Peer-to-peer
3. Concurrency intro
4. Managing concurrency
5. spinlocks
6. MCS locks
7. Non-scalable locks are dangerous

---------------------------------------------------------------------------

1. Last time

    --FDS

    --began P2P, DHTs

2. Peer-to-peer 

    How do DHTs work?

    Scalable DHT lookup:
      Key/value store spread over millions of nodes
      Typical DHT interface:
        put(key, value)
        get(key) -> value
      loose consistency; likely that get(k) sees put(k), but no guarantee
      loose guarantees about keeping data alive

    Why is it hard?
      Millions of participating nodes
      Could broadcast/flood request -- but too many messages
      Every node could know about every other node
        Then hashing is easy
        But keeping a million-node table up to date is hard
      We want modest state, and modest number of messages/lookup

    Basic idea
      Impose a data structure (e.g. tree) over the nodes
        Each node has references to only a few other nodes
      Lookups traverse the data structure -- "routing"
        I.e. hop from node to node
      DHT should route get() to same node as previous put()

    Example: The "Chord" peer-to-peer lookup system
      By Stoica, Morris, Karger, Kaashoek and Balakrishnan; 2001

    Chord's ID-space topology
      Ring: All IDs are 160-bit numbers, viewed in a ring.
      Each node has an ID, randomly chosen

    Assignment of key IDs to node IDs?
      Key stored on first node whose ID is equal to or greater than key ID.
        Closeness is defined as the "clockwise distance"
      If node and key IDs are uniform, we get reasonable load balance.
      So keys IDs should be hashes (e.g. bittorrent infohash)
      ** This is one way to instantiate _consistent hashing_ **

    Basic routing -- correct but slow
      Query is at some node.
      Node needs to forward the query to a node "closer" to key.
        If we keep moving query closer, eventually we'll win.
      Each node knows its "successor" on the ring.
        n.lookup(k):
          if n < k <= n.successor
            return n.successor
          else
            forward to n.successor
      I.e. forward query in a clockwise direction until done
      n.successor must be correct!
        otherwise we may skip over the responsible node
        and get(k) won't see data inserted by put(k)

    Forwarding through successor is slow
      Data structure is a linked list: O(n)
      Can we make it more like a binary search?
        Need to be able to halve the distance at each step.

    log(n) "finger table" routing:
      Keep track of nodes exponentially further away:
        New state: f[i] contains successor of n + 2^i
        n.lookup(k):
          if n < k <= n.successor:
            return successor
          else:
            n' = closest_preceding_node(k) -- in f[]
            forward to n'

    for a six-bit system, maybe node 8's looks like this:
      0: 14
      1: 14
      2: 14
      3: 21
      4: 32
      5: 42

    Why do lookups now take log(n) hops?
      One of the fingers must take you roughly half-way to target

    There's a binary lookup tree rooted at every node
      Threaded through other nodes' finger tables
      This is *better* than simply arranging the nodes in a single tree
        Every node acts as a root, so there's no root hotspot
        But a lot more state in total

    Is log(n) fast or slow?
      For a million nodes it's 20 hops.
      If each hop takes 50 ms, lookups take a second.
      If each hop has 10% chance of failure, it's a couple of timeouts.
      So in practice log(n) is better than O(n) but not great.

    How does a new node acquire correct tables?

    Chord's routing is conceptually similar to Kademlia's
      Finger table similar to bucket levels
        Both halve the metric distance for each step
        Both are about speed and can be imprecise
      n.successor similar to Kademlia's requirement that
        each node know of all the nodes that are very close in xor-space
        in both cases care is needed to ensure that different lookups
          for same key converge on exactly the same node

    Retrospective
      DHTs seem very promising for finding data in large p2p systems
        Decentralization seems good for load, fault tolerance
      But: the security problems are difficult
      But: churn is a serious problem, particularly if log(n) is big
      So DHTs have not had the impact that many hoped for

3. Concurrency intro

    Q: what is concurrency?  A: stuff happening at the same time

    A. What are the sources of concurrency?

	1. Multiple processors but common memory

	2. Multiplexing in time a/k/a scheduling: multiple processes or
	threads share memory (even if not running at same time). 

	3. Interrupts (actually a form of scheduling and also used to
	implement scheduling)

	    (a) from devices (e.g., disk finished, new data from
	    network, etc.)

	    (b) from a periodic timer

	4. Anything else?

    B. Detour: the thread abstraction

        --in-kernel
        --in process, etc.

        we'll assume mostly in-kernel

        basically threads within a process are like different processes
        that just happen to share %cr3.

	The execution of multiple threads is interleaved

	Different kinds of threads, in terms of how "hard" it is to
	synchronize them:

	    --non-preemptive threads: a thread executes exclusively
	    until it makes a blocking call. (e.g., a read() on a file).

	    --preemptive threads: between any two instructions, another
	    thread can run [how is this implemented? answer: with
	    interrupts and context switches]

	Note that under multiple CPUs, we are inherently in a preemptive
	world. Consider a thread T on a CPU 0. Another thread on CPU 1
	can execute between any two instructions of T.

    C. What makes concurrency hard to deal with?

        --Hard or impossible to reason about all possible interleavings

	--see handout; panels 1, 2, 3

	    2a:  x = 1 or x = 2.
	    2b:  x = 13 or x = 25.
	    2c:  x = 1 or x = 2 or x = 3 

	    3: incorrect list structure

	    4: incorrect count in buffer

	--all of these are called *race conditions*; not all are errors,
	though.

	--worst part of errors from race conditions is that a program
	may work fine most of the time but only occasionally show
	problems. why?  (because the instructions of the various threads
	or processes or whatevever get interleaved in a
	non-deterministic order.)

	    --and it's worse than that because inserting debugging code
	    may change the timing so that the bug doesn't show up

        --hardware makes the problem harder, by departing from
        sequential consistency (we'll mostly not cover that in this
        class)

    D. Pervasive tension in multicore/concurrent programming:
    performance vs. correctness
    
        --there is usually a tension between performance and correctness
        when writing concurrent code

        --performance compromised by too much locking or too-large
        critical sections 
             --Serialization reduces opportunities for concurrent execution
            (and use of the multiple cores)
            --And enforcing serialization has a cost
       
        --correctness usually compromised by not enough locking, or by
        fine-grained locking

4.  Managing concurrency: protect critical sections

        * critical sections
        * protecting critical sections
        * implementing critical sections

      --step 1: the concept of *critical* section

        --Regard accesses of shared variables (for example, "count" in
        the bounded buffer example) as being in a _critical section_

        --Critical sections will be protected from concurrent execution

	--Now we need a solution to _critical section_ problem

	--Solution must satisfy 3 properties:

	    1. mutual exclusion
		only one thread can be in c.s. at a time		
                [this is the notion of atomicity]

	    2. progress
		if no threads executing in c.s., one of the threads
		trying to enter a given c.s. will eventually get in
		
	    3. bounded waiting
		once a thread T starts trying to enter the critical
		section, there is a bound on the number of other threads
		that may enter the critical section before T enters

        top-level idea: remove some concurrency

        --step 2: protecting critical sections. 

            --want lock()/unlock() or enter()/leave() or acquire()/release()

	        --lots of names for the same idea

	        --mutex_init(mutex_t* m), mutex_lock(mutex_t* m),
	        mutex_unlock(mutex_t* m),....

	        --pthread_mutex_init(), pthread_mutex_lock(), ...

	    --in each case, the semantics are that once the thread of
	    execution is executing inside the critical section, no other
	    thread of execution is executing there

        --step 3: implementing critical sections

            --Peterson's algorithm: 
                not modular
                doesn't generalize well
                requires sequential consistency

            --ASSUME KERNEL MODE:
    
                --"easy" way, assuming a uniprocessor machine: 

                    lock() --> disable interrupts
                    unlock() --> reenable interrupts

                    [convince yourself that this provides mutual exclusion]

                --multiprocessor machine:

                    spinlocks
                    basic: test-and-set. see handout. 
    
            --ASSUME USER SPACE:
        
                --spinlocks, mutexes, etc.

                --won't study mutexes in this class, but you will use
                them a lot in application code

---
Transition:

    --in user-level space, worry about correctness first

    --in kernel space, care about both performance and correctness

    --performance issues:
        (a) cost to acquire lock. deal with this one for the rest of the day.
        (b) cost of serializing in the first place. won't deal with this
        too much in this semester.
---


5. spinlocks

    test-and-set: basic. just saw.

    test-and-test-and-set: this is 5b on the handout

    ticket: this is 6 on the handout

    MCS: this is 7 on the handout


6. Non-scalable locks are dangerous

    ASK: why is collapse bad in general?

        make the point: upgrade the machine, and performance gets
        *worse*. weird and counter-intuitive. paper tells you why that
        happens.

        you hope normally when you increase load that throughput
        matches capacity. you don't want your system coming to a
        grinding halt.

    what is this MESI concept?

        idea: there is a directory that contains, for every single
        cache line, the following info:
            [tag | state | core_ID]

        state can be Modified, Exclusive, Shared, Invalid

            Modified: some core has dirty data

            Exclusive: some core has the cache line, but there is no
            dirty data

            Shared: a bunch of cores have the thing cached, and it matches
            DRAM.

            Invalid: no one has it cached

        loads or stores can change the state, and generate cross-cache
        traffic; this is the cache coherence protocol in action.

        for example, a load of a cache line that is in the modified
        state causes the cache coherence protocol to go get the latest
        value.

    ASK: Figure 6: why does a Store, after a shared, generate a
    "Broadcast Invalidate"? 

        Answer: because the directory doesn't know which cores have
        it cached. it could be anywhere in the machine. 

        result: intersocket traffic! hundreds of cycles!
    
    ASK: this paper is about what happens to locks under contention. but
    where does the contention come from?

        Answer: loads of lock->current, especially after stores to that
        value.
   

    [more next time]

---------------------------------------------------------------------------

Acknowledgment: P2P piece due to Robert Morris's 6.824 notes.