Class 23 CS 480-008 21 April 2016 On the board ------------ 1. Last time 2. Finish Peer-to-peer 3. Concurrency intro 4. Managing concurrency 5. spinlocks 6. MCS locks 7. Non-scalable locks are dangerous --------------------------------------------------------------------------- 1. Last time --FDS --began P2P, DHTs 2. Peer-to-peer How do DHTs work? Scalable DHT lookup: Key/value store spread over millions of nodes Typical DHT interface: put(key, value) get(key) -> value loose consistency; likely that get(k) sees put(k), but no guarantee loose guarantees about keeping data alive Why is it hard? Millions of participating nodes Could broadcast/flood request -- but too many messages Every node could know about every other node Then hashing is easy But keeping a million-node table up to date is hard We want modest state, and modest number of messages/lookup Basic idea Impose a data structure (e.g. tree) over the nodes Each node has references to only a few other nodes Lookups traverse the data structure -- "routing" I.e. hop from node to node DHT should route get() to same node as previous put() Example: The "Chord" peer-to-peer lookup system By Stoica, Morris, Karger, Kaashoek and Balakrishnan; 2001 Chord's ID-space topology Ring: All IDs are 160-bit numbers, viewed in a ring. Each node has an ID, randomly chosen Assignment of key IDs to node IDs? Key stored on first node whose ID is equal to or greater than key ID. Closeness is defined as the "clockwise distance" If node and key IDs are uniform, we get reasonable load balance. So keys IDs should be hashes (e.g. bittorrent infohash) ** This is one way to instantiate _consistent hashing_ ** Basic routing -- correct but slow Query is at some node. Node needs to forward the query to a node "closer" to key. If we keep moving query closer, eventually we'll win. Each node knows its "successor" on the ring. n.lookup(k): if n < k <= n.successor return n.successor else forward to n.successor I.e. forward query in a clockwise direction until done n.successor must be correct! otherwise we may skip over the responsible node and get(k) won't see data inserted by put(k) Forwarding through successor is slow Data structure is a linked list: O(n) Can we make it more like a binary search? Need to be able to halve the distance at each step. log(n) "finger table" routing: Keep track of nodes exponentially further away: New state: f[i] contains successor of n + 2^i n.lookup(k): if n < k <= n.successor: return successor else: n' = closest_preceding_node(k) -- in f[] forward to n' for a six-bit system, maybe node 8's looks like this: 0: 14 1: 14 2: 14 3: 21 4: 32 5: 42 Why do lookups now take log(n) hops? One of the fingers must take you roughly half-way to target There's a binary lookup tree rooted at every node Threaded through other nodes' finger tables This is *better* than simply arranging the nodes in a single tree Every node acts as a root, so there's no root hotspot But a lot more state in total Is log(n) fast or slow? For a million nodes it's 20 hops. If each hop takes 50 ms, lookups take a second. If each hop has 10% chance of failure, it's a couple of timeouts. So in practice log(n) is better than O(n) but not great. How does a new node acquire correct tables? Chord's routing is conceptually similar to Kademlia's Finger table similar to bucket levels Both halve the metric distance for each step Both are about speed and can be imprecise n.successor similar to Kademlia's requirement that each node know of all the nodes that are very close in xor-space in both cases care is needed to ensure that different lookups for same key converge on exactly the same node Retrospective DHTs seem very promising for finding data in large p2p systems Decentralization seems good for load, fault tolerance But: the security problems are difficult But: churn is a serious problem, particularly if log(n) is big So DHTs have not had the impact that many hoped for 3. Concurrency intro Q: what is concurrency? A: stuff happening at the same time A. What are the sources of concurrency? 1. Multiple processors but common memory 2. Multiplexing in time a/k/a scheduling: multiple processes or threads share memory (even if not running at same time). 3. Interrupts (actually a form of scheduling and also used to implement scheduling) (a) from devices (e.g., disk finished, new data from network, etc.) (b) from a periodic timer 4. Anything else? B. Detour: the thread abstraction --in-kernel --in process, etc. we'll assume mostly in-kernel basically threads within a process are like different processes that just happen to share %cr3. The execution of multiple threads is interleaved Different kinds of threads, in terms of how "hard" it is to synchronize them: --non-preemptive threads: a thread executes exclusively until it makes a blocking call. (e.g., a read() on a file). --preemptive threads: between any two instructions, another thread can run [how is this implemented? answer: with interrupts and context switches] Note that under multiple CPUs, we are inherently in a preemptive world. Consider a thread T on a CPU 0. Another thread on CPU 1 can execute between any two instructions of T. C. What makes concurrency hard to deal with? --Hard or impossible to reason about all possible interleavings --see handout; panels 1, 2, 3 2a: x = 1 or x = 2. 2b: x = 13 or x = 25. 2c: x = 1 or x = 2 or x = 3 3: incorrect list structure 4: incorrect count in buffer --all of these are called *race conditions*; not all are errors, though. --worst part of errors from race conditions is that a program may work fine most of the time but only occasionally show problems. why? (because the instructions of the various threads or processes or whatevever get interleaved in a non-deterministic order.) --and it's worse than that because inserting debugging code may change the timing so that the bug doesn't show up --hardware makes the problem harder, by departing from sequential consistency (we'll mostly not cover that in this class) D. Pervasive tension in multicore/concurrent programming: performance vs. correctness --there is usually a tension between performance and correctness when writing concurrent code --performance compromised by too much locking or too-large critical sections --Serialization reduces opportunities for concurrent execution (and use of the multiple cores) --And enforcing serialization has a cost --correctness usually compromised by not enough locking, or by fine-grained locking 4. Managing concurrency: protect critical sections * critical sections * protecting critical sections * implementing critical sections --step 1: the concept of *critical* section --Regard accesses of shared variables (for example, "count" in the bounded buffer example) as being in a _critical section_ --Critical sections will be protected from concurrent execution --Now we need a solution to _critical section_ problem --Solution must satisfy 3 properties: 1. mutual exclusion only one thread can be in c.s. at a time [this is the notion of atomicity] 2. progress if no threads executing in c.s., one of the threads trying to enter a given c.s. will eventually get in 3. bounded waiting once a thread T starts trying to enter the critical section, there is a bound on the number of other threads that may enter the critical section before T enters top-level idea: remove some concurrency --step 2: protecting critical sections. --want lock()/unlock() or enter()/leave() or acquire()/release() --lots of names for the same idea --mutex_init(mutex_t* m), mutex_lock(mutex_t* m), mutex_unlock(mutex_t* m),.... --pthread_mutex_init(), pthread_mutex_lock(), ... --in each case, the semantics are that once the thread of execution is executing inside the critical section, no other thread of execution is executing there --step 3: implementing critical sections --Peterson's algorithm: not modular doesn't generalize well requires sequential consistency --ASSUME KERNEL MODE: --"easy" way, assuming a uniprocessor machine: lock() --> disable interrupts unlock() --> reenable interrupts [convince yourself that this provides mutual exclusion] --multiprocessor machine: spinlocks basic: test-and-set. see handout. --ASSUME USER SPACE: --spinlocks, mutexes, etc. --won't study mutexes in this class, but you will use them a lot in application code --- Transition: --in user-level space, worry about correctness first --in kernel space, care about both performance and correctness --performance issues: (a) cost to acquire lock. deal with this one for the rest of the day. (b) cost of serializing in the first place. won't deal with this too much in this semester. --- 5. spinlocks test-and-set: basic. just saw. test-and-test-and-set: this is 5b on the handout ticket: this is 6 on the handout MCS: this is 7 on the handout 6. Non-scalable locks are dangerous ASK: why is collapse bad in general? make the point: upgrade the machine, and performance gets *worse*. weird and counter-intuitive. paper tells you why that happens. you hope normally when you increase load that throughput matches capacity. you don't want your system coming to a grinding halt. what is this MESI concept? idea: there is a directory that contains, for every single cache line, the following info: [tag | state | core_ID] state can be Modified, Exclusive, Shared, Invalid Modified: some core has dirty data Exclusive: some core has the cache line, but there is no dirty data Shared: a bunch of cores have the thing cached, and it matches DRAM. Invalid: no one has it cached loads or stores can change the state, and generate cross-cache traffic; this is the cache coherence protocol in action. for example, a load of a cache line that is in the modified state causes the cache coherence protocol to go get the latest value. ASK: Figure 6: why does a Store, after a shared, generate a "Broadcast Invalidate"? Answer: because the directory doesn't know which cores have it cached. it could be anywhere in the machine. result: intersocket traffic! hundreds of cycles! ASK: this paper is about what happens to locks under contention. but where does the contention come from? Answer: loads of lock->current, especially after stores to that value. [more next time] --------------------------------------------------------------------------- Acknowledgment: P2P piece due to Robert Morris's 6.824 notes.