Class 20 CS 480-008 12 April 2016 On the board ------------ 1. Last time 2. MapReduce model 3. admin notes 4. MapReduce, the system Background Basic problem Execution flow and implementation Fault-tolerance Load balancing Performance 5. Discussion --------------------------------------------------------------------------- 1. Last time --Web security 2. MapReduce model --map(), reduce(): these are constructs borrowed from functional languages --assumption of MapReduce framework: the underlying implementation "runs" map() and reduce() appropriately: --feeds input to map() --does a "group by" or "shuffle" --feeds intermediate results to reduce() note that reduce gets as input a pair: (key, list_of_values) --we'll see how this is implemented shortly. for now, we're concentrating on the programmer's interface. PROBLEM 1: --let's use map() and reduce() to compute the number of occurrences of words of a certain length. Assume that the input to the job is a list of key-value pairs: (document_id, document_text) output should be a list of key-value pairs: (length, number of words of that length) // key is doc_id, value is doc contents map(String key, String value) { for each word w in value EmitIntermediate(len(w), 1); } // key: a length. values: list of counts reduce(String key, Iterator values) { int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)) } PROBLEM 2: --compute an "inverted index": input is a concatenated list of documents, which fits in memory output is a list of associations: English_word --> [list of documents that contain the word] PROBLEM 3: --compute an "inverted index" when the input file = "THE WEB" 3. admin notes --homework exercise on MapReduce, due Thursday. should be short (1-2 hours). 4. MapReduce, the system A. Background: Google environment Q: Why do/did they have tons and tons of commodity PCs? A: Economics (At their scale, even reliable machines would fail, so they need mechanisms for fault-tolerance [replication, etc.]. Once they have such mechanisms, can get away with components that are less reliable.) [draw picture] Not just Google's employees who can run these jobs: we live in the era of "big data" and "the cloud". Services like Amazon Web services, Hadoop, etc. make these kind of service available to anyone with an Internet connection and working knowledge of Python/C++/Java/etc. Google File System (GFS): basically a giant store of replicated 64 MB chunks. each chunk is itself usually part of some larger file. Thursday's paper can be understood to be a cousin of GFS Top-level interactions in GFS conceptually look like this: client --> GFS master: file name, offset GFS master --> client: chunk ID, chunkserver client --> chunkserver: chunk ID, byte range chunkserver --> client: bytes B. Basic problem solved by the authors: --in Google's environment (and now, in many organizations and on many projects) there are lots and lots of cases where programmers need to run huge computations. "huge" = way too big to fit on one disk, or in one machine's memory, given the aforementioned machines. ----let's say you want to sort 10 terabytes of data. you can't just write a C program whose virtual address space is several terabytes, and code up a 10-line quicksort program against a 10^12-sized array --more such cases than there are programmers who have the know-how to implement such computations. --why does it require know-how? because when you implement something at this scale, you have to worry about: ----distribution (I give you 5,000 machines to run your job: how do you even go about harnessing all of those machines to get your computation done?) ----synchronization (how does the job get coordinated?) ----load-balancing (how to avoid hot-spots, bottlenecks, and unequal load across machines?) ----parallelism (what can be run simultaneously with what?) ----fault-tolerance (what happens if a job fails? what happens if a node incorrectly suspects another node of failure? how does one keep the system in a consistent state?) ----scheduling (which machines should run which jobs?) ----correctness (how do we ensure that the final result is what it "should" be?) ----latency (how do we keep a job from dragging on?) --strawman approach: teach all Google developers to become experts at distributed systems. problem 1: impractical. problem 2: doesn't address the issue that even the experts, like Dean and Ghemawat (the authors), were spending more time than they wanted implementing such computations, especially given the following observation. --observation: if we ignore the scale, the computations in question are not conceptually complex. Example: think about creating a frequency tally of words in a document; this is an easy program to write correctly on a single machine. --MapReduce solution: identify an abstraction that: ----is expressive enough to allow programmers to get their work done ----hides the the details of large-scale computation ----aligns with a natural strategy for large-scale implementation --Result: programmer writes simple map() and reduce() functions. The framework, aka the MapReduce implementation, does everything else. Meanwhile, because the "everything else" is behind an abstraction barrier, it can be implemented once and well. --This is a total classic of systems design: they identified the right abstraction, and it opened up a whole world. --So how did the authors come up with the abstraction? (they kept re-coding such jobs, and eventually perceived that there was a commonality that could be exposed as user-supplied map() and reduce(), perhaps with multiple such jobs strung together.) C. Execution flow and implementation --flow: map phase shuffle phase [the framework implements this, with no help from the user] what's going on with hash(key) mod R ? reduce phase [picture of phases] --question: why does the Map worker write its data to the local disk? --question (one of yours): "Between steps 4 and 5 of the Execution overview, how does the master know when to notify the reduce workers?" --question (one of yours): "Why don't they store the completed map tasks on the global file system as well. This way, when some worker dies, you won’t have to restart the map task all over again." D. Fault tolerance How does the framework tolerate worker faults? --start over --and why does that work? --b/c computation was expressed in a way that made it okay to re-run things twice. (most computations are stateless and deterministic.) deterministic = "re-run, get the same result" --so the framework doesn't have to be too careful about not duplicating work, or having multiple workers executing the same task --and the programmer doesn't have to reason about what happens if their computation starts over --how do the authors arrange for this picture? --for each task, there is a definitive worker that has produced it --for mappers, the master keeps track of the definitive intermediate file names --if it changes (because of mapper failure), reducers are informed --for reducers, they do atomic renames, so that there is a definitive "winner" (or worker who produced the final results) --what is the assumption here? --that machines fail completely. if the failure amounts to delivering partial data, say, the framework produces the wrong result. --what's going on at the end of "semantics in the presence of failures"? (multiple students asked about this.) --the weakened semantics is that, in the case of non-deterministic map() jobs, different reduce jobs could *disagree* about what a given map job produced. and that in turn could happen because of how map jobs are "committed" (see above about the master vis-a-vis intermediate file names). --the ultimate result is that the output files might not be equivalent to the output of some sequential execution; it is in this sense that we say that the consistency semantics are weakened How does the framework tolerate master faults? --Doesn't. --The master task is a single point of failure. Why did this otherwise fault-tolerant system get designed with a single point of failure? --Simplicity! --Computation model assumes no side-effects, so it's not like anything gets messed up if the computation starts over. --How could they have avoided the single point of failure? (With complexity.) How does the framework detect worker faults? --(Timeouts.) --What if it got the declaration wrong, and the worker is still executing. (No problem. master just ignores the results when considering what to tell reducers about where their inputs are.) E. Load balancing How do they get load-balancing? (By creating many more tasks than there are machines, ensuring that tasks can be parceled out to machines dynamically.) F. Performance Latency: --What has the biggest effect on performance (answer: stragglers.) --How did the authors solve the problem? --How did the authors know that stragglers would be a problem? --They didn't! --Lessons: (1) don't optimize until you see that something is actually a problem. (2) tuning may require hacks Throughput: --impressive. they scan a terabyte in 2.5 minutes (though paper that we read next gets even higher throughput). --classic example of using parallelism to decrease latency. 5. Discussion --A triumph of pragmatism over idealism: --Note how functional programming makes hard problem (what to do about partial and multiple executions) into easy problem (reexecute partially failed tasks from scratch). --> BUT this comes at the cost of a restricted model of computation. --How they handle buggy input (they skip it!) --can't deal with a single point of failure --performance hacks --Are there computations that can't be conveniently expressed? --Computations that change data or do lots of processing of it. --Any computation that is not expressible as transformations of (k,v) pairs --The not-nice way to say it is that the MapReduce programming model pushes work onto the programmer (this is a general design point: if you can constrain the programmer, then the framework itself can optimize because the design space is more restricted) --How about interactive jobs? (Follow-up work has looked at extensively. See the work surrounding Spark.) --How have the ideas evolved since the initial publication? --Massive attention to these problems --Improvements of programming model: not just map() and reduce() but other functional operations: filter(), zip(), etc. --Improvements of scheduling --Improvements of implementation: place intermediate results in memory, use caching, etc. --Data parallel computation is now everywhere. Bottom line: more expressive than just map + reduce, but: (a) programmer still has to do the work of expressing computation in data parallel fashion, and (b) lots of the implementation strategies are strongly influenced by the original MapReduce paper --Will MapReduce be unnecessary complexity rendered obsolete in the future by faster, bigger, better computers? ---- Observe that the authors built something very powerful out of simple pieces. That's the essence of great systems design. Key sentence in the paper: "MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily." ---