Class 20
CS 480-008
12 April 2016

On the board
------------

1. Last time
2. MapReduce model
3. admin notes
4. MapReduce, the system
    Background
    Basic problem
    Execution flow and implementation
    Fault-tolerance
    Load balancing
    Performance
5. Discussion


---------------------------------------------------------------------------

1. Last time

    --Web security

2. MapReduce model

    --map(), reduce(): these are constructs borrowed from functional
    languages

    --assumption of MapReduce framework: the underlying implementation
    "runs" map() and reduce() appropriately:

        --feeds input to map()
        --does a "group by" or "shuffle"
        --feeds intermediate results to reduce()

        note that reduce gets as input a pair:
            (key, list_of_values)

    --we'll see how this is implemented shortly. for now, we're
    concentrating on the programmer's interface.

    PROBLEM 1:

    --let's use map() and reduce() to compute the number of occurrences
    of words of a certain length. Assume that the input to the job is a
    list of key-value pairs:
        (document_id, document_text)
    output should be a list of key-value pairs:
        (length, number of words of that length)


        // key is doc_id, value is doc contents
        map(String key, String value) {

            for each word w in value
                EmitIntermediate(len(w), 1);
        }

        // key: a length. values: list of counts
        reduce(String key, Iterator values) {

            int result = 0;
            for each v in values:
                result += ParseInt(v);

            Emit(AsString(result))
        }
        
    PROBLEM 2:

    --compute an "inverted index":

        input is a concatenated list of documents, which fits in memory
        output is a list of associations:
            English_word --> [list of documents that contain the word]

    PROBLEM 3:

    --compute an "inverted index" when the input file = "THE WEB"


3. admin notes

    --homework exercise on MapReduce, due Thursday. should be short (1-2
    hours).

4. MapReduce, the system

    A. Background: Google environment

        Q: Why do/did they have tons and tons of commodity PCs?
        A: Economics

        (At their scale, even reliable machines would fail, so they need
        mechanisms for fault-tolerance [replication, etc.]. Once they
        have such mechanisms, can get away with components that are less
        reliable.)

        [draw picture]

        Not just Google's employees who can run these jobs: we live in
        the era of "big data" and "the cloud". Services like Amazon Web
        services, Hadoop, etc.  make these kind of service available to
        anyone with an Internet connection and working knowledge of
        Python/C++/Java/etc.

        Google File System (GFS):

            basically a giant store of replicated 64 MB chunks. each
            chunk is itself usually part of some larger file.

            Thursday's paper can be understood to be a cousin of GFS

            Top-level interactions in GFS conceptually look like this:

                client --> GFS master: file name, offset
                GFS master --> client: chunk ID, chunkserver

                client --> chunkserver: chunk ID, byte range
                chunkserver --> client: bytes

    B. Basic problem solved by the authors:

        --in Google's environment (and now, in many organizations and on
        many projects) there are lots and lots of cases where
        programmers need to run huge computations. "huge" = way too big
        to fit on one disk, or in one machine's memory, given the
        aforementioned machines.

            ----let's say you want to sort 10 terabytes of data. you
            can't just write a C program whose virtual address space is
            several terabytes, and code up a 10-line quicksort program
            against a 10^12-sized array

        --more such cases than there are programmers who have the
        know-how to implement such computations. 
        
        --why does it require know-how? because when you implement
        something at this scale, you have to worry about:

            ----distribution (I give you 5,000 machines
                to run your job: how do you even go about harnessing all
                of those machines to get your computation done?)
            ----synchronization (how does the job get coordinated?)
            ----load-balancing (how to avoid hot-spots, bottlenecks,
                and unequal load across machines?)
            ----parallelism (what can be run simultaneously with what?)
            ----fault-tolerance (what happens if a job fails? what happens
               if a node incorrectly suspects another node of failure?
               how does one keep the system in a consistent state?)
            ----scheduling (which machines should run which jobs?)
            ----correctness (how do we ensure that the final result is 
                what it "should" be?)
            ----latency (how do we keep a job from dragging on?)

        --strawman approach: teach all Google developers to become
        experts at distributed systems. problem 1: impractical. problem
        2: doesn't address the issue that even the experts, like Dean
        and Ghemawat (the authors), were spending more time than they
        wanted implementing such computations, especially given the
        following observation.

        --observation: if we ignore the scale, the computations in
        question are not conceptually complex. Example: think about
        creating a frequency tally of words in a document; this is an
        easy program to write correctly on a single machine.

        --MapReduce solution: identify an abstraction that:

            ----is expressive enough to allow programmers to get their
            work done 

            ----hides the the details of large-scale computation

            ----aligns with a natural strategy for large-scale
            implementation
            
        --Result: programmer writes simple map() and reduce() functions.
        The framework, aka the MapReduce implementation, does everything
        else. Meanwhile, because the "everything else" is behind an
        abstraction barrier, it can be implemented once and well.
    
        --This is a total classic of systems design: they identified the
        right abstraction, and it opened up a whole world.

        --So how did the authors come up with the abstraction? (they
        kept re-coding such jobs, and eventually perceived that there
        was a commonality that could be exposed as user-supplied map()
        and reduce(), perhaps with multiple such jobs strung together.)


    C. Execution flow and implementation

        --flow:

            map phase

            shuffle phase [the framework implements this, with no help
            from the user]

                what's going on with 
                    hash(key) mod R
                ?

            reduce phase

            [picture of phases]

        --question: why does the Map worker write its data to the local
        disk?

        --question (one of yours): "Between steps 4 and 5 of the
        Execution overview, how does the master know when to notify the
        reduce workers?"

        --question (one of yours): "Why don't they store the completed
        map tasks on the global file system as well. This way, when some
        worker dies, you won’t have to restart the map task all over
        again."


    D. Fault tolerance
    
       How does the framework tolerate worker faults? 

       --start over

       --and why does that work? 
        
            --b/c computation was expressed in a way that made it okay
            to re-run things twice. (most computations are stateless and
            deterministic.) deterministic = "re-run, get the same
            result"
            
            --so the framework doesn't have to be too careful about not
            duplicating work, or having multiple workers executing the
            same task

            --and the programmer doesn't have to reason about what
            happens if their computation starts over

       --how do the authors arrange for this picture?
       
           --for each task, there is a definitive worker that has
            produced it

            --for mappers, the master keeps track of the definitive
            intermediate file names
                --if it changes (because of mapper failure), reducers
                are informed

            --for reducers, they do atomic renames, so that there is a
            definitive "winner" (or worker who produced the final
            results)

       --what is the assumption here?

            --that machines fail completely. if the failure amounts to
            delivering partial data, say, the framework produces the
            wrong result.

       --what's going on at the end of "semantics in the presence of
       failures"? (multiple students asked about this.)

            --the weakened semantics is that, in the case of
            non-deterministic map() jobs, different reduce jobs could
            *disagree* about what a given map job produced. and that in
            turn could happen because of how map jobs are "committed"
            (see above about the master vis-a-vis intermediate file
            names).
            
            --the ultimate result is that the output files might not be
            equivalent to the output of some sequential execution; it is
            in this sense that we say that the consistency semantics are
            weakened

      How does the framework tolerate master faults? 

        --Doesn't.

        --The master task is a single point of failure. Why did this otherwise 
        fault-tolerant system get designed with a single point of failure?

            --Simplicity!

            --Computation model assumes no side-effects, so it's not like
            anything gets messed up if the computation starts over.

            --How could they have avoided the single point of failure? 
            (With complexity.)

      How does the framework detect worker faults?

        --(Timeouts.)
        
        --What if it got the declaration wrong, and the worker is still
        executing. (No problem. master just ignores the results when
        considering what to tell reducers about where their inputs are.)


    E. Load balancing

        How do they get load-balancing? (By creating many more tasks
        than there are machines, ensuring that tasks can be parceled out
        to machines dynamically.)


    F. Performance

       Latency:

       --What has the biggest effect on performance (answer: stragglers.)

       --How did the authors solve the problem?

       --How did the authors know that stragglers would be a problem?

          --They didn't!

          --Lessons:
              (1) don't optimize until you see that something is
              actually a problem.
              (2) tuning may require hacks


       Throughput:

        --impressive. they scan a terabyte in 2.5 minutes (though paper
        that we read next gets even higher throughput).
        
        --classic example of using parallelism to decrease latency.

5. Discussion

    --A triumph of pragmatism over idealism:

        --Note how functional programming makes hard problem (what to do
        about partial and multiple executions) into easy problem
        (reexecute partially failed tasks from scratch). 

            --> BUT this comes at the cost of a restricted model of
            computation.

        --How they handle buggy input
          (they skip it!)

        --can't deal with a single point of failure

        --performance hacks


    --Are there computations that can't be conveniently expressed?

        --Computations that change data or do lots of processing of it. 

        --Any computation that is not expressible as transformations of
        (k,v) pairs

        --The not-nice way to say it is that the MapReduce programming
        model pushes work onto the programmer (this is a general design
        point: if you can constrain the programmer, then the framework
        itself can optimize because the design space is more restricted)

    --How about interactive jobs? (Follow-up work has looked at
    extensively. See the work surrounding Spark.)

    --How have the ideas evolved since the initial publication?

        --Massive attention to these problems
        --Improvements of programming model: not just map() and reduce()
        but other functional operations: filter(), zip(), etc.
        --Improvements of scheduling
        --Improvements of implementation: place intermediate results
          in memory, use caching, etc.

        --Data parallel computation is now everywhere. Bottom line:
        more expressive than just map + reduce, but:
            (a) programmer still has to do the work of expressing
            computation in data parallel fashion, and
            (b) lots of the implementation strategies are strongly
            influenced by the original MapReduce paper

    --Will MapReduce be unnecessary complexity rendered obsolete in the
    future by faster, bigger, better computers?

----
    Observe that the authors built something very powerful out of simple
    pieces. That's the essence of great systems design.

    Key sentence in the paper: "MapReduce has been so successful because it
    makes it possible to write a simple program and run it efficiently on a
    thousand machines in the course of half an hour, greatly speeding up the
    development and prototyping cycle. Furthermore, it allows programmers
    who have no experience with distributed and/or parallel systems to
    exploit large amounts of resources easily."
---