Class 22
CS 480-008
19 April 2016

On the board
------------

1. Last time
2. FDS
    A. Intro
    B. Design
    C. Replication
    D. Other questions
    E. Performance and evaluation
    F. Discussion
3. admin notes
4. Peer-to-peer

---------------------------------------------------------------------------

1. Last time

    --Started FDS

    --Motivation for blob stores in general: 
        --thousands of computers want to do "big-data" processing
            (lots of data, processing that data in parallel, 
             MapReduce-style computations)
        --need to put their data somewhere
        --need to amortize the work of reading (disk seek plus overhead
        of initiating communication to another node)

    --Motivation for *this* blob store:
       --data center bandwidth might not be the bottleneck 
       --make all reads cost the same, regardless of data placement?
       --the claim is that this leads to more flexible big data application
         development

3. FDS

    A. [last time] Intro

        --common pattern
            --lots of clients
            --lots of storage servers (known as "tract servers" here)
            --a partitioning strategy
            --master (known as "metadata server" here) controls partitioning
            --replica groups for reliability availability, durability

    B. [last time] Basic Design

        --key problem: how does the system partition the data, and how
        do the other nodes know where to find it?

        --stop and ask: how fast will a client be able to read a single
        tract?
            [disk bandwith: ~100 MB/sec]
            [network bandwidth to each client: 10 Gbps = 1.25 GB/sec]

        --so where does the abstract's single-client 2GB number come
        from?
            [see end of 5.2: gave each client 20 Gbps NICs; read from
            multiple tracts]

        --why distribute a blob over multiple servers? 

            (Because we want access bandwidth to a blob to be larger
            than single-disk bandwidth could happen if one client is
            suddenly reading in a blob all at once [see above], or if
            multiple clients are simultaneously accessing different
            parts of the blob.)
           
        --when would such distribution impede performance?

        --abstract claims recover from lost disk (92 GB) in 6.2 seconds
           that's 15 GByte / sec
           how is that even possible?! that's 30x the speed of a disk!
                    
    C. Replication

        --replication: what if the TLT looks like this:
           0: S1 S2
           1: S2 S1
           2: S3 S4
           3: S4 S3
           ...
           Why is this a bad idea?
           How long will repair take?
           What are the risks if two servers fail?

        --Q: why is the paper's n^2 scheme better?

           TLT with n^2 entries, with every server pair occuring once
           0: S1 S2
           1: S1 S3
           2: S1 S4
           3: S2 S1
           4: S2 S3
           5: S2 S4
           ...

           How long will repair take?
           What are the risks if two servers fail?

        --Q: why do they actually use a minimum replication level of 3?
           same n^2 table as before, third server is randomly chosen
           What is effect on repair time?
           What is effect of two servers failing?
           What if three disks fail?

        * Replication
          A writing client sends a copy to each tractserver in the TLT.
          A reading client asks one tractserver.


        --how do they handle churn?

            --Adding a tractserver:

                replace relevant entries in the map. copy data from other
                nodes, etc.

            --How do they maintain n^2 plus one arrangement as servers
            leave join?
                Unclear.


            --can restore data extremely quickly. why?
                (because when a disk fails, they're not sitting there
                recopying from one disk to another. instead, they're
                asking m other disks for 1/m of the failed disk's
                contents. if m=100, this cuts recovery latency.)

                old idea (RAMCloud is a recent instantiation), but this
                paper goes disk to disk.
       
        --What happens after a tractserver fails?
          Metadata server stops getting heartbeat RPCs
          Picks random replacement for each TLT entry failed server was in
          New TLT gets a new version number
          Replacement servers fetch copies

        --what if a client reads/writes but has an old tract table?
          (version number: server is supposed to reject. but what if
          server also has stale info?)
 
        --how do they ensure consistency?
            they don't. 

            version numbers help, but they can still get consistency
            deviations:

            --example:
                * metadata server times out on tractserver
                * tractserver wasn't down; just slow
                * client 1 doesn't hear about metadata update
                * client 2 hears about the metadata update
                * client 2 writes to three servers *not* including the timed out one
                * client 1 reads from old tractserver; now sees stale
                  data

            --will they lose data?
                (answer: not unless all of the nodes fail. why?
                because the client waits for all nodes to
                acknowledge before returning to app.)


        --what do they do about cluster growth?

            --add entries to the table?
                (no: that would change the table size since it would
                cause everything to remap.)

            --so they must be fixing the table size at the beginning of
            time (they don't talk about this restriction), with lots of
            redundant entries.

            --then, to add a node, they replace a bunch of redundant
            entries with the new node

            --does that provide load-balance?
                (answer: no. more accurately: depends on how many
                redundant entries there were.)
                (consistent hashing handles this better.)

            --they mark new entries pending, etc.

        --what happens if the metadata server (the thing distributing
        the TLT) fails?

            --answer: the system is hosed. a human operator has to fix
            it. (so they have a single point of failure)
        

        --while metadata server is down, can the system proceed?

        --how does rebooted metadata server get a copy of the TLT?


    D. Other questions
 
        Q: why do they need the scrubber application mentioned in 2.3?
        why don't they delete the tracts when the blob is deleted?  can
        a blob be written after it is deleted?
            [because of inconsistency between metadata and data?]
              
        --they say that they zero copy model is different; how is it
        different? (answer: need to get buffers from the NIC, return them,
        etc.)

    E. Perf and Eval

    --how do we know we're seeing "good" performance? what's the best
    you can expect?

    --they do good science. establish baseline by determining what the
    raw hardware can do, and using that as the comparison point.
   
    --question: why don't they make the tract size 1MB? (Answer: figure
    3. their throughput wouldn't be as good.)

    Q: Figure 4a: why starts low? why goes up? why levels off?
       why does it level off at that particular performance?
        [eventually, client network bandwidth exceeds tract server disk 
        bandwidth]
        [notice: levels off at 32 GBs/s. 516 disks ==> 62 MB/s. approx
        half the disk throughput; see Fig 3....so they could do better!]
        
    Q: Figure 4b shows random r/w as fast as sequential (Figure 4a).
       is this what you'd expect?
        [yes, because "sequential" is not "sequential on the disk" but
        rather "sequential through the blob": every change of tract
        generates the same workload profile]

    Q: why are writes slower than reads with replication in Figure 4c?
        [need to send multiple copies]


    Q: where does the 92 GB in 6.2 seconds come from?
       Table 1, 4th column
       180 GB (read + write) in 6.2 seconds is 30 GB/s
       Over 1000 disks, that's 30 MB/s per disk
       128 servers
       what's the limiting resource? disk? cpu? net?
          --does not seem to be network or CPU
          --seems to be "innermost, slowest" disk tracks. see
          next-to-last paragraph of 5.3.

    --see final paragraph of 5.3. 1 TB disk in 3000 disk cluster can be
    recovered in approx 17s. Sounds great. But again, there is room for
    improvement!

    --MapReduce sort vs FDS sort

      MR style:

      * a mapper reads its split 1/Mth of the input file (e.g., a tract)
        map emits a <key, record> for each record in split
        map partitions keys among R intermediate files  (M*R intermediate files in total)

      * a reducer reads 1 of R intermediate files produced by each
      mapper reads M intermediate files (of 1/R size) sorts its input
      produces 1/Rth of the final sorted output file  (R blobs)

     FDS style:

     * not totally different. 2-3 phases. buckets, etc.
    
     but:

     * mapper/reducer role is the same

     * no writing of intermediate content (not really a consequence of FDS
     or bisection bandwidth; could do this in traditional data center)

     * relies on dynamic work allocation, which *is* a consequence of
     bisection bandwidth + FDS design (MR couldn't have a head
     coordinating at fine grain because free node might not be near the
     data)

     * workload does not have high reduction factor, so will benefit
     under FDS's datacenter topology

    How big is each sort bucket?
      i.e. is the sort of each bucket in-memory?
      1400 GB total
      128 compute servers
      between 12 and 96 GB of RAM each
      hmm, say 50 on average, so total RAM may be 6400 GB
      thus sort of each bucket is in memory, does not write passes to FDS
      thus total time is just three transfers of 1400 GB
        client limit: 128 * 2 GB/s = 256 GB / sec
        disk limit: 1000 * 50 MB/s = 50 GB / sec
      thus bottleneck is likely to be disk throughput
        --penultimate paragraph of 6.1: very nice
                

    F. Discussion

3. admin notes

    --start on lab 5b in advance!


4. Peer-to-peer 

    Kademlia: most commonly used DHT; basis of eDonkey
        done at NYU!!!

    [DHTs, consistent hashing, Chord, Kademlia, BitTorrent]


    Peer-to-peer

      [draw picture: user computers, files, direct xfers]

      users computers talk directly to each other to implement service
        in contrast to user computers talking to central servers
      could be closed or open
      examples:
        skype, video and music players, file sharing


    Why might P2P be a win?
      spreads network/caching costs over users
      absence of server may mean:
        easier to deploy
        less chance of overload
        single failure won't wreck the whole system
        harder to attack


    Why don't all Internet services use P2P?
      can be hard to find data items over millions of users
      user computers not as reliable than managed servers
      if open, can be attacked via evil participants

    The result is that P2P has some successful niches:
      Client-client video/music, where serving costs are high
      Chat (user to user anyway; privacy and control)
      Popular data but owning organization has no money
      No natural single owner or controller (Bitcoin)
      Illegal file sharing


    Example: classic BitTorrent
      a cooperative download system, very popular!
      user clicks on download link for e.g. latest Linux kernel distribution
        gets torrent file w/ content hash and IP address of tracker
      user's BT client talks to tracker
        tracker tells it list of other user clients w/ downloaded file
      user't BT client talks to one or more client's w/ the file
      user's BT client tells tracker it has a copy now too
      user's BT client serves the file to others for a while

      the point:
        provides huge download b/w w/o expensive server/link


    BitTorrent can also use a DHT instead of / as well as a tracker
      this is the topic of one of the optional readings for today
      BT clients cooperatively implement a giant key/value store
      "distributed hash table"
      the key is the file content hash ("infohash")
      the value is the IP address of a client willing to serve the file
        Kademlia can store multiple values for a key
      client does get(infohash) to find other clients willing to serve
        and put(infohash, self) to register itself as willing to serve
      client also joins the DHT to help implement it

    Why might the DHT be a win for BitTorrent?
      single giant tracker, less fragmented than many trackers
        so clients more likely to find each other
      maybe a classic tracker too exposed to legal, etc. attacks
      it's not clear that BitTorrent depends heavily on the DHT
        mostly a backup for classic trackers?

---------------------------------------------------------------------------

Acknowledgment: Some FDS pieces from Robert Morris's 6.824 notes.