Class 22 CS 480-008 19 April 2016 On the board ------------ 1. Last time 2. FDS A. Intro B. Design C. Replication D. Other questions E. Performance and evaluation F. Discussion 3. admin notes 4. Peer-to-peer --------------------------------------------------------------------------- 1. Last time --Started FDS --Motivation for blob stores in general: --thousands of computers want to do "big-data" processing (lots of data, processing that data in parallel, MapReduce-style computations) --need to put their data somewhere --need to amortize the work of reading (disk seek plus overhead of initiating communication to another node) --Motivation for *this* blob store: --data center bandwidth might not be the bottleneck --make all reads cost the same, regardless of data placement? --the claim is that this leads to more flexible big data application development 3. FDS A. [last time] Intro --common pattern --lots of clients --lots of storage servers (known as "tract servers" here) --a partitioning strategy --master (known as "metadata server" here) controls partitioning --replica groups for reliability availability, durability B. [last time] Basic Design --key problem: how does the system partition the data, and how do the other nodes know where to find it? --stop and ask: how fast will a client be able to read a single tract? [disk bandwith: ~100 MB/sec] [network bandwidth to each client: 10 Gbps = 1.25 GB/sec] --so where does the abstract's single-client 2GB number come from? [see end of 5.2: gave each client 20 Gbps NICs; read from multiple tracts] --why distribute a blob over multiple servers? (Because we want access bandwidth to a blob to be larger than single-disk bandwidth could happen if one client is suddenly reading in a blob all at once [see above], or if multiple clients are simultaneously accessing different parts of the blob.) --when would such distribution impede performance? --abstract claims recover from lost disk (92 GB) in 6.2 seconds that's 15 GByte / sec how is that even possible?! that's 30x the speed of a disk! C. Replication --replication: what if the TLT looks like this: 0: S1 S2 1: S2 S1 2: S3 S4 3: S4 S3 ... Why is this a bad idea? How long will repair take? What are the risks if two servers fail? --Q: why is the paper's n^2 scheme better? TLT with n^2 entries, with every server pair occuring once 0: S1 S2 1: S1 S3 2: S1 S4 3: S2 S1 4: S2 S3 5: S2 S4 ... How long will repair take? What are the risks if two servers fail? --Q: why do they actually use a minimum replication level of 3? same n^2 table as before, third server is randomly chosen What is effect on repair time? What is effect of two servers failing? What if three disks fail? * Replication A writing client sends a copy to each tractserver in the TLT. A reading client asks one tractserver. --how do they handle churn? --Adding a tractserver: replace relevant entries in the map. copy data from other nodes, etc. --How do they maintain n^2 plus one arrangement as servers leave join? Unclear. --can restore data extremely quickly. why? (because when a disk fails, they're not sitting there recopying from one disk to another. instead, they're asking m other disks for 1/m of the failed disk's contents. if m=100, this cuts recovery latency.) old idea (RAMCloud is a recent instantiation), but this paper goes disk to disk. --What happens after a tractserver fails? Metadata server stops getting heartbeat RPCs Picks random replacement for each TLT entry failed server was in New TLT gets a new version number Replacement servers fetch copies --what if a client reads/writes but has an old tract table? (version number: server is supposed to reject. but what if server also has stale info?) --how do they ensure consistency? they don't. version numbers help, but they can still get consistency deviations: --example: * metadata server times out on tractserver * tractserver wasn't down; just slow * client 1 doesn't hear about metadata update * client 2 hears about the metadata update * client 2 writes to three servers *not* including the timed out one * client 1 reads from old tractserver; now sees stale data --will they lose data? (answer: not unless all of the nodes fail. why? because the client waits for all nodes to acknowledge before returning to app.) --what do they do about cluster growth? --add entries to the table? (no: that would change the table size since it would cause everything to remap.) --so they must be fixing the table size at the beginning of time (they don't talk about this restriction), with lots of redundant entries. --then, to add a node, they replace a bunch of redundant entries with the new node --does that provide load-balance? (answer: no. more accurately: depends on how many redundant entries there were.) (consistent hashing handles this better.) --they mark new entries pending, etc. --what happens if the metadata server (the thing distributing the TLT) fails? --answer: the system is hosed. a human operator has to fix it. (so they have a single point of failure) --while metadata server is down, can the system proceed? --how does rebooted metadata server get a copy of the TLT? D. Other questions Q: why do they need the scrubber application mentioned in 2.3? why don't they delete the tracts when the blob is deleted? can a blob be written after it is deleted? [because of inconsistency between metadata and data?] --they say that they zero copy model is different; how is it different? (answer: need to get buffers from the NIC, return them, etc.) E. Perf and Eval --how do we know we're seeing "good" performance? what's the best you can expect? --they do good science. establish baseline by determining what the raw hardware can do, and using that as the comparison point. --question: why don't they make the tract size 1MB? (Answer: figure 3. their throughput wouldn't be as good.) Q: Figure 4a: why starts low? why goes up? why levels off? why does it level off at that particular performance? [eventually, client network bandwidth exceeds tract server disk bandwidth] [notice: levels off at 32 GBs/s. 516 disks ==> 62 MB/s. approx half the disk throughput; see Fig 3....so they could do better!] Q: Figure 4b shows random r/w as fast as sequential (Figure 4a). is this what you'd expect? [yes, because "sequential" is not "sequential on the disk" but rather "sequential through the blob": every change of tract generates the same workload profile] Q: why are writes slower than reads with replication in Figure 4c? [need to send multiple copies] Q: where does the 92 GB in 6.2 seconds come from? Table 1, 4th column 180 GB (read + write) in 6.2 seconds is 30 GB/s Over 1000 disks, that's 30 MB/s per disk 128 servers what's the limiting resource? disk? cpu? net? --does not seem to be network or CPU --seems to be "innermost, slowest" disk tracks. see next-to-last paragraph of 5.3. --see final paragraph of 5.3. 1 TB disk in 3000 disk cluster can be recovered in approx 17s. Sounds great. But again, there is room for improvement! --MapReduce sort vs FDS sort MR style: * a mapper reads its split 1/Mth of the input file (e.g., a tract) map emits a for each record in split map partitions keys among R intermediate files (M*R intermediate files in total) * a reducer reads 1 of R intermediate files produced by each mapper reads M intermediate files (of 1/R size) sorts its input produces 1/Rth of the final sorted output file (R blobs) FDS style: * not totally different. 2-3 phases. buckets, etc. but: * mapper/reducer role is the same * no writing of intermediate content (not really a consequence of FDS or bisection bandwidth; could do this in traditional data center) * relies on dynamic work allocation, which *is* a consequence of bisection bandwidth + FDS design (MR couldn't have a head coordinating at fine grain because free node might not be near the data) * workload does not have high reduction factor, so will benefit under FDS's datacenter topology How big is each sort bucket? i.e. is the sort of each bucket in-memory? 1400 GB total 128 compute servers between 12 and 96 GB of RAM each hmm, say 50 on average, so total RAM may be 6400 GB thus sort of each bucket is in memory, does not write passes to FDS thus total time is just three transfers of 1400 GB client limit: 128 * 2 GB/s = 256 GB / sec disk limit: 1000 * 50 MB/s = 50 GB / sec thus bottleneck is likely to be disk throughput --penultimate paragraph of 6.1: very nice F. Discussion 3. admin notes --start on lab 5b in advance! 4. Peer-to-peer Kademlia: most commonly used DHT; basis of eDonkey done at NYU!!! [DHTs, consistent hashing, Chord, Kademlia, BitTorrent] Peer-to-peer [draw picture: user computers, files, direct xfers] users computers talk directly to each other to implement service in contrast to user computers talking to central servers could be closed or open examples: skype, video and music players, file sharing Why might P2P be a win? spreads network/caching costs over users absence of server may mean: easier to deploy less chance of overload single failure won't wreck the whole system harder to attack Why don't all Internet services use P2P? can be hard to find data items over millions of users user computers not as reliable than managed servers if open, can be attacked via evil participants The result is that P2P has some successful niches: Client-client video/music, where serving costs are high Chat (user to user anyway; privacy and control) Popular data but owning organization has no money No natural single owner or controller (Bitcoin) Illegal file sharing Example: classic BitTorrent a cooperative download system, very popular! user clicks on download link for e.g. latest Linux kernel distribution gets torrent file w/ content hash and IP address of tracker user's BT client talks to tracker tracker tells it list of other user clients w/ downloaded file user't BT client talks to one or more client's w/ the file user's BT client tells tracker it has a copy now too user's BT client serves the file to others for a while the point: provides huge download b/w w/o expensive server/link BitTorrent can also use a DHT instead of / as well as a tracker this is the topic of one of the optional readings for today BT clients cooperatively implement a giant key/value store "distributed hash table" the key is the file content hash ("infohash") the value is the IP address of a client willing to serve the file Kademlia can store multiple values for a key client does get(infohash) to find other clients willing to serve and put(infohash, self) to register itself as willing to serve client also joins the DHT to help implement it Why might the DHT be a win for BitTorrent? single giant tracker, less fragmented than many trackers so clients more likely to find each other maybe a classic tracker too exposed to legal, etc. attacks it's not clear that BitTorrent depends heavily on the DHT mostly a backup for classic trackers? --------------------------------------------------------------------------- Acknowledgment: Some FDS pieces from Robert Morris's 6.824 notes.