Class 17 CS 372H 24 March 2011 On the board ------------ 1. Last time 2. Disks, continued 3. Flash memory 4. File systems --------------------------------------------------------------------------- 1. Last time --I/O architecture (high-level) --Disks 2. Disks continued E. how driver interfaces to disk --Sectors --Disk interface presents linear array of **sectors** --generally 512 bytes, written atomically (even if power failure; disk saves enough momentum to complete) --larger atomic units have to be synthesized by OS (will discuss later) --goes for multiple contiguous sectors or even a whole collection of unrelated sectors --OS will find ways to make such writes *appear* atomic, though, of course, the disk itself can't write more than a sector atomically --analogy to critical sections in code: --> a thread holds a lock for a while, doing a bunch of things. to the other threads, whatever that thread does is atomic: they can observe the state before lock acquistion and after lock release, but not in the middle, even though, of course, the lock-holding thread is really doing a bunch of operations that are not atomic from the processor's perspective --disk maps logical sector # to physical sectors --Zoning: puts more sectors on longer tracks --Track skewing: sector 0 position varies by track, but let the disk worry about it. Why? (for speed when doing sequential access) --Sparing: flawed sectors remapped elsewhere --all of this is invisible to OS. stated more precisely, the OS does not know the logical to physical sector mapping. the OS specifies a platter, track, sector, but who knows where it really is? --In any case, larger logical sector # difference means larger seek --Highly non-linear relationship (*and* depends on zone) --OS has no info on rotational positions --Can empirically build table to estimate times --Turns out that sometimes the logical-->physical sector mapping is what you'd expect. F. disk performance, II --Disk cache used for read-ahead (disk keeps reading at last host request) --Otherwise, sequential reads would incur whole revolution --Policy decision: should read-ahead cross track boundaries? a head-switch cannot be stopped, so there is a cost to aggressive read-ahead --Write caching can be a big win --if battery backed, big win: data in write buffer often overwritten so can save disk writes. also, many stored writes means scheduling can happen optimally --if not battery backed, then policy decision between disk and host about whether to report data in cache as on disk or not --Placement and ordering of requests critical --Sequential I/O much, much MUCH **MUCH** faster than random --Long seeks much slower than short ones --Power might fail any time, leaving inconsistent state --Must be careful about order for crashes --More on this in over next few weeks --Try to achieve contiguous accesses where possible --for example, make big chunks of individual files contiguous --"The secret to making disks fast is to treat them like tape" (John Ousterhout). --Why? say you want to read 1KB randomly. how much does that cost? average seek: ~4ms 1/2 rotation: ~3ms (10000 RPM = 166 RPS = 6 ms/rotation) transfer: ~.01 ms because 512 bytes/sector * 1000 sectors/track * 1 track/6 ms ~ 85MB/s transfer speed so 1 KB / (85MB/s) = 1 KB / (85KB/ms) = ~.01ms seek + rotation time dominates for small reads! --implication: can get 100s of times more data with almost no further overhead (more data affects only the transfer time term) --more abstractly: effective bandwidth (chunk_size) = chunk_size / (7ms + chunk_size/actual_BW) actual_BW ~85 MB/s. --Try to order requests to minimize seek times --OS (or disk) can only do this if it has multiple requests to order --Requires disk I/O concurrency --High-performance apps try to maximize I/O concurrency --or avoid I/O except to do write-logging (stick all your data structures in memory; write "backup" copies to disk sequentially; don't do random-access reads from the disk) G. Disk scheduling (performance III) --see 5.4.3 in the book --FCFS: process requests in the order they are received +: easy to implement +: good fairness -: cannot exploit request locality -: increases average latency, decreasing throughput --SPTF/SSTF/SSF: shortest positioning time first / shortest seek time first: pick request with shortest seek time +: exploits locality of requests +: higher throughput -: starvation -: don't always know which request will be fastest improvement: aged SPTF --give older requests priority --adjust "effective" seek time with weighting [no pun intended] factor: T_{eff} = T_{pos} - W*T_{wait} --Elevator scheduling: like SPTF, but next seek must be in same direction; switch direction only if no further requests +: exploits locality +: bounded waiting -: cylinders in middle get better service -: doesn't fully exploit locality modification: only sweep in one direction (treating all address as being circular): very commonly used in Unix. H. technology and systems trends --unfortunately, while seeks and rotational delay are getting a little faster, they have not kept up with the huge growth elsewhere in computers. --transfer bandwidth has grown about 10x per decade --the thing that is growing fast is disk density (byte_stored/$). that's because density is less about the mechanical limitations --to improve density, need to get the head close to the surface. --[aside: what happens if the head contacts the surface? called "head crash": scrapes off the magnetic material ... and, with it, the data.] --Disk accesses a huge system bottleneck and getting worse. So what to do? --Bandwidth increase lets system (pre-)fetch large chunks for about the same cost as small chunk. --So trade latency for bandwidth if you can get lots of related stuff at roughly the same time. How to do that? --By clustering the related stuff together on the disk --The saving grace for big systems is that memory size is increasing faster than typical workload size --result: more and more of workload fits in file cache, which in turn means that the profile of traffic to the disk has changed: now mostly writes and new data. --which means logging and journaling become viable (more on this over next few classes) --------------------------------------------------------------------------- Admin note --no class one week from today --there will be a makeup by video in two weeks --------------------------------------------------------------------------- 3. flash memory A. Overview --Today, people increasingly using flash memory --Completely solid state (no moving parts) --Remembers data by storing charge --Lower power consumption and heat --No mechanical seek times to worry about --Limited # overwrites possible --Blocks wear out after 10,000 (MLC) -- 100,000 (SLC) erases --Requires _flash translation layer_ (FTL) to provide _wear leveling_, so repeated writes to logical block don't wear out physical block --FTL can seriously impact performance --In particular, random writes _very_ expensive see http://research.microsoft.com/pubs/63681/TR-2005-176.pdf --Limited durability --Charge wears out over time --Turn off device for a year, you can easily lose data B. Types of flash memory --NAND flash (most prevalent for storage) --Higher density (most used for storage) --Faster erase and write --More errors internally, so need error correction --NOR flash --Faster reads in smaller data units --Can execute code straight out of NOR flash --Significantly slower erases --Single-level cell (SLC) vs. Multi-level cell (MLC) --MLC encodes multiple bits in voltage level --MLC slower to write than SLC --NAND Flash Overview --Flash device has 2112-byte _pages_ --2048 bytes of data + 64 bytes metadata & ECC --_Blocks_ contain 64 (SLC) or 128 (MLC) pages (128KB or 256KB pages) --Blocks divided into 2--4 _planes_ --All planes contend for same package pins --But can access their blocks in parallel to overlap latencies --Can _read_ one page at a time --Takes 25 microseconds + time to get data off chip --Must _erase_ whole block before _programming_ --Erase sets all bits to 1: very expensive (2 msec) --Programming pre-erased block requires moving data to internal buffer, then 200 (SLC) -- 800 (MLC) microseconds --so random reads and writes are way faster than on a disk. But...... --sequential disk reads and writes are roughly as fast as flash memory (at least in terms of order of magnitude) and much cheaper in $/byte --Flash characteristics from http://cseweb.ucsd.edu/~swanson/papers/Asplos2009Gordon.pdf Parameter SLC MLC --------------------------------------------------------- Density Per Die (GB) 4 8 Page Size (Bytes) 2048+32 2048+64 Block Size (Pages) 64 128 Read Latency (us) 25 25 Write Latency (us) 200 800 Erase Latency (us) 2000 2000 40MHz, 16-bit bus Read b/w (MB/s) 75.8 75.8 Program b/w (MB/s) 20.1 5.0 133MHz Read b/w (MB/s) 126.4 126.4 Program b/w (MB/s) 20.1 5.0 --disk vs. MLC NAND flash vs. regular DRAM disk flash DRAM -------------------------------------------------------- Smallest write sector sector byte Atomic write sector sector byte/word Random read 8 ms 75 us 50 ns Random write 8 ms 300 us* 50 ns Sequential read 100 MB/s 250 MB/s > 1 GB/s Sequential write 100 MB/s 170 MB/s* > 1 GB/s Cost $.08--1/GB $3/GB $10-25/GB Persistence Non-volatile Non-vol. Volatile *flash write performance degrades over time 4. file systems [write on the board:] A. Intro B. Files C. Implementing files 1. contiguous 2. linked files 3. FAT 4. indexed files D. Directories E. FS performance (case study: FFS) A. Intro --more papers on FSs than on any other single topic --probably also the hardest part of operating systems --what does a FS do? --provide persistence (don't go away ... ever) --somehow associate bytes on the disk with names (files) --somehow associates names with each other (directories) --where are FSes implemented? --can implement them on disk, over network, in memory, in NVRAM (non-volatile RAM), on tape, with paper (!!!!) --we are going to focus on the disk and generalize later. we'll see what it means to implement a FS over the network --a few quick notes about disks in the context of FS design --disk is the first thing we've seen that (a) doesn't go away; and (b) we can modify (BIOS ROM, hardware configuration, etc. don't go away, but we weren't able to modify these things). two implications here: (i) we're going to have to put all of our important state on the disk (ii) we have to live with what we put on the disk! scribble randomly on memory --> reboot and hope it doesn't happen again. scribbe randomly on the disk --> now what? (answer: in many cases, we're hosed.) --mismatch: CPU and memory are *also* working with "important state", but they are vastly faster than disks --disk is enormous: 100-1000x more data than memory --how to organize all of this information? --answer is by categorizing things (taxonomies). a FS is a kind of taxonomy ("/homes" has home directories, "/homes/bob/classes/cs372h" has bob's cs372h material, etc.) B. Files --what is a file? --answer from user's view: a bunch of named bytes on the disk --answer from FS's view: collection of disk blocks --big job of a FS: map name and offset to disk blocks FS {file,offset} --> disk address --operations are create(file), delete(file), read(), write() --***goal: operations have as few disk accesses as possible and minimal space overhead --wait, why do we want minimal space overhead, given that the disk is huge? --answer: cache space never enough; the amount of data that can be retrieved in one fetch is never enough. hence, really don't want to waste. [[--note that we have seen translation/indirection before: page table: page table virtual address ----------> physical address per-file metadata: inode offset ------> disk block address how'd we get the inode? directory file name ----------> file # (file # *is* an inode in Unix) ]] C. Implementing files --our task: meet the goal marked *** above. --for now, we're going to assume that the file's metadata is given to us. when we look at directories in a bit, we'll see where the metadata comes from; the above picture should also give a hint access patterns we could imagine supporting: (i) Sequential: --File data processed in sequential order --By far the most common mode --Example: editor writes out new file, compiler reads in file, etc (ii) Random access: --Address any block in file directly without passing through --Examples: large data set, demand paging, databases (iii) Keyed access --Search for block with particular values --Examples: associative data base, index --This thing is everywhere in the field of databases, search engines, but.... --...usually not provided by a FS in OS helpful observations: (i) All blocks in file tend to be used together, sequentially (ii) All files in directory tend to be used together (iii) All *names* in directory tend to be used together further design parameters: (i) Most files are small (ii) Much of the disk is allocated to large files (iii) Many of the I/O operations are made to large files (iv) Want good sequential and good random access candidate designs........ 1. contiguous allocation "extent based" --when creating a file, make user pre-specify its length, and allocate the space at once --file metadata contains location and size --example: IBM OS/360 [ a1 a2 a3 b1 b2 ] what if a file c needs two sectors?! +: simple +: fast access, both sequential and random -: fragmentation where have we seen something similar? (answer: segmentation in virtual memory) 2. linked files --keep a linked list of free blocks --metadata: pointer to file's first block --each block holds pointer to next one +: no more fragmentation +: sequential access easy (and probably mostly fast, assuming decent free space management, since the pointers will point close by) -: random access is a disaster -: pointers take up room in blocks; messes up alignment of data 3. modification of linked files: FAT --keep link structure in memory --in fixed-size "FAT" (file allocation table) --pointer chasing now happens in RAM [DRAW PICTURE] --example: MS-DOS (and iPods, MP3 players, digital cameras) +: no need to maintain separate free list (table says what's free) +: low space overhead -: maximum size limited. 64K entries 512 byte blocks --> 32MB max file system bigger blocks bring advantages and disadvantages, and ditto a bigger table note: to guard against bad sectors, better store multiple copies of FAT on the disk!! [thanks to David Mazieres for portions of the above]