Class 17
CS 372H
24 March 2011

On the board
------------

1. Last time
2. Disks, continued
3. Flash memory
4. File systems

---------------------------------------------------------------------------

1. Last time

    --I/O architecture (high-level)

    --Disks

2. Disks continued

    E. how driver interfaces to disk

	--Sectors

	    --Disk interface presents linear array of **sectors**

	    --generally 512 bytes, written atomically (even if power
	    failure; disk saves enough momentum to complete)
	    
	    --larger atomic units have to be synthesized by OS (will
	    discuss later)
		--goes for multiple contiguous sectors or even a whole
		collection of unrelated sectors
		--OS will find ways to make such writes *appear* atomic,
		though, of course, the disk itself can't write more than
		a sector atomically
		--analogy to critical sections in code: 
		
		    --> a thread holds a lock for a while, doing a bunch
		    of things. to the other threads, whatever that
		    thread does is atomic: they can observe the state
		    before lock acquistion and after lock release, but
		    not in the middle, even though, of course, the
		    lock-holding thread is really doing a bunch of
		    operations that are not atomic from the processor's
		    perspective

	--disk maps logical sector # to physical sectors

	    --Zoning: puts more sectors on longer tracks
	  
	    --Track skewing: sector 0 position varies by track, but let
	    the disk worry about it. Why? (for speed when doing
	    sequential access)

	    --Sparing: flawed sectors remapped elsewhere
	
	--all of this is invisible to OS. stated more precisely, the OS
	does not know the logical to physical sector mapping. the OS
	specifies a platter, track, sector, but who knows where it
	really is? 

	    --In any case, larger logical sector # difference means
	    larger seek

	    --Highly non-linear relationship (*and* depends on zone)
	
	    --OS has no info on rotational positions

	    --Can empirically build table to estimate times

	    --Turns out that sometimes the logical-->physical sector
	    mapping is what you'd expect.

    F. disk performance, II

	--Disk cache used for read-ahead (disk keeps reading at last
	host request)

	    --Otherwise, sequential reads would incur whole revolution

	    --Policy decision: should read-ahead cross track boundaries? 
	    a head-switch cannot be stopped, so there is a cost to
	    aggressive read-ahead
	

	--Write caching can be a big win
	    
	    --if battery backed, big win: data in write buffer often
	    overwritten so can save disk writes. also, many stored
	    writes means scheduling can happen optimally

	    --if not battery backed, then policy decision between disk
	    and host about whether to report data in cache as on disk or
	    not


	--Placement and ordering of requests critical

	    --Sequential I/O much, much MUCH **MUCH** faster than random

	    --Long seeks much slower than short ones
	
	    --Power might fail any time, leaving inconsistent state

	--Must be careful about order for crashes

	    --More on this in over next few weeks

	--Try to achieve contiguous accesses where possible

	    --for example, make big chunks of individual files contiguous

	    --"The secret to making disks fast is to treat them like tape"
	    (John Ousterhout).

	    --Why? 

	     say you want to read 1KB randomly. how much does that cost?
		average seek: ~4ms 
		1/2 rotation: ~3ms (10000 RPM = 166 RPS = 6 ms/rotation)
		transfer: ~.01 ms
		    because 
			512 bytes/sector * 1000 sectors/track * 1
			    track/6 ms ~ 85MB/s transfer speed
			so
			1 KB / (85MB/s) = 1 KB / (85KB/ms) = ~.01ms


		seek + rotation time dominates for small reads! 

		--implication: can get 100s of times more data with
		almost no further overhead (more data affects only the
		transfer time term)

	    --more abstractly:

		effective bandwidth (chunk_size) = 
			chunk_size / (7ms + chunk_size/actual_BW)

		actual_BW ~85 MB/s. 
      
	--Try to order requests to minimize seek times

	    --OS (or disk) can only do this if it has multiple requests
	    to order

	    --Requires disk I/O concurrency
	
	    --High-performance apps try to maximize I/O concurrency

		--or avoid I/O except to do write-logging (stick all
		your data structures in memory; write "backup" copies to
		disk sequentially; don't do random-access reads from the
		disk)

    G. Disk scheduling (performance III)

	--see 5.4.3 in the book

	--FCFS: process requests in the order they are received
	    +: easy to implement
	    +: good fairness
	    -: cannot exploit request locality
	    -: increases average latency, decreasing throughput

	--SPTF/SSTF/SSF: shortest positioning time first / shortest seek
	time first: pick request with shortest seek time
    
	    +: exploits locality of requests
	    +: higher throughput
	    -: starvation
	    -: don't always know which request will be fastest

	    improvement: aged SPTF
		
		--give older requests priority

		--adjust "effective" seek time with weighting [no pun
		intended] factor:
		    T_{eff} = T_{pos} - W*T_{wait}


	--Elevator scheduling: like SPTF, but next seek must be in same
	direction; switch direction only if no further requests
	    +: exploits locality
	    +: bounded waiting
	    -: cylinders in middle get better service
	    -: doesn't fully exploit locality

	    modification: only sweep in one direction (treating all
	    address as being circular): very commonly used in Unix.

    H. technology and systems trends

	--unfortunately, while seeks and rotational delay are getting a
	little faster, they have not kept up with the huge growth
	elsewhere in computers.
	
	--transfer bandwidth has grown about 10x per decade

	--the thing that is growing fast is disk density
	(byte_stored/$). that's because density is less about the
	mechanical limitations

	--to improve density, need to get the head close to the surface.

	    --[aside: what happens if the head contacts the surface? called
	    "head crash": scrapes off the magnetic material ... and,
	    with it, the data.]

	--Disk accesses a huge system bottleneck and getting worse. So
	what to do?

	    --Bandwidth increase lets system (pre-)fetch large chunks
	    for about the same cost as small chunk.

	    --So trade latency for bandwidth if you can get lots of
	    related stuff at roughly the same time. How to do that?

	    --By clustering the related stuff together on the disk

	--The saving grace for big systems is that memory size
	is increasing faster than typical workload size

	    --result: more and more of workload fits in file cache,
	    which in turn means that the profile of traffic to the disk
	    has changed: now mostly writes and new data.

	    --which means logging and journaling become viable (more on
	    this over next few classes)

---------------------------------------------------------------------------

Admin note

    --no class one week from today

    --there will be a makeup by video in two weeks
 
---------------------------------------------------------------------------

3. flash memory

    A. Overview

	--Today, people increasingly using flash memory
	--Completely solid state (no moving parts)
	    --Remembers data by storing charge
	    --Lower power consumption and heat
	    --No mechanical seek times to worry about

	--Limited # overwrites possible
	    --Blocks wear out after 10,000 (MLC) -- 100,000 (SLC) erases
	    --Requires _flash translation layer_ (FTL) to provide
	    _wear leveling_, so repeated writes to logical block
	  don't wear out physical block
	    --FTL can seriously impact performance

	    --In particular, random writes _very_ expensive
	    see http://research.microsoft.com/pubs/63681/TR-2005-176.pdf

	--Limited durability

	    --Charge wears out over time

	    --Turn off device for a year, you can easily lose data

    B. Types of flash memory
      

	--NAND flash (most prevalent for storage)

	    --Higher density (most used for storage)

	    --Faster erase and write

	    --More errors internally, so need error correction

	--NOR flash

	    --Faster reads in smaller data units

	    --Can execute code straight out of NOR flash

	    --Significantly slower erases
  

	--Single-level cell (SLC) vs. Multi-level cell (MLC)

	    --MLC encodes multiple bits in voltage level

	    --MLC slower to write than SLC
 
	--NAND Flash Overview

	    --Flash device has 2112-byte _pages_

		--2048 bytes of data + 64 bytes metadata & ECC

	    --_Blocks_ contain 64 (SLC) or 128 (MLC) pages
	       (128KB or 256KB pages)

	    --Blocks divided into 2--4 _planes_

		--All planes contend for same package pins
		--But can access their blocks in parallel to overlap latencies

	    --Can _read_ one page at a time

		--Takes 25 microseconds + time to get data off chip

	    --Must _erase_ whole block before _programming_

		--Erase sets all bits to 1: very expensive (2 msec)

		--Programming pre-erased block requires moving data to
		internal buffer, then 200 (SLC) -- 800 (MLC)
		microseconds

	    --so random reads and writes are way faster than on a disk.
	    But......

		--sequential disk reads and writes are roughly as fast
		as flash memory (at least in terms of order of
		magnitude) and much cheaper in $/byte

	--Flash characteristics
	from http://cseweb.ucsd.edu/~swanson/papers/Asplos2009Gordon.pdf

	    Parameter		               SLC          MLC
	    ---------------------------------------------------------
	    Density Per Die (GB)	        4             8 
	    Page Size (Bytes)			 2048+32      2048+64
	    Block Size (Pages)			  64            128 
	    Read Latency (us)			  25             25 
	    Write Latency (us)			 200            800 
	    Erase Latency (us)			2000            2000 
	    40MHz, 16-bit bus Read b/w (MB/s)    75.8           75.8 
			    Program b/w (MB/s)   20.1            5.0 
	    133MHz            Read b/w (MB/s)   126.4         126.4 
			    Program b/w (MB/s)   20.1           5.0 


    --disk vs. MLC NAND flash vs. regular DRAM

                       disk        flash         DRAM
    --------------------------------------------------------
    Smallest write    sector	  sector         byte 
    Atomic write      sector	  sector         byte/word 
    Random read        8 ms	  75 us          50 ns 
    Random write       8 ms	  300 us*        50 ns 
    Sequential read  100 MB/s	  250 MB/s       > 1 GB/s 
    Sequential write 100 MB/s	  170 MB/s*      > 1 GB/s 
    Cost              $.08--1/GB  $3/GB         $10-25/GB 
    Persistence     Non-volatile  Non-vol.     Volatile 

    *flash write performance degrades over time

4. file systems

    [write on the board:]

    A. Intro
    B. Files
    C. Implementing files
	1. contiguous
	2. linked files
	3. FAT 
	4. indexed files
    D. Directories
    E. FS performance (case study: FFS)
 
    A. Intro

    --more papers on FSs than on any other single topic

	--probably also the hardest part of operating systems

    --what does a FS do?

	--provide persistence (don't go away ... ever)

	--somehow associate bytes on the disk with names (files)

	--somehow associates names with each other (directories)

    --where are FSes implemented?

	--can implement them on disk, over network, in memory, in NVRAM
	(non-volatile RAM), on tape, with paper (!!!!)

	--we are going to focus on the disk and generalize later. we'll
	see what it means to implement a FS over the network
   
    --a few quick notes about disks in the context of FS design

	--disk is the first thing we've seen that (a) doesn't go away;
	and (b) we can modify (BIOS ROM, hardware configuration, etc.
	don't go away, but we weren't able to modify these things). two
	implications here:

	    (i) we're going to have to put all of our important state on
	    the disk

	    (ii) we have to live with what we put on the disk! scribble
	    randomly on memory --> reboot and hope it doesn't happen
	    again. scribbe randomly on the disk --> now what? (answer:
	    in many cases, we're hosed.)

	--mismatch: CPU and memory are *also* working with "important
	state", but they are vastly faster than disks

	--disk is enormous: 100-1000x more data than memory

	    --how to organize all of this information?
	    --answer is by categorizing things (taxonomies). a FS is a
	    kind of taxonomy ("/homes" has home directories,
	    "/homes/bob/classes/cs372h" has bob's cs372h material, etc.)


    B. Files

	--what is a file?
	    --answer from user's view: a bunch of named bytes on the disk
	    --answer from FS's view: collection of disk blocks
	
	--big job of a FS: map name and offset to disk blocks
	   
                                 FS
                   {file,offset} --> disk address
	    
	    --operations are create(file), delete(file), read(), write()

	    --***goal: operations have as few disk accesses as possible
	    and minimal space overhead
	    
		--wait, why do we want minimal space overhead, given that
		the disk is huge?

		--answer: cache space never enough; the amount of data
		that can be retrieved in one fetch is never enough.
		hence, really don't want to waste.

	[[--note that we have seen translation/indirection before:

	    page table:

		                    page table 
		    virtual address ----------> physical address

    
	    per-file metadata:

			    inode
		    offset ------>  disk block address


	    how'd we get the inode?

			       directory
		    file name ----------> file # 
		    
		(file # *is* an inode in Unix)
		    		
	    ]]


    C. Implementing files

	--our task: meet the goal marked *** above. 

	--for now, we're going to assume that the file's metadata is
	given to us. when we look at directories in a bit, we'll see
	where the metadata comes from; the above picture should also
	give a hint
    
	access patterns we could imagine supporting:

	(i) Sequential:
	    --File data processed in sequential order
	    --By far the most common mode
	    --Example: editor writes out new file, compiler reads in file, etc

	(ii) Random access:
	    --Address any block in file directly without passing through
	    --Examples: large data set, demand paging, databases

	(iii) Keyed access
	    --Search for block with particular values
	    --Examples: associative data base, index
	    --This thing is everywhere in the field of databases,
	    search engines, but....
	    --...usually not provided by a FS in OS

	helpful observations:

	(i) All blocks in file tend to be used together, sequentially 

	(ii) All files in directory tend to be used together

	(iii) All *names* in directory tend to be used together

	further design parameters:

	(i) Most files are small 
	
	(ii) Much of the disk is allocated to large files

	(iii) Many of the I/O operations are made to large files

	(iv) Want good sequential and good random access 

	candidate designs........

	1. contiguous allocation 

	  "extent based"
	  --when creating a file, make user pre-specify its length, and
	  allocate the space at once
	  --file metadata contains location and size

	  --example: IBM OS/360

		[<free> a1 a2 a3 <free> b1 b2 <free> ]

		what if a file c needs two sectors?!
	  
	  +: simple
	  +: fast access, both sequential and random
	  -: fragmentation
	  
	  where have we seen something similar? (answer: segmentation in
	  virtual memory)

	2. linked files
	    
	    --keep a linked list of free blocks
	    --metadata: pointer to file's first block
	    --each block holds pointer to next one

	  +: no more fragmentation
	  +: sequential access easy (and probably mostly fast, assuming
	     decent free space management, since the pointers will point
	     close by)
	  -: random access is a disaster
	  -: pointers take up room in blocks; messes up alignment of
	  data

	3. modification of linked files: FAT

	    --keep link structure in memory
	    --in fixed-size "FAT" (file allocation table)
	    --pointer chasing now happens in RAM

	    [DRAW PICTURE]

	    --example: MS-DOS (and iPods, MP3 players, digital cameras)
	
	  +: no need to maintain separate free list (table says what's free)	
	  +: low space overhead
	  -: maximum size limited. 
	      64K entries
	      512 byte blocks --> 32MB max file system
	   bigger blocks bring advantages and disadvantages, and ditto a
	   bigger table

	    note: to guard against bad sectors, better store multiple
	    copies of FAT on the disk!!


[thanks to David Mazieres for portions of the above]