Class 21
CS 439
02 April 2013

On the board
------------

1. Last time
2. File systems
3. crash recovery / logging
4. midterm review

---------------------------------------------------------------------------

1. Last time

    got close to the end of file systems


    A. [last time] intro
    B. [last time] files
    C. [last time] implementing files
        1. [last time] contiguous
        2. [last time] linked files
        3. [last time] FAT
        4. [last time] indexed files
    D. [last time] Directories
    E. FS performance
    F. mmap


    --Hierarchial Unix

	--used since CTSS (1960s), and Unix picked it up and used it nicely

	--structure like:
		            "/"
	     bin  cdrom    dev       sbin           tmp
			        awk chmod ....

	--directories stored on disk just like regular files

	    --here's the data in a directory file; this data is in the
	    *data blocks* of the directory:

	      [<name, inode#>]
	       <bin, 1021>
	       <dev, 1001>
	       <sbin, 2011>
	       ....

	    --i-node for directory contains a special flag bit

		--only special users can write directory files

	--key point: i-number might reference another directory

	    --this neatly turns the FS into a hierarchical tree, with almost
	    no work

	--another nice thing about this: if you speed up file
	operations, you also speed up directory operations, because
	directories are just like files

	--bootstrapping: where do you start looking?

	    --root dir always inode #2 (0 and 1 reserved)

	    --and, voila, we have a namespace!


	--special names: "/", ".", ".."

	--given those names, we need only two operations to navigate the
	entire name space:

	    --"cd name": (change context to directory "name")
	    --"ls": (list all names in current directory)


	--example:

	    [DRAW PICTURE]


	--links:

	    --hard link: multiple dir entries point to same inode; inode
	    contains refcount

		"ln a b": creates a synonym ("b") for file ("a")

		--how do we avoid cycles in the graph? (answer: can't
		hard link to directories)

	    --soft link: synonym for a *name*

		"ln -s /d/a b": 

		--creates a new inode, not just a new directory entry

		--new inode has "sym link" bit set

		--contents of that new file:

		    "/d/a"

    E. FS Performance
    
	--Unix FS was simple, elegant and ... slow

	    --blocks too small

	    --file index (inode) too large
		--too many layers of mapping indirection
		--transfer rate low (they were getting one block at a time)

	    --poor clustering of related objects

		--consecutive file blocks not close together

		--Inodes far from data blocks

		--Inodes for a given directory not close together

		--result: poor enumeration performance, meaning things like:
			"ls" and "grep foo *.c" were slowwwww

	    --other problems:
		--14 character names were the limit
		--can't atomically update file in crash-proof way
 
 
       --FFS (fast file system) fixes these problems to a degree.
	
	    [Reference: "M. K. McKusik, W. N. Joy, S. J. Leffler, and R.
	    S.  Fabry. A Fast File System for UNIX. ACM Trans. on
	    Computer Systems, Vol. 2, No. 3, Aug. 1984, pp. 181-197.]

      what can we do to above?

      [ask for suggestions]

      * make block size bigger (4 KB, 8KB, or 16 KB)

      * cluster related objects

	  "cylinder groups" (one or more consecutive cylinders)

	[superblock | bookkeeping info | inodes | bitmap | data blocks (512 bytes each) ]
	
	    --try to put inodes and data blocks in the same cylinder group

	    --try to put all inodes of files in the same directory in
	    the same cylinder group
	    
	    --new directories placed in cylinder group with greater than
	    average number of free inodes

	    --as files are allocated, use a heuristic: spill to next
	    cylinder group after 48 KB of file (which would be the point
	    at which an indirect block would be required, assuming
	    4096-byte blocks) and at every megabyte thereafter.
	    
      * bitmaps (to track free blocks)

	    --Easier to find contiguous blocks

	    --Can keep the entire thing in memory (as in lab 5)

	    --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB.
	    not outrageous these days.

      * reserve space
	   --but don't tell users. (df makes full disk look 110% full)

      * total performance

	--20-40% of disk bandwidth for large files

	--10-20x of original Unix file system!

	--still not the best we can do
	    (meta-data writes happen synchronously, which really hurts
	    performance. but making asynchronous requires story for
	    crash recovery.)

      Others:

	--Most obvious: big file cache

	    --kernel maintains a *buffer cache* in memory

	    --internally, all uses of ReadDisk(blockNum, readbuf)
	    replaced with:

		ReadDiskCache(blockNum, readbuf) {
		    ptr = buffercache.get(blockNum); 
		    if (ptr) {
			copy BLKSIZE bytes from ptr to readbuf
		    } else {
			newBuf = malloc(BLKSIZE);
			ReadDisk(blockNum, newBuf);
			buffercache.insert(blockNum, newBuf);
			copy BLKSIZE bytes from newBuf to readbuf
		    }

	--no rotation delay if you're reading the whole track.
	    --so try to read the whole track

	--more generally, try to work with big chunks (lots of disk
	blocks) 
	    --write in big chunks
	    --read ahead in big chunks (64 KB)

	--why not just read/write 1 MB at a time?
	    --(for writes: may not get data to disk often enough)
	    --(for reads: may waste read bandwidth)

    F. mmap: memory mapping files

	--recall some syscalls: 
	    fd = open(pathname, mode)
	    write(fd, buf, sz)
	    read(fd, buf, sz)

	--what the heck is a fd?
	    --indexes into a table
	    --what's in the given entry in the table?
		--inumber!
		--inode, probably!
		--and per-open-file data (file position, etc.)

	--syscall:
	    void* mmap(void* addr, size_t len, int prot, int flags,
		       int fd, off_t offset);


	--map the specified open file (fd) into a region of my
	virtual memory (at addr, or at a kernel-selected place if
	addr is 0), and return a pointer to it

	--after this, loads and stores to addr[offset] are
	equivalent to reading and writing to the file at the given
	offset

	--how's this implemented?! (answer: through virtual memory,
	with the VA being addr [or whatever the kernel selects] and
	the PA being what? answer: the physical address storing the
	given page in the kernel's buffer cache).

	--have to deal with eviction from buffer cache, but this
	problem is not unique. in all operating systems besides JOS,
	the kernel designers *anyway* have to be able to invalidate
	VA-->PA mappings when a page is removed from RAM

---------------------------------------------------------------------------

thanks to David Mazieres and Mike Dahlin

---------------------------------------------------------------------------

3. crash recovery

    --there are a lot of data structures used to implement the file
    system (bitmap of free blocks, directories, inodes, indirect blocks,
    data blocks, etc.)

    --require: crash anywhere and the system can be recovered

    --options:

	--*write through*: write changes immediately to disk. problem:
	slow! have to wait for each write to complete before going on

	--*write back*: delay writing modified data back to disk.
	problem: can lose data. another problem: updates can go to the
	disk in a wrong order

    --If multiple updates needed, do them in specific order so that if a
    crash occurs, **fsck** can work.


    --Approaches to crash recovery:

    A. ad-hoc 
    B. ordered (soft) updates
    C. WAL (write-ahead logging)

    A. ad-hoc

	--can't have all data written asynchronously. If all data were
	written asynchronously, we could encounter the following
	unacceptable scenarios:

	    (a) Delete/truncate a file, append to other file, crash
		--New file may reuse block from old
		--Old inode may not be updated
		--Cross-allocation!
		--Often inode with older mtime wrong, but can't be sure

	    (b) Append to file, allocate indirect block, crash
		--Inode points to indirect block
		--But indirect block may contain garbage 

	--so what's the actual approach?

	    --be careful about order of updates. specifically:

	    --Write new inode to disk before directory entry

	    --Remove directory name before deallocating inode

	    --Write cleared inode to disk before updating cylinder group
	    free map

	--how is it implemented?

	    --synchronous write through for *metadata*. 

	    --doing one metadata write at a time ensures ordering

	example: for file create:

	    --write data to file

	    --update/write inode

	    --mark inode "allocated" in bitmap

	    --mark data blocks "allocated" in bitmap

	    --update directory

	    --(if directory grew) mark new file block "allocated" in
	    bitmap 
	
	now, cases:

	--inode not marked allocated in bitmap --> only writes were to
	unallocated, unreachable blocks; the result is that the write
	"disappears"
	 
	--inode allocated, data blocks not marked allocated in bitmap
	--> fsck must update bitmap 
	 
	--file created, but not yet in any directory --> fsck ultimately
	deletes file (after all that!)

	Disadvantages to this ad-hoc approach:

 	    (a) need to get ad-hoc reasoning exactly right 

	    (b) poor performance (synchronous writes of metadata) 

		--multiple updates to same block require that they be
		issued separately. for example, imagine two updates to
		same directory block. requires first complete before
		doing the second (otherwise, not synchronous)

		--more generally, cost of crash recoverability is
		enormous. (a job like "untar" could be 10-20x slower)

	    (c) slow recovery: fsck must scan entire disk

		--recovery gets slower as disks get bigger. if fsck
		takes one minute, what happens when disk gets 10 times
		bigger?
	
	[aside: why not use battery-backed RAM? answer:

	    --Expensive (requires specialized hardware)

	    --Often don't learn battery has died until too late

	    --A pain if computer dies (can't just move disk)

	    --If OS bug causes crash, RAM might be garbage]

    B. ordered updates

	--could reason carefully about the precise order in which
	asynchronous writes should go to disk

	--advantages

	    --performance

	    --fsck is very fast and can run in the background, since all
	    it needs to do is fix up bookkeeping

	--limitations

	    --hard to get right

    	    --arguably ad-hoc: very specific to FFS data structures.
	    unclear how to apply the approach to FSes that use data
	    structures like B+-trees

	    --metadata updates can happen out of order (for example:
	    create A, create B, crash.... it might be that only B exists
	    after the crash!)

	--to see this approach in action:
	    
	    [G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N.
	     Patt. Soft Updates: A Solution to the Metadata Update
	     Problem in File Systems. ACM Trans. on Computer Systems.
	     Vol. 18. No. 2., May 2000, pp. 127-153.
	     http://portal.acm.org/citation.cfm?id=350853.350863]

    
    C. Journaling

    Golden rule of atomicity, per Saltzer-Kaashoek:
    "never modify the only copy"

	--Reserve a portion of disk for **write-ahead log**
	    --Write any metadata operation first to log, then to disk
	    --After crash/reboot, re-play the log (efficient)
	    --May re-do already committed change, but won't miss anything
      
	--Performance advantage:
	    --Log is consecutive portion of disk
	    --Multiple log writes very fast (at disk b/w)
	    --Consider updates committed when written to log
      
	--Example: delete directory tree
	    --Record all freed blocks, changed directory entries in log
	    --Return control to user
	    --Write out changed directories, bitmaps, etc. in background
		(sort for good disk arm scheduling)
    
	--On recovery, must do three things:

	    i. find oldest relevant log entry
	    ii. find end of log
	    iii. read and replay committed portion of log.
	
	    i. find oldest relevant log entry

		--Otherwise, redundant and slow to replay whole log

		--Idea: checkpoints! (this idea is used throughout systems)
		    --Once all records up to log entry N have been processed and
		    once all affected blocks stably committed to disk ...
		    --Record N to disk either in reserved checkpoint location, or
		    in checkpoint log record
		    --Never need to go back before most recent checkpointed N

	    ii. find end of log

		--Typically circular buffer, so look at sequence numbers

		--Can include begin transaction/end transaction records

		    --but then need to make sure that "end transaction" only
		    gets to the disk after all other disk blocks in the
		    transaction are on disk
			--but disk can reorder requests, then system crashes
			--to avoid that, need separate disk write for "end
			transaction", which is a performance hit

		    --to avoid that, use checksums: a log entry is
		    committed when all of its disk blocks match its
		    checksum value

	    iii. not much to say: read and replay!
	    
	 --Logs are key: enable atomic complex operations. to see this,
	 we'll take a slight detour.....

	    [can skip, since the same points come up again under
	    transactions]

	    detour: some file systems (for example, XFS from SGI) use a B+-tree data
	    structure. a few quick words about B+-trees:
		--key-value map
		--ordering defined on keys (where is nearest key?)
		--data stored in blocks, so explicitly designed for
		efficient disk access
		--with n items stored, all operations are O(log n):
		    --retrieve closest <key,value> to target key k
		    --insert new <key,value> pair
		    --delete <key,value> pair
		--see any algorithms book (e.g., Cormen et al.) for details
		--**complex to implement**

	    --wait, why are we mentioning B+-trees? because some file
	    systems use them:
		--efficient implementation of large directories (map
		key = hash(filename) to value = inode #)
		--efficient implementation of inode
		    --instead of using FFS-style fixed block
		    pointers, map:
			file offset (key) --> {start block, # blocks} (value)
		    --if file consists of a small number of extents
		    (i.e., segments), then inodes are small, even
		    for large files
		--efficient implementation of map from inode # to
		inode. map:
		    inode # --> {block #, # of consecutive inodes in use}
		    [bonus: allows fast way to identify free node!]
			
	    --some B+-tree operations require multiple operations.
	    intermediate states are incorrect. what happens if there's a
	    crash in the middle? B+-tree could be in inconsistent state

	    --journaling is a big help here
		--First write all changes to the log ("insert k,v",
		"delete k", etc.)
		--If crash while writing log, incomplete log record will be
		discarded, and no change made
		--Otherwise, if crash while updating B+-tree, will
		replay entire log record and write everything

	    --limitations of journaling
		--fsync() syncs *all* operations' metadata to log

    --write-ahead logging is everywhere

    --what's the problem? (all data is written twice, in the worst case)

	--(aside: less of a problem to write data twice if you have two
	disks. common way to make systems fast: use multiple disks. then
	easier to avoid seeks)

    --log started as a way to help with consistency, but now the log is
    authoritative, so actually do we need the *other* copy of the data?
    what if everything is just stored in the log???? Transition to
    Log-structured file system (LFS)

    D. Summarize crash recovery
    
	--three viable approaches to crash recovery in file systems:

	(i) ad-hoc

	    --worry about metadata consistency, not data consistency

	    --accomplish metadata consistency by being careful about
	    order of updates

	    --write metadata synchronously

	(ii) ordered updates (soft updates), which is in OpenBSD

	    --worry about metadata consistency

	    --leads to great performance: metadata doesn't have to be
	    written synchronously (writes just have obey a partial
	    order)
	    
	(iii) journaling (the approach in most Linux file systems)

	    --more flexible

	    --easier to reason about

	    --possibly worse performance


4. midterm review

    Ground rules: same as last time

    Material: everything we've covered: readings, labs, homeworks,
    lectures

lecture topics

    --finished scheduling

    --virtual memory

	--segmentation (how does it work in general? on the x86?)

	--paging (how does it work in general? on the x86?)

	    virtual address: [10bits  10bits  12bits]

	--entry in pgdir and page table:

	    [20 bits  more bits   bottom 3 bits]

	--protection (user/kernel | read/write |  present/not)

	--what's a TLB?
	
	--how does JOS handle virtual memory for the kernel? for user
	processes?

	--page faults (their uses and costs)

        --page replacement

    --heap management

    --interrupts: purpose and mechanics

    --I/O

        --architecture

	--how kernel communicates with devices

	--device drivers

	--DMA

	--polling vs. interrupts

    --Disks

	--geometry, performance, interface, technology trends,
	scheduling, placement strategy

    --More concurrency

        --review spinlocks, MCS locks

        --ccNUMA machines

        --event-driven programming
        
    --File systems

	--basic objects: files, directories, meta-data, links, inodes

	--how does naming work? what allows system to map
	    /usr/homes/bob/index.html to a file object?

	--types of file layout:
	    --extents, FAT, indexed structure, classic Unix, FFS

	--tradeoffs

	--performance

	--caching 

    --crash recovery
	
	--ad-hoc: write meta-data synchronously and in the right order
	    --depend on fsck

	--ordered updates: write meta-data in the right order but not
	necessarily synchronously
	    --depend on fsck

	--write-ahead logging (called *journaling* in FS context)

    --case study: LFS (log-structured file system)