Class 21 CS 439 02 April 2013 On the board ------------ 1. Last time 2. File systems 3. crash recovery / logging 4. midterm review --------------------------------------------------------------------------- 1. Last time got close to the end of file systems A. [last time] intro B. [last time] files C. [last time] implementing files 1. [last time] contiguous 2. [last time] linked files 3. [last time] FAT 4. [last time] indexed files D. [last time] Directories E. FS performance F. mmap --Hierarchial Unix --used since CTSS (1960s), and Unix picked it up and used it nicely --structure like: "/" bin cdrom dev sbin tmp awk chmod .... --directories stored on disk just like regular files --here's the data in a directory file; this data is in the *data blocks* of the directory: [] .... --i-node for directory contains a special flag bit --only special users can write directory files --key point: i-number might reference another directory --this neatly turns the FS into a hierarchical tree, with almost no work --another nice thing about this: if you speed up file operations, you also speed up directory operations, because directories are just like files --bootstrapping: where do you start looking? --root dir always inode #2 (0 and 1 reserved) --and, voila, we have a namespace! --special names: "/", ".", ".." --given those names, we need only two operations to navigate the entire name space: --"cd name": (change context to directory "name") --"ls": (list all names in current directory) --example: [DRAW PICTURE] --links: --hard link: multiple dir entries point to same inode; inode contains refcount "ln a b": creates a synonym ("b") for file ("a") --how do we avoid cycles in the graph? (answer: can't hard link to directories) --soft link: synonym for a *name* "ln -s /d/a b": --creates a new inode, not just a new directory entry --new inode has "sym link" bit set --contents of that new file: "/d/a" E. FS Performance --Unix FS was simple, elegant and ... slow --blocks too small --file index (inode) too large --too many layers of mapping indirection --transfer rate low (they were getting one block at a time) --poor clustering of related objects --consecutive file blocks not close together --Inodes far from data blocks --Inodes for a given directory not close together --result: poor enumeration performance, meaning things like: "ls" and "grep foo *.c" were slowwwww --other problems: --14 character names were the limit --can't atomically update file in crash-proof way --FFS (fast file system) fixes these problems to a degree. [Reference: "M. K. McKusik, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File System for UNIX. ACM Trans. on Computer Systems, Vol. 2, No. 3, Aug. 1984, pp. 181-197.] what can we do to above? [ask for suggestions] * make block size bigger (4 KB, 8KB, or 16 KB) * cluster related objects "cylinder groups" (one or more consecutive cylinders) [superblock | bookkeeping info | inodes | bitmap | data blocks (512 bytes each) ] --try to put inodes and data blocks in the same cylinder group --try to put all inodes of files in the same directory in the same cylinder group --new directories placed in cylinder group with greater than average number of free inodes --as files are allocated, use a heuristic: spill to next cylinder group after 48 KB of file (which would be the point at which an indirect block would be required, assuming 4096-byte blocks) and at every megabyte thereafter. * bitmaps (to track free blocks) --Easier to find contiguous blocks --Can keep the entire thing in memory (as in lab 5) --100 GB disk / 4KB disk blocks = 25,000,000 entries = 3MB. not outrageous these days. * reserve space --but don't tell users. (df makes full disk look 110% full) * total performance --20-40% of disk bandwidth for large files --10-20x of original Unix file system! --still not the best we can do (meta-data writes happen synchronously, which really hurts performance. but making asynchronous requires story for crash recovery.) Others: --Most obvious: big file cache --kernel maintains a *buffer cache* in memory --internally, all uses of ReadDisk(blockNum, readbuf) replaced with: ReadDiskCache(blockNum, readbuf) { ptr = buffercache.get(blockNum); if (ptr) { copy BLKSIZE bytes from ptr to readbuf } else { newBuf = malloc(BLKSIZE); ReadDisk(blockNum, newBuf); buffercache.insert(blockNum, newBuf); copy BLKSIZE bytes from newBuf to readbuf } --no rotation delay if you're reading the whole track. --so try to read the whole track --more generally, try to work with big chunks (lots of disk blocks) --write in big chunks --read ahead in big chunks (64 KB) --why not just read/write 1 MB at a time? --(for writes: may not get data to disk often enough) --(for reads: may waste read bandwidth) F. mmap: memory mapping files --recall some syscalls: fd = open(pathname, mode) write(fd, buf, sz) read(fd, buf, sz) --what the heck is a fd? --indexes into a table --what's in the given entry in the table? --inumber! --inode, probably! --and per-open-file data (file position, etc.) --syscall: void* mmap(void* addr, size_t len, int prot, int flags, int fd, off_t offset); --map the specified open file (fd) into a region of my virtual memory (at addr, or at a kernel-selected place if addr is 0), and return a pointer to it --after this, loads and stores to addr[offset] are equivalent to reading and writing to the file at the given offset --how's this implemented?! (answer: through virtual memory, with the VA being addr [or whatever the kernel selects] and the PA being what? answer: the physical address storing the given page in the kernel's buffer cache). --have to deal with eviction from buffer cache, but this problem is not unique. in all operating systems besides JOS, the kernel designers *anyway* have to be able to invalidate VA-->PA mappings when a page is removed from RAM --------------------------------------------------------------------------- thanks to David Mazieres and Mike Dahlin --------------------------------------------------------------------------- 3. crash recovery --there are a lot of data structures used to implement the file system (bitmap of free blocks, directories, inodes, indirect blocks, data blocks, etc.) --require: crash anywhere and the system can be recovered --options: --*write through*: write changes immediately to disk. problem: slow! have to wait for each write to complete before going on --*write back*: delay writing modified data back to disk. problem: can lose data. another problem: updates can go to the disk in a wrong order --If multiple updates needed, do them in specific order so that if a crash occurs, **fsck** can work. --Approaches to crash recovery: A. ad-hoc B. ordered (soft) updates C. WAL (write-ahead logging) A. ad-hoc --can't have all data written asynchronously. If all data were written asynchronously, we could encounter the following unacceptable scenarios: (a) Delete/truncate a file, append to other file, crash --New file may reuse block from old --Old inode may not be updated --Cross-allocation! --Often inode with older mtime wrong, but can't be sure (b) Append to file, allocate indirect block, crash --Inode points to indirect block --But indirect block may contain garbage --so what's the actual approach? --be careful about order of updates. specifically: --Write new inode to disk before directory entry --Remove directory name before deallocating inode --Write cleared inode to disk before updating cylinder group free map --how is it implemented? --synchronous write through for *metadata*. --doing one metadata write at a time ensures ordering example: for file create: --write data to file --update/write inode --mark inode "allocated" in bitmap --mark data blocks "allocated" in bitmap --update directory --(if directory grew) mark new file block "allocated" in bitmap now, cases: --inode not marked allocated in bitmap --> only writes were to unallocated, unreachable blocks; the result is that the write "disappears" --inode allocated, data blocks not marked allocated in bitmap --> fsck must update bitmap --file created, but not yet in any directory --> fsck ultimately deletes file (after all that!) Disadvantages to this ad-hoc approach: (a) need to get ad-hoc reasoning exactly right (b) poor performance (synchronous writes of metadata) --multiple updates to same block require that they be issued separately. for example, imagine two updates to same directory block. requires first complete before doing the second (otherwise, not synchronous) --more generally, cost of crash recoverability is enormous. (a job like "untar" could be 10-20x slower) (c) slow recovery: fsck must scan entire disk --recovery gets slower as disks get bigger. if fsck takes one minute, what happens when disk gets 10 times bigger? [aside: why not use battery-backed RAM? answer: --Expensive (requires specialized hardware) --Often don't learn battery has died until too late --A pain if computer dies (can't just move disk) --If OS bug causes crash, RAM might be garbage] B. ordered updates --could reason carefully about the precise order in which asynchronous writes should go to disk --advantages --performance --fsck is very fast and can run in the background, since all it needs to do is fix up bookkeeping --limitations --hard to get right --arguably ad-hoc: very specific to FFS data structures. unclear how to apply the approach to FSes that use data structures like B+-trees --metadata updates can happen out of order (for example: create A, create B, crash.... it might be that only B exists after the crash!) --to see this approach in action: [G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N. Patt. Soft Updates: A Solution to the Metadata Update Problem in File Systems. ACM Trans. on Computer Systems. Vol. 18. No. 2., May 2000, pp. 127-153. http://portal.acm.org/citation.cfm?id=350853.350863] C. Journaling Golden rule of atomicity, per Saltzer-Kaashoek: "never modify the only copy" --Reserve a portion of disk for **write-ahead log** --Write any metadata operation first to log, then to disk --After crash/reboot, re-play the log (efficient) --May re-do already committed change, but won't miss anything --Performance advantage: --Log is consecutive portion of disk --Multiple log writes very fast (at disk b/w) --Consider updates committed when written to log --Example: delete directory tree --Record all freed blocks, changed directory entries in log --Return control to user --Write out changed directories, bitmaps, etc. in background (sort for good disk arm scheduling) --On recovery, must do three things: i. find oldest relevant log entry ii. find end of log iii. read and replay committed portion of log. i. find oldest relevant log entry --Otherwise, redundant and slow to replay whole log --Idea: checkpoints! (this idea is used throughout systems) --Once all records up to log entry N have been processed and once all affected blocks stably committed to disk ... --Record N to disk either in reserved checkpoint location, or in checkpoint log record --Never need to go back before most recent checkpointed N ii. find end of log --Typically circular buffer, so look at sequence numbers --Can include begin transaction/end transaction records --but then need to make sure that "end transaction" only gets to the disk after all other disk blocks in the transaction are on disk --but disk can reorder requests, then system crashes --to avoid that, need separate disk write for "end transaction", which is a performance hit --to avoid that, use checksums: a log entry is committed when all of its disk blocks match its checksum value iii. not much to say: read and replay! --Logs are key: enable atomic complex operations. to see this, we'll take a slight detour..... [can skip, since the same points come up again under transactions] detour: some file systems (for example, XFS from SGI) use a B+-tree data structure. a few quick words about B+-trees: --key-value map --ordering defined on keys (where is nearest key?) --data stored in blocks, so explicitly designed for efficient disk access --with n items stored, all operations are O(log n): --retrieve closest to target key k --insert new pair --delete pair --see any algorithms book (e.g., Cormen et al.) for details --**complex to implement** --wait, why are we mentioning B+-trees? because some file systems use them: --efficient implementation of large directories (map key = hash(filename) to value = inode #) --efficient implementation of inode --instead of using FFS-style fixed block pointers, map: file offset (key) --> {start block, # blocks} (value) --if file consists of a small number of extents (i.e., segments), then inodes are small, even for large files --efficient implementation of map from inode # to inode. map: inode # --> {block #, # of consecutive inodes in use} [bonus: allows fast way to identify free node!] --some B+-tree operations require multiple operations. intermediate states are incorrect. what happens if there's a crash in the middle? B+-tree could be in inconsistent state --journaling is a big help here --First write all changes to the log ("insert k,v", "delete k", etc.) --If crash while writing log, incomplete log record will be discarded, and no change made --Otherwise, if crash while updating B+-tree, will replay entire log record and write everything --limitations of journaling --fsync() syncs *all* operations' metadata to log --write-ahead logging is everywhere --what's the problem? (all data is written twice, in the worst case) --(aside: less of a problem to write data twice if you have two disks. common way to make systems fast: use multiple disks. then easier to avoid seeks) --log started as a way to help with consistency, but now the log is authoritative, so actually do we need the *other* copy of the data? what if everything is just stored in the log???? Transition to Log-structured file system (LFS) D. Summarize crash recovery --three viable approaches to crash recovery in file systems: (i) ad-hoc --worry about metadata consistency, not data consistency --accomplish metadata consistency by being careful about order of updates --write metadata synchronously (ii) ordered updates (soft updates), which is in OpenBSD --worry about metadata consistency --leads to great performance: metadata doesn't have to be written synchronously (writes just have obey a partial order) (iii) journaling (the approach in most Linux file systems) --more flexible --easier to reason about --possibly worse performance 4. midterm review Ground rules: same as last time Material: everything we've covered: readings, labs, homeworks, lectures lecture topics --finished scheduling --virtual memory --segmentation (how does it work in general? on the x86?) --paging (how does it work in general? on the x86?) virtual address: [10bits 10bits 12bits] --entry in pgdir and page table: [20 bits more bits bottom 3 bits] --protection (user/kernel | read/write | present/not) --what's a TLB? --how does JOS handle virtual memory for the kernel? for user processes? --page faults (their uses and costs) --page replacement --heap management --interrupts: purpose and mechanics --I/O --architecture --how kernel communicates with devices --device drivers --DMA --polling vs. interrupts --Disks --geometry, performance, interface, technology trends, scheduling, placement strategy --More concurrency --review spinlocks, MCS locks --ccNUMA machines --event-driven programming --File systems --basic objects: files, directories, meta-data, links, inodes --how does naming work? what allows system to map /usr/homes/bob/index.html to a file object? --types of file layout: --extents, FAT, indexed structure, classic Unix, FFS --tradeoffs --performance --caching --crash recovery --ad-hoc: write meta-data synchronously and in the right order --depend on fsck --ordered updates: write meta-data in the right order but not necessarily synchronously --depend on fsck --write-ahead logging (called *journaling* in FS context) --case study: LFS (log-structured file system)