CS202 Spring 2015 Lab 6: File System

Released Friday, April 10, 2015
Due Friday, April 24, 2015, 9:00 PM

Lab 6: File System

Introduction

In this lab, you will implement (pieces of) a simple disk-based file system.

Some notes:

This lab is an edited version of a lab written by Isami Romanowski. (He in turn adapted code from MIT's JOS, porting it to the FUSE and Linux environment, adding inodes, etc.)

Getting Started

We recommend doing this lab inside the development platform that you set up in the previous labs (the virtual devbox or the CIMS machines).

From within this platform, change to your lab directory, use Git to commit changes you've made since handing in Lab 5 (if any), obtain the latest version of the course repository, and then create a local branch called lab6 based on our origin/lab6 (which is our lab6 branch). The commands are as follows, but you may not need all of them (for example, if git status indicates that there are no changes to commit):

$ cd ~/cs202
$ git commit -am 'my solution to lab5'
Created commit 254dac5: my solution to lab5
 3 files changed, 31 insertions(+), 6 deletions(-)
$ git pull
Already up-to-date.
$ git checkout -b lab6 origin/lab6
Branch lab6 set up to track remote branch refs/remotes/origin/lab6.
Switched to a new branch "lab6"
$ cd lab6

You are now in the lab6 directory. Your next step depends on whether you are using the CIMS machines or the devbox:

CIMS Users

You have to use one of the following machines:
crackle1.cims.nyu.edu 
crackle2.cims.nyu.edu 
crackle3.cims.nyu.edu 
crackle4.cims.nyu.edu 
crackle5.cims.nyu.edu 
Note that this requirement supersedes the usual list of machines.

Next, install required library files:

$ pwd     # make sure that you are in lab6 directory
/home/your-netid/cs202/lab6
$ ./install-lab6-for-cims.sh
You only need this on CIMS machines, press <ENTER> to continue, or <Ctrl>-C to cancel
<ENTER>
Copying fuse library....
Installation successful

Devbox Users

Install the necessary library and binary using apt: (default password for using sudo is labs)
$ sudo apt-get update # update the apt catalog
.....
$ sudo apt-get install fuse libfuse-dev
.....

The code provided in this lab handles the bulk of the work for creating a file system abstraction in Linux; you will implement a few fundamental operations. Stop now to look over the code that we have provided in the following files:

FUSE

The file system that we will build is implemented as a user-level process. This file system's storage will be a file (in the example given below, we call it testfs.img) that lives in the normal file system of the Linux machine (the CIMS machine or Ubuntu devbox) that you are using. Much of your code will treat this file as if it were a disk.

This entire arrangement (file system implemented in user space with arbitrary choice of storage) is due to software called FUSE (Filesystem in Userspace). In order to really understand what FUSE is doing, we need to take a brief detour to describe VFS. Linux (like Unix) has a layer of kernel software called VFS; conceptually, every file system sits below this layer, and exports a uniform interface to VFS. (You can think of any potential file system as being a "driver" for VFS; VFS asks the software below it to do things like "read", "write", etc.; that software fulfills these requests by interacting with a disk driver, and interpreting the contents of disk blocks.) The purpose of this architecture is to make it relatively easy to plug a new file system into the kernel: the file system writer simply implements the interface that VFS is expecting from it, and the rest of the OS uses the interface that VFS exports to it. In this way, we obtain the usual benefits of pluggability and modularity.

FUSE is just another "VFS driver", but it comes with a twist. Instead of FUSE implementing a disk-based file system (the usual picture), it responds to VFS's requests by asking a user-level process (which is called a "FUSE driver") to respond to it. So the FUSE kernel module is an adapter that speaks "fuse" to a user-level process (and you will be writing your code in this user-level process) and "VFS" to the rest of the kernel.

Meanwhile, a FUSE driver can use whatever implementation it wants. It could store its data in memory, across the network, on Jupiter, whatever. In the setup in this lab, the FUSE driver will interact with a traditional Linux file (as noted above), and pretend that this file is a sector-addressable disk.

The FUSE driver registers a set of callbacks with the FUSE system (via libfuse and ultimately the FUSE kernel module); these callbacks are things like read, write, etc. A FUSE driver is associated with a particular directory, or mount point. The concept of mounting was explained in OSTEP 39 (see 39.15). Any I/O operations requested on files and directories under this mount point are dispatched by the kernel (via VFS, the FUSE kernel module, and libfuse) to the callbacks registered by the FUSE driver.

To recap all of the above: the file system user interacts with the file system roughly in this fashion:

  1. When the file system user, Process A, makes a request to the system, such as listing all files in a directory via ls, the ls process issues one or more system calls (stat, read, etc.).
  2. The kernel hands the system call to VFS.
  3. VFS finds that the system call is referencing a file or directory that is managed by FUSE.
  4. VFS then dispatches the request to FUSE, which dispatches it to the corresponding FUSE driver (which is where you will write your code).
  5. The FUSE driver handles the request by interacting with the "disk", which is implemented as an ordinary file. The FUSE driver then responds, and the responses go back through the chain.

Here's an example from the staff solution to show what this looks like, where testfs.img is a disk image with only the root directory and the file hello on its file system:

$ mkdir mnt # create a directory to serve as a mount point
$ df mnt # see what file system mnt is associated with
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1        7092728 4536616   2172780  68% /
$ ls mnt
$ build/fsdriver testfs.img mnt # mount testfs.img at mnt
$ df mnt # note that mnt's file system is now different
Filesystem          Size  Used Avail Use% Mounted on
cs202fs#testfs.img  8.0M   24K  8.0M   1% /home/cs202/cs202/lab6/mnt
$ ls mnt # and there's the hello file
hello
$ cat mnt/hello # which we can read with any program
Hello, world!
$ fusermount -u mnt # unmount mnt
$ df mnt # and its associated file system is back to normal
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1        7092728 4536616   2172780  68% /
$ ls mnt # and hello is gone, but still lives in testfs.img
$
Note that in the above example, after we run fsdriver, the kernel is actually dispatching the all the open, read, readdir, etc. calls that ls and cat make to our FUSE driver. The FUSE driver takes care of searching for a file when open is called, reading file data when read is called, and so on. When fusermount is run, our file system is unmounted from mnt, and then all I/O operations under mnt return to being serviced normally by the kernel.

Our File System

Below, we give an overview of the features that our file system will support; along the way, we review some of the file system concepts that we have studied in class and the reading.

On-Disk File System Structure

Most UNIX file systems divide available disk space into two main types of regions: inode regions and data regions. UNIX file systems assign one inode to each file in the file system; a file's inode holds a file's meta-data (pointers to data blocks, etc.). The data regions are divided into much larger (typically 8KB or more) data blocks, within which the file system stores file data and directory data. Directory entries (the "data" in a directory) contain file names and inode numbers; a file is said to be hard-linked if multiple directory entries in the file system refer to that file's inode. Both files and directories logically consist of a series of data blocks; these blocks can be scattered throughout the disk much as the pages of a process's virtual address space can be scattered throughout physical memory.

Unlike most UNIX file systems, we make a simplification in the layout of the file system: there is only one region on the disk, in which both inode blocks and data blocks reside. Furthermore, each inode is allocated its own disk block instead of being packed alongside other inodes in a single disk block.

Sectors and Blocks

Disk perform reads and writes in units of sectors, which are typically 512 bytes. However, file systems allocate and use disk storage in units of blocks (for example, 4KB, or 8 sectors). Notice the distinction between the two terms: sector size is a property of the disk hardware, whereas block size is a creation of the file system that uses the disk. A file system's block size must be a multiple of the sector size of the underlying disk. As explained in class, there are advantages to making the block size larger than the sector size.

Our file system will use a block size of 4096 bytes.

Superblocks

Disk layout

File systems typically place important meta-data at reserved, well-known disk blocks (such as the very start of the disk). This meta-data describes properties of the entire file system (block size, disk size, meta-data required to find the root directory, the time the file system was last mounted, the time the file system was last checked for errors, and so on). These special blocks are called superblocks. Many "real" file systems maintain multiple replicas of superblocks, placing them far apart on the disk; that way, if one of them is corrupted or the disk develops a media error in that region, the other replicas remain accessible.

Our file system will have a single superblock, which will always be at block 0 on the disk. Its layout is defined by struct superblock in fs_types.h. Block 0 is typically reserved to hold boot loaders and partition tables, so file systems generally do not use the very first disk block. Since our file system is not meant to be used on a real disk, we use block 0 to store the superblock for simplicity.

The superblock in our file system contains a reference to a block containing the "root" inode (the s_root field in struct superblock). The "root" inode is the inode for the file system's root directory. This inode stores pointers to blocks; these blocks, together, contain a sequence of dirent structures. Each structure includes a file and an inode number (one can think of this as assigning a given "name" to a given inode); the collection of these structures forms the content of the file system's root directory. These contents can include further directories, etc.

The Block Bitmap: Managing Free Disk Blocks

In the same way that the kernel must manage the system's physical memory to ensure that a given physical page is used for only one purpose at a time, a file system must manage the blocks of storage on a disk to ensure that a given disk block is used for only one purpose at a time. In Lab 5, you kept the physical_pageinfo structures for all physical pages in an array, pageinfo, to keep track of the free physical pages in kernel.c. In file systems it is common to keep track of free disk blocks using a bitmap (essentially, an array of bits, one for each resource that is being tracked). A given bit in the bitmap is set if the corresponding block is free, and clear if the corresponding block is in use.

The bitmap in our file system always starts at disk block 1, immediately after the superblock. For simplicity we will reserve enough bitmap blocks to hold one bit for each block in the entire disk, including the blocks containing the superblock and the bitmap itself. We will simply make sure that the bitmap bits corresponding to these special, "reserved" areas of the disk are always clear (marked in-use). Note that, since our file system uses 4096-byte blocks, each bitmap block contains 4096*8=32768 bits, or enough bits to track 32768 disk blocks.

File Meta-data

The layout of the meta-data describing a file in our file system is described by struct inode in fs_types.h. This meta-data includes the file's size, type (regular file, directory, symbolic link, etc.), time stamps, permission information, and pointers to the data blocks of the file. Because our file system supports hard links, one inode may be referred to by more than one name -- which is why the inode itself does not store the file "name". Instead, directory entries give names to inodes (as noted eearlier).

The i_direct array in struct inode contains space to store the block numbers of the first 10 (N_DIRECT) blocks of the file, which we will call the file's direct blocks. For small files up to 10*4096 = 40KB in size, this means that the block numbers of all of the file's blocks will fit directly within the inode structure itself. For larger files, however, we need a place to hold the rest of the file's block numbers. For any file greater than 40KB, an additional disk block, called the file's indirect block, holds up to 4096/4 = 1024 additional block numbers, pushing the maximum file size up to (10 + 1024)*4096 = 4136KB, or a little over 4MB. The file system also supports double-indirect blocks. A double-indirect block (i_double in the inode structure) stores 4096/4 = 1024 additional indirect block numbers, which themselves each store 1024 additional direct block numbers. This affords an additional 1024*1024*4096 = 4GB worth of data, bringing the maximum file size to a little over 4GB, in theory. To support even larger files, real-world file systems typically support triple-indirect blocks (and sometimes beyond).

Other Features

Our file system supports all the traditional UNIX notions of file ownership, permissions, hard and symbolic links, time stamps, and special device files. Perhaps surprisingly, much of this functionality will come for free (or very low cost) after writing just a small number of core file system operations. Some of the ease in supporting these traditional UNIX file system notions comes from FUSE.

Implementation

You will implement some components of the FUSE driver (and hence the file system): allocating disk blocks, mapping file offsets to disk blocks, and freeing disk blocks allocated in inodes. Because a lot of the implementation is provided, it is important that you familiarize yourself with the provided code and the various file system interfaces.

The main file system code that we've provided for you resides in fsdriver.c. This file contains all the FUSE callbacks to handle I/O syscalls, as well as the main function. Once the path to the disk image and the path to the mount point (testfs.img and mnt respectively in the supplied example) have been read from the command line arguments, our FUSE driver maps the specified disk image into memory using the map_disk_image function (defined in disk_map.c), which itself initializes some file system metadata. Then, fsdriver.c calls fuse_main, which handles kernel dispatches to our registered callbacks. These callbacks will invoke the functions that you write in the coming exercises.

The FUSE driver code we provide maps the disk image into memory using the mmap syscall. In short, mmap reserves a portion of the running process's virtual address space to provide read and write access to a file as if it were an array in memory. For example, if file is the pointer to the first byte of the file in memory after mmaping, writing ((char *)file)[5] = 3 is approximately equivalent to the two calls lseek(fd, 5, SEEK_SET) and then write(fd, <location of memory that holds byte 3>, 1). To flush any in-memory changes you've made to a file onto the disk, you would use the msync function. To unmap the file from virtual memory, use munmap. As always, you can check the man pages for more information on these syscalls. For this lab, you will not need to be intimately familiar with their operation, but you should have a high-level understanding of what they do.

The Block Bitmap

When you run the FUSE driver ($ fsdriver testfs.img mnt), the function map_disk_image will set the bitmap pointer. After this, we can treat bitmap as a packed array of bits, one for each block on the disk. See, for example, block_is_free, which simply checks whether a given block is marked free in the bitmap.

Exercise 1. Before starting any work on the driver, run ./chmod-walk in the Lab 6 directory This will set up permissions in the directories leading up to your lab directory correctly so that you will be able to run the driver successfully. (If you do not run this script, FUSE will be unable to mount or unmount file systems in your lab directory.)

Exercise 2. Implement alloc_block in bitmap.c. It should find a free disk block in the bitmap, mark it used, and return the number of that block. When you allocate a block, you should immediately flush the changed bitmap block to disk with flush_block, to help file system consistency. Under the hood, flush_block calls msync to schedule the disk block in memory to get flushed to the actual disk image.

Note: You should use free_block as a model.

Use make grade to test your code. Your code should now pass the "alloc_block" test. All code for "core file system operations" tests is in the fs_test function in fsdriver.c, so if this test or any other test specifically mentioned in the exercises fails, you can check there to see what might be going wrong.

Running our tests may not provide you with enough information to debug a problem. Thus, you may sometimes find it helpful to run the driver directly. First, you'll need to create a disk image to use for testing. The easiest way to do this is to run test/testbasic.bash. As a side effect, this creates a disk image, testfs.img, which is set up properly for the internal tests in the driver. Then, to run the driver, do one of the following three things:
  1. $ build/fsdriver testfs.img mnt --test-ops
    
    The command above runs the internal file system tests and then exits; it does not mount the file system.
  2. $ build/fsdriver -d testfs.img mnt
    
    This command runs the driver in debugging mode and mounts testfs.img at mnt. In debugging mode, the driver does not exit until you press Ctrl-C. This means that you cannot interact with the file system via the terminal containing the command above; instead, you will need to open up a new terminal and interact with the file system from there. While in debugging mode, fsdriver will print to the console a trace of the syscalls dispatched to the driver; any printfs that you inserted in the driver will also be displayed. Once you run the command above, you should see something like:
    FUSE library version: 2.9.2
    nullpath_ok: 0
    nopath: 0
    utime_omit_ok: 0
    unique: 1, opcode: INIT (26), nodeid: 0, insize: 56, pid: 0
    INIT: 7.23
    flags=0x0003f7fb
    max_readahead=0x00020000
    INIT: 7.19
    flags=0x00000011
    max_readahead=0x00020000
    max_write=0x00020000
    max_background=0
    congestion_threshold=0
    unique: 1, success, outsize: 40
    ...
    
    Now, open up a new terminal, and interact with the file system to see the debugging outputs. Something like:
    $ ls mnt 
    
    This should cause the original terminal to print germane output.
  3. You can run fsdriver in gdb.
To unmount testfs.img from mnt, run fusermount -u mnt.

File Operations

We have provided various functions in dir.c and inode.c to implement the basic facilities you will need to interpret and manage inode structures, scan and manage the entries of directories, and walk the file system from the root to resolve an absolute pathname. Read through all of the code in these files and make sure you understand what each function does before proceeding.

Exercise 3. Implement inode_block_walk and inode_get_block in inode.c.

They are the workhorses of the file system. For example, inode_read and inode_write are little more than the bookkeeping atop inode_get_block necessary to copy bytes between scattered blocks and a sequential buffer.

Their signatures are:
int inode_block_walk(struct inode *ino, uint32_t filebno, uint32_t **ppdiskbno, bool alloc);
int inode_get_block(struct inode *ino, uint32_t filebno, char **blk);
inode_block_walk finds the disk block number slot for the 'filebno'th block in inode 'ino', and sets '*ppdiskbno' to point to that slot. inode_get_block goes one step further and sets *blk to the start of the block, such that by using *blk, we can access the contents of the block. It also allocates a new block if necessary.

Use make grade to test your code. Your code should pass the "inode_open" and "inode_get_block" tests.

After Exercise 3, you should be able to read and write to the file system. Try something like

$ echo "hello" > "mnt/world"; cat "mnt/world"

Exercise 4. Implement inode_truncate_blocks in inode.c. inode_truncate_blocks frees data and metadata blocks that an inode allocated but no longer needs. This function is used, for instance, when an inode is deleted; the space reserved by the inode must be freed so that other files can be created on the system.

Use make grade to test your code. Your code should pass the "inode_flush/inode_truncate/file rewrite" tests.

Exercise 5. Implement inode_link and inode_unlink in inode.c. inode_link links an inode referenced by one path to another location, and inode_unlink removes a reference to an inode at a specified path. Make sure that you properly increment the link count in an inode when linking and decrement the link count when unlinking. Don't forget to free an inode when its link count reaches zero!

inode_link and inode_unlink allow us to exploit the level of indirection provided by using inodes in our file system (as opposed to storing all file meta-data inside of directories, for instance) and manage referencing inodes with multiple names. The inode_unlink operation is particularly important as it allows us to release the space reserved for an inode, acting as a "remove" operation when an inode's link count is one.

Use make grade to test your code. Your code should pass the "inode_link/inode_unlink" tests.

After Exercise 5, you should be able to make hard links. Try something like

$ echo "hello" > "mnt/world"; ln "mnt/world" "mnt/hello"; rm "mnt/world"; cat "mnt/hello"

The tests after "inode_link/inode_unlink" are all effectively stress tests, in some way or another, for the driver. Each of them relies on the core functionality that you implemented; some can fail if you didn't handle certain edge cases correctly. If you fail one of these tests, go back and check the logic in your code to make sure you didn't miss anything.

Honors supplement. Honors students (section 001) should choose ONE of the following two exercises; students in section 002 are free (even encouraged) to do the exercises, but there will be no credit given.

Exercise 6 (Honors). The file system is likely to be corrupted if it gets interrupted in the middle of an operation (for example, by a crash or a reboot). Implement soft updates or journalling to make the file system crash-resilient and demonstrate some situation where the old file system would get corrupted, but yours doesn't.

Exercise 7 (Honors). Currently, our file system allocates one block (4096 bytes) per inode. However, each struct inode only takes up 98 bytes. If we were clever with file system design, we could store 4096/64 = 64 4096/96 = 41 inodes in every block. Modify the file system so that inodes are stored more compactly on disk. You may want to make the file system more like a traditional UNIX file system by splitting up the disk into inode and data regions, so that it is easier to reference inodes by an index (generally called an "inum" for "inode number") into the inode region.

Questions

Answer the following questions in answers.txt.

  1. How long approximately did it take you to do this lab?
  2. Do you feel like you gained an understanding of how to build a file system in this lab? Please suggest improvements.

Handing in the lab

If you are working in a pair, only one of the team members shoud submit.

If you have not done so yet, commit your code.

$ git status  # see what files you changed
$ git diff    # examine your changes
$ git commit -am "My solutions to lab6"
Then, follow these steps:

After a successful submission, you should receive email confirmation (at your NYU email address).

If you submit multiple times, we will take the latest submission and count slack hours accordingly.

This completes the lab.

Acknowledgement: the diagram explaining FUSE is adapted from the diagram displayed on FUSE's homepage.


Last updated: Mon May 04 11:24:46 -0400 2015 [validate xhtml]