Class 8
CS 372H
10 February 2011

On the board
------------

(One handout)

1. Last time

2. Threads
    --Intro
    --User-level threading
    --Kernel-level threading

---------------------------------------------------------------------------

1. Last time

    --replacement policies

    --just to connect virtual memory and RAM to our abstract examples
    from last time:
    
	--the S1,S2,S3 are physical pages 

	--the A,B,C,D in the virtual memory context are
	    (process_id, VPN) pairs, representing the *virtual* page that
	    happens to live in a given physical page

    --note that caching doesn't always save the day: there may simply be
    too much demand on memory

	--see notes from last time about ways of handling this case

    --we asked how the kernel could figure out how many pages the
    process was using; one answer is page fault interposition.

    --tie up a loose end:

	Note that many machines, x86 included, maintain 4 bits per page
	table entry:

	    --*use*: Set when page referenced; cleared by an algorithm like
	    CLOCK (the bit is called "Accessed" on x86)

	    --*modified*: Set when page modified; cleared when page written
	    to disk (the bit is called "Dirty" on x86)

	    --*valid*: Program can reference this page without getting a
	    page fault. Set if page is in memory? [no. it is "only if", not
	    "if". *valid*=1 implies page in physical memory. but page in
	    physical memory does not imply *valid*=1; in other words,
	    *valid*=0 does not imply page is not in physical memory.]

	    --*read-only*: program can read page, but not modify it. Set if
	    page is truly read-only? [no. similar case to above, but
	    slightly confusing because the bit is called "writable". if a
	    page's bits are such that it appears to be read-only, it may or
	    may not be because it is truly "read only". but if a page is
	    truly read-only, it better have its bits set to be read-only.]

	Do we actually need Modified and Use bits in the page tables set
	by the harware?

	    --answer: no.

	    --how could we simulate them?

	    --for the Modified [x86: Dirty] bit, just mark all pages
	    read-only. Then if a write happens, the OS gets a page fault
	    and can set the bit itself. Then the OS should mark the page
	    writable so that this page fault doesn't happen again

	    --for the Use [x86: Accessed] bit, just mark all pages as
	    not present (even if they are present). Then if a reference
	    happens, the OS gets a page fault, and can set the bit,
	    after which point the OS should mark the page present (i.e.,
	    set the PRESENT bit).


2. Threads
    
    --How many people have programmed with threads before? (It's okay if
    you haven't; in fact, it's possibly better.....)

    A. Introduction

	[In class, we went over some motivation, but it was quite muddy
	and discussed too many new ideas at once. Here, in the notes,
	I'm simplifying the motivation and introduction. Later, we'll
	circle back and dig deeply into those ideas. They concern when
	and whether threads are truly needed.]

	--*threads* are a very natural way to do multiple tasks but
	operating on the same memory state. there are two fundamental
	motivations for threads, but not each of these motivations
	applies to every instance:

	    (1) desire to have a single process take advantage of
	    multiple CPUs 

		(*) --> but we'll see that whether the process can in fact
		take advantage of multiple CPUs depends on the
		implementation of threads

	    (2) often very natural to structure some computation (or task
	    or job or whatever) as multiple units of control that see
	    the same memory 
		
		(*) --> but we'll see that this motivation depends on the
		computation itself

	--abstraction/illusion: sequential set of instructions that
	executes within the address space of a process

	    (i) a thread *is* a set of registers (including a PC/IP) and
	    a stack.

	    (ii) multiple threads within the same process share the same
	    memory. (they can even read and write each other's stacks,
	    but if there are no bugs, that should not happen. generally
	    the memory that they both look at it is heap memory or
	    statically initialized memory.) 

		--another way to put this: a thread does not have its
		own page directory. so on the x86, two threads share the
		same value of %cr3.

	    (iii) multiple threads within the same process are executing at once
		(*) --> but we'll see that this only actually happens sometimes

	[Note for your studying: if you truly understand why each of the
	three counterpoints marked "(*) --> but" above is true, then you
	have a good handle on the true motivations for threads and on
	what problems threads are solving.]

	--thread API

	    tid thread_create(void (*fn)(void*), void* arg)
		--create a new thread, run fn(arg)

	    void thread_exit()
		--destroy current thread

	    void thread_join(tid thread)
		--wait for thread to exit

	    and lots of support for synchronization (which we'll see in
	    upcoming classes)

	--example uses:

	    --EXAMPLE #1:

		int main(int argc, char** argv) {
		    thread_create(stage1_processing, NULL);
		    thread_create(stage2_processing, NULL);
		}

		void stage1_processing(void*)
		{
		    while (1) {
			do_some_CPU_intensive_things();
			when done, enqueue to some task list;
		    }
		}
	
		void stage2_processing(void*)
		{
		    while (1) {
			dequeue a task from some task list;
			do some processing
			print some output to terminal
		    }
		}

		above, threading is serving to overlap computation (the
		CPU-intensive things) and I/O (the printing to the
		terminal). while the thread sleeps waiting for the data
		to go to the terminal, the first thread can do
		CPU-intensive things.

	    --EXAMPLE #2: threaded web server services clients simultaneously: 

		 for (;;) { 
		     fd = accept_client (); 
		     thread_create (service_client, &fd); 
		 } 

		void service_client(void* arg) {
		    int* fd_ptr = (int*)arg;
		    int fd = *fd_ptr;
    
		    while (client_request_not_read_in) {
			read(fd, ....);   /* [+] */
		    }

		    do_work_for_client();

		    while (response_to_client_not_fully_written_out) {
			write(fd, ...);  
		    }

		    thread_exit();
		}

		the point to the above example is that all of the work
		for a single client is encapsulated. imagine if all of
		that work had to happen within a single thread of
		control; it could be done, but it would not be as
		convenient.

		Note that, to the thread, the read() and write() look to
		be *blocking*. That means that they only continue past
		the read() or write() if there is data for them, or if
		the output channel can accommodate data, respectively.

		However, to the module that *implements* threading, the
		read() and write() are non-blocking (we define these
		terms below).

	    --we will see other uses in the coming weeks

	--the implementation of thread_create, thread_exit, etc. can be
	done in many different layers in the system:

	    --user space (here, there's a library or a thread run-time,
		and the kernel does not know that the process is
		multithreaded)
	    --kernel
	    --Java virtual machine
	    --Flash player
	    --etc.

    	--relationship to labs
	
	    --you are now in the middle of implementing processes (lab 3
	    and 4)

	    --in lab T, you will work with threads. you can imagine
	    implementing threads inside a JOS user process (known as an
	    environment in the context of JOS), or you can imagine JOS
	    providing the facility. lab T, however, you will execute on
	    Unix.

	--So let's examine threads for a bit.....

	    two common models:
	    * user-level threading
	    * kernel-level threading

	    in both cases we will look at:
	    * thread control blocks (TCBs; analogy is with PCBs)
	    * dispatch/swtch()
	    * the level of true concurrency

    B. User-level threading

	--kernel is totally ignorant of user-level threads

	--thread_create() allocates a new stack 
	    --do we need memory space for registers?

	--keep a queue of runnable threads

	--run-time system:

	    --provides a layer above system calls: if they would block,
	    switch, and run a different thread

	    --does scheduling
		--thread is running
		--save thread state (to TCB)
		--Choose new thread to run
		--Load its state (from TCB)
		--new thread is running

	--when do the above steps happen?

	    Two options:

	    1. Only when a thread calls yield() or would block on I/O

		--This is called *cooperative multithreading* or
		*non-preemptive multithreading*.

		--Upside: Makes it pretty easy to avoid errors from
		concurrency	

		--Downside: Harder to program because now the threads
		have to be good about yielding, and you might have
		forgotten to yield inside a CPU-bound task.	

	    2. What if we wanted to make user-level threads switch
	    non-deterministically?

		--deliver a periodic timer interrupt or signal to a
		thread scheduler [setitimer() ]. When it gets its
		interrupt, swap out the thread.

		--makes it more complex to program with user-level
		threads

		--in practice, systems aren't usually built this way,
		but sometimes it is what you want (e.g., if you're
		simulating some OS-like thing inside a process, and you
		want to simulate the non-determinism that arises from
		hardware timer interrupts).

	--Before continuing, we need to clarify *blocking* versus
	*nonblocking* I/O calls.
	    
	    [This was something that I muddied in lecture. However,
	    understanding this is important to understanding the
	    implementation of user-level threading.]
	
	    --Blocking means that the entity making the call (the thread
	    in this case) does not progress past the I/O call (often a
	    read() or write()) unless there is data for the thread (or,
	    in the case of a write, unless the output channel can
	    accommodate the data)

	    --Nonblocking means that if the call *would* block, the call
	    returns with an error message, and the thread keeps going.

	    --(This idea also pertains to read/write system calls
	    exposed by the kernel for the use of a process.)

	    --Usually, the *thread* is supposed to see the call as
	    blocking. However, there is a subtlety that is important:
	    the other side of that call (e.g., the run-time that created
	    the thread abstraction) makes a corresponding system call in
	    *non-blocking* mode. That is because in this scenario of
	    user-level threads, if the run-time *did* block, it wouldn't
	    be able to run another thread.

	    --As an aside, note that the relationship between the
	    run-time and the thread is very similar to the relationship
	    between the kernel and a process. When a process makes a
	    blocking I/O call (most of you have done this at some point
	    in your life -- pretty much whenever you called read() to
	    get the data in some file), the kernel puts the process to
	    sleep until the data arrives from the disk. But just as the
	    run-time issues the I/O syscall to the kernel in
	    non-blocking mode, the kernel issues the I/O request to the
	    disk in non-blocking mode. The reason is that if the kernel
	    went to sleep every time it waited on data from the disk,
	    then the kernel wouldn't be able to run other processes. Put
	    differently, the abstraction of "sleeping until there is
	    data available" is an abstraction presented to the higher
	    layer, and the lower layer implements that abstraction by
	    simply not running the higher layer until the data is
	    available. 
    
	--To return to our multi-threaded Web server example from above:

	    --Recall that the thread calls read() to get data from
	    remote web browser 

	    --Let's assume that the Web server is using user-level
	    threading. Then, the read() in the Web server example
	    (marked with "[+]") is actually a "fake" call implemented by
	    the threading run-time. The run-time makes the true read()
	    syscall (exposed by the kernel) in non-blocking mode.

		(*) --> subtlety/exception: read/write syscalls for disk
		I/O cannot be issued in non-blocking mode, but you can
		ignore this point for now. we'll come back to it

	    --If the kernel has no data for the run-time, the run-time
	    makes the calling thread yield() and schedules another
	    thread, one that itself had previously not be running.

	    --When the run-time is idle, or on timer, check which
	    connections have new data, and switch() to one of them

	--Let's look at how the above process is implemented, focusing
	on the register/EIP/stack switching. We will further focus on
	the case of *cooperative* user-level multithreading.

	    Basic idea: swtch() called at "sane" moments, in response
	    to a function call from a thread. That function is usually
	    yield(), i.e., the call graph usually looks like this:
		
		    fake_read() 
			if read would block
			    yield()
				swtch()

	    and the pseudocode looks something like this:
    
	    int fake_read(int fd, char* buf, int num) {

		int nread = -1;

		while (nread == -1) {

		    /* this is a non-blocking read() syscall */
		    nread = read(fd, buf, num); 
	       
		    if (nread == -1) {
			/* read would block */
			yield();
		    }
		}

		return nread;
	    }

	    void yield() {

		tid next = pick_next_thread(); /* get a runnable thread */
		tid current = get_current_thread();

		swtch(current, next);
	    }

	    --to repeat, what "would block" means:
		--in read direction, it means that there's no data to read
		--in write direction, it means that output buffers are
		full, so the write cannot happen yet

	    --how is swtch() implemented?

		--see handout.....
		--[draw picture of the two stacks]
		--make sure you understand what is going on 

	--How to switch threads in non-cooperative context?

	    In non-cooperative context, a thread could be switched out
	    at any moment, so its state is not neatly arranged on the
	    stack, per the call graph

	    but in that case, the OS would have put some of the thread's
	    registers in a trap frame, and the run-time can yank those
	    registers, save them (and the other registers) in the TCB or
	    on the thread's regular stack, and then restore them later

	    Said differently, thread switching by the user-level run
	    time looks a lot like process switching by the kernel.

	Notes/questions:

	--In kernel's PCB, only one set of registers is stored.....

	    --QUESTION: where are the other registers for the other
	    threads?

	Disadvantages to user-level threads:

	--Can we imagine having two user-level threads truly executing
	at once, that is on two different processors? (Answer: no. why?)

	--What if the OS handles page faults for the process? (then a
	page fault in one thread blocks all threads).
	    --(not a huge issue in practice)

	--Similarly, if a thread needs to go to disk, then that actually
	blocks *all* threads (since the kernel won't allow the run-time
	to make a non-blocking read() call to the disk). So what do we
	do about this?

	    --extend the API
	    
	    --live with it

	    --use elaborate hacks with memory mapped files (e.g.,
	    files are all memory mapped, and runtime asks to handle
	    its own page faults, if the OS allows it)


    C. Kernel-level threading

	--next time