Class 8
CS 372H
12 February 2010 (by video)

Outline
-------
1. Last time
2. Threads
    --Intro
    --User-level threads
    --Kernel threads
    --Scheduling threads

---------------------------------------------------------------------------

1. Last time

    --replacement policies

    --point was:

	--optimal is known as OPT or MIN (textbook asserts but doesn't
	prove optimality...notes from last time contain a proof)

	--LRU is usually a good approximation to optimal

	--Implementing LRU in hardware or at OS/hardware interface is a
	pain

	--So implement CLOCK or NTH CHANCE ... decent approximations to
	LRU, which is in turn good approximation to OPT *assuming that
	past is a good predictor of the future*

    --note that caching doesn't always save the day: there may simply be
    too much demand on memory

	--see notes from last time about ways of handling this case

2. Threads

    A. Introduction
	
	--Recall what processes were all about: way to isolate some
	computations. give the process the illusion that it is executing
	sequentially

	--But process isn't enough......

	    might want to have a process that takes advantage of
	    multiple CPUs (why not rely on OS to schedule different
	    processes on different CPUs?)

	    some computations are naturally structured as being done in
	    parallel. Examples:

		--producer/consumer situations. shows up everywhere: get
		messages from the network, each message causes the
		process to execute a query on a database.

		    --could structure this as a process that just reads
		    from the network and the disk at once and just does
		    everything together

		    --potentially cleaner way to do it: one thread reads
		    from the network and classifies requests. another thread 
		    consumes from the queues and answers the requests.

		--Web servers

		    --want a pool of different threads to handle
		    requests from the network

		--I/O intensive sub-tasks mixed with CPU intensive sub-tasks

		--CPU intensive sub-tasks mixed with other CPU intensive sub-tasks

	    --counter-argument: if you're always I/O bound, avoid
	    threads (and their accompanying errors) and just program in
	    event-driven style. Very old debate. Threading is winning
	    because event-driven can't really take advantage of multiple
	    CPUs

	--*threads* are an abstraction that represents a sequential set of
	instructions but that executes within the address space of a
	process.

	    can see the same memory that other threads can.
	    
	    more specifically, a thread is a set of registers, including
	    a program counter and a stack, but *not* its own page directory.

	    [draw picture comparing single-threaded process to
	    multi-threaded process]

	--abstraction/illusion:

	    multiple threads are executing at once

	    but we'll see that this only actually happens sometimes

	--NOTE: In class we talked about processes first, and now we are
	talking about threads, but in the labs, you will first work with
	threads (in lab T) and then implement processes in JOS, which are
	known as environments.

	classification:
						# address spaces
					one				    many		
	    # threads/               
	      addr space

		one		      MS Dos				    traditional Unix
				      Palm OS

		many		      Embedded systems,                   VMS, Mach, NT, Solaris, HP-UX, ...
				      Pilot (OS on first personal
					computer ever built --
					the Alto.
					idea was there was no need for
					protection if there was only
					one user.)


	--NOTE: lots of ways to structure computations.......
	    event-driven
	    threaded
	    processes
	    different computers

	    [we'll come back to this point later on.]

	threads are a very natural way to do multiple tasks but
	operating on the same memory state.

	--thread API

	    tid thread_create(void (*fn)(void*), void* arg)
		--create a new thread, run fn(arg)

	    void thread_exit
		--destroy current thread

	    void thread_join(tid thread)
		--wait for thread to exit

	    and lots of support for synchronization (which we'll see
	    perhaps later today and in upcoming classes)

	--example use:

	    --threaded web server services clients simultaneously: 
		 for (;;) { 
		     fd = accept_client (); 
		     thread_create (service_client, &fd); 
		 } 


	    --we will see other uses....
    
	--So let's examine threads for a bit.....

	two common models:
	* user-level threads
	* kernel threads

	in both cases we will look at:
	* thread control blocks
	* dispatch/switch()
	* the level of true concurrency


    B. User-level threads

	--kernel is totally ignorant of user-level threads

	--thread_create() allocates a new stack 
	    --do we need memory space for registers?

	--keep a queue of runnable threads

	--run-time system:

	    --wraps system calls: if they would block, switch, and run a
	    different thread

	    --does scheduling
		--thread is running
		--save thread state (to TCB)
		--Choose new thread to run
		--Load its state (from TCB)
		--new thread is running

	--when do the above steps happen?

	    Two options:

	    1. Only when a thread calls yield() or blocks on I/O

		--This is called *cooperative multithreading* or
		*non-preemptive multithreading*.

		--Upside: Makes it pretty easy to avoid errors from
		concurrency	

		--Downside: Harder to program because now the threads
		have to be good about yielding, and you might have
		forgotten to yield inside a CPU-bound task.	

	    2. What if we wanted to make user-level threads switch
	    non-deterministically?

		--deliver a periodic timer interrupt or signal to a
		thread scheduler [setitimer() ]. When it gets its
		interrupt, swap out the thread.

		--makes it way more complex to program with user-level
		threads

		--in practice, systems aren't usually built this way,
		but sometimes it is what you want (e.g., if you're
		simulating some OS-like thing inside a process, and you
		want non-determinism).


	--Multi-threaded web server example 
	    --Thread calls read to get data from remote web browser 
	    --"fake" user-level read call makes the read() syscall in non-blocking mode 
	    --No data? schedule another thread 
	    --When idle or on timer check which connections have new
	    data, and switch() to one of them
	    
	--How to switch threads in cooperative context? 

	    see handout.....

	    [draw picture of the two stacks]

	    basic idea: switch() called at "sane" moments, in response
	    to a function call from a thread(). That function is usually
	    yield(), i.e., the call graph usually looks like this:

		read_wrapper()
		    check whether read would block
		    if read would block
			yield()
			    switch()

	    make sure you understand what is going on and how switch()
	    works.....


	--What if we are in non-cooperative context?

	    then a thread could be switched out at any moment, so its
	    state is not neatly arranged on the stack, per the call
	    graph

	    but in that case, the OS would have put the thread's
	    registers in a trap frame, and the run-time can yank the
	    thread's registers, save them in the TCB or on the thread's
	    regular stack, and then restore them later (i.e., thread
	    switching by the user-level run time looks a lot like
	    process switching by the kernel).

	Notes/questions:

	--In kernel's PCB, only one set of registers is stored.....

	    --QUESTION: where are the other registers for the other
	    threads?

	Disadvantages to user-level threads:

	--Can we imagine having two user-level threads truly executing
	at once, that is on two different processors? (Answer: no.)

	--Related question: what happens if a user-level thread executes
	a blocking system call, like read(fd, ....) to a disk?

	    --answer: *all* threads block because, to the kernel, the
	    process is blocked

	    --This is why threading libraries typically wrap system
	    calls: so that when the thread calls read(), the library
	    turns it into a non-blocking version of read().

	    --Unfortunately, disk calls in traditional Unix are always
	    blocking, so we either need to:
	    
		--extend the API
		
		--live with this

		--use elaborate hacks with memory mapped files (e.g.,
		files are all memory mapped, and runtime asks to handle
		its own page faults, if the OS allows it)

	--What if the OS handles page faults for the process? (then a
	page fault in one thread blocks all threads).

    C. Kernel threads

	--Kernel maintains TCBs

	    --looks a lot like PCB

	    --[Draw picture]

	--thread_create() becomes a syscall

	--when do thread switches happen?

	    --with kernel-level threads, it can happen at any point.

	--basic game plan for dispatch/switch:

	    --thread is running
	    --switch to kernel
	    --save thread state (to TCB)
	    --Choose new thread to run
	    --Load its state (from TCB)
	    --new thread is running

	--Can two kernel-level threads execute on two different
	processors? (Answer: yes.)

	--Disadvantage to kernel threads:

	    --every thread operation (create, exit, join, synchronize,
	    etc.) goes through the kernel --> 10x-30x slower than
	    user-level threads

	    --heavier-weight memory requirements (each thread gets a
	    stack in user space *and* within the kernel. compare to
	    user-level threads: each thread gets a stack in user space,
	    and there's one stack within the kernel that corresponds to
	    the process.)

	--Old debates about user-level threads vs. kernel threads. The
	"Scheduler Activations" paper, by Anderson et al., [ACM
	Transactions on Computer Systems 10, 1 (February 1992), pp.
	53--79]  proposes an abstraction that is a hybrid of the two.

	--Some people think that threads, i.e., concurrent applications,
	shouldn't be used at all (because of the many bugs and difficult
	cases that come up, as we'll discuss). However, that position is
	becoming increasingly less tenable, given multicore computing.

	    --The fundamental reason is this: if you have a
	    computation-intensive job that wants to take advantage of
	    all of the hardware resources of a machine, you either need
	    to (a) structure the job as different processes; or (b) use
	    kernel threads. There is no other way, given mainstream OS
	    abstractions, to take advantage of a machine's parallelism.
	    (a) winds up being inconvenient (in order to share data, the
	    processes either have to separately set up shared memory
	    regions, or else pass messages). So people use (b).

    D. Scheduling threads

	--Dispatcher can choose:

	    --to run each thread to completion

	    --time-slice in big chunks

	    --time-slice so that each thread executes only one
	    instruction at a time

	--Programs must work in all cases, for all interleavings

	--So how can you know if your concurrent program works? Whether
	*all* interleavings work?

	    1. Enumerate and test all possibilities? (Impossible.)

	    2. Instead, maintain *invariants* on program state; structure
	    program carefully to maintain these invariants

	--General strategy for dealing with concurrency:
	
	    --use *atomic actions* [means the action is indivisible,
	    regardless of how things are interleaved] to....
	    
	    --....build higher-level abstractions....

		--example: mutexes
	    
	    --....that provide invariants we can reason about....
	    
		--example: only one thread of control is modifying a
		linked list at once
		
        --This is our transition to the general topic of concurrency,
	which will occupy us for the next few classes