Class 8 CS 372H 10 February 2011 On the board ------------ (One handout) 1. Last time 2. Threads --Intro --User-level threading --Kernel-level threading --------------------------------------------------------------------------- 1. Last time --replacement policies --just to connect virtual memory and RAM to our abstract examples from last time: --the S1,S2,S3 are physical pages --the A,B,C,D in the virtual memory context are (process_id, VPN) pairs, representing the *virtual* page that happens to live in a given physical page --note that caching doesn't always save the day: there may simply be too much demand on memory --see notes from last time about ways of handling this case --we asked how the kernel could figure out how many pages the process was using; one answer is page fault interposition. --tie up a loose end: Note that many machines, x86 included, maintain 4 bits per page table entry: --*use*: Set when page referenced; cleared by an algorithm like CLOCK (the bit is called "Accessed" on x86) --*modified*: Set when page modified; cleared when page written to disk (the bit is called "Dirty" on x86) --*valid*: Program can reference this page without getting a page fault. Set if page is in memory? [no. it is "only if", not "if". *valid*=1 implies page in physical memory. but page in physical memory does not imply *valid*=1; in other words, *valid*=0 does not imply page is not in physical memory.] --*read-only*: program can read page, but not modify it. Set if page is truly read-only? [no. similar case to above, but slightly confusing because the bit is called "writable". if a page's bits are such that it appears to be read-only, it may or may not be because it is truly "read only". but if a page is truly read-only, it better have its bits set to be read-only.] Do we actually need Modified and Use bits in the page tables set by the harware? --answer: no. --how could we simulate them? --for the Modified [x86: Dirty] bit, just mark all pages read-only. Then if a write happens, the OS gets a page fault and can set the bit itself. Then the OS should mark the page writable so that this page fault doesn't happen again --for the Use [x86: Accessed] bit, just mark all pages as not present (even if they are present). Then if a reference happens, the OS gets a page fault, and can set the bit, after which point the OS should mark the page present (i.e., set the PRESENT bit). 2. Threads --How many people have programmed with threads before? (It's okay if you haven't; in fact, it's possibly better.....) A. Introduction [In class, we went over some motivation, but it was quite muddy and discussed too many new ideas at once. Here, in the notes, I'm simplifying the motivation and introduction. Later, we'll circle back and dig deeply into those ideas. They concern when and whether threads are truly needed.] --*threads* are a very natural way to do multiple tasks but operating on the same memory state. there are two fundamental motivations for threads, but not each of these motivations applies to every instance: (1) desire to have a single process take advantage of multiple CPUs (*) --> but we'll see that whether the process can in fact take advantage of multiple CPUs depends on the implementation of threads (2) often very natural to structure some computation (or task or job or whatever) as multiple units of control that see the same memory (*) --> but we'll see that this motivation depends on the computation itself --abstraction/illusion: sequential set of instructions that executes within the address space of a process (i) a thread *is* a set of registers (including a PC/IP) and a stack. (ii) multiple threads within the same process share the same memory. (they can even read and write each other's stacks, but if there are no bugs, that should not happen. generally the memory that they both look at it is heap memory or statically initialized memory.) --another way to put this: a thread does not have its own page directory. so on the x86, two threads share the same value of %cr3. (iii) multiple threads within the same process are executing at once (*) --> but we'll see that this only actually happens sometimes [Note for your studying: if you truly understand why each of the three counterpoints marked "(*) --> but" above is true, then you have a good handle on the true motivations for threads and on what problems threads are solving.] --thread API tid thread_create(void (*fn)(void*), void* arg) --create a new thread, run fn(arg) void thread_exit() --destroy current thread void thread_join(tid thread) --wait for thread to exit and lots of support for synchronization (which we'll see in upcoming classes) --example uses: --EXAMPLE #1: int main(int argc, char** argv) { thread_create(stage1_processing, NULL); thread_create(stage2_processing, NULL); } void stage1_processing(void*) { while (1) { do_some_CPU_intensive_things(); when done, enqueue to some task list; } } void stage2_processing(void*) { while (1) { dequeue a task from some task list; do some processing print some output to terminal } } above, threading is serving to overlap computation (the CPU-intensive things) and I/O (the printing to the terminal). while the thread sleeps waiting for the data to go to the terminal, the first thread can do CPU-intensive things. --EXAMPLE #2: threaded web server services clients simultaneously: for (;;) { fd = accept_client (); thread_create (service_client, &fd); } void service_client(void* arg) { int* fd_ptr = (int*)arg; int fd = *fd_ptr; while (client_request_not_read_in) { read(fd, ....); /* [+] */ } do_work_for_client(); while (response_to_client_not_fully_written_out) { write(fd, ...); } thread_exit(); } the point to the above example is that all of the work for a single client is encapsulated. imagine if all of that work had to happen within a single thread of control; it could be done, but it would not be as convenient. Note that, to the thread, the read() and write() look to be *blocking*. That means that they only continue past the read() or write() if there is data for them, or if the output channel can accommodate data, respectively. However, to the module that *implements* threading, the read() and write() are non-blocking (we define these terms below). --we will see other uses in the coming weeks --the implementation of thread_create, thread_exit, etc. can be done in many different layers in the system: --user space (here, there's a library or a thread run-time, and the kernel does not know that the process is multithreaded) --kernel --Java virtual machine --Flash player --etc. --relationship to labs --you are now in the middle of implementing processes (lab 3 and 4) --in lab T, you will work with threads. you can imagine implementing threads inside a JOS user process (known as an environment in the context of JOS), or you can imagine JOS providing the facility. lab T, however, you will execute on Unix. --So let's examine threads for a bit..... two common models: * user-level threading * kernel-level threading in both cases we will look at: * thread control blocks (TCBs; analogy is with PCBs) * dispatch/swtch() * the level of true concurrency B. User-level threading --kernel is totally ignorant of user-level threads --thread_create() allocates a new stack --do we need memory space for registers? --keep a queue of runnable threads --run-time system: --provides a layer above system calls: if they would block, switch, and run a different thread --does scheduling --thread is running --save thread state (to TCB) --Choose new thread to run --Load its state (from TCB) --new thread is running --when do the above steps happen? Two options: 1. Only when a thread calls yield() or would block on I/O --This is called *cooperative multithreading* or *non-preemptive multithreading*. --Upside: Makes it pretty easy to avoid errors from concurrency --Downside: Harder to program because now the threads have to be good about yielding, and you might have forgotten to yield inside a CPU-bound task. 2. What if we wanted to make user-level threads switch non-deterministically? --deliver a periodic timer interrupt or signal to a thread scheduler [setitimer() ]. When it gets its interrupt, swap out the thread. --makes it more complex to program with user-level threads --in practice, systems aren't usually built this way, but sometimes it is what you want (e.g., if you're simulating some OS-like thing inside a process, and you want to simulate the non-determinism that arises from hardware timer interrupts). --Before continuing, we need to clarify *blocking* versus *nonblocking* I/O calls. [This was something that I muddied in lecture. However, understanding this is important to understanding the implementation of user-level threading.] --Blocking means that the entity making the call (the thread in this case) does not progress past the I/O call (often a read() or write()) unless there is data for the thread (or, in the case of a write, unless the output channel can accommodate the data) --Nonblocking means that if the call *would* block, the call returns with an error message, and the thread keeps going. --(This idea also pertains to read/write system calls exposed by the kernel for the use of a process.) --Usually, the *thread* is supposed to see the call as blocking. However, there is a subtlety that is important: the other side of that call (e.g., the run-time that created the thread abstraction) makes a corresponding system call in *non-blocking* mode. That is because in this scenario of user-level threads, if the run-time *did* block, it wouldn't be able to run another thread. --As an aside, note that the relationship between the run-time and the thread is very similar to the relationship between the kernel and a process. When a process makes a blocking I/O call (most of you have done this at some point in your life -- pretty much whenever you called read() to get the data in some file), the kernel puts the process to sleep until the data arrives from the disk. But just as the run-time issues the I/O syscall to the kernel in non-blocking mode, the kernel issues the I/O request to the disk in non-blocking mode. The reason is that if the kernel went to sleep every time it waited on data from the disk, then the kernel wouldn't be able to run other processes. Put differently, the abstraction of "sleeping until there is data available" is an abstraction presented to the higher layer, and the lower layer implements that abstraction by simply not running the higher layer until the data is available. --To return to our multi-threaded Web server example from above: --Recall that the thread calls read() to get data from remote web browser --Let's assume that the Web server is using user-level threading. Then, the read() in the Web server example (marked with "[+]") is actually a "fake" call implemented by the threading run-time. The run-time makes the true read() syscall (exposed by the kernel) in non-blocking mode. (*) --> subtlety/exception: read/write syscalls for disk I/O cannot be issued in non-blocking mode, but you can ignore this point for now. we'll come back to it --If the kernel has no data for the run-time, the run-time makes the calling thread yield() and schedules another thread, one that itself had previously not be running. --When the run-time is idle, or on timer, check which connections have new data, and switch() to one of them --Let's look at how the above process is implemented, focusing on the register/EIP/stack switching. We will further focus on the case of *cooperative* user-level multithreading. Basic idea: swtch() called at "sane" moments, in response to a function call from a thread. That function is usually yield(), i.e., the call graph usually looks like this: fake_read() if read would block yield() swtch() and the pseudocode looks something like this: int fake_read(int fd, char* buf, int num) { int nread = -1; while (nread == -1) { /* this is a non-blocking read() syscall */ nread = read(fd, buf, num); if (nread == -1) { /* read would block */ yield(); } } return nread; } void yield() { tid next = pick_next_thread(); /* get a runnable thread */ tid current = get_current_thread(); swtch(current, next); } --to repeat, what "would block" means: --in read direction, it means that there's no data to read --in write direction, it means that output buffers are full, so the write cannot happen yet --how is swtch() implemented? --see handout..... --[draw picture of the two stacks] --make sure you understand what is going on --How to switch threads in non-cooperative context? In non-cooperative context, a thread could be switched out at any moment, so its state is not neatly arranged on the stack, per the call graph but in that case, the OS would have put some of the thread's registers in a trap frame, and the run-time can yank those registers, save them (and the other registers) in the TCB or on the thread's regular stack, and then restore them later Said differently, thread switching by the user-level run time looks a lot like process switching by the kernel. Notes/questions: --In kernel's PCB, only one set of registers is stored..... --QUESTION: where are the other registers for the other threads? Disadvantages to user-level threads: --Can we imagine having two user-level threads truly executing at once, that is on two different processors? (Answer: no. why?) --What if the OS handles page faults for the process? (then a page fault in one thread blocks all threads). --(not a huge issue in practice) --Similarly, if a thread needs to go to disk, then that actually blocks *all* threads (since the kernel won't allow the run-time to make a non-blocking read() call to the disk). So what do we do about this? --extend the API --live with it --use elaborate hacks with memory mapped files (e.g., files are all memory mapped, and runtime asks to handle its own page faults, if the OS allows it) C. Kernel-level threading --next time