Class 15 CS 372H 09 March 2010 On the board ------------ 1. Last time --exokernel --microkernels vs monolithic kernels 2. Liedtke paper 3. Therac-25 --------------------------------------------------------------------------- 1. Last time --microkernels vs monolithic kernels vs exokernels --More about the trade-off: --monolithic makes it easier to have sub-systems cooperate (such as the pager and the file system; these modules can read each other's data structures) --monolithic creates lots of strange interactions, which leads to complexity and bugs --message passing is another way to handle concurrency. every process has its own memory region, and then kernel's job is to shuttle messages back and forth. 2. Liedtke paper L3: commercially available microkernel. evolved into L4, which is still in use. --Remember, everything works via message passing --Therefore message passing and IPC needs to be fast list: background principles interface optimizations discussion A. Background --What's an RPC? What's an IPC? --IPC: message from thread (or process A) to thread or process B --RPC: a round-trip of IPCs (there and back) --What's the minimum need to do an IPC? --See Table 3, back page: 127 cycles --The big cost: our friends "int" and "iret" --Why are these expensive? --pipeline flushed --registers dumped on stack --TLB misses in the wake of context switches --Why are 5 TLB misses needed? (i) B's thread control block (ii) loading %cr3 flushes TLB, so kernel text causes miss on the next kernel instruction after "lcr3" (iii,iv) iret, accesses both *kernel* stack and *user* text - two pages (v) B's user code looks at message --How do you think this trend has progressed since the paper? --ANSWER: 1. Worse now. Faster processors optimized for straight-line code 2. Traps/Exceptions flush deeper pipeline, cache misses cost more cycles --Actual IPC time of optimized L3: 5 usec --Is that expensive? Compared to what? --accessing a disk? (milliseconds to access disk, so no problem) --network interrupts when packets arrive? well, what if you wanted to handle 50,000 packets/second? two IPCs/packet = 100,000 IPCs/second. processor can only do 200,000 IPCs/second, so IPCs would take up 100,000/200,000 = 100,000/200,000 = 50% of CPU B. Principles -- *IPC performance is the master* -- Plus a bunch of other things that emphasize IPC performance --All design decisions require a *performance discussion* --If something performs poorly, look for new techniques --*Synergistic effects* have to be taken into consideration [What does this mean? That a lot of little things might add up to a big gain, or a big loss if two changes interact poorly. Need to test each combination of features?!] --The design has to *cover all levels* from architecture down to coding --The design has to be made on a *concrete basis* -- Up until this point, a bunch of principles that argue that you should do endless IPC optimization! --How do we know when to stop? --How do we know when we can't optimize further? --Answer: One of the nicer principles in L3: "The design has to aim at a concrete performance goal." -- Without this, you'd get lost optimizing things that don't matter -- Take minimum IPC time (172 cycles), multiply by 2 -- 350 cycles = 7 usec (50 Mhz) -- set **T** = 5 usec -- Minimum null RPC is already at 69% T! -- System calls + address space switches = 60% T -- L3 achieves 250 cycles = 5 usec -- Basic approach: Design the microkernel for a specific CPU C. Interface old: send (threadID, send-message, timeout); /* nonblocking */ receive (receive-message, timeout); /* nonblocking */ if A sends to B: A: send(); receive(); B: while (1) { select(); receive(&requestbuf); replybuf = process(); send(replybuf); } new: call (threadID, send-message, receive-message, timeout); reply_and_receive_next (reply-message, receive-message, timeout); now: A: call(threadID, send-buf, receive-buf, timeout) B: receive(&requestbuf); do { replybuf = process(requestbuf); reply_and_receive_next(replybuf, &requestbuf, timeout); } while (1); D. Optimizations (i) new system call: 2 system calls per RPC, instead of 4. (ii) complex messages: send one message instead of a bunch (iii) direct transfer with memory mapping what's going on here? naive solution: two copies: A --> kernel --> B okay, so why not share user-level pages between A and B? have sender copy into shared buffer? well, then receiver might need write access to signal when it's done processing. problem: --security issue: information can flow back from B to A other problems with shared buffers in this context: --receiver checks message legality, then message changes (if receiver copies message first, then we're back where we started) --with many clients, a server could run out of VA space --somehow need to coordinate first --not app-friendly. why? [have to copy data anyway. can't get data into buffer, etc.] Liedtke's approach: one copy: A --> remapped B. --Kernel does copy inside A --How to do this maximally cheaply? --ANSWER: Copy two PDE's (8MB) from B's address space into kernel range of A's pgdir. --Then execute the copy in A's kernel space --Literally copy the entries? --No! copy the entry *except* the PTE_U bit needs to be cleared because only the kernel should be using this window in A --Why two PDEs? Maximum message size is 4 Meg, so the copy is guaranteed to work regardless of how B aligned the message buffer --Why not just copy PTEs? Would be much more expensive] --What does it mean for the TLB to be "window clean"? Why do we care? Why can't we just invalidate the mappings? --Means TLB contains no mappings within communication window --We care because mapping is cheap (copy PDE), but invalidation not x86 only lets you invalidate one page at a time, or whole TLB --Why isn't it enough to invalidate the two pages? --trick question. it's not two pages. it's two PDEs --> 8 MB. --We need to invalidate because the same kernel virtaddr space may refer to multiple physical pages (the kernel's window is in the same virtual place in every process) --Does TLB invalidation of communication window turn out to be a problem? Not usually, because have to load %cr3 during IPC anyway (Unless the address space doesn't change) (iv) Thread control block (TCB) tcb contains basic info about thread --registers, links for various doubly-linked lists, pgdir, uid, ... --commonly accessed fields packed together on the same cache line [Draw picture of array, with kernel stack inside TCB] Kernel stack is on same page as tcb. Why? a. Minimizes TLB misses (since accessing kernel stack will bring in tcb) --consider the alternative --NOTE: in table 3, switching the stacks doesn't cause TLB miss. the reason is because B's TCB was accessed earlier in table 3. b. Very efficient access to current TCB -- just mask off lower 12 bits of %esp Another nice thing: can access *any* TCB efficiently, given the thread id. why? --actual thread number is in 32-bit thread id in very particular way b [ {thr_num}<---->] where tcb size = 2^b. --doing it this way replaces an {"and", "multiply", "add"} with {"and","add"}! Note that the thread ID here is like the JOS env ID (has a number that serves as an index, a generation, etc.) (v) Lazy scheduling conventional approach to scheduling: A sends message to B: Move A from ready queue to waiting queue Move B from waiting queue to ready queue This requires 58 cycles, including 4 TLB misses. What are TLB misses? Using doubly linked lists [go over best implementation] Most efficient would be to insert A at B's old position in list So previous and next elements in each list must be touched lazy scheduling: Insight: After A blocks, *don't take it off the ready queue yet!* It will probably get right back on very quickly. Ready queue must contain all ready threads, EXCEPT POSSIBLY CURRENT ONE Might contain other threads that aren't actually ready, though Each wakeup queue contains AT LEAST all threads waiting in that queue Again, might contain other threads, too Scheduler removes inappropriate queue entries when scanning queue Why does this help performance? Only three situations in which thread gives up CPU but stays ready: "send" syscall (as opposed to call), preemption, and hardware interrupts [these are the only cases when the thread needs to be put in the ready list] So very often can IPC into thread while not putting it on ready list "ipc : lazy queue update" ratio can reach 50:1 with high ipc rates (vi) Segment register optimization --Loading segment registers is slow -- have to access GDT, etc. --But common case is that users don't change their segment registers --Observation: It's faster to check a segment register than load it So just check that segment registers are okay Only need to load if user code changed them (vii) Various other tricks --Short messages passed through registers --Minimize TLB misses by putting things on the same page --Put commonly used-data on same cache lines --Other coding tricks: short offsets, avoid jumps, etc. E. Discussion --Great performance numbers! Much better than other microkernels (Fig 7, 8) --Too bad microbenchmark performance might not matter --Too bad, too, that hardware evolution has made ipc inherently more expensive --What do you think of theme of paper? Liedtke was fighting a losing battle against CPU makers: hardware evolution making IPC inherently more expensive [But very nice series of design decisions (or hacks).] --Is fast IPC something that computer architects should design hardware to take into account? ------------------------------------------------------------------------ Admin notes --review session was Monday --notes from review session will be posted --remember to check announcements every 24 hours (or subscribe via RSS) --am having office hours on Wed and will do further review then --ground rules for exam --75 minute exam --at 70 minutes, you have to stay seated; do not get up and distract your classmates. --you must hand your exam to me (we are not going to collect them). the purpose of this is so everyone gets the same amount of time. --at 78 minutes, I will walk out of the room and won't accept any exams when I leave --thus you must hand in your exam at time x minutes, where: x <= 70 OR 75 <= x < 78 --bring ONE two-sided sheet of notes; formatting requirements listed on Web page --bring your ID ------------------------------------------------------------------------ 3. Therac-25 A. Mechanics B. What went wrong? C. What could/should they have done? A. Mechanics [draw picture of this thing] dual-mode machine (actually, triple mode, given the disasters) beam beam beam energy current modifier (given by TT position) intended settings: --------------------------------------------------- for electron therapy | 5-25 MeV low magnets | | for X-ray therapy | 25 MeV high (100 x) flattener photon mode | | for field light mode | 0 0 none (b/c of the flattener, more current is needed in X-ray mode) What can go wrong? (a) if beam has high current, but turntable has 'magnets', not the flattener, it is a disaster: patient gets hit with high current electron beam (b) another way to kill a patient is to turn the beam on with the turntable in the field-light position So what's going on? (Multiple modes, and mixing them up is very, very bad) B. What actually went wrong? --two software problems --a bunch of non-technical problems (i) software problem #1: [this is our best guess; actually hard to know for sure, given the way that the paper is written.] --three threads --keyboard --turntable --general parameter setting --see handout for the pseudocode --now, if the operator sets a consistent set of parameters for x (X-ray (photon) mode), realizes that the doctor ordered something different, and then edits very quickly to e (electron) mode, then what happens? --if the re-editing takes less than 8 seconds, the general parameter setting thread never sees that the editing happened because it's busy doing something else. when it returns, it misses the setup signal --now the turntable is in 'e' position (magnets) --but the beam is a high intensity beam because the 'Treat' never saw the request to go to electron mode --each thread and the operator thinks everything is okay --operator presses BEAM ON --> patient mortally injured --so why doesn't the computer check the set-up for consistency before turning on the beam? [all it does it check that there's no more input processing.] alternatives: --double-check with operator --end-to-end consistency check in software --hardware interlocks [probably want all of the above] (ii) software problem #2: how it's supposd to work: --operator sets up parameters on the screen --operator moves turntable to field-light mode, and visually checks that patient is properly positioned --operator hits "set" to store the parameters --at this point, the class3 "interlock" is supposed to tell the software to check and perhaps modify the turntable position --operator presses "beam on" how they implemented this: --see pseudocode on handout but it doesn't always work out that way. why? --because this boolean flag is implemented as a counter. --(why implemented as a counter? PDP-11 had an Increment Byte instruction that added 1 ("inc A"). This increment thing presumably took a bit less code space than materializing the constant 1 in an instruction like "A = 1".) --so what goes wrong? --operator presses "beam on", and a beam is delivered in field light position, with no scanning magnets or flattener --> patient injured or killed --why? (iii) Lots of larger issues here too --***No end-to-end consistency checks***. What you actually want is: --right before turning the beam on, the software checks that parameters line up --hardware that won't turn beam on if the parameters are inconsistent --then double-check that by using a radiation "phantom" --too easy to say 'go', errors reported by number, no documentation --false alarms (operators learn the following response: "it'll probably work the next time") --unnecessarily complex and poor code --weird software reuse: wrote own OS ... but used code from a different machine --measuring devices that report _underdoses_ when they are ridiculously saturated --no real quality control, unit tests, etc. --no error documentation, no documentation on software design --no follow-through on Therac-20's blown fuses --company lied; didn't tell users about each other's failures --company assumed software wasn't the problem C. What could/should they have done? --Addressing the stuff above --You might be thinking, "So many things went wrong. There was no single cause of failure. Does that mean no single design change could have contributed to success?" --Answer: no! do end-to-end consistency checks! that single change would have prevented these errors! D. What happened in disasters reported by NYT? --Hard to know for sure --Looks like: software lost the treatment plan, and it defaulted to "all leaves open". Analog of field light position. What could/should have been done? --a good rule is: "software should have sensible defaults". looks like this rule is violated here. --in a system like this, there should be hardware interlocks (for example: no turning on the beam unless the leaves are closed) E. Amateur ethics/philosophy (i). Philosophical/ethical question: you have a 999/1000 chance of being cured by this machine. 1/1000 times it will cause you to die a gruesome death. do you pick it? most people would. --> then, what *should* the FDA do? (ii). should people have to be licensed to write software? (food for thought)