Class 20
CS 372H
05 April 2012

On the board
------------

1. Last time
2. Scheduling, continued
    --disciplines, continued
    --lessons and conclusions
3. I/O
4. Livelock paper
    --context and problem
    --solution
    --reflection

---------------------------------------------------------------------------

1. Last time

    --LFS: a bit about crash recovery and segment cleaning

    --clarify metric of "write cost":

	--they evaluate their approaches to segment cleaning in terms of
	a metric that they call "write cost". goal is to choose which
	segments to clean such that "write cost" remains low.

	--write cost is cost to write a new byte of data, including the
	overhead of cleaning.

	    --expressed as a multiple of cost to write if there were no
	    overhead and if data could be written at the disk's transfer
	    bandwidth

	    --write cost of 1.0 would be perfect

	    --write cost of 10.0 means that only 1/10th of the disk's
	    bandwidth goes to writing new data. the rest goes to
	    cleaning, seek latency, or rotational latency.

	    [they then ignore seek latency or rotational latency;
	    presumably it is mentioned b/c they also tell us how _other_
	    systems perform on this metric]

	    --W.C. = 
		(total # of bytes moved to and from disk) / 
		    (# of those bytes that represent new data)

	    --in steady state, if they read in N segments, with
	    average utilization of u, then they read in and write out 
		N + N*u bytes.
	    this frees up space for N*(1-u) bytes of new data
		so total bytes written:
		    N + N*u + N*(1-u)
		# bytes that represent new data:
		    N*(1-u)

		W.C. = (N + N*u + N*(1-u)) / N*(1-u) = 2/(1-u)

	    --this metric ignores cost of normal reads (it counts only
	    reads that happen as part of segment cleaning).

	--in simulation, the space-time policy [choose segment with min. (1-u')*age/(1+u')]
	performs the best

	--so that's what they implement in the real system. they don't
	implement multiple policies in the real system, so we can't know
	whether their approach is "best" on their workload.

    --correct SRTCF, example 2

2. Scheduling disciplines

    A. [last time] FCFS/FIFO
    B. [last time] Round robin

    C. [last time, in part] SJF (shortest job first)

	--STCF: shortest time to completion first 
	    --Schedule the job whose next CPU burst is the shortest
	    
	--SRTCF: shortest remaining time to completion first
	    --preemptive version of STCF: if job arrives that has a
	    shorter time to completion than the remaining time on the
	    current job, immediately preempt CPU to give to new job

	--idea:
	    --get short jobs out of the system
	    --big (positive) effect on short jobs, small
	    (negative) effect on large jobs
	    --result: minimize waiting time (can prove this)

	--seeks to minimize average waiting time for a given set of
	processes

	--example 1:

	    [see notes from last time]

	--example 2:

	    3 jobs
	    A, B: both CPU bound, run for a week
	    C: I/O bound, loop
		1 ms of CPU
		10 ms of disk I/O

	    by itself, C uses 90% of disk
	    By itself, A or B uses 100% of CPU

	    what happens if we use FIFO?
		--if A or B gets in, keeps CPU for 2 weeks

	    what about RR with 100msec time slice?
		--only get 5% disk utilization

	    what about RR with 1msec time slice?
		--get nearly 90% disk utilization
		--but lots of preemptions

	    with SRTCF:
		--no needless preeemptions
		--get high disk utilization

	--SRTCF advantages:
	    --optimal response time (min waiting time)
	    --low overhead 

	--disadvantages:
	    --not hard to get unfairness or starvation (long-running
	    jobs)
	    --does not optimize turnaround time (only waiting time)
	    --** requires predicting the future **

	so useful as a yardstick for measuring other policies (good way
	to do CS design and development: figure out what the absolute
	best answer is, then figure out how to approximate it)

	however, can attempt to estimate future based on past (another
	thing that people do when designing systems):

	    --Exponentially weighted average a good idea 

	    --t_n: length of proc's nth CPU burst

	    --\tao_{n+1}: estimate for n+1 burst

	    --choose \alpha, 0 < \alpha <= 1

	    --set \tao_{n+1} = \alpha * t_n + (1-\alpha)*\tao_n

	    --this is called an exponential weighted moving average
	    (EWMA)

	    --reacts to changes, but smoothly

	upshot: favor jobs that have been using CPU the least amount of
	time; that ought to approximate SRTCF

    D. Priority schemes
    
	--priority scheme: give every process a number (set by
	administrator), and give the CPU to the process with the highest
	priority (which might be the lowest or highest number, depending
	on the scheme)

	    --can be done preempively or non-preemptively

	--generally a bad idea because of starvation of low priority
	tasks 
	
	    --here's an extreme example:
		--say H at high priority, L at low priority
		--H tries to acquire lock, fails, and spins
		--L never runs

	    --but note: SJF is priority scheduling where priority is the
	    predicted next CPU burst time 

	--solution to this starvation is to increase a process's
	priority as it waits

    E. multilevel feedback queues

	[first used in CTSS; also used in FreeBSD. Linux up until 2.6.23
	did something roughly similar.]
	
	two ideas:
	    --*multiple queues, each with different priority*. OS does RR
	    at each non-empty priority level before moving onto next priority
	    level. 32 levels, for example.
	    --feedback: process's priority changes, for example based on
	    how little or much it's used the CPU
	
	result is to favor interactive jobs that use less CPU but
	without starving all of the jobs

	a process's priority might be set like this:
	    --decreases whenever timer interrupt found the process
	    running
	    --increases while the process is runnable but not running

	advantages:
	    --approximates SRTCF
	disadvantages:
	    --gameable: user puts in meaningless I/O to keep job's
	    priority high
	    --can't donate priority
	    --not very flexible
	    --not good for real-time and multimedia

    F. Real time
	--examples: cars, video playback, robots on assembly lines
	--Soft real time: miss deadline and CD will sound funny 
	--Hard real time: miss deadline and plane will crash 

	--Many strategies. Long literature here. Basically, as long as
	\sum (CPU_needed/period) <= 1, then first-deadline-first works.

    G. [skip in class; it's in the text] What Linux does

	--before Linux 2.4:
	    
	    --O(n) operations for O(n) processes

	    --no affinity (means: which CPU the process wants to, or
	    usually will, run on) on multicore systems (bad for cache)

	    --global run-queue lock (coarse-grained lock)

	--Linux 2.4 to 2.6.23
	
	    goal: O(1) for all operations

	    [see section 10.3.4 in the text]

	    Approach:

	    --140 priority levels

		0-99: for "real-time" tasks
		    --some are FIFO (not preemptible)
		    --some are round-robin (preemptible by clock or higher
		    priority process)
		[none is real "real time" because there are no guarantees or
		deadlines]

		100-140: for "timesharing" tasks
		    --for user tasks (depends on nice and behavior)A

	    --keeps per-process 4-entry "load estimator"

		--how much CPU consumed in each of last 4 seconds

		--adjusts priority by +/- 5 based on behavior

	    --each CPU has a run queue

	    --each run queue is implemented as :
	    
		active array
		expired array
		
		each array has 140 elements, each entry of which is a task
		list of processes

	    --avoids global clock and helps with affinity
		a separate load-balancer can move tasks between CPUs

	    --scheduling algorithm: run highest priority task in active
	    array; after task uses quantum, place it in expired list; swap
	    expired/active pointers when active list empty

		--bitmap cache for empty/non-empty state of each list 

	--post Linux 2.6.23: rough reinvention of ideas from stride
	scheduling

	    --motivation: the above approach is just a fancy multilevel
	    feedback queue with an efficient implementation, so it
	    inherits the disadvantages of multilevel feedback queues

    H. lottery and stride scheduling

	[citation: C. A. Waldsburger and W. E. Weihl. Lottery
	Scheduling: Flexible Proportional-Share Resource Management.
	Proc. Usenix Symposium on Operating Systems Design and
	Implementation, November, 1994.
	http://www.usenix.org/publications/library/proceedings/osdi/full_papers/waldspurger.pdf]

	--lottery scheduling:
	
	    Issue lottery tickets to processes 
	    -- Let p_i have t_i tickets 
	    -- Let T be total # of tickets, T = \sum t_i
	    -- Chance of winning next quantum is t_i / T
	    -- Note lottery tickets not used up, more like season tickets

	    controls long-term average proportion of CPU for each
	    process

	    can also group processes hierarchically for control
		--subdivide lottery tickets
		--can model as currency, so there can be an exchange
		rate between real currencies (money) and lottery tickets


	--lots of nice features

	    --deals with starvation (have one ticket --> will make
	    progress)

	    --don't have to worry that adding one high priority job will
	    starve all others

	    --adding/deleting jobs affects all jobs proportionally (T
	    gets bigger)

	    --can transfer tickets between processes: highly useful if a
	    client is waiting for a server. then client can donate
	    tickets to server so it can run.

		--note difference between donating tix and donating
		priority. with donating tix, recipient amasses enough
		until it runs. with donating priority, no difference
		between one process donating and 1000 processes donating

	--many other details

	    --ticket inflation for processes that don't use their whole
	    quantum

	    --use fraction f of quantum; inflate tix by 1/f until it
	    next gets CPU

	--disadvantages

	    --latency unpredictable

	    --expected error somewhat high 

		--for those comfortable with probability: this winds up
		being a binomial distribution. variance n*p*(1-p) -->
		standard deviation \proportional \sqrt(n),
		    --where:
			p is fraction of tickets owned
			n is number of quanta

	--in reaction to these disadvantages, Waldspurger and Weihl
	proposed *Stride Scheduling*
 
	    [citations:
	    
		C. A. Waldsburger and W. E. Weihl.  Stride Scheduling:
		Deterministic Proportional-Share Resource Management.
		Technical Memorandum MIT/LCS/TM-528, MIT Laboratory for
		Computer Science, June 1995. 
		http://www.psg.lcs.mit.edu/papers/stride-tm528.ps

		Carl A. Waldspurger. Lottery and Stride Scheduling:
		Flexible Proportional-Share Resource Management, Ph.D.
		dissertation, Massachusetts Institute of Technology,
		September 1995. Also appears as Technical Report
		MIT/LCS/TR-667.
		http://waldspurger.org/carl/papers/phd-mit-tr667.pdf]

	    --the current Linux scheduler (post 2.6.23), called CFS
	    (Completely Fair Scheduler), roughly reinvented these ideas

	    --basically, a deterministic version of lottery scheduling.
	    less randomness --> less expected error.
	   

3. Scheduling lessons and conclusions

    --Scheduling comes up all over the place

	--m requests share n resources

	--disk arm: which read/write request to do next?

	--memory: which process to take physical page from?

    --This topic was popular in the days of time sharing, when there was
    a shortage of resources all around, but many scheduling problems
    become not very interesting when you can just buy a faster CPU or a
    faster network.

	--Exception 1: web sites and large-scale networks often cannot
	be made fast enough to handle peak demand (flash crowds,
	attacks) so scheduling becomes important again. For example may
	want to prioritize paying customers, or address
	denial-of-service attacks.

	--Exception 2: some scheduling decisions have non-linear effects
	on overall system behavior, not just different performance for
	different users. For example, livelock scenario, which we are
	discussing.

	--Exception 3: real-time systems:
	    soft real time: miss deadline and CD or MPEG decode will skip
	    hard real time: miss deadline and plane will crash

	    Plus, at some level, every system with a human at the other
	    end is a real-time system. If a Web server delays too long,
	    the user gives up. So the ultimate effect of the system may
	    in fact depend on scheduling!
	    
    --In principle, scheduling decisions shouldn't affect program's
    results

	--This is good because it's rare to be able to calculate the
	best schedule

	--So instead, we build the kernel so that it's correct to do a
	context switch and restore at any time, and then *any* schedule
	will get the right answer for the program

	--This is a case of a concept that comes up a fair bit in
	computer systems: the policy/mechanism split. In this case, the
	idea is that the *mechanism* allows the OS to switch any time
	while the *policy* determines when to switch in order to meet
	whatever goals are desired by the scheduling designer

	    [[--In my view, the notion of "policy/mechanism split" is
	    way overused in computer systems, for two reasons:
	    
		--when someone says they separated policy from mechanism
		in some system, usually what's going on is that they
		separated the hard problem from the easy problem and
		solved the easy problem; or

		--it's simply not the case that the two are separate.
		*every* mechanism encodes a range of possible policies,
		and by choice of mechanism you are usually constraining
		what policies are possible. That point is obvious but
		tends to be overlooked when people advertise that
		they've "fully separated policy from mechanism"]]

    --But there are cases when the schedule *can* affect correctness

	--multimedia: delay too long, and the result looks or sounds
	wrong

	--Web server: delay too long, and users give up


    --Three lessons (besides policy/mechanism split):

	(i) Know your goals; write them down

	(ii) Compare against optimal, even if optimal can't be built. 

	    --It's a useful benchmark. Don't waste your time improving
	    something if it's already at 99% of optimal.

	    --Provides helpful insight. (For example, we know from the
	    fact that SJF is optimal that it's impossible to be optimal
	    and fair, so don't spend time looking for an optimal
	    algorithm that is also fair.)

	(iii) There are actually many different schedulers in the
	system that interact:

	    --mutexes, etc. are implicitly making scheduling decisions

	    --interrupts: likewise (by invoking handlers)

	    --disk: the disk scheduler doesn't know to favor one
	    process's I/O above another

	    --network: same thing: how does the network code know which
	    process's packets to favor? (it doesn't)

	    --example of multiple interacting schedulers:

		you can optimize the CPU's scheduler and still find it
		does nothing (e.g., if you're getting interrupted
		200,000 times per second, only the interrupt handler is
		going to get the CPU, so you need to solve that problem
		before you worry about how the main CPU scheduler
		allocates the CPU to jobs)

	    --Basically, the _existence_ of interrupts is bad for
	    scheduling. 

4. I/O 

    * architecture
    * communicating with devices
    * device drivers

    A. architecture

    [draw logical picture of CPU/Memory/crossbar]

	--CPU accesses physical memory over a bus

	--devices access memory over I/O bus

	--devices can appear to be a region of memory
	    --recall 640K-1MB region, from early classes
	    --and hole in memory for PCI

    [draw PC architecture picture]

    [draw picture of the I/O bus]

    B. communicating with a device

	how do host and device communicate? answer:
	    --memory-mapped device registers
	    --device memory
	    --special I/O instructions
	    --DMA

# --skip in class-- (have covered before)
#
#	(a) Memory-mapped device registers
#
#	    --Certain _physical_ addresses correspond to device registers
#
#	    --Load/store gets status/sends instructions -- not real memory
#	    
#	(b) Device memory -- device may have memory that OS can write to
#	directly on the other side of I/O bus
#
#	(c) Special I/O instructions
#
#	    --Some CPUs (e.g., x86) have special I/O instructions
#
#	    --Like load & store, but asserts special I/O pin on CPU
#
#	    --OS can allow user-mode access to I/O ports with finer
#	    granularity than page

	(d) DMA -- place instructions to card in main memory

	    --Typically then need to "poke" card by writing to register

	    --Overlaps unrelated computation with moving data over
	    (typically slower than memory) I/O bus

	    how it works (roughly)

	    [buffer descriptor list]
	       <metadata> --> [  buf ]
	       <metadata> --> [  buf ]
	       ....

	    card knows where to find the descriptor list. then it can
	    access the buffers with DMA
	

	    (i) example: network interface card


	    |
	    I/O bus -------  [ bus interface  
				<buffers in both directions> 
				    link interface] --> network link
	    |
	    |

	    --Link interface talks to wire/fiber/antenna

		--Typically does framing, link-layer CRC

	    --FIFO queues on card provide small amount of buffering

	    --bus interface logic uses DMA to move packets to and from
	    buffers in main memory

	    (ii) example: IDE disk read with DMA

	    [draw picture]


    C. Device drivers

	* entry points
	* synchronization
	    --polling
	    --interrupts

	--Device driver provides several entry points to kernel

	    --example: Reset, ioctl, output, read, write, **interrupt

	    --when you write a driver, you are implementing this
	    interface, and also calling functions that the kernel itself
	    exposes

	    --purpose of driver: abstract nasty hardware so that kernel
	    doesn't have to understand all of the details. kernel just
	    knows that it has a device that exposes a call like "read",
	    "write", and that the device can interrupt the kernel

	--How should driver synchronize with device?
	    
	    examples:
	    --need to know when transmit buffers free or packets arrive
	    --need to know when disk request is complete
 
	    --[note: the device doesn't care a huge amount about which
	    of the following two options is in effect: interrupts are an
	    abstraction that happens between the device and the CPU. the
	    question here is about the logic in the driver and the
	    interrupt controller.]
 
	    --Approach 1: **Polling**

		--Sent a packet?  Loop asking card when buffer is free

		--Waiting to receive?  Keep asking card if it has packet

		--Disk I/O?  Keep looping until disk ready bit set
      
		--What are the disadvantages of polling? (Trade-off between
		wasting CPU cycles [can't do anything else while polling]
		and high latency [if poll scheduled for the future but, say,
		packet is ready or disk block has arrived])

	    --Approach 2: **Interrupt-driven **

		--ask card to interrupt CPU on events
		    --Interrupt handler runs at high priority
		    --Asks card what happened (xmit buffer free, new packet)
		    --This is what most general-purpose OSes do.
		    Nevertheless.....


		--....it's important to understand the following; you'll
		probably run into this issue if you build systems that
		need to run at high speed

		--interrupts are actually bad at high data arrival rate.
		classically this issue comes up with network cards

		    --Packets can arrive faster than OS can process them

		    --Interrupts are very expensive (context switch)

		    --Interrupt handlers have high priority

		    --In worst case, can spend 100% of time in interrupt handler
		      and never make any progress. this is a phenonmenon
		      known as *receive livelock*.

		--best thing to do is: start with interrupts. if you
		need high performance and interrupts are slowing you,
		then use polling. if you then notice that polling is
		chewing too many CPU cycles, then move to adaptive
		switching between interrupts and polling.

	      
		--interrupts are great for disk requests.


5. Livelock paper

    ASK: what's the authors' thesis?

	[draw analogy to real world: coffee counter or email, etc.]
   
    ASK: what can go wrong?

	[draw picture of throughput collapse.
	 draw picture of what we want to happen.]

    ASK: what is the relation to OS scheduling? (answer: none; that's
    the problem.)

    A. Context and problem

	The problem

	    --devices generate interrupts, which cause work for the
	    processor
	    
	    --that work is high priority

	    --later work associated with the packet reception event is
	    lower priority (further into the system the packet gets, the
	    lower the priority associated with processing it; this is
	    backwards).

		--example: see figure 6-2

		  - Receive processing, appends packets to ipintrq, IP
		    input queue

		  - IP forwarding layer (IPL softnet (i.e., soft
		    interrupt) or just kernel thread, depends on OS)

		  - Device output queue

		  - Transmit processing, takes packets from output queue

	    --result: if there are too many packets arriving, later work
	    never gets done

	    --What does kernel do on input overload?
	      Queues fill up, packets dropped
	      When are packets dropped?  When should they be dropped?
		Usually dropped at ipintrq.  Should be dropped earlier.
	      Causes throughput to go down as offered load goes up!

	    --Who cares? If system is overloaded, aren't we in trouble
	    anyway? who cares how the system does if there's too much
	    load?

		(--answer: system should continue to give good service to
		the requests that it can.)

	Why was this paper written when it was?

	    (--answer: I/O interrupt rates got ahead of their engineered
	    zone. it was now possible to interrupt the CPU more times per
	    second with events that cost more than the time between
	    interrupts.)

	ASK: is the work still relevant?

	    Yes:
	
	      Network speeds growing as fast or faster than CPU speeds

	      Interrupts and device access getting comparatively more expensive

	More details on the problem

	    Look at figure 6.1:
	     --Why do black dots (no screend) go up?  (more packets to forward)
	     --What determines how high the peak is?  (when CPU saturates)
	     --Why do they go down?  (wasting CPU time on discarded packets)
	     --What happens when we add screend? (need more CPU time per packet)

	    Do dedicated routers suffer the same problem?
	    Usually not, because routing usually on fast path entirely in line-cards
	    But Slammer worm sometimes caused problems:
	       "because one of the routers in this network was starved for
		memory, it didn't run the Cisco Express Forwarding (CEF)
		algorithm... this made the problem worse because without CEF,
		the router must take time to create a "route cache" for each
		destination in order to forward packets at high speed. Since
		the worm generates random destination addresses, the router
		ended up spending most of its time creating these route cache
		entries, and it ran out of memory to boot."
		    - http://www.onlamp.com/pub/a/onlamp/2003/01/28/msworm.html

	Is it possible to generate livelock just from disk interrupts?

	    (Answer: No--because if system is starved for CPU, won't issue
	    more disk requests)

	    Key point:  flow control/back pressure avoids
	    livelock-inducing overload

	Why do receive interrupts have high priority?  Is this good?
	    Old network cards could not buffer many packets, don't want to lose burst
	    Today's cards can, so not really needed any more

	Why not just tell scheduler to give interrupts lower priority?
	  non-preemptible

	Why not completely process (i.e., forward) each packet in interrupt handler?
	  Other parts of kernel don't expect to run at high interrupt-level

	    E.g., some packet processing code might call sleep (e.g., in mem allocator)

	  Still might want an output queue [to handle packet processing
	  in batches.]

	BIG PICTURE

	    We've surrendered all control over scheduling various
	    actions to the CPU's interrupt mechanism. And it doesn't
	    give us the control we need to enforce our desired policy:
	    output has precedence over input.

     B. Solution

	What about using polling exclusively instead of interrupts?
	  +: Solves overload problem / gets scheduling under control
	  -: Killer latency or wasted CPU
	  (note that a strict real-time OS might use only polling, if
	  device interrupts are not as important as some other task)

	What's the paper's solution?
	  Do most packet processing in "top half" kernel thread
	    Receive interrupt arrives, poke thread, disable interrupt
	    Thread processes packets fully, put on output queue
	    Thread checks for more work, reenables interrupts otherwise
	  Eliminates IP input queue (packets dropped before work invested)

	[alternate way to say it

	    to avoid livelock, they:
	    --they turn off interrupts at the right time
	    --they ensure that they're doing the "right" work (poll
		across queues in the proper priority)
	    --they make the higher-level processing not preemptible

	    but now they want good throughput and latency. for that they:
	    --reenable interrupts
	    --take advantage of buffering
	    --eliminate pointless queues
	]

	Why does this work?
	  What happens when packets arrive slowly?
	  What happens when packets arrive too fast?

	Figure 6-3: why does "Polling (no quota)" work badly?
	    (Answer: still doing too much receive work; starving other
	    work.)

	Figure 6-3: Why does it do even worse than unmodified kernel?
	    (Answer: Because now packets are dropped at the device output queue
	    Actually investing more work in each dropped packet!)

	Figure 6-4: Why does "Polling, no feedback" behave badly?
	    (Answer: There's a queue in front of screend.
	    We can still give 100% to input thread, 0% to screend.)

	    Why does "Polling w/ feedback" behave well?
		Input thread yields when queue to screend fills.

	    [can skip] What if screend hangs, what about other consumers of
	    packets?
		E.g., can you ssh to machine to fix screend?
		  Fortunately screend typically only application
		  Also, re-enable input after timeout

	BIG PICTURE
	    polling loop gives us more control -- but only knows about
	    device and IP packet processing, not about other activities
	    on host. 

	Why won't this work for CPU-bound processes?
	    (answer: there's no queue to give feedback. process is CPU
	    bound.)

	What is solution?
	    Every 10 msec clear counter
	    Track CPU usage in receive thread, after a fraction disable
    
	BIG  PICTURE
	    This thing is a hack. We don't really want the network
	    interface scheduling the CPU. That is backwards.

	Wait, do we really need quotas if we have queue feedback?
	(answer: no.)

	What if there is no CPU-bound process?  Waste of resources?
	    No, re-enable receive interrupts in idle thread

#	[skip] Why aren't the numbers exactly what you would expect? (25%,50%,75%,...)
#	    Other background processes may use CPU
#	    Time to process interrupts not charged against quotas
#	[skip] Why is there a small dip in most of the curves?
#	    More interrupts happen at lower pkt rate, and not charged against quota

    C. Reflection

	--High-level points of this paper
	
	    don't do scheduling reactively and in ad-hoc fashion (which is
	    the default). Be smart about when interrupts are taken and
	    devices scheduled.

	    Don't spend time on new work before completing existing work.

	    Or give new work lower priority than partially-completed work.

	    Corollary: If you might discard, do it as early as possible.
	     

	--Ironically, their work has a number of hacks.

	    do we really need explicit scheduling of CPU from network interface?
	    seems backwards.

		(answer: their solution is perhaps not elegant but works
		for their experiments. ideally, there would be a
		coherent scheduling policy that takes account of
		everything together. not their fault, though; they are
		constrained by the system.)

	--If we apply their fixes, does the phenomemon totally go away?
	  E.g. for web server, NFS server threads that wait for disk, etc.
	  Can the net device throw away packets without slowing down host?

	  Problem: We want to drop packets for applications with big queues

	    But requires work to determine which application a packet belongs to

	    Possible solution: have network interface hardware sort packets

	--Livelock can happen in context of distributed systems as well.


[Summarize list of changes they made:

    --interrupts only to initiate polling
    --fair polling among event sources using quotas
    --allow other important tasks to run using
	feedback from full queue
	CPU quotas
    --drop packets early]