Class 21
CS 372H
8 April 2010

On the board
------------

0. last time
    --journaling (purpose of log: crash-recovery)
    --LFS (purpose of log: mainly performance)
1. transacations
    --Atomicity (all-or-nothing atomicity)
    --Isolation (before-or-after atomicity)
2. RPC, client/server systems
3. NFS (mostly next time)

---------------------------------------------------------------------------

1. Transactions

    consider a system with: (log) + (complex on-disk structures)

    what kind of systems are we talking about? well, these ideas apply
    to file systems, which may have complex on-disk structures. but they
    apply far more widely, and in fact many of these ideas were
    developed in the context of databases. one confusing detail is that
    databases often request a raw block interface to the disk, thereby
    bypassing the file system. So one way to think about this mini unit
    here is that the "system" identified above could be a file system
    (often running inside the kernel) or else a database (often running
    in user space)

    want to build *transactions*:
	--way of grouping a bunch of operations. but what's a transaction?
	--intuition:
	    begin_transaction();
	    deposit_account(acct 2, $30);
	    withdraw_account(acct 1, $30);
	    end_transaction();

	    probably okay to do neither deposit nor withdrawal,
	    definitely okay to do both. not okay to do one or the other.

	--most of you will run into this idea if you do database programming

	--but it's bigger than DBs

	    --arguably, LFS is using transactions

	    --could even imagine having the kernel export a transactional
	    interface:

		sys_begin_transaction()
		syscall1();
		syscall2();
		sys_end_transaction();

		--there is research in this department that proposes
		exactly this
		--very nice for, say, software install: wrap the entire
		install() program in a transaction. so all-or-nothing

	    --[aside: could imagine having the hardware export a
	    transactional interface to the OS:

		begin_transaction()
		write_mem();
		write_mem();
		read_mem();
		write_mem();
		end_transaction();

		--this is called *transactional memory*. lots of
		research on this in last 10 years.
		--cleaner way of handling concurrency than using locking
		--but nasty interactions with I/O devices, since you
		need locks for I/O devices (can't rollback once you've
		emitted output or taken input)]

	    
	--okay, back to DBs:
	    --basically, a bunch of tables implemented with complex
	    on-disk structures
	    --want to be able to make sensible modifications to those
	    structures
	    --can crash at any point

	--we're only going to scratch the surface of this material. a
	course on DBs will give you far more info.

	 --classically, transactions have been defined to have four
	 properties: ACID:
	    A: atomicity
	    C: consistency
	    I: isolation
	    D: durability

	--we'll focus on the "A" and the "I"

	--"C" is not hard to provide from our perspective: just don't do
	anything dumb inside the transaction (gross oversimplification)

	--"D" is also not too hard, if we're logging our changes

	--What the heck is the difference between "A" and "I"?
	    --A: atomicity. think of it as "all-or-nothing atomicity"
	    --I: isolation. think of it as "before-or-after atomicity".

	--A: means: "if there's a crash, it looks to everyone after the
	crash as if the transaction either fully completed or didn't
	start.

	--I is a response to concurrency. it means: "the transactions
	should appear as if they executed in serial order".

	--briefly going to describe how to provide these two types of
	atomicity

    A. atomicity / ("all-or-nothing atomicity")

	assume no concurrency for now..... 

	our challenge is that crash can happen at any point, but we want
	the state to always look consistent

	log is authoritative copy. helps get on-disk structures to
	consistent state after crash.

	log record types:
		BEGIN_TRANS(trans_id)
		END_TRANS(trans_id)
		CHANGE(trans_id, "redo action", "undo action")
		OUTCOME(trans_id, COMMIT|ABORT)

	entries:
	    SEQ #
	    TYPE: [Begin/end/change/abort/commit]
	    TRANS ID:
	    PREV_LSN: 
	    REDO action: 
	    UNDO action:

	[DRAW PICTURE OF THE SOFTWARE STRUCTURE:

		    APPLICATION OR USER (what is interface to
			transaction system?)
		    ---------------------------------- 
		    TRANSACTION SYSTEM (what is interface to next layer
			down?)
		    ---------------------------------- 
		    DISK = LOG + CELL/STABLE/NV STORAGE]

	example:
	
	    application does:
		BEGIN_TRANS
		    CHANGE_STREET
		    CHANGE_ZIPCODE
		END_TRANS

	    or:
		BEGIN_TRANS
		    DEBIT_ACCOUNT 123, $100
		    CREDIT_ACCOUNT 456, $100
		END_TRANS

	    [SHOW HOW LOTS OF TRANSACTIONS ARE INTERMINGLED IN THE LOG.]

	--Why do aborts happen?

	    --say because of illegal values at the commit point

	    --(in isolation discussion, other reasons will surface)

	--why have separate END record (instead of using OUTCOME)?
	(will see below)

	--why BEGIN_TRANS != first CHANGE record? (makes it easier to
	explain, and may make recovery simpler, but no fundamental
	reason.)

	--concept: commit point: the point at which there's no turning back.

	    --actions always look like this:
		--first step 
		....            [can back out, leaving no trace]
		--commit point
		.....           [completion is inevitable]
		--last step

	    --what's commit point when buying a house? when buying a
	    pair of shoes? when getting married?

	    --what's commit point here? (when OUTCOME(COMMIT) record is in log.
	    So, better log the commit record *on disk* before you tell
	    the user of the transaction that it committed! Get that
	    wrong, and then user of transaction would proceed on false
	    premise, namely that the so-called committed action will be
	    visible after a crash.)

	--note: maintain some invariants:
	    --always, always, always log the change before modifying the
	    non-volatile cell storage also known as stable storage (this is what we
	    mean by "write-ahead log")
	    --END means that the changes have been installed

	--[*] observe: no required ordering between the commit record
	and installation (can install at any time for a long-running
	transaction, or, can write the commit record and very lazily
	update the non-volatile storage)

	    --wait, what is the required ordering? (just that the update
	    record has to be logged before the stable storage is
	    changed)

	--***NOTE: Below we are going to make unrealistic assumption
	that, if the disk blocks that correspond to stable storage are
	cached, then that cache is write through

	    --this is different from whatever in-memory structures the
	    database (or more generally, the transactional system) uses.
	    those in-memory structures don't matter for the purposes of
	    this discussion

	--make sure it's okay if you crash during recovery
	    --procedure needs to be idempotent
	    --how? (answer: records expressed as "blind writes" (for
	    example, "put 3 in cell 5", rather than something like
	    "increment value of cell 5". in other words, records shouldn't
	    make reference to previous value of modified cell.)

	--now say a crash happens. a simple recovery protocol goes like this:

	    --scan backward looking for "losers" (actions that don't have an
	    END record)
		--if you encounter a CHANGE record that corresponds to a
		losing transaction, then apply its UNDO
	    --starting at the beginning
		--if you encounter a CHANGE record that corresponds to a
		committed transaction (which you learned about from
		backward scan) for which there is no END record,
		then REDO the action
	    --subtle detail: for all losing transactions, log END_TRANS
		--why? (because if we didn't, then future recoveries
		would actually undo the actions, which might be wrong if
		future actions made updates to those variables)

	    [NOTE: the above is not typo'ed. the reason that it's okay
	    to UNDO a committed action in the backward scan is that the
	    forward scan will REDO the committed actions. This is
	    actually exactly what we want: the state of storage is
	    unwound in LIFO order to make it as though the losing
	    transactions never happened. Then, the ones that actually
	    did commit get applied in order.]

	--observe: recovery is complex. involves both undo and redo. what if
	we wanted only one of these two types of logging?
	
	    (1) say we wanted only undo logging? what requirement would we
	    have to place on the application?

		--perform all installs *before* logging OUTCOME record.
		--in that case, there is never a need to redo.
		--recovery just consists of undoing all of the actions for
		which there is no OUTCOME record....and
		    --logging an OUTCOME(abort) record

		this scheme is called *undo logging* or *rollback recovery*

	    (2) say we wanted only redo logging? what requirement would we
	    have to place on the application?
		
		--perform installs only *after* logging OUTCOME record
		--in that case, there is never a need to undo
		--recovery just consists of scanning the log and applying
		all of the redo actions for actions that:
		    --committed; but
		    --do not have an END record

		this scheme is called *redo logging* or *roll-forward
		recovery*

	--checkpoints make recovery faster

	    --but introduce complexity

	--non-write through cache of disk pages makes things even more
	complex

    B. Isolation ("before-or-after atomicity")

	--easiest approach: one giant lock. only one transaction active
	at a time. so everything really is serialized
	    --advantage: easy to reason about
	    --disadvantage: no concurrency

	--next approach: fine-grained locks (e.g., one per cell, or
	per-table, or whatever), and acquire all needed locks at
	begin_transaction and release all of them at end_transaction
	    --advantage: easy to reason about. works great if it could
	    be implemented
	    --disadvantage: requires transaction to know all of its
	    needed locks in advance

	--actual approach: two-phase locking. gradually acquire locks as
	you need them (phase 1), and then release all of them together
	at the commit point (phase 2)

	    --your intuition from the concurrency unit will tell you
	    that this creates a problem....namely deadlock. we'll come
	    back to that in a second.

	    --why does this actually preserve a serial ordering? here's
	    an informal argument:

		--consider the lock point, that is, the point in time at
		which the transaction owns all of the locks it will ever
		acquire.

		--consider any lock that is acquired. from that point to
		the lock point, the application always sees the same
		data

		--so regard the application as having done all of its
		reads and writes instantly at the lock point

		--the lock points create the needed serialization.
		Here's why, informally.  Regard the transactions as
		having taken place in the order given by their lock
		points. Okay, but how do we know that the lock points
		serialize? Answer: each lock point takes place at an
		instant, and any lock points with intersecting lock sets
		must be serialized with respect to each other as a
		result of the mutual exclusion given by locks.
		
	    --to fix deadlock, several possibilities:

		--one of them is to remove the no-preempt condition:
		have the transaction manager abort transactions (roll
		them back) after a timeout


2. RPC, client/server systems

    --what the heck is RPC?

    --client/server systems

    --potential of RPC: fantastic way to build distributed systems
       --RPC system takes care of all the distributed/network issues

    --how well does all of this work?

----------------------------------------------------------------------

admin notes:

--lab 6 out 

    --due two weeks from tomorrow

    --start on time: meaning now or next week

    --don't fall behind on networking material, 'cause you need to
    understand some aspects of networking to see what's going on in the
    lab

----------------------------------------------------------------------