Monday 5:00 - 7, Warren Weaver 109.
Prof Dennis Shasha
Instructors: Dennis Shasha (email@example.com)
with Alberto Lerner (firstname.lastname@example.org)
Office hours: Monday 4-4:45, Warren Weaver
room 422 also after class.
Guy Lichtman (email@example.com) -- he will be familiar with the
course material so you can ask him questions. He is the primay source
of knowledge about software we will use.
He is also the class list administrator.
His software web page is
The goal of this course is
to study the design and data manipulation of
large data applications.
The course will be addressed to students who have only
an elementary familiarity with databases, but want to learn
how to use such systems effectively and correctly.
Virtually every interesting computer application involves
large data. Database management systems have grown up as generic
pieces of software that allow the input, modification, and querying
of large data.
The chief amenities offered by such systems are a high level query
language (SQL, XPath, XQuery), indexes to make data access fast,
an optimizer that chooses how to process queries,
and the illusion of transactional atomicity (concurrency control
The job of a database designer is to understand an application,
decide on the table design (which attributes go into which tables),
decide on a set of indexes, and build queries.
Doing this well entails learning design theory (normal forms),
the various kinds of indexes, and SQL.
We will spend the bulk of this course on these issues though
we will touch on data mining, data warehouses, capacity planning,
transaction processing, semantic extensions to SQL for finance, and XML.
The syllabus follows the chapter in the book
Database Systems Concepts, fifth edition.
You can use any recent edition however.
Introduction to the issues
Relational model -- algebras
and the advantage of having a single data type -- relation
(intuitively, a table where the order of rows doesn't matter).
Every operation takes a relation and produces one.
See notes on lecture 1
SQL -- starts simple, but can get quite complex.
It's important to think in terms of bulk as opposed
to scalar operators.
When used to query and manipulate data, SQL is a "Data Manipulation Language"
See notes on lecture 2
See notes on lecture 3
We also discuss SQL as a language for setting up databases, i.e.
as a "Data Definition Language" (DDL):
Entity relationship-based design
See notes on lecture 4
Normalization theory -- the theory behind database design
See notes on lecture 6
File structures and indexing
See notes on
See notes on
Related to file
structures and query processing is how to configure your system.
Here is an approximation to an n-server
See notes on
Data warehousing and data mining
Also find some notes from
the data warehousing portion
of the database tuning notes
In class, we will discuss
If we have time, we will discuss information gain and decision tree building
Andrew Moore's notes.
See notes on
Personalized search using Bigtable:
Bigtable is a system for storing
and managing very large amounts of structured data.
Unlike a traditional database
system, the row/column space can be sparse. Row keys and values are
arbitrary strings, and the system allows each row/column cell to store
not just a single value but a set of values with associated
timestamps, simplifying analyses that examine how values have changed
over time. Data in a single table is internally broken at arbitrary
row boundaries to form contiguous regions of data called tablets.
Other fun topics (inserted
at various times in the semester): array databases, more data mining and so on.
You can find the lecture about array query languages
Sam Madden in his class at MIT on the subject suggested the
pertaining to this topic.
You may work in groups of two on the homeworks.
Please sign both of your names.
and in-class exam 1/2 (closed book but for which
you may bring in one 8.5 inch by 11 inch sheet of paper
on which you may write or type anything you like on either side of the sheet).
The in-class exam will be given
on April 24, 2006.
For the last class on May 1, you will be invited to put your
feet up on the table and relax.
Note that you can work in teams of 1 or 2 for all homeworks.
Exam topics follow what we covered in the lectures.
Nothing optional will be included (though I encourage
you to read that). Typical questions: given
an English description, find the SQL; canonical covers
and database table design; index selection; capacity
planning; application design (given an English
description, find the table design and select
appropriate indexes); XML; query optimization basics;
Homeworks must be submitted electronically
at 4:30 PM on the day they are due
to the teaching assistant or in class
if by hard copy.
There are no late homeworks.
If you miss a homework, then I need a doctor's note or a note
from your employer (with contact information).
Your scores from the other homeworks will be averaged in to give
you a grade for the missing one.
If you miss the final exam, then again I need a note from your employer
or doctor (again
with contact information).
Please note that I may verify these notes.
I'm sorry to be hard-nosed about this, but I'm giving you the homeworks
as of the first day of class so you should make these deadlines.
Notes and Other Information
Register for the
class list server mailing system
The notes come largely from the book
Database System Concepts
by Silberschatz, Korth, and Sudarshan
fourth or fifth edition.
Formally, this book is optional, so you can get it
from an online bookseller as you wish.
There are also six copies on reserve in Bobst (not in the Courant
library but in Bobst).
Here is a lecture on
by Andrew Moore and colleagues.
Here is a lecture
on time series analysis
by my colleagues and me.
We will also discuss, if we have time
several related topics found at the end of these lecture notes
(i) high performance replicated
database systems without the use of two phase commit, (ii) case studies from
Wall Street, (iii) a self-tuning technique that improves on LRU,
and (iv) a data structure for data warehouses.
Here are some notes about
good and bad uses of stored procedures
Here is Joe Conron's nice paper on indexes
(from when he was a master's student).
Here are notes about
materialized views in Oracle.
Finally, here are the rules about