Database Systems
G22.2433.01
Spring 2006
Monday 5:00 - 7, Warren Weaver 109.
Prof Dennis Shasha
Instructors: Dennis Shasha (shasha@cs.nyu.edu)
with Alberto Lerner (lerner@cs.nyu.edu)
Office hours: Monday 4-4:45, Warren Weaver
room 422 also after class.
Teaching Assistants:
Guy Lichtman (lichtman@cs.nyu.edu) -- he will be familiar with the
course material so you can ask him questions. He is the primay source
of knowledge about software we will use.
He is also the class list administrator.
His software web page is
here
OVERVIEW
The goal of this course is
to study the design and data manipulation of
large data applications.
The course will be addressed to students who have only
an elementary familiarity with databases, but want to learn
how to use such systems effectively and correctly.
Virtually every interesting computer application involves
large data. Database management systems have grown up as generic
pieces of software that allow the input, modification, and querying
of large data.
The chief amenities offered by such systems are a high level query
language (SQL, XPath, XQuery), indexes to make data access fast,
an optimizer that chooses how to process queries,
and the illusion of transactional atomicity (concurrency control
and recovery).
The job of a database designer is to understand an application,
decide on the table design (which attributes go into which tables),
decide on a set of indexes, and build queries.
Doing this well entails learning design theory (normal forms),
the various kinds of indexes, and SQL.
We will spend the bulk of this course on these issues though
we will touch on data mining, data warehouses, capacity planning,
transaction processing, semantic extensions to SQL for finance, and XML.
Syllabus
The syllabus follows the chapter in the book
Database Systems Concepts, fifth edition.
You can use any recent edition however.
-
Introduction to the issues
Find slides
here.
-
Relational model -- algebras
and the advantage of having a single data type -- relation
(intuitively, a table where the order of rows doesn't matter).
Every operation takes a relation and produces one.
Find slides
here.
See notes on lecture 1
here
-
SQL -- starts simple, but can get quite complex.
It's important to think in terms of bulk as opposed
to scalar operators.
When used to query and manipulate data, SQL is a "Data Manipulation Language"
(DML).
Find slides
here.
See notes on lecture 2
here
See notes on lecture 3
here
We also discuss SQL as a language for setting up databases, i.e.
as a "Data Definition Language" (DDL):
Find slides
here.
-
Entity relationship-based design
Find slides
here.
See notes on lecture 4
here
-
Normalization theory -- the theory behind database design
Find slides
here.
See notes on lecture 6
here
-
XML, Xquery
Find slides
here.
-
File structures and indexing
Find slides
here.
See notes on
this lecture
-
Query processing
Find slides
here.
See notes on
this lecture
Related to file
structures and query processing is how to configure your system.
Here is an approximation to an n-server
capacity planner.
-
Query optimization
Find slides
here.
See notes on
this lecture
-
Data warehousing and data mining
Find slides
here.
Also find some notes from
the data warehousing portion
of the database tuning notes
In class, we will discuss
clustering.
If we have time, we will discuss information gain and decision tree building
based on
Andrew Moore's notes.
See notes on
this lecture
-
Personalized search using Bigtable:
Bigtable is a system for storing
and managing very large amounts of structured data.
Unlike a traditional database
system, the row/column space can be sparse. Row keys and values are
arbitrary strings, and the system allows each row/column cell to store
not just a single value but a set of values with associated
timestamps, simplifying analyses that examine how values have changed
over time. Data in a single table is internally broken at arbitrary
row boundaries to form contiguous regions of data called tablets.
-
Other fun topics (inserted
at various times in the semester): array databases, more data mining and so on.
You can find the lecture about array query languages
here
Sam Madden in his class at MIT on the subject suggested the
following questions
pertaining to this topic.
Homeworks
You may work in groups of two on the homeworks.
Please sign both of your names.
Grading
Homeworks 1/2
and in-class exam 1/2 (closed book but for which
you may bring in one 8.5 inch by 11 inch sheet of paper
on which you may write or type anything you like on either side of the sheet).
The in-class exam will be given
on April 24, 2006.
For the last class on May 1, you will be invited to put your
feet up on the table and relax.
Note that you can work in teams of 1 or 2 for all homeworks.
Exam topics follow what we covered in the lectures.
Nothing optional will be included (though I encourage
you to read that). Typical questions: given
an English description, find the SQL; canonical covers
and database table design; index selection; capacity
planning; application design (given an English
description, find the table design and select
appropriate indexes); XML; query optimization basics;
etc.
Late policy:
Homeworks must be submitted electronically
at 4:30 PM on the day they are due
to the teaching assistant or in class
if by hard copy.
There are no late homeworks.
If you miss a homework, then I need a doctor's note or a note
from your employer (with contact information).
Your scores from the other homeworks will be averaged in to give
you a grade for the missing one.
If you miss the final exam, then again I need a note from your employer
or doctor (again
with contact information).
Please note that I may verify these notes.
I'm sorry to be hard-nosed about this, but I'm giving you the homeworks
as of the first day of class so you should make these deadlines.
Notes and Other Information
Register for the
class list server mailing system
here.
The notes come largely from the book
Database System Concepts
by Silberschatz, Korth, and Sudarshan
fourth or fifth edition.
Formally, this book is optional, so you can get it
from an online bookseller as you wish.
There are also six copies on reserve in Bobst (not in the Courant
library but in Bobst).
Here is a lecture on
biosurveillance
by Andrew Moore and colleagues.
Here is a lecture
on time series analysis
by my colleagues and me.
We will also discuss, if we have time
several related topics found at the end of these lecture notes
(i) high performance replicated
database systems without the use of two phase commit, (ii) case studies from
Wall Street, (iii) a self-tuning technique that improves on LRU,
and (iv) a data structure for data warehouses.
Here are some notes about
good and bad uses of stored procedures
Here is Joe Conron's nice paper on indexes
(from when he was a master's student).
Here are notes about
materialized views in Oracle.
Finally, here are the rules about
academic honesty.