Normal order: data structures, indexes, query processing, query optimization Start chapter 12 with 2-3 trees. Index -- data structure + (possibly) organization of the records + (possibly) some rule about the pointers. dense = pointer from index to record sparse = pointer from index to page containing one or more records sparse index has a shorter (less height) data structure. dense index = sometimes you can answer the query entirely within the index "covering the query". Can have a sparse index only when the data is organized based on the key of the index. For white pages of the phone book, this is lastname,first name. When we organize data based on some concatenation of attributes A1, A2, ... Ak, then it is said to be "clustered" on those attributes. It also, of course, clustered on any prefix of those attributes. "organized based on" = sorted if the data structure is a B tree or same hash value if data structure is a hash structure. index -- one of several data structures (hash, b tree, r tree, ...) + clustered/non-clustered (organized based on some concatenation of attributes) + sparse/dense if it's clustered then it can be sparse or dense if non-clustered then it must be dense. covering index for a select X from R where Y = ... should have the order Y, X because the Y values will be encountered first as the process descends the index. So, the process will be able to search a small subtree to get all matching X values. covering (and ordered) for this query would be (Y,X). covering and unordered for this query would be (X,Y). The latter can still give a performance improvement because the query can be answered by looking at the entire index rather than at the entire table. So better than a scan. Let's say I have an employee table emp(id, salary, name, department) Consider a query ... where department = :x Choices: 1. no index 2. non-clustered index on department alone 3. clustered index on department 4. covering dense index on department, other attributes for the query. mysql: create index indexname on tablename (Att1, Att2) homework 3 question 4c: consider the following query select R.C from R, S,T where R.A = S.B and S.D = T.A This might have the ordering join R with S and then S with T. This ordering might change if you added a predicate like select R.C from R, S,T where R.A = S.B and S.D = T.A and T.G = 5 Start chapter 13 with joins p. 22. select R.C from R, S where R.A = S.B Suppose you have an index on S.B. In the nested loop, you can try the following for r in R look up r.A in the index for S.B if present then output r.C end for What is the time complexity: |R|* log(|S|). This is much better than |R| * |S|. What if we had had an index on R.A but not on S.B? for s in S look up s.B in the index for R.A if present then find the record(s) for that value and output the C value(s) end if end for Which index on R would "cover" this join? Answer: (A,C) F = { AB --> C CD --> B E --> F F --> A A --> G A --> D G --> B } Is that equivalent to G = { A --> C CD --> B E --> F F --> A A --> G A --> D G --> B } notice that G is at least as strong as F. i.e. if G is true then surely F is true. Why? Because if A --> C holds then AB --> C must hold. Why? Because A --> C implies that if two rows agree on their A values then they agree on their C values. So, surely if they agree on their A and B values, they agree on their C values. The challenge therefore is to find out whether F is as strong as G. This is not much of a challenge for any but the first functional dependency. Why? Because they are equal. So, all we need to do is to prove that A --> C follows from the functional dependencies F. To prove that, I will use attribute closures. So, I will ask whether in F, A+ includes C. Because the attributes in A+ are all attributes functionally dependent on A. A+ with respect to set F = {A}, {A, G, D} then we get {A, G, D, B} because of G --> B and the fact that G is a subset of {A, G, D}. From {A, G, D, B}, we get C. Notice: we have A --> G and G --> B. This implies A --> B but we still want to keep the original ones because there could be cases where A --> B holds but A --> G and G --> B do not. A G B 1 1 1 1 2 1 ========= A comment from Alberto on indexes + query optimization: Incidentally, I wouldn't be suprised if a question arose about how to optimize queries, now considering these three axis together: available indices, join order, and join algorithms. A quick answer is that indices would be considered in the base of the dynamic programming induction. That is, one can scan a table or traverse the latter through its index. The choice of index wouldn't alter the size of the intermediate results -- and thus the join order, if that's the cost function in question. But the presense of an index can, for instance, make a nested-loop join be less unatractive, or to allow a merge join to be a possibility. The choice of which index to pick is then not a fuction of cost and may not became clear until after several induction steps (ie, after the join of i of the n tables were considered). But on "normal" dynamic programmig, you don't compute a result in an induction step if that is not going to be used by the following step. This is where the "interesting orders" concept cames in. But I guess that's for the advanced database class. ;-)