A Self-Organizing Database System - a Different Approach to Query Optimization

Candidate: Piatetsky-Shapiro,Gregory Ilya


A Self-Organizing Database System (SODS) monitors queries asked, finds a good (or optimal) database structure for those queries, and suggests or does the reorganization. In this thesis we describe a prototype SODS for single-file relational queries and give an integrated analysis of its major design problems: (1) estimation of the number of records satisfying a condition (i.e., condition selectivity); (2) query optimization; (3) storing information about a set of queries; (4) optimal selection of secondary indices. We present new results for each of those problems. Some of this research was implemented in FASTSCAN, a commercial query system. We present a new method for accurate estimation of the number of records satisfying a condition field rel constant, where rel is one of =, , (LESSTHEQ), (GREATERTHEQ). We also examine estimates for more complicated conditions. We present elementary operations (such as UNION, INTERSECT) on pointer and record streams. We show how to use the query parse tree to construct a query evaluation method (EM) from those operations. Then we give an algorithm for selecting the optimal EM, based on converting the query to conjunctive normal form. We examine ways to compress information about a set of queries by combining information for similar queries. We derive a compression scheme which allows a correct and fast computation of the cost of the average query under any index set. We combine all previous results in analyzing the NP-hard problem of optimal index selection. We present two algorithms for it. The first one always finds the optimal answer and runs fast on real-size problems despite its exponential worst-case complexity. The second one (a Greedy method) runs much faster, yet finds the optimal answer very frequently. We analyze the Maximum Cover problem (also NP-hard), a simplification of the optimal index selection. We prove that the Greedy method is an epsilon-approximate algorithm: its answer is always > 63% of the optimal answer.