* k-means algorithm pick k centroid candidates randomly find the points closest to each centroid candidate until you your candidate centroids don't change in each cluster pick a better candidate centroid if there is one every point finds the nearest centroid Note that both steps reduce the average distance from each point to the nearest centroid if there is any change. * Notice that even with enough points can get bizarre answers. e.g. cluster on left two clusters on right. If initial points are two in the left cluster and one between the right clusters, that will be stable. No good however. Different random starting configuration. * How to choose k? Silhouette method. Imagine that you have three evident clusters, but k = 2. Then you're going to get something bizarre. So you want to increase k, but to what extent? Increasing k will always decrease the distance to centroid. Silhouette method. Graph decrease in total distance against number of centroids. When you start to level out, stop. * Types of learning. Supervised = have training set and usually some objective. For example, patient gets better would be an objective. Look at people who get better and people who don't and see what might be the cause of the difference. Decision tree building is one example. Unsupervised = no training set. Clustering and friends. Goal: group the data. decision support: make queries. data mining: let data speak to you. For data warehousing: go to dbtune p. 115, review applications, and bitmaps. Aggregate (strategic) targeting: Aggregates flow up from a wide selection of data, and then Targeted decisions flow down Examples: Riding the wave of clothing fads Tracking delays for frequent-flyer customers Broad Aggregate queries over ranges of values, e.g., find the total sales by region and quarter. Deep Queries that require precise individualized information, e.g., which frequent flyers have been delayed several times in the last month? Dynamic (vs. Static) Queries that require up-to-date information, e.g. which nodes have the highest traffic now? Bit vectors: sweet spot is many rows but few values of an attribute. Dayofweek has only 7 values, but the sales table may have millions of rows. Sex, state, age decade, professional category. bit map for day of week. bit map will consist of in this case 7 bit vectors, one for each day. bit vector for monday will have a length equal to the number of rows. Call it bitmonday. bitmonday[i] = 1 if row i has "Monday" in its dayofweek field. Else 0. Every bit vector has cardinality (length) equal to the number of rows and a 1 in a position if that row has the data represented by the bit vector. Suppose we have a query select * from sales where dayofweek = 'Monday' and agedecade = '20s' and state = 'California' and .... Intersection is just bit and. Why was it invented? and salary > 60000 Can create bit vectors for bit positions. Bit vector for each bit position in the representation of salary. sampling when it can be used. -- broad queries. How to do it. Then start with decision trees. A occurs with probability 1/2 B with 1/4 C with 1/8 D with 1/8 "entropy" = 1/2*1 (for A) + 1/4 * 2 + 1/8 * 3 + 1/8 *3. Intuitively the negative of the log of probability is the number of bits you want. entropy is the number of bits you need on the average per character (in general, message). Y = whether or not you like Gladiator X = major H(Y|X = math); population of 4 people (the math majors). Two like gladiator and two do not. (- 1/2 * log 1/2) + (-1/2 * log 1/2) = 1 (resulting entropy). ^liking ^ not liking (0 * log 0) + (-1 * log 1) = 0 ^liking ^ not liking