Data Structures Lecture 7: More on Lists; Hash Tables. 2/20

DJW sections 3.1, 3.5 (stacks); 5.1, 5.3, 5.6 (queues); 6.7 (lists) 2/20.

Returning a value

The following class definition does not compile. Why not?
public class FindSqrt {
    public static int findSqrt(int n) {
       if (n <= 0) return 0;
       for (int i = 1; i <= n; i++) {
          if (i*i >= n) return i;
        }
    }
}









Answer: Java cannot be sure that this returns a value; as far as the Java compiler is concerned, the loop might execute to completion without the condition ever being satisfied. You have to add an additional return statement at the end. You know that this will never be executed, but it reassures the compiler

Generic Ordered Lists

GOrderedList.java

Stacks and Queues

Stacks and queues are lists with restrictions on the forms of access.

Stacks

Stacks obey "Last In, First Out" (LIFO) discipline. In most applications, there is an additional restriction that you can only examine the top element of the stack. There are therefore three main methods (besides the constructor):

ArrayStack.java

FIFO Queues: List implementation

A FIFO queue observes First In First Out (FIFO) ordering; items come off the queue in the same order in which they went on. The main methods are: FIFO queues are implemented in two ways. The first is a linked list, with pointers to front and back.

FIFOQueue.java

FIFO Queues: Circular array implementation

The second implementation is as a circular array. The queue consists of an array elements start and end index. Conceptually the array ``wraps around'', so that when you reach the end, you go back to index 0. In our implementation, end is the index of the first empty slot. (You can never actually fill the array, because there would be no way to distinguish that from the empty queue.) So: CircularArray.java

Hash Tables

A hash table is a data structure that allows you to associate a value with a key, and then look up the value associated with the key --- with very high probability in constant time.

The Java HashTable library class

Sample code (German-English dictionary): TestLibraryHash.java.

The generic library class HashMap is a hash table with keys of class K and values of class V. In this example, G2E is constructed as an object of class HashMap < String,String >

Important methods:

We'll come back to more implementation details later.

Implementation

MyHashTable.java.
This is a hash table with keys of class String and values of type Person.

The hash table is an array of a specified size, which is supposed to be larger than the maximum number of keys you plan to store. Each element of the array is a linked list of nodes (in this case, we've used a singly linked list with no header). Each node has three data fields: key, value, and next.

Method MyHash(S) maps string S to a index in the hash table.

The method add(Key,PP) :

(You will improve this in the problem set.)

The method get(Key) :

Terminology: Two keys collide in a hash table if the hash function maps them to the same index. The capacity of the hash table is the size of the array. The load factor is the number of keys stored in the hash table divided by the capacity. The size should be chosen so that the load factor is less than 1. For instance, if we want to implement a German-English dictionary with 50,000 German words, we need a hash table that is larger than 50,000.

Since the number of keys in the hash table is less than the capacity of the hash table, assuming that the keys are evenly distributed across indices, there will be few collisions, and most of the linked lists will be of length 1. A few will be of length 2; a very few will be of length 3, and so on. The probability that there is any linked list that is very much longer than the load factor is very small.

In the library class HashMap, the system automatically doubles the size of the hash table when the load factor is reached, similar to what we saw with StringBuffers.

(There are other implementations of hash tables that don't use linked lists, described in Weiss section 5.4. For our purposes, these are unimportant.)

Example

This picture shows the final state of the hash table constructed in TestMyHashTable.java.

Choosing the hash function

The hash function maps the key to an index in the hash table. It is critical to avoid collisions as far as possible. are implementing a German-English dictionary with 50,000 words and we are using a hash table of capacity 75,000 -- a load capacity of 0.67, which is very reasonable.
  1. There certainly is no function that maps every different String to a different index, since there are only 75,000 different indices, and there are infinitely many strings.
  2. It would be difficult to find a hash function that maps every different German word to a different index i.e. no collisions. (However, when the set of keys is fixed, as here, it is sometimes worth putting in substantial effort to finding a hash function with few collisions.)
  3. If the German words of length N were a random selection of strings of length N then you could use essentially any hash function that distributes the strings of length N evenly over the indices, and with high probability that would be a good hash function.
  4. But German words are not random strings; they have a lot of patterns. For example: Some letters are common, some are rare. Some sequences of letters are impossible. German uses a lot of compound words, so many words are parts of other words.
  5. What is important is that these patterns don't somehow cause a lot of German words to hash to the same value. For instance, you don't want a hash function that only looks at the first four letters of the word because many words begin with the same first four letters. You want the hash function to ``make hash'' of any pattern in the set; hence the name.
  6. However, the hash function has to be computable quickly; otherwise you lose the advantage of the hash table.
  7. The hash function for Strings given in MyHashTable.java, which is pretty much the same as the one in Weiss, is pretty good. It views the string as essentially a numeral in base 37 (or base 43, if the hash table size is close to a multiple of 37) and then reduces that number mod the table size. This hash function is a little slower than ideal, particularly for German words, which are often long.
  8. The choice of hash function corresponds to the kinds of patterns that naturally occur in actual collections. A good hash function for images, for example, may be quite different from a good hash function for strings.
  9. Obviously, the hash function does have to be repeatable; you could get good distribution if you incorporated a random number, but then you could never find it again.

The Java library hash function

The Java library provides a method hashCode for classes that are expected to be used as keys in a hash tables; e.g. String, Integer and so on. This maps the value to a 32 bit integer. Reducing this modulo the hash table size gives a good hash function.

If the hash table size L = 2k --- which it always is in the hashMap class --- then reducing mod L is the same as taking the k lowest-order bits, which is the same as doing a bitwise AND with L-1. That is the explanation of the code in the method goodMod. exoe