Lecture 8: Hash Tables 2/25

DJW section 10.6. However, note that the hash tables in DJW are somewhat different from ours. In their implementation, the hash table is an array of key-value pairs rather than an array of headers of linked lists. This is called Closed hashing. The technique for dealing with collisions becomes more complicated in closed hashing.

A hash table is a data structure that allows you to associate a value with a key, and then look up the value associated with the key --- with very high probability in constant time.

The Java HashTable library class

Sample code (German-English dictionary): TestLibraryHash.java.

The generic library class HashMap is a hash table with keys of class K and values of class V. In this example, G2E is constructed as an object of class HashMap < String,String >

Important methods:

We'll come back to more implementation details later.

Implementation

MyHashTable.java.
This is a hash table with keys of class String and values of type Person.

The hash table is an array of a specified size, which is supposed to be larger than the maximum number of keys you plan to store. Each element of the array is a linked list of nodes (in this case, we've used a singly linked list with no header). Each node has three data fields: key, value, and next.

Method MyHash(S) maps string S to a index in the hash table.

The method add(Key,PP) :

The method get(Key) :

Terminology: Two keys collide in a hash table if the hash function maps them to the same index. The capacity of the hash table is the size of the array. The load factor is the number of keys stored in the hash table divided by the capacity. The size should be chosen so that the load factor is less than 1. For instance, if we want to implement a German-English dictionary with 50,000 German words, we need a hash table that is larger than 50,000.

Since the number of keys in the hash table is less than the capacity of the hash table, assuming that the keys are evenly distributed across indices, there will be few collisions, and most of the linked lists will be of length 1. A few will be of length 2; a very few will be of length 3, and so on. The probability that there is any linked list that is very much longer than the load factor is very small.

In the library class HashMap, the system automatically doubles the size of the hash table when the load factor is reached, similar to what we saw with StringBuffers.

(There are other implementations of hash tables that don't use linked lists, described in DJW ????. For our purposes, these are unimportant.)

Example

This picture shows the final state of the hash table constructed in TestMyHashTable.java.

Choosing the hash function

The hash function maps the key to an index in the hash table. It is critical to avoid collisions as far as possible. Suppose that we are implementing a German-English dictionary with 50,000 words and we are using a hash table of capacity 75,000. Then we have a load capacity of 0.67, which is very reasonable.
  1. There certainly is no function that maps every different String to a different index, since there are only 75,000 different indices, and there are infinitely many strings.
  2. It would be difficult to find a hash function that maps every different German word to a different index i.e. no collisions. (However, when the set of keys is fixed, as here, it is sometimes worth putting in substantial effort to finding a hash function with few collisions.)
  3. If the German words of length N were a random selection of strings of length N then you could use essentially any hash function that distributes the strings of length N evenly over the indices, and with high probability that would be a good hash function.
  4. But German words are not random strings; they have a lot of patterns. For example: Some letters are common, some are rare. Some sequences of letters are impossible. German uses a lot of compound words, so many words are parts of other words.
  5. What is important is that these patterns don't somehow cause a lot of German words to hash to the same value. For instance, you don't want a hash function that only looks at the first four letters of the word because many words begin with the same first four letters. You want the hash function to ``make hash'' of any pattern in the set; hence the name.
  6. However, the hash function has to be computable quickly; otherwise you lose the advantage of the hash table.
  7. The hash function for Strings given in MyHashTable.java, is pretty good. It views the string as essentially a numeral in base 37 (or base 43, if the hash table size is close to a multiple of 37) and then reduces that number mod the table size. This hash function is a little slower than ideal, particularly for German words, which are often long.
  8. The choice of hash function corresponds to the kinds of patterns that naturally occur in actual collections. A good hash function for images, for example, may be quite different from a good hash function for strings.
  9. Obviously, the hash function does have to be repeatable; you could get good distribution if you incorporated a random number, but then you could never find it again.

The Java library hash function

The Java library provides a method hashCode for classes that are expected to be used as keys in a hash tables; e.g. String, Integer and so on. This maps the value to a 32 bit integer. Reducing this modulo the hash table size gives a good hash function.

If the hash table size L = 2k --- which it always is in the hashMap class --- then reducing mod L is the same as taking the k lowest-order bits, which is the same as doing a bitwise AND with L-1. That is the explanation of the code in the method goodMod. exoe

equals() and hashCode() for complex data structures:

Java provides an equals(X) method and a hashCode() method for an arbitrary object. However, the default is that both these methods are based on the address in memory of the address.

Example: TestEqualLists1.java

Sometimes this is what you want, but often it is not. For example, you would like two linked lists to be considered equal if they have the identical sequence of elements; and that this sense of ``equals'' should be used by all the functions, including library functions, that call the equals method. Likewise, you might like to use a list like [1, 5, 8] as a key, and then look it up with a different list [1, 5, 8], without it having to be the same actual object.

The solution is to override the equals(X) and the hashCode() methods: Example using lists of ints: TestEqualLists2.java

Your new equals(X) and hashCode() methods must satisfy the following constraints; otherwise the functions that call these (lots of things, in the case of equals(); hash tables in the case of hashCode()) will fail in strange and unpredictable ways.

The method equals(X) must be an equivalence relation. That is:

The method hashCode() must be compatible with equals(X). That is, if X.equals(Y) returns true, then X.hashCode() and Y.hashCode() must be equal.

If you do want equality in the sense of ``the identical object'', then you can always use X == Y.

For complex data structures or mathematical entities, the question of what it means for two things to be the same, and how you compute that can be a deep and difficult one.