(Updated January 14, 2025)
Table of contents
-
Overview
Hashing
Hash Table
Hash Function
Implementation
Collision Resolution
Load Factor
Hash Map
Improving Hashes
Complete Examples
Exercises
Overview
We have seen arrays and their power. They can store vast knowledge, perform complex operations like sorting, and be relatively fast using tools like binary search.
The next step after that, from an OOP perspective, was the Vector
(and ArrayList
!) with many operations bolted onto the data itself.
Then, we branched out to the linked linear list. This novel concept allowed us to overcome some array limitations, like managing the array contents, which complicates adding and removing data while maintaining array continuity.
However, both arrays and linked lists still suffer from scalability issues. The more data we put into them, the slower they can become without additional steps. In other words, unless we keep sorting the array, we cannot use the O(log2n) binary search, and we are stuck with the O(n) linear search.
The same is true for the linked list. Sorting that data structure would be mind-numbing, so we are left with the O(n)
Hashing
Hashing is a pretty old concept authored by H. P. Luhn at IBM in 1953. The original idea was to find new ways of rapidly searching large piles of data to determine if they contained a specific value. The access speed approached O(1) (constant) by storing the data in memory using clever calculations. Remember that sequential search is O(n) and binary search is O(log2n). Sorting arrays helped make things faster, but linked lists ruined that with their disconnected nature of construction.
Of course, this was not the only value to hashing. It has become the basis of how we can validate something as authentic or take two huge items and determine if they are identical. You may have heard of MD5, SHA1, SHA256, etc. These are all forms of hashing by creating a reproducible and unique value that can be used to validate the contents of something as genuine.
These specific hashing algorithms were designed in such a way as to (potentially) never have a collision. In other words, no two distinct files should ever produce the same hash. This is quite a mathematical challenge and MD5 has already been proven broken, SHA1 has been shown to be vulnerable.
So collisions are bad, right?
Well, it depends. Collisions are bad if you are looking for something that provides reasonable assurance that an entity’s contents are genuine and unaltered or if your goals involve cryptography. Otherwise, they are perfectly fine and welcome if your requirements are not that strict. We can handle them.
Hash Table
A hash table is a dictionary where keys are mapped to array positions. This is done by taking the key and applying a mathematical hash function. The function is always designed to return the same hash value for the given data, which is then used as an index into the array or massaged further into a value that can represent a bucket.
The array is a collection of buckets. The buckets are exactly what you think – containers capable of holding things—plural, meaning multiple. Again, we do not generally care if collisions occur, but it would be best to keep their number low.
Hash Function
So, how does hashing work? First, imagine a mathematical function that can take data as input and produce a numeric value. Then, imagine an array as a line of buckets. We can always find a specific bucket since we know the array’s length. This is part of the overall algorithm – no matter what happens in the math, the final result will always be in the range 0
to n-1
, inclusive, for an array of size n
.
How does this happen? Well, we take the hash value modulo the number of buckets.
Word is "PICKLE". Number of buckets is 181. "PICKLE" --> [Hash Function] --> 37682376 % 181 --> 167 So, "PICKLE" lives in bucket 167.
The math is not the critical piece here (yet). We send the word to the hash function, producing an unsigned value. Then we took that value modulo the number of buckets, and we have the bucket that word will live in!
We could send the word and the number of buckets to the hash function if we wanted.
("PICKLE", 181) --> [Hash Function] --> 167
The hash value is the bucket! It is up to you, the programmer, to decide what is best for implementation.
Here is a version of a hashing function based on Kernighan and Ritchie’s C Programming Language 2nd edition (q.v.):
public static int hashValue(String s) {
int h = 0;
int x, l = s.length();
for ( x = 0; x < l; x++ )
h = (h << 5) - h + s.charAt(x); // h * 31 + c
if ( h < 0 )
h = -h;
return h;
}
A simple hash function for strings.
Now, in the hashValue()
method presented, make no assumptions about the data and focus on the application of the procedure:
- Start by setting the hash value,
h
, to zero. - For each
char
in the string, multiplyh
by31
and add the nextchar
value. (This version uses a bit rotation, q.v.) - When we're done, if
h
is negative, make it positive. - Return the hash value.
How does the hash value become negative? Remember that a large enough calculation will eventually exceed the space provided and overflow. The long
type could be chosen to try reducing the collision space due to overflows.
Is it proper to ensure the value is not negative? It was a design choice in this case. This would not be an issue in C implementations since we could use unsigned storage.
Now we will see how this works:
public class HashingAround {
/**
* Hash a string
*
* @param s string to be hashed
* @return hash value
*/
public static int hashValue(String s) {
int h = 0;
int x, l = s.length();
for ( x = 0; x < l; x++ )
h = (h << 5) - h + s.charAt(x);
if ( h < 0 )
h = -h;
return h;
}
public static void main(String[] args) {
final int BUCKETS = 181;
String[] words = {"XYLOPHONE", "POTATO", "BOOTPRINT", "RING", "Ring", "ring", "COMMIT"};
for ( String w : words ){
int hv = hashValue(w);
int b = hv % BUCKETS;
System.out.println(w + " has a hash value of " + hv + " and exists in bucket " + b + ".");
}
}
}
Example 1: Testing the hash function.
The output is shown below:
XYLOPHONE has a hash value of 1417705354 and exists in bucket 48. POTATO has a hash value of 1929109465 and exists in bucket 62. BOOTPRINT has hash value of 789745371 and exists in bucket 17. RING has a hash value of 2515504 and exists in bucket 147. Ring has a hash value of 2547280 and exists in bucket 67. ring has a hash value of 3500592 and exists in bucket 52. COMMIT has a hash value of 1993481527 and exists in bucket 17.
There are specific takeaways from the output:
- The case of the letters in the word matters. Note the differences in RING, Ring and ring.
- Collisions are difficult to "predict." Would you have guessed that BOOTPRINT and COMMIT would be in the same bucket?
- How does the collision rate change with the number of buckets?
Even K&R indicates that this is not a perfect hash model but is easy to remember. It is suitable for many needs, is much better than the first edition, and is far from the only solution available. Additional hashing models are presented in the Improving Hashes section below.
Also, the number of buckets is prime. This is an attempt to reduce the collision rate by choosing a number less likely to be a factor.
Implementation
Up to this point, the only solid code we have seen is a hash function and some hashed words. Now, we will begin the process of storing data in the array!
We will initially assume a collision rate of 0 and use a method known as open addressing. This means we use the resulting value as the index in the array. Consider the following:
static class HashTable {
private String[] buckets;
private int numBuckets;
private int count = 0;
public HashTable(int b) {
numBuckets = b;
buckets = new String[numBuckets];
}
// The KnR_2nd() method is a hash function described later.
public void put(String s) {
int b = KnR_2nd(s, numBuckets);
buckets[b] = s;
System.out.println(s + " added to bucket " + b);
count++;
}
public boolean get(String s) {
int b = KnR_2nd(s, numBuckets);
if (buckets[b] != null)
return true;
return false;
}
public void remove(String s) {
int b = KnR_2nd(s, numBuckets);
buckets[b] = null;
}
public String toString() {
String s = "Hash Table: size " + count + ", buckets " + numBuckets + ":\n";
for (int x = 0; x < numBuckets; x++)
s += "bucket[" + x + "] -> " + buckets[x] + "\n";
return s;
}
}
A very primitive hash table implementation.
There are many assumptions in this code. Not the least of which is that every value will map to a unique location. Hence, the simplicity of the implementation. The operation put()
adds a value to the table, get()
returns whether the value is in the table, and remove()
is responsible for deleting the value if found.
Consider some code that uses this hash table:
public static void main(String[] args) {
final int BUCKETS = 11;
String[] words = {"192.168.10.2", "192.168.2.10", "192.168.1.3","192.168.3.1",
"192.168.1.6", "192.168.6.1"};
HashTable ht = new HashTable(BUCKETS);
for ( String w : words )
ht.put(w);
System.out.println(ht);
}
Put some IP addresses in the table.
This code will store some IP addresses in the hash table. A sample run produces the following:
192.168.10.2 added to bucket 4 192.168.2.10 added to bucket 2 192.168.1.3 added to bucket 10 192.168.3.1 added to bucket 5 192.168.1.6 added to bucket 2 192.168.6.1 added to bucket 6 Hash Table: size 6, buckets 11: bucket[0] -> null bucket[1] -> null bucket[2] -> 192.168.1.6 bucket[3] -> null bucket[4] -> 192.168.10.2 bucket[5] -> 192.168.3.1 bucket[6] -> 192.168.6.1 bucket[7] -> null bucket[8] -> null bucket[9] -> null bucket[10] -> 192.168.1.3
As you can see, we managed to get one collision. We managed 5 out of 6 values getting mapped. Another sample run with the same data but a different hash function produces:
192.168.10.2 added to bucket 6 192.168.2.10 added to bucket 6 192.168.1.3 added to bucket 3 192.168.3.1 added to bucket 3 192.168.1.6 added to bucket 6 192.168.6.1 added to bucket 6 Hash Table: size 6, buckets 11: bucket[0] -> null bucket[1] -> null bucket[2] -> null bucket[3] -> 192.168.3.1 bucket[4] -> null bucket[5] -> null bucket[6] -> 192.168.6.1 bucket[7] -> null bucket[8] -> null bucket[9] -> null bucket[10] -> null
Well, that could have gone better. We were hoping the table would contain 6 values, but we ended up with only 2 due to even more collisions and no collision management.
Now, we will consider some basic ways to address this.
Collision Resolution Schemes
Collision resolution can be achieved by computing another location within the table (open addressing), maintaining a list (chaining), or using a unique form designed by the application writer.
Open Addressing
In the open-addressing model, there are several ways to address collisions. These typically involve finding another location in the hash table array. We will only look at linear probing and double-hashing as possible solutions. A drawback of open addressing is clustering. This happens when multiple collisions look for new places in the hash table. This can rapidly move our performance from O(1) to O(n).
Linear Probing involves performing a linear search from the collision point to find the next empty slot. Since we can check for null
, a straightforward modification allows us to address this rapidly.
public void put(String s) {
int b = KnR_1st(s, numBuckets);
System.out.print(s + " hashed to bucket " + b + ". ");
while (b < numBuckets && buckets[b] != null)
b++;
if (b >= numBuckets)
throw new ArrayIndexOutOfBoundsException();
buckets[b] = s;
System.out.println(s + " added to bucket " + b);
count++;
}
public boolean get(String s) {
int b = KnR_1st(s, numBuckets);
while (b < numBuckets && buckets[b] != null) {
if ( ! buckets[b].equals(s))
return true;
b++;
}
return false;
}
public void remove(String s) {
int b = KnR_1st(s, numBuckets);
while (b < numBuckets && buckets[b] != null) {
if ( buckets[b].equals(s)) {
buckets[b] = null;
count--;
return;
}
b++;
}
}
Code changes that apply linear probing to find the next free slot in the event of a collision.
The above code change results in the following:
192.168.10.2 hashed to bucket 6. 192.168.10.2 added to bucket 6 192.168.2.10 hashed to bucket 6. 192.168.2.10 added to bucket 7 192.168.1.3 hashed to bucket 3. 192.168.1.3 added to bucket 3 192.168.3.1 hashed to bucket 3. 192.168.3.1 added to bucket 4 192.168.1.6 hashed to bucket 6. 192.168.1.6 added to bucket 8 192.168.6.1 hashed to bucket 6. 192.168.6.1 added to bucket 9 Hash Table: size 6, buckets 11: bucket[0] -> null bucket[1] -> null bucket[2] -> null bucket[3] -> 192.168.1.3 bucket[4] -> 192.168.3.1 bucket[5] -> null bucket[6] -> 192.168.10.2 bucket[7] -> 192.168.2.10 bucket[8] -> 192.168.1.6 bucket[9] -> 192.168.6.1 bucket[10] -> null
As you can see in the debug statements, the probe found a new place for the data after detecting the collision. Multiple entries landed in bucket 6, and then we had to walk the array to find the next free slot. This is an example of a high collision rate (or poorly chosen hash function) leading to a less efficient implementation.
Double-Hashing involves a secondary hashing algorithm to find another empty slot. Keep in mind that the second hash may also collide. Combined with linear probing, this may help reduce clustering. The problem with this model is that the put()
, get()
, and remove()
methods become more complex.
Chaining
Chaining allows adding a data structure to the mix—the linked list. Adding a list to the bucket will enable us to walk the chain within the same bucket rather than using linear probing to find another bucket and possibly running out of buckets to check!
Direct Chaining
Direct chaining means that each bucket is the head of a linked list. The goal is to be as fast as possible. When we add a value to the bucket, we take an approach similar to pushing onto a stack. So, the new node always goes to the head of the chain.
Our HashTable
class can be modified like the following:
static class HashTable {
private Node[] buckets;
private int numBuckets;
private int count = 0;
static boolean DEBUG = false;
private static class Node {
String data;
Node next;
public Node(String s) {
data = s;
next = null;
}
}
public HashTable(int b) {
numBuckets = b;
buckets = new Node[numBuckets];
}
/* ... */
}
Code to augment the hash table by encapsulating the data into nodes and then creating primitive lists through an array of nodes.
Separate Chaining
Separate chaining involves a slight modification to the direct chaining model. In this case, we have a value and a chain rather than the bucket being the head of the linked list.
static class HashTable {
private Bucket[] buckets;
private int numBuckets;
private int count = 0;
static boolean DEBUG = false;
private static class Bucket {
String value;
Node chain;
}
private static class Node {
String data;
Node next;
public Node(String s) {
data = s;
next = null;
}
}
public HashTable(int b) {
numBuckets = b;
buckets = new Bucket[numBuckets];
}
/* ... */
}
Abstracted further to have a Bucket
be a single value plus a chain.
The takeaway on separate chaining is the additional testing needed to see if we have filled value
before creating a new node to place on the head of the chain
.
Load Factor
The whole point of the hash table is to provide a relatively fast data structure for storing and searching data. The goal is to try and achieve O(1) time complexity (see Big-O Notation). Remember that arrays and linked lists have linear time complexity or O(n), and linear does not scale well.
Hash tables can be implemented poorly: horrible hashing algorithm, too few buckets, too much data, improper hash application.
Or they can be implemented very well but degrade over time because the data set has grown, but the number of buckets remains constant.
The load factor is the ratio of data to buckets. If you have 45 items and 75 buckets, the load factor is .60. Many implementations try to keep this at or below .75.
What do you do when the load factor is exceeded? You build another hash table with more (often double) buckets and move the data to the new one, re-hashing all the data as you go - because the data will likely be in a different bucket.
From the perspective of the hash table itself, this can all happen behind the scenes. We can add code to the put()
method to inspect the current load factor. This is why we have kept track of the number of data values in the count
instance variable.
Here is a view of the changes we will make to the code:
static class HashTable {
private Node[] buckets;
private int numBuckets;
private int count = 0;
private boolean REHASH = false;
static boolean DEBUG = false;
/* ... */
public void put(String s) {
if ( !REHASH && ((double)count / numBuckets) > .75)
rehash();
int b = KnR_2nd(s, numBuckets);
Node n = new Node(s);
if (get(s))
return;
n.next = buckets[b];
buckets[b] = n;
System.out.println(s + " added to bucket " + b);
count++;
}
/* ... */
private void rehash() {
REHASH = true;
System.out.println(this); // show the current state before the rehashing begins.
Node[] oldBuckets = buckets;
numBuckets *= 2;
buckets = new Node[numBuckets];
count = 0;
for ( int x = 0; x < oldBuckets.length; x++) {
Node p = oldBuckets[x];
while (p != null ) {
put(p.data);
p = p.next;
}
}
REHASH = false;
}
/* ... */
}
An implementation of rehash()
.
We have added the REHASH variable as minimal protection against accidentally starting a new rehash. Once we set that flag, we will not check the load factor until the flag has cleared.
Hash Map
There is also the concept of a hash map. The hash map concept allows recording (key, value) pairs. The application uses the key to record the data to get that value back with the known key. This is a little different yet similar to the hash table. The key is still used to select a place within the table, but the key is also uniquely identifying the value.
In the hash table model, the key is typically also the value. The real key is the hash value produced by the hash function, which determines how to map the key into the data structure.
You are using a hash table when you have single values to be stored and rapidly sought and a hash map when you have grouped components such as key-value pairs.
Improving Hashes
There are so many ways to improve hashing algorithms. The mathematics of which is well beyond the scope of this introductory piece.
This next bit of code provides several hashing schemes simultaneously. They are all viable solutions for string and numeric hashing.
public class SeriouslyHashingAround {
/* K&R 1st edition Hash */
public static int KnR_1st(String s) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = h + s.charAt(x);
if (h < 0)
h = -h;
return h;
}
/* K&R 2nd edition Hash */
public static int KnR_2nd(String s) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = (h << 5) - h + s.charAt(x); // h * 31 + c
if (h < 0)
h = -h;
return h;
}
/* sdbm Hash - Historically from Sleepycat's Berkeley DB. */
public static int sdbm(String s) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = (h << 6) + (h << 16) - h + s.charAt(x); // h * 65599 + c
if (h < 0)
h = -h;
return h;
}
/* Knuth's Multiplicative Hash - good for numeric key distribution. */
public static int Knuth_mult(int i) {
// 1001 1110 0011 0111 0111 1001 1011 0001
return i * 0x9E3779B1;
}
/* Dan Bernstein's djb2 Hash starting at 5381 with 33 multiplier. */
public static int djb2_Bernstein(String s) {
int h = 5381;
int x, l = s.length();
h = 0;
for (x = 0; x < l; x++)
h = ((h << 5) + h) + s.charAt(x); // h * 33 + c
if (h < 0)
h = -h;
return h;
}
public static void main(String[] args) {
final int BUCKETS = 181;
String[] words = {"XYLOPHONE", "POTATO", "BOOTPRINT", "RING", "Ring", "ring", "COMMIT"};
for (String w : words) {
// change the hash function to see how things move around.
int h = sdbm(w);
int b = h % BUCKETS;
System.out.println(w + " has hash value of " + h + " and exists in bucket " + b + ".");
}
}
}
Example 2: Many hash functions to experiment with.
In Example 2, many different hash functions exist. Change the function inside the for
loop to see a different distribution. Be aware the Knuth_mult()
is for numerical values.
Complete Examples
All of the code from this chapter is now presented as complete examples, so you can experiment with many of the concepts.
This first example demonstrates what can happen when you use different hashing methods. Remember that when switching from KnR_2nd()
to KnR_1st()
, there are collisions, and data is lost since there is no collision handling.
public class HashingNoCollisionCheck {
/* Hash wrapper */
public static int hash(String s, int b) {
return KnR_1st(s, b);
}
/* K&R 1st edition Hash */
public static int KnR_1st(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = h + s.charAt(x);
if (h < 0)
h = -h;
return h % b;
}
/* K&R 2nd edition Hash */
public static int KnR_2nd(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = (h << 5) - h + s.charAt(x); // h * 31 + c
if (h < 0)
h = -h;
return h % b;
}
static class HashTable {
private String[] buckets;
private int numBuckets;
private int count = 0;
public HashTable(int b) {
numBuckets = b;
buckets = new String[numBuckets];
}
/**
* Add a word to the has table.
*
* @param s string to be placed.
*/
public void put(String s) {
int b = hash(s, numBuckets);
buckets[b] = s;
System.out.println(s + " added to bucket " + b);
count++;
}
/**
* Check a word exists.
*
* @param s string to check
* @return if available
*/
public boolean get(String s) {
int b = hash(s, numBuckets);
if (buckets[b] != null)
return true;
return false;
}
/**
* Remove a word.
*
* @param s word to be removed
*/
public void remove(String s) {
int b = hash(s, numBuckets);
buckets[b] = null;
}
/**
* Convert to a string.
*
* @return string representation of the table.
*/
public String toString() {
String s = "Hash Table: size " + count + ", buckets " + numBuckets + ":\n";
for (int x = 0; x < numBuckets; x++)
s += "bucket[" + x + "] -> " + buckets[x] + "\n";
return s;
}
}
public static void main(String[] args) {
final int BUCKETS = 11;
String[] words = {"192.168.10.2", "192.168.2.10", "192.168.1.3", "192.168.3.1",
"192.168.1.6", "192.168.6.1"};
HashTable ht = new HashTable(BUCKETS);
for (String w : words)
ht.put(w);
System.out.println(ht);
}
}
Example 3: Hashing with no collision check.
The following example has additional code for simple linear probing to resolve collisions. This throws an exception if we exceed the length of the table.
public class LinearProbing {
/* Hash wrapper */
public static int hash(String s, int b) {
return KnR_1st(s, b);
}
/* K&R 1st edition Hash */
public static int KnR_1st(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = h + s.charAt(x);
if (h < 0)
h = -h;
return h % b;
}
/* K&R 2nd edition Hash */
public static int KnR_2nd(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = (h << 5) - h + s.charAt(x); // h * 31 + c
if (h < 0)
h = -h;
return h % b;
}
static class HashTable {
private String[] buckets;
private int numBuckets;
private int count = 0;
public HashTable(int b) {
numBuckets = b;
buckets = new String[numBuckets];
}
public void put(String s) {
int b = hash(s, numBuckets);
System.out.print(s + " hashed to bucket " + b + ". ");
while (b < numBuckets && buckets[b] != null)
b++;
if (b >= numBuckets)
throw new ArrayIndexOutOfBoundsException();
buckets[b] = s;
System.out.println(s + " added to bucket " + b);
count++;
}
public boolean get(String s) {
int b = hash(s, numBuckets);
while (b < numBuckets && buckets[b] != null) {
if (!buckets[b].equals(s))
return true;
b++;
}
return false;
}
public void remove(String s) {
int b = hash(s, numBuckets);
while (b < numBuckets && buckets[b] != null) {
if (buckets[b].equals(s)) {
buckets[b] = null;
count--;
return;
}
b++;
}
}
public String toString() {
String s = "Hash Table: size " + count + ", buckets " + numBuckets + ":\n";
for (int x = 0; x < numBuckets; x++)
s += "bucket[" + x + "] -> " + buckets[x] + "\n";
return s;
}
}
public static void main(String[] args) {
final int BUCKETS = 11;
String[] words = {"192.168.10.2", "192.168.2.10", "192.168.1.3", "192.168.3.1",
"192.168.1.6", "192.168.6.1"};
HashTable ht = new HashTable(BUCKETS);
for (String w : words)
ht.put(w);
System.out.println(ht);
}
}
Example 4: Hash table with linear probing to resolve collisions.
Now, we have code that implements direct chaining.
public class DirectChaining {
/* Hash wrapper */
public static int hash(String s, int b) {
return KnR_2nd(s, b);
}
/* K&R 1st edition Hash */
public static int KnR_1st(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = h + s.charAt(x);
if (h < 0)
h = -h;
return h % b;
}
/* K&R 2nd edition Hash */
public static int KnR_2nd(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = (h << 5) - h + s.charAt(x); // h * 31 + c
if (h < 0)
h = -h;
return h % b;
}
static class HashTable {
private Node[] buckets;
private int numBuckets;
private int count = 0;
static boolean DEBUG = false;
private static class Node {
String data;
Node next;
public Node(String s) {
data = s;
next = null;
}
}
public HashTable(int b) {
numBuckets = b;
buckets = new Node[numBuckets];
}
public void put(String s) {
int b = hash(s, numBuckets);
Node n = new Node(s);
if (get(s))
return;
n.next = buckets[b];
buckets[b] = n;
System.out.println(s + " added to bucket " + b);
count++;
}
public boolean get(String s) {
int b = hash(s, numBuckets);
Node p = buckets[b];
while (p != null) {
if (p.data.equals(s))
return true;
p = p.next;
}
return false;
}
public void remove(String s) {
int b = hash(s, numBuckets);
Node p = buckets[b];
// empty
if (p == null)
return;
// head of chain
if (p.data.equals(s))
buckets[b] = p.next;
else {
// seek the value
Node cur = p.next, back = p;
while (cur != null) {
if (cur.data.equals(s)) {
back.next = cur.next;
break;
}
back = cur;
cur = cur.next;
}
}
count--;
}
private static String dumpList(Node l) {
Node p = l;
String s = "";
while (p != null) {
s = s + p.data + " ";
p = p.next;
}
return s;
}
public String toString() {
String s = "Hash Table: size " + count + ", buckets " + numBuckets + ":\n";
for (int x = 0; x < numBuckets; x++)
s += "bucket[" + x + "] -> " + dumpList(buckets[x]) + "\n";
return s;
}
}
public static void main(String[] args) {
final int BUCKETS = 11;
String[] words = {"192.168.10.2", "192.168.2.10", "192.168.1.3", "192.168.3.1",
"192.168.1.6", "192.168.6.1"};
HashTable ht = new HashTable(BUCKETS);
for (String w : words)
ht.put(w);
System.out.println(ht);
}
}
Example 5: Hashing using direct chaining.
Note that the direct chaining with the poor distribution KnR_1st()
helps not to lose data. However, we've dumped all the data into just two buckets. A better hash function will be needed.
192.168.10.2 added to bucket 6 192.168.2.10 added to bucket 6 192.168.1.3 added to bucket 3 192.168.3.1 added to bucket 3 192.168.1.6 added to bucket 6 192.168.6.1 added to bucket 6 Hash Table: size 6, buckets 11: bucket[0] -> bucket[1] -> bucket[2] -> bucket[3] -> 192.168.3.1 192.168.1.3 bucket[4] -> bucket[5] -> bucket[6] -> 192.168.6.1 192.168.1.6 192.168.2.10 192.168.10.2 bucket[7] -> bucket[8] -> bucket[9] -> bucket[10] ->
Finally, direct chaining with an implemented load factor detection and redistribution.
public class DirectChainLoadFactor {
/* Hash wrapper */
public static int hash(String s, int b) {
return KnR_2nd(s, b);
}
/* K&R 1st edition Hash */
public static int KnR_1st(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = h + s.charAt(x);
if (h < 0)
h = -h;
return h % b;
}
/* K&R 2nd edition Hash */
public static int KnR_2nd(String s, int b) {
int h = 0;
int x, l = s.length();
for (x = 0; x < l; x++)
h = (h << 5) - h + s.charAt(x); // h * 31 + c
if (h < 0)
h = -h;
return h % b;
}
static class HashTable {
private Node[] buckets;
private int numBuckets;
private int count = 0;
private boolean REHASH = false;
static boolean DEBUG = false;
private static class Node {
String data;
Node next;
public Node(String s) {
data = s;
next = null;
}
}
public HashTable(int b) {
numBuckets = b;
buckets = new Node[numBuckets];
}
public void put(String s) {
if (!REHASH && ((double) count / numBuckets) > .75)
rehash();
int b = hash(s, numBuckets);
Node n = new Node(s);
if (get(s))
return;
n.next = buckets[b];
buckets[b] = n;
System.out.println(s + " added to bucket " + b);
count++;
}
public boolean get(String s) {
int b = hash(s, numBuckets);
Node p = buckets[b];
while (p != null) {
if (p.data.equals(s))
return true;
p = p.next;
}
return false;
}
public void remove(String s) {
int b = hash(s, numBuckets);
Node p = buckets[b];
// empty
if (p == null)
return;
// head of chain
if (p.data.equals(s))
buckets[b] = p.next;
else {
// seek the value
Node cur = p.next, back = p;
while (cur != null) {
if (cur.data.equals(s)) {
back.next = cur.next;
break;
}
back = cur;
cur = cur.next;
}
}
count--;
}
private void rehash() {
REHASH = true;
System.out.println(this); // show the current state before the rehashing begins.
Node[] oldBuckets = buckets;
numBuckets *= 2;
buckets = new Node[numBuckets];
count = 0;
for (int x = 0; x < oldBuckets.length; x++) {
Node p = oldBuckets[x];
while (p != null) {
put(p.data);
p = p.next;
}
}
REHASH = false;
}
private static String dumpList(Node l) {
Node p = l;
String s = "";
while (p != null) {
s = s + p.data + " ";
p = p.next;
}
return s;
}
public String toString() {
String s = "Hash Table: size " + count + ", buckets " + numBuckets + ":\n";
for (int x = 0; x < numBuckets; x++)
s += "bucket[" + x + "] -> " + dumpList(buckets[x]) + "\n";
return s;
}
}
public static void main(String[] args) {
final int BUCKETS = 8;
String[] words = {"192.168.10.2", "192.168.2.10", "192.168.1.3", "192.168.3.1",
"192.168.1.6", "192.168.6.1", "192.168.1.2", "192.168.2.1"};
HashTable ht = new HashTable(BUCKETS);
for (String w : words)
ht.put(w);
System.out.println(ht);
}
}
Example 6: Hashing using direct chaining with rehash on high load factor.
Exercises
- (Beginner) Generate some random strings and a set of buckets such that you remain below the .75 load factor. Measure the time it takes with each hashing method to determine the fastest hash, not the most efficient.
- (Intermediate) Examine the differences in hash methods in
SeriouslyHashingAround.java
above. Determine which method is most effective with word or IP address data. How does the spread of items look? What happens when the data set is increased? Buckets increased? Examine the data and see if you can predict when the style of data, hashing model, and number of buckets will change. - (Advanced) Perform time trials of linear probing and direct chaining. Experiment with different data sizes and hash methods. Determine if liner probing or direct chaining has the upper hand in any specific combinations.