A Hash Table in C#

A hash table (or hash map) is a data structure that represents a collection of key-value pairs. This allows client code to refer to elements stored in the hash table by name instead of just by integer index. More generally, this description applies to what’s called an associative array (named because keys are associated with values). A hash table in particular is a means by which an associative array can be made very efficient, using some clever tricks.

A hash table’s structure looks something like this:hashtable

On the left, we see an array of pointers to lists (more generally ‘buckets,’ because we could use another data structure here). When we add a new item to the hash table, we put its key through a hash function. The hash function returns the integer index of the bucket into which we place the item. When we want to look up the value associated with a key, or determine whether that key exists, we take the hash again, then search the bucket and see if we can’t find the item with the appropriate key.

All that hashing and bucketing sounds complicated, but it saves quite a bit of time. Looking up an item in a naive associative array would mean starting at the beginning and going through every key until we found the one we were after. With a hash table, we only have to search as many items as are in our key’s bucket.

Now, this should make the hash table faster, but if we choose our hash function poorly, we can still end up with something very slow. For instance, if our hash function disregards the input and returns 7, we’ll have a working hash table in which all the keys are stored in bucket 7, and we’re back to the performance of the naive associative array. A good hash function should distribute items evenly across the buckets, but a pathological choice of keys can still result in slow behavior.

We also have to make sure that the hash function chooses only those integers we have available as indices. An easy way to do that is by reducing the result of an arbitrary hash function modulo n, where n is number of buckets. If our number of buckets is a power of 2, we can also just use the least significant bits of the result of the hash to achieve the same result more quickly.

Ideally, we never have two items stored in the same bucket. When an item is added to an occupied bucket, we call this a ‘hash collision.’ If our hash table stores a fixed set of items, we can find a hash function which gives us no collisions, for a very fast lookup. If we have to store unknown, arbitrary values, we try to choose a function which results in relatively few collisions.

As the number of items stored in the table grows, the odds of a hash collision increase surprisingly fast (see the birthday paradox). We can mitigate this by expanding the array of available indices as the hash table fills. Expanding the array is an expensive operation, because we have to re-hash the contents of the table – we may either re-hash everything at once, or maintain the table as it was while simultaneously allocating another, larger hash table and determining when and how to move the old entries to the new. My example will not expand the hash table.

Here’s the code:

I also added a method to delete elements from the linked list by value, and return the deleted item. The existing Delete() method got a return value, too, just for parity:

And the client code to test it:

Which gives the expected output: