Monthly Archives: November 2013

Data Structures Retrospective

The hash table was the last data structure on the list that inspired this project (linked lists, binary trees, tries, stacks, queues, array lists, hash tables). Next up is algorithms; breadth-first, depth-first and binary search; merge and quicksort; and various tree algorithms like find and insert.

During this project, I’ve become much more comfortable with generics in C# and Java, and re-familiarized myself with Visual Studio and Eclipse. I like Studio better, which I expected based on my more recent familiarity, but it’s also a much faster environment. Because my C# IDE is more comfortable than my Java IDE, I’m coming to prefer C# over Java, even though Java was my first language. This might mean that if I’m looking for a big grey corporate job I’ll prioritize C#.

I hadn’t done much Python before, and I still haven’t. I like the brevity. I don’t like not having autocomplete.

PHP made me more money than the others, back when programming was my primary job duty instead of something we shoehorned into my title to make me happy. But I never got past using Notepad++ instead of an IDE, and that hurt a little bit here. When I have to actually think, it’s really helpful to have the interpreter and the breakpoints and even the autocomplete – the print statements that I use to debug PHP are workable, but it’s like using a hand drill instead of a power tool. Not only does it slow me down, it breaks my flow. I’ll need to look at a heavier stack for PHP development in the future.

A Hash Table in C#

A hash table (or hash map) is a data structure that represents a collection of key-value pairs. This allows client code to refer to elements stored in the hash table by name instead of just by integer index. More generally, this description applies to what’s called an associative array (named because keys are associated with values). A hash table in particular is a means by which an associative array can be made very efficient, using some clever tricks.

A hash table’s structure looks something like this:hashtable

On the left, we see an array of pointers to lists (more generally ‘buckets,’ because we could use another data structure here). When we add a new item to the hash table, we put its key through a hash function. The hash function returns the integer index of the bucket into which we place the item. When we want to look up the value associated with a key, or determine whether that key exists, we take the hash again, then search the bucket and see if we can’t find the item with the appropriate key.

All that hashing and bucketing sounds complicated, but it saves quite a bit of time. Looking up an item in a naive associative array would mean starting at the beginning and going through every key until we found the one we were after. With a hash table, we only have to search as many items as are in our key’s bucket.

Now, this should make the hash table faster, but if we choose our hash function poorly, we can still end up with something very slow. For instance, if our hash function disregards the input and returns 7, we’ll have a working hash table in which all the keys are stored in bucket 7, and we’re back to the performance of the naive associative array. A good hash function should distribute items evenly across the buckets, but a pathological choice of keys can still result in slow behavior.

We also have to make sure that the hash function chooses only those integers we have available as indices. An easy way to do that is by reducing the result of an arbitrary hash function modulo n, where n is number of buckets. If our number of buckets is a power of 2, we can also just use the least significant bits of the result of the hash to achieve the same result more quickly.

Ideally, we never have two items stored in the same bucket. When an item is added to an occupied bucket, we call this a ‘hash collision.’ If our hash table stores a fixed set of items, we can find a hash function which gives us no collisions, for a very fast lookup. If we have to store unknown, arbitrary values, we try to choose a function which results in relatively few collisions.

As the number of items stored in the table grows, the odds of a hash collision increase surprisingly fast (see the birthday paradox). We can mitigate this by expanding the array of available indices as the hash table fills. Expanding the array is an expensive operation, because we have to re-hash the contents of the table – we may either re-hash everything at once, or maintain the table as it was while simultaneously allocating another, larger hash table and determining when and how to move the old entries to the new. My example will not expand the hash table.

Here’s the code:

I also added a method to delete elements from the linked list by value, and return the deleted item. The existing Delete() method got a return value, too, just for parity:

And the client code to test it:

Which gives the expected output:

An Array List in Java

An array list, also called a dynamic array, is a variable-sized data structure to which elements can be added and removed. Like a linked list, it can serve as the underlying data structure behind a stack or a queue.

The simplest way to build a dynamic array is to store the data in a traditional array, which is copied to a new, larger array when it gets full. This means that most insertions are O(1), but that inserting when the underlying array is full becomes an O(N) operation. A real implementation might also de-allocate some of the memory if we remove elements below a certain threshold.

ArrayList

Here’s the code for an array list which allows insertions only at the end and expands by a factor of 2 when it gets full:

Some code to demonstrate the functionality:

And the output:

A Queue in C#

A queue (pronounced as one would the letter ‘Q,’ although the grad student who taught my Data Structures course was fond of saying ‘kway-way’ to remind us of the spelling) is analogous to a stack, but our access to the data is at the element first added, not the element most recently added. It’s just like the line at McDonald’s or at the bank or a queue in Britain, where imitating data structures is a national pastime. Instead of being named ‘push’ and ‘pop,’ the operations are called ‘enqueue’ and ‘dequeue.’

queue

 

Queues are a very ‘fair’ means of determining priorities. As already mentioned, we use them to determine who is served next at a restaurant. A printer uses a queue to determine which job to handle next, in case multiple jobs are submitted while another is being handled.

Like a stack, a queue is implemented atop the scaffold of another data structure, like a list or an array. Since I did the linked list in C#, and we’re back around to C# in this example, I’ve built my toy queue on my own toy linked list, rather than using .Net’s built-in linked list:

As you can see, my earlier laziness in not fleshing out the linked list’s optional functionality made matters a bit more complicated. If the linked list tracked its last element and offered the ability to delete from either end, the dequeue method could be a bit simpler. As it stands, we have to walk through the entire queue every time we want to dequeue anything. It’s as if our maître d’ or bank teller starts at the end of the line, asking each customer “Was anyone here before you?” and determining whom to serve that way.

We also add another little test to our DataStructureTesters project:

Which gives the expected output:

A Stack in Python

A stack is an ordered collection which offers two operations: ‘push,’ adding an item to the stack, and ‘pop,’ to remove and return an item. ‘Pop’ will return the most-recently-pushed item which has not been previously popped. The physical analog is a stack of plates like the one you might find in your kitchen cabinets: supposing you are only strong enough to handle one plate at a time, you’ll always add (‘push’) a plate onto the top of the stack, or remove (‘pop’) the last plate you pushed.

stack

Those are the only two operations you get. It can be counterintuitive, but limiting what you’re able to do with the collection like this is pretty useful. Your web browser two stacks for your ‘back’ and ‘forward’ buttons, popping a page or URL from the from the ‘back’ stack and pushing the current page onto the ‘forward’ stack when you hit the ‘back’ button (or vice-versa). Your computer’s memory contains a ‘call stack’ for every application, which keeps track of the subroutine that application is currently executing. This allows us to call subroutines from other subroutines and return to the point where we left off.

Internally, stacks are usually built on top of lists, either array lists or linked lists. In fact, my linked list from the original post is already a stack, but with the ‘push’ and ‘pop’ operations named ‘insert’ and ‘delete.’ The difference is that a full-featured linked list would probably have some functionality to insert and delete at the tail or perhaps at an arbitrary position in the list, whereas a stack shouldn’t (if it does, it’s not a stack any more).

Since we’ve already implemented a linked list, I’ll take Python’s built-in list structure for granted, and just implement the stack on top of it:

This gives the expected output:

Now, even the casual observer will note that I made this a little more difficult than it could’ve been. Python’s lists already define a “pop,” but using that would’ve been silly.

In a real setting, we’d probably want to implement a ‘peek’ method, which would allow us to get the item atop the stack without popping it. With this toy example, we’d have to pop the item, use it, and push it back onto the stack. We might also want an isEmpty method, or a more graceful behavior when we try to pop from an empty stack (right now, we get a “list index out of range” error).