Hashing

A data object called a symbol table is required to be defined and implemented in many applications, such as compiler/assembler writing. A symbol table is nothing but a set of pairs (name, value), where value represents a collection of attributes associated with the name, and the collection of attributes depends on the program element identified by the name.
 
For example, if a name x is used to identify an array in a program, then the attributes associated with x are the number of dimensions, lower bound and upper bound of each dimension, and element type. Therefore, a symbol table can be thought of as a linear list of pairs (name, value), and we can use a list data object for realizing a symbol table.
 
A symbol table is referred to or accessed frequently for adding a name, or for storing or retrieving the attributes of a name.
 
Therefore, accessing efficiency is a prime concern when designing a symbol table. The most common method of implementing a symbol table is to use a hash table.
 
Hashing is a method of directly computing the index of the table by using a suitable mathematical function called a hash function.
 
Note The hash function operates on the name to be stored in the symbol table, or whose attributes are to be retrieved from the symbol table.
 

If h is a hash function and x is a name, then h(x) gives the index of the table where x, along with its attributes, can be stored. If x is already stored in the table, then h(x) gives the index of the table where it is stored, in order to retrieve the attributes of x from the table.
 
There are various methods of defining a hash function. One is the division method. In this method, we take the sum of the values of the characters, divide it by the size of the table, and take the remainder. This gives us an integer value lying in the range of 0 to (n−1), if the size of the table is n.
 
Another method is the mid-square method. In this method, the identifier is first squared and then the appropriate number of bits from the middle of the square is used as the hash value. Since the middle bits of the square usually depend on all the characters in the identifier, it is expected that different identifiers will result in different values. The number of middle bits that we select depends on the table size. Therefore, if r is the number of middle bits that we are using to form the hash value, then the table size will be 2r. So when we use this method, the table size is required to be a power of 2.
 
A third method is folding, in which the identifier is partitioned into several parts, all but the last part being of the same length. These parts are then added together to obtain the hash value.
 
To store the name or to add attributes of the name, we compute the hash value of the name, and place the name or attributes, as the case may be, at that place in the table whose index is the hash value of the name.
 
To retrieve the attribute values of the name kept in the symbol table, we apply the hash function of the name to that index of the table where we get the attributes of the name. So we find that no comparisons are required to be done; the time required for the retrieval is independent of the table size. The retrieval is possible in a constant amount of time, which will be the time taken for computing the hash function.
 
Therefore a hash table seems to be the best for realization of the symbol table, but there is one problem associated with the hashing, and that is collision.
 
Hash collision occurs when the two identifiers are mapped into the same hash value. This happens because a hash function defines a mapping from a set of valid identifiers to the set of those integers that are used as indices of the table.
 
Therefore we see that the domain of the mapping defined by the hash function is much larger than the range of the mapping, and hence the mapping is of a many-to-one nature. Therefore, when we implement a hash table, a suitable collision-handling mechanism is to be provided, which will be activated when there is a collision.
 
Collision handling involves finding an alternative location for one of the two colliding symbols. For example, if x and y are the different identifiers and h(x = h(y), x and y are the colliding symbols. If x is encountered before y, then the ith entry of the table will be used for accommodating the symbol x, but later on when y comes, there is a hash collision. Therefore we have to find a suitable alternative location either for x or y. This means we can either accommodate y in that location, or we can move x to that location and place y in the ith location of the table.
 
Various methods are available to obtain an alternative location to handle the collision. They differ from each other in the way in which a search is made for an alternative location. The following are commonly used collision-handling techniques:
 

Linear Probing or Linear Open Addressing

In this method, if for an identifier x, h(x) = i, and if the ith location is already occupied, we search for a location close to the ith location by doing a linear search, starting from the (i+1)th location to accommodate x. This means we start from the (i+1)th location and do the linear search until we get an empty location; once we get an empty location we accommodate x there.
 

Rehashing

In rehashing we find an alternative empty location by modifying the hash function and applying the modified hash function to the colliding symbol. For example, if x is the symbol and h(x) = i, and if the ith location is already occupied, then we modify the hash function h to h1, and find out h1(x), if h1(x) = j. If the jth location is empty, then we accommodate x in the jth location. Otherwise, we once again modify h1 to some h2 and repeat the process until the collision is handled. Once the collision is handled, we revert to the original hash function before considering the next symbol.
 

Overflow chaining

Overflow chaining is a method of implementing a hash table in which the collisions are handled automatically. In this method, we use two tables: a symbol table to accommodate identifiers and their attributes, and a hash table, which is an array of pointers pointing to symbol table entries. Each symbol table entry is made of three fields: the first for holding the identifier, the second for holding the attributes, and the third for holding the link or pointer that can be made to point to any symbol table entry. The insertions into the symbol table are done as follows:
 
If x is the symbol to be inserted, it will be added to the next available entry of the symbol table. The hash value of x is then computed. If h(x) = i, then the ith hash table pointer is made to point to the symbol table entry in which x is stored, if the ith hash table pointer does not point to any symbol table entry. If the ith hash table pointer is already pointing to some symbol table entry, then the link field of the symbol table entry containing x is made to point to that symbol table entry to which the ith hash table pointer is pointing to, and the ith hash table pointer is made to point to the symbol entry containing x. This is equivalent to building a linked list on the ith index of the hash table. The retrieval of attributes is done as follows:
 
If x is a symbol, then we obtain h(x), use this value as the index of the hash table, and traverse the list built on this index to get that entry which contains x. A typical hash table implemented using this technique is shown here.
 
The symbols to b stored are x1,y1,z1,x2,y2,z2. The hash function that we use is h(symbol) = (value of first letter of the symbol) mod n, where n is the size of table.
 
if h(x1) = i
 
h(y1) = j
h(z1) = k
 
then
 
h(x2) = i
h(y2) = j
h(z2) = k
 
Therefore, the contents of the symbol table will be the one shown in following Figure.