Huffman Coding

trieAs I have mentioned in a previous blog, it is very important to reduce as much as possible distractions (i.e., email, phone, texting) while at work and to dedicate a percentage (i.e., 10% to 15%) of your daily workday to learn something. On occasions I decide to learn a new technology (e.g., Spring). Sometimes I decide to spend time polishing on topics I know (e.g., Java) while sometimes I need to refresh concepts (e.g., Huffman Codes and Tries) that I learned in school but seldom use at work.

On a daily basis, I try to solve a problem / challenge on HackerRank. The if_the_only_challenges have nothing to do with work so they make me think outside of the box. I fully believe that many technical problems could better be solved by some upfront thinking and design that by just diving into them with the reduce toolset one uses every day. We are all familiar with the saying “When the only tool in your toolkit is a hammer, all your problems look like nails”.

This morning I solved a challenge on HackerRank. A description of an implementation of a trie applied to Huffman coding was described. The challenge required to traverse the tree using a sting of ones and zeroes (i.e., “1001011”) and extract the associated string (i.e., “ABACA”). The original trie only had four nodes corresponding to letters A, B and C.

You could solve the challenge and mode to the next one, but I believe that one challenge a day is enough. The rest of the available time (10% to 15% of the workday) should be spent on understanding the background of the challenge. In this case it was Tries and Huffman codes.

huffman_encodingA trie can be implemented as a type of binary search tree.  In a binary search tree the values are in the nodes. A binary tree starts with the root node which would contain the value for the first entry and two null nodes; one named left and the other right. In Java one could represent a node using the following class:

public class Node {

// **** elements / fields ****

public int    data;

Node          left;

Node          right;

// **** constructor ****

public Node (int data) {

this.data     = data;

this.left     = null;

this.right    = null;

}

public Node() {

this.left     = null;

this.right    = null;

}

// **** methods ****

public int getData () {

return this.data;

}

}

Notwithstanding node balancing and rebalancing, new values are inserted to the left of the root is they are smaller than it. Values greater than the root are inserted to the right. For example; having n in the range [1, 100] and a root of 19, if the next element is an 87, then 87 > 19 so the new element would be inserted in a sub tree to the right of the root.encoding_tree

In a Trie, of interest are the edges of the graph. For example, assuming that the left child is reached with a 0 and the right child with a 1, one could traverse the trie until a node leaf (a child without children (left == right == null)) is reached. At that point, the value of interest would be kept in the contents of the child node.

The following class could be used as a node in a Trie used in a Huffman algorithm:

public class Node {

char   element = 0x00;      // character for a leaf node

int    weight;              // weight of the subtree rooted at this node

Node   left;                // reference to the left subtree

Node   right;                     // reference to the right subtree

String code = “”;           // code of this node from the root

/*

* constructor for an empty node

*/

public Node() {

}

/*

* constructor for specified weight and character

*/

public Node(int weight, char element) {

this.weight   = weight;

this.element  = element;

}

}

Huffman codes are typically used for compression. A byte (8 bits) is used to represent ASCII (and extended ASCII) characters in the range of (0 – 255). In a nutshell one needs to produce a distribution of the occurrence of each character in a language (e.g., English) or a specified message (“Hello world”). The trie for the entire English language would contain between 128 to 256 nodes (one for each character). The next step is to populate a trie with the values in the distribution.

Each character in ASCII would have a unique path from the root of the tree to the leaf node holding the character. The edges which are labeled with a 0 or a 1 are used to uniquely associate each ASCII character with a pattern of 1s and 0s. The idea is that the most commonly used characters (e.g., ‘r’, ‘s’, ‘t’, ‘l’, ‘n’ and ‘e’) would have representations using less than eight bits (one byte).

frequency_tableTo uncompress the message, the trie is traversed from the root following the sequence of 1s and 0s down to a leaf node. The leaf node contains the actual character. The process is repeated until the entire message is reproduced.

Of course, in actual practice there are many other concerns (i.e., incorrect bit) that could derail the expansion of a sequence.

In conclusion, if you decide to practice with sites like HackerRank, make sure you spend the additional time learning more and experimenting on your own with the subject of the challenge. It is possible, but highly unlikely, that one would find the exact same condition with a problem at work. Expand your toolkit past the hammer.

Regards;

John

John.canessa@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *