I am always interested in learning new and refreshing on known algorithms. I looked at a problem in HackerRank which at this time is not relevant. The problem involves looking for sub strings in a string. There are many algorithms that could be used, but as the saying goes, there are many ways to skin a cat, but I always say that one and only one is the best.
I am not going to discuss the some possible approaches to solve the problem in this post. I will take a look at Tries which might be close. Will build a node for the trie, a trie and then will attempt to figure out if by modifying the node and possibly the trie, I can solve the problem. If we are not able to solve the challenge (I do not expect to solve it with a base Trie), will cover a better approach in the next post.
On a separate note, a month or so ago I was vacationing in Italy with family and friends. On a couple occasions I had an Italian dessert named cassata. Coming from an Italian family, I enjoyed many times this desert. For some reason or another we have not had the opportunity to make it at home.
Earlier today I was going over new videos on YouTube. I just saw one that uses panettone (another Italian dessert which is made for the holidays) as the base for the cassata. Discussed it with my wife over breakfast and we decided to get all the ingredients for the weekend so we can make our own cassata. If you are interested you can find the YouTube video here. The title for the video is “CASSATA DI PANETTONE ricetta veloce senza cottura – Tutti a Tavola”. The video is in Italian but you can probably get it translated or with captions in English.
Let’s get to the subject for this post. Let’s start by looking at what is a Trie and why would we use one. A trie is a type of tree. A Trie is based on nodes which have three basic fields. One holds a character the second is used to refer to its children and the last a flag that indicates that a word is complete. If you just have a list of words in a dictionary you can find them by specifying the word and if it was properly spelled, locate it. When you use a dictionary, you are able to add characters and move up and down until you find the correct word which in some cases you might not know how to spell it. A trie helps one mimic the operation / algorithm of looking up words in a dictionary.
A trie starts with an empty node. The character field in the root is not used. The children field can be implemented with different data structures that can be used to map a character to a node. In our example we have inserted two words: ‘cat’ and ‘car’. The children field must hold space to reference to the node for the ‘c’ character (not the actual character). For this we could use an array of 26 nodes if we are only going to use lowercase characters in the English dictionary. The reference to the node holding ‘a’ would be in the first slot, the reference for the node (if any) holding the ‘b’ character in the second slot, and so forth. In our case our array would be holding a reference for the node holding the ‘c’ character in slot three. This would work fine because if we are looking for a node holding the ‘c’ character we can define the index as: ‘c’ – ‘a’ == 2.
We could also use some type of map to map characters to nodes. In Java we could use a HashMap<Character, TrieNode> in which the key is of type Character and the value is of type TrieNode. I decided to name the class for the node for the trie TrieNode. You could as well have used other names (i.e., Node).
Note that the first node holds the ‘c’ and we flag it as false because the word we are inserting is “cat”. The process repeats to insert the node for the letter ‘a’. One more time for the node holding the letter ‘t’. Note that we have added a node without a character in which we have set the end of word flag to true.
I have read and watched videos that show the extra node. Not sure why it is there. I would simply set the flag to true in the node for letter ‘t’ to indicate that the word “cat” is complete. The same holds for the word “car”. Note that after the “ca” we need to add a new node to hold the letter ‘r’. So we add a new child in the node for the letter ‘r’ which references the node holding the letter ‘r’.
My question is, if we are using a lowercase alphabet, we have 26 slots for the characters. Which character represents the node we added at the end of each complete word? I figured we could use a ‘\000’ (null character) so we actually need an array of 27 entries. Where do we put the reference for the blank node? To keep things consistent we could place it in position 26 so we would not have to change our mapping.
I spent the time implementing such approach. Besides added complexity I did not find any benefits. If you do please let me know. On a side note, the few videos that show the additional node do not seem to include an implementation. I wonder why.
Moving on, we will not use an extra node or an additional special character to reference the leaf (empty) node. We will just set the flag to true for the characters that indicate an end of word.
In this diagram we have eliminated the node that indicates the end of a word. We have also added a couple more words to our trie: “done” and “do”. When we inserted the word “done”, we set all flags to false with the exception in the last node which is for the character ‘e’. Later when we insert the word “do” we find that there are nodes for characters ‘d’ and ‘o’. The only thing we need to do is to set the flag from false to true in the node for letter ‘o’. As you can see, we no longer require special containers to hold a value for the empty nodes.
6 cat car done do trie try deleteWords <<< list: [car, cat, do, done, trie, try] deleteWords >>> word to delete [single <Enter> to exit]: checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: c checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: ca checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: car checkWord <<< string ==>car<== is a complete word checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: car= checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: c checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: ca checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: cat checkWord <<< string ==>cat<== is a complete word checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: cat- checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: car checkWord <<< string ==>car<== is a complete word checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: cart checkWord <<< string ==>cart<== not found !!! checkWord <<< list: [car, cat, do, done, trie, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: car*
I started first with some code that I used to experiment with the Trie. Later I modified it to get closer to what I thought would be needed for the HackerRank problem Determining DNA Health which I will attempt to address in a future post. This will become clearer after we take a look at the actual code in the main() function in the Solution.
We enter one character at a time and determine that the word “car” is in our dictionary (trie).
We erase “car” and enter “cat”. The word “cat” is also in our trie.
We then erase the ‘t’ and enter “rt” to form “cart” which is not in our dictionary.
24 look car card loot baba cards baby trie carton tried cat lock tries cake john cake babe try cot aba carmen cats cots cart deleteWords <<< list: [aba, baba, babe, baby, cake, car, card, cards, carmen, cart, carton, cat, cats, cot, cots, john, lock, look, loot, trie, tried, tries, try] deleteWords >>> word to delete [single <Enter> to exit]: checkWord <<< list: [aba, baba, babe, baby, cake, car, card, cards, carmen, cart, carton, cat, cats, cot, cots, john, lock, look, loot, trie, tried, tries, try] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>: *
In this example we insert 24 words. They all appear to go into the trie or not? Note that in the 24 words the word “cake” is repeated. The software will ignore the second attempt to insert the word. This is achieved by first searching for the word and returning if the word is found in the trie. The final count of words in our trie is 23.
/* * Test scaffolding. */ public static void main(String[] args) throws IOException, InterruptedException { // // ???? start reverse string test ???? // Stack<Character> stack = new Stack<Character>(); // String word = "car"; // System.out.print("main <<< word: " + word + " reverse: "); // reverse(word, stack); // // ???? end reverse string test ???? // // ???? start factorial test ???? // System.out.print("main <<< f: "); // int f = sc.nextInt(); // System.out.println("main <<< factorial: " + factorial(f)); // System.exit(-1); // // ???? end factorial test ???? // // ???? start factorial test ???? // System.out.print("main <<< s: "); // int s = sc.nextInt(); // System.out.println("main <<< sum: " + sum(s)); // System.exit(-1); // // ???? end factorial test ???? // **** list of words that may be used in a trie **** ArrayList<String> al = new ArrayList<String>(); // **** get the total number of words to read **** int n = sc.nextInt(); sc.nextLine(); // **** read the words into a string **** String string = sc.nextLine(); // **** split the words **** String[] strings = string.split(" "); // **** add the words into the array list **** for (int i = 0; i < n; i++) { al.add(strings[i]); } // **** check if we did NOT read all the words into the array list **** if (n != al.size()) { System.err.println("main <<< UNEXPECTED n: " + n + " != al.size: " + al.size()); System.exit(-1); } // **** check word as you type one character at a time**** checkWord(al); // **** close scanner **** sc.close(); }
The code starts with a few lines commented out. I was going to remove them but for the time being will leave them in. Please disregard them. They just call a few test methods that I used to refresh and think about recursion.
The code of interest starts by declaring an array list which we will populate with our dictionary of words. The count of words is read and we enter a loop reading the actual words. Note that in the second example we ill ready the word “cake” twice.
After a simple sanity check, we make a call to checkWord() and pass in the array list with our dictionary. In production code we should use a better name than “al” for the array list (e.g., dictionary).
/* * Check word using Trie. */ static void checkWord(ArrayList<String> al) { // **** create Trie with the specified list of words **** Trie trie = new Trie(al); // **** delete words from the trie **** deleteWords(trie); // deleteWordsRec(trie); // **** **** boolean done = false; String string = ""; TrieNode lastNode = trie.getRoot(); // **** loop until done **** while (!done) { // **** generate and display list of trie words **** List<String> list = new ArrayList<String>(); trie.genList(trie.getRoot(), list); System.out.println("checkWord <<< list: " + list.toString()); // **** prompt for the next character **** System.out.println("checkWord <<< ['-' remove last char, '=' clear word or '*' exit]"); System.out.print("checkWord >>> SINGLE character followed by <Enter>: " + string); // **** get input from the user **** String str = sc.nextLine(); // **** check if no characters were entered **** if (str.length() == 0) continue; // **** check if user entered: '*' (exit) **** if (str.charAt(0) == '*') { done = true; continue; } // **** check if user entered '=' (clear current word) **** if (str.charAt(0) == '=') { string = ""; lastNode = trie.getRoot(); continue; } // **** check if user entered '-' (clear last character) **** if ((string.length() != 0) && (str.charAt(0) == '-')) { string = string.substring(0, string.length() - 1); lastNode = lastNode.prev; if (lastNode.end) { System.out.println("checkWord <<< string ==>" + string + "<== is a complete word"); } continue; } // **** extract first character only (just in case) **** char ch = str.charAt(0); // **** look for the character in the last node **** TrieNode node = lastNode.children.get(ch); // **** character NOT found **** if (node == null) { System.out.println("checkWord <<< string ==>" + string + ch + "<== not found !!!"); } // **** character found **** else { // **** update the last node in the trie **** lastNode = node; // **** append character to string **** string += ch; // **** check if we have a complete word **** if (lastNode.end) { System.out.println("checkWord <<< string ==>" + string + "<== is a complete word"); } } } }
The function starts by populating the trie with the array list.
I have currently commented out the deleteWordRec() function. The deleteWordsRec() is work in progress.
6 cat car done do trie try deleteWords <<< list: [car, cat, do, done, trie, try] deleteWords >>> word to delete [single <Enter> to exit]: trie deleteWords <<< list: [car, cat, do, done, try] deleteWords >>> word to delete [single <Enter> to exit]: try deleteWords <<< list: [car, cat, do, done] deleteWords >>> word to delete [single <Enter> to exit]: checkWord <<< list: [car, cat, do, done] checkWord <<< ['-' remove last char, '=' clear word or '*' exit] checkWord >>> SINGLE character followed by <Enter>:
Since in our second diagram we did not use the words ”trie” or “try” we remove them from our trie. The set of prompts illustrate the current contents of the trie. Initially all six words are in. When done we end up with only four words.
Back on the code, after deleting some words from our trie, we enter a loop used to process a character at a time. The reason for this is to test if we can verify if the next character is valid or not without having to start at the root.
The code gets the input character (must be followed by the <Enter> key) and checks if it represents a command. If not it checks if the character is a child in the last node. If it is, the last node is updated and the process repeats. If the last character is not a child, it displays an error message.
import java.util.TreeMap; /* * Node for our Trie. */ public class TrieNode { // **** members **** public char ch; TreeMap<Character, TrieNode> children; boolean end; String word; TrieNode prev; /* * Constructor with character. */ public TrieNode(char ch) { this.ch = ch; this.children = new TreeMap<Character, TrieNode>(); this.end = false; this.prev = null; this.word = ""; } /* * Constructor without character. */ public TrieNode() { this.children = new TreeMap<Character, TrieNode>(); this.end = false; this.prev = null; this.word = ""; } /* * Gets the child node associated with this character. */ public TrieNode getChild(char ch) { if (this.children != null) { return this.children.get(ch); } else { return null; } } /* * Add child to children. */ public void addChild(TrieNode child) { children.put(child.ch, child); } /* * Returns a string of the specified node. */ public String toString() { // **** for efficiency **** StringBuilder sb = new StringBuilder(); // **** **** sb.append("ch: " + this.ch); sb.append(" word ==>" + this.word + "<=="); sb.append(" endOfWord: " + this.end); // **** return a string **** return sb.toString(); } }
The TrieNode class shows the expected members. I decided not to use an array of TrieNode objects. Instead I went with a TreeMap. The reason is to simplify the display of contents in alphabetical order. Feel free to change the type and experiment with the code.
Note that we have two additional members, word which holds the string of characters from the root to the current node and prev which is used as a reference for the previous node. As you can imagine we can make use of both fields to optimize different operations in the trie.
We have two constructors. One creates an empty node while the other creates a node with a character.
The getChild() method is used to get the child node associated with a character.
The addChild() method can be used to add a child to the specified node.
Finally, the toString() method returns a string with some of the fields in the TrieNode.
import java.util.ArrayList; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Set; import java.util.Stack; import java.util.Map.Entry; /* * Trie class. */ public class Trie { // **** root for our trie **** private TrieNode root; // **** **** private Stack<String> tempStack = new Stack<String>(); /* * Constructor empty Trie. */ public Trie() { root = new TrieNode(); tempStack = new Stack<String>(); } /* * Constructor using an array list. */ public Trie(ArrayList<String> al) { // **** **** root = new TrieNode(); tempStack = new Stack<String>(); // **** populate the Trie **** for (int i = 0; i < al.size(); i++) { insert(al.get(i)); } } /* * Insert single (complete) word into Trie. */ public void insert(String str) { // **** check if this word is a duplicate **** if (search(str)) { return; } // **** used in the loop **** TrieNode current = root; StringBuilder sb = new StringBuilder(); // **** traverse the characters in the string **** for (int i = 0; i < str.length(); i++) { // **** get the current character **** char ch = str.charAt(i); // **** add character to string builder **** sb.append(ch); // **** get the child node for this character **** TrieNode node = current.children.get(ch); // **** the character is NOT in the Trie **** if (node == null) { // **** create node for this character **** node = new TrieNode(); // **** set the node **** node.ch = ch; node.prev = current; node.word = sb.toString(); // **** **** current.children.put(ch, node); } // **** update current for next pass **** current = node; } // **** set end of word **** current.end = true; } /* * Search for the specified string in the Trie. Returns true is full word is * found. Returns false if full word is NOT found. */ public boolean search(String str) { // **** start at the root of the trie **** TrieNode current = root; // **** traverse the characters in the string looking them up in the trie **** for (int i = 0; i < str.length(); i++) { // **** get the current character **** char ch = str.charAt(i); // **** get the child node for this character **** TrieNode node = current.children.get(ch); // **** check if this characters was NOT found **** if (node == null) return false; // **** **** current = node; } // **** return if this string is a full word or not **** return current.end; } /* * Return the root node of the Trie. */ public TrieNode getRoot() { return this.root; } /* * Display the contents of the trie. This is a recursive call. */ public void display(TrieNode current, String str) { // **** check if end of word **** if (current.end) System.out.println(str); // **** get the set of child entries **** Set<Entry<Character, TrieNode>> set = current.children.entrySet(); // **** iterate through the entries **** Iterator<Entry<Character, TrieNode>> it = set.iterator(); while (it.hasNext()) { // **** get the next entry **** Entry<Character, TrieNode> entry = it.next(); // **** get the character for this entry **** char ch = entry.getKey(); // **** append the character to the string **** if (ch != '\000') str += ch; // **** process the next entry **** display(entry.getValue(), str); } } /* * Print the contents of the Trie. */ public void printTrie() { // **** to load the stack **** printTrie(getRoot(), ""); // **** dump stack **** while (!tempStack.isEmpty()) System.out.println(tempStack.remove(tempStack.size() - 1)); } /* * Print contents of the Trie. Recursive call. */ public void printTrie(TrieNode node, String s) { // **** **** String strSoFar = s; // **** **** if (node.ch != '\000') strSoFar += String.valueOf(node.ch); // **** **** if (node.end) { // **** make copy of this string (to print in ascending order) **** tempStack.push(strSoFar); // **** **** return; } else { Stack<TrieNode> stack = new Stack<TrieNode>(); Iterator<TrieNode> itr = node.children.values().iterator(); // **** **** while (itr.hasNext()) { stack.add(itr.next()); } // **** display words in ascending order **** while (!stack.empty()) { TrieNode t = stack.pop(); printTrie(t, strSoFar); } } } /* * Get all words with the specified prefix starting at the specified node. This * is a recursive call. Should be a private method. */ public List<String> getAll(String prefix, TrieNode node) { // **** new list **** List<String> al = new ArrayList<>(); // **** add this prefix to the list **** if (node.end) { al.add(prefix); } // **** loop once per character **** for (Map.Entry<Character, TrieNode> child : node.children.entrySet()) { // **** add character to the prefix **** List<String> subSuffix = getAll(prefix + child.getKey(), child.getValue()); // **** add subfix **** al.addAll(subSuffix); } // **** **** return al; } /* * Return a list with all words that start with the specified prefix. If the * prefix is ""; then all words in the trie are returned. */ public List<String> returnAllChildren(String prefix) { // **** **** List<String> al = new ArrayList<>(); // **** start at the root node of the trie **** TrieNode current = root; // **** **** for (int i = 0; i < prefix.length(); i++) { // **** get the current character from the prefix **** char ch = prefix.charAt(i); // **** get the child node for this character **** TrieNode node = current.children.get(ch); // **** check if there is no such character (prefix) **** if (node == null) { return al; } // **** to continue loop with this node **** current = node; } // **** get all words starting at this node with the specified prefix **** return getAll(prefix, current); } /* * Return the node of the last character in the specified string. */ public TrieNode getLastNode(String str) { // **** perform sanity checks **** if (str.length() == 0) return null; // **** current node **** TrieNode current = root; // **** traverse the string **** for (int i = 0; i < str.length(); i++) { // **** get the current char **** char ch = str.charAt(i); // **** get trie node associated with this character **** TrieNode child = current.children.get(ch); // **** something is wrong **** if (child == null) return null; // **** to continue in loop **** current = child; } // **** return node associated with last character of the string **** return current; } /* * Dump contents of trie. This is a recursive call. */ public void dump(TrieNode node, String str) { // **** end condition **** if (node == null) return; // **** print complete word **** if (node.end) { System.out.println(str); } // **** get set of children **** Set<Entry<Character, TrieNode>> set = node.children.entrySet(); // **** loop for all children **** for (Entry<Character, TrieNode> entry : set) { char ch = entry.getKey(); dump(entry.getValue(), str + ch); } } /* * Traverse the trie collecting all strings from end nodes. */ public void genList(TrieNode root, List<String> list) { // **** **** Set<Entry<Character, TrieNode>> set = root.children.entrySet(); // **** **** if (set == null) { return; } // **** **** for (Entry<Character, TrieNode> entry : set) { // **** add word to the list **** if (entry.getValue().end) { list.add(entry.getValue().word); } // **** **** genList(entry.getValue(), list); } } /* * Delete the specified (complete) word from the trie. Returns without message * if word is not found. */ public void delete(String word) { // **** start at the root of the trie **** TrieNode current = this.getRoot(); // **** get to the last node for this word **** for (char ch : word.toCharArray()) { // **** get the child node for the current character in current node **** TrieNode child = current.children.get(ch); // **** check if child not found **** if (child == null) { return; } // **** get ready for the next character **** current = child; } // **** check if this node is NOT an end node **** if (!current.end) { return; } // **** this is no longer going to be a word **** current.end = false; // **** loop deleting nodes (as needed) **** TrieNode prev = current.prev; while (current != root) { // **** reference previous node **** prev = current.prev; // *** get number of children **** int numChildren = current.children.size(); // **** node part of multiple words **** if (numChildren != 0) { break; } // **** character to delete **** char ch = current.ch; // **** remove character from previous node **** prev.children.remove(ch); // **** set current to our previous node **** current = prev; // **** check if this is an end of word **** if (current.end) break; } // **** delete character from root node (if needed) **** if (current == root) { char ch = word.charAt(0); current.children.remove(ch); } } }
The root node is defined in the class.
The following constructors are used to create an empty trie or one that is populated with an array list.
The insert() method is used to insert a complete word into the trie. Note that we make used of the search() method to determine if the complete word is already in the trie.
We then enter a loop that traverses the characters in the string one at a time.
We get the current character from the string and look up the character if it is a child of the current node. If it is not we allocate a new node and populate it. One way or the other we update the current node to get it ready to process the next loop.
When we are done with the loop we set the flag to true to indicate we have a complete word.
The search() method looks for the specified word in the trie. If a complete word is not found the method returns false; otherwise it returns true.
The class contains other methods. Some of them are useful and operational. I believe some of them are experimental and they may not work. If you have trouble making one work please let me know. I will modify the code so the method in question operates properly.
As I mentioned earlier, I was experimenting with Tries to solve a specific problem. In a near future post I will discuss and attempt the solution using this Trie implementation. One way or the other I will attempt to solve the problem.
The entire code for this project can be found in my GitHub repository.
If you have comments or questions regarding this, or any other post in this blog, or if you would like for me to serve of assistance with any phase in the SDLC (Software Development Life Cycle) of a project associated with a product or service, please do not hesitate and leave me a note below. If you prefer, send me a message using the following address: john.canessa@gmail.com. All messages will remain private.
Keep on reading and experimenting. It is the best way to learn, refresh your knowledge and enhance your developer toolset!
John
Follow me on Twitter: @john_canessa