Re-Space

UPDATE – The format for the code for the class TrieNode has been addressed. Sorry for the inconvenience.

It is Friday and my youngest and family will be arriving this afternoon for a weekend long visit. My wife and I are getting ready for their arrival later today. We all enjoy good food so we have an extensive menu planned which includes fugazza, pasta carbonara with lots of pancetta, beef roast with a homemade tomato sauce, limoncello cake and different alcoholic and non-alcoholic beverages.

Let’s get to the main subject of this post. I tackled problem 17.13 from the Cracking the Coding Interview book. The problem is name Re-Space.

Apparently a user accidentally was able to remove all cases and punctuation marks from a lengthy document. The idea is to restore / add spaces. We can use a dictionary. The issue is that some words may not be in it. For such words the book suggests flagging them with underlines (‘_’).

It seems that this problem calls for the use of a Trie. The issue might be that in an interview typically lasting about an hour (so in practice 45 minutes) there does not seem to be much time to develop code for a Trie and put it to work to attempt to solve a problem.

As usual I start by making some drawings on paper or a white board. Once I have a reasonable path defined I move to start coding. I like to use the Test Driven Development (TDD) approach. I will have more to say on TDD in a few paragraphs.

I have not made use of Tries much on software development projects. I have a reasonable understanding and have developed some code for them. Apparently in the Contacts with Trie post in this blog from about three years ago, I solved a Hacker Rank problem that required the use of Tries. The code in that post (and many other older ones) is not properly formatted. The reason is that I changed the software that I use to format code. I believe I have gone back to some old post and updated the format, but I only have done it when someone had a question regarding that specific post.

Earlier this week I was exchanging some messages with a friend and software developer regarding TDD. One of his comments regarding a post in which I mentioned the use of TDD was the lack of unit tests. My response was what do unit testing has to do with TDD?

I am not going to spend much time in this post to explain my position. I will dedicate a separate post in the future to TDD. It seems that most people make an immediate association of TDD with unit testing. In actuality the concept of TDD has nothing to do with unit testing. The idea is to start developing software with a blank slate. Then develop the code that will be able to use the feature once it is designed, implemented and tested. As the developer cycles and observes the results, makes changes to the design and implementation and test them, the result will be a much better and solid product.

For example, let’s assume that I want to develop a new API call for a specific product. I first attempt to make the call with some arguments that may or may not be included in the released version which will fail because the server knows nothing about it. I then start with the implementation and continue testing (e.g., manually invoking the API using curl) as needed. In some passes I might add or remove parameters. Perhaps edit the parameters to make the interface better. Perhaps I find something (an issue) with the server when I make my call that has something to do with how the server operates. I would not change the scope of my task, but surely would issue a bug report so the issue would be addressed in the future.

The idea of unit testing is a way to show management that the software was tested. In most cases the tests are so simple that they are of little use. I do know for a fact that Boeing uses Agile and unit testing for their software development. The issue that was discovered after passing all those tests and delivery of the product (the actual airplane) when two planes crashed taking the lives of hundreds of people was not discovered or addressed by unit or system tests. Think about the essence and not the associated procedures of TDD.

The photo shows a mangled string with the text “jesslookedjustliketimherbrother”. This sample came from the book. The results showed that “jess” and “tim” where not in the dictionary.

I drew a diagram of a Trie loaded with the words from a very short and simple dictionary. A Trieis made of a tree. Each node has some information. Given that we are using lower case characters, our nodes should have a mechanism to support 26 characters each. We could use an array, a list or a hash map among others. They all have pros and cons.

Note that that when we encounter the end of a word, we flag it with a square. In this case we do not have words that may continue. All our words are separate in our example. In practice we may have “amazon”, “amazonian”, “amazon prime” and “amazing”. Note that all these words start with the prefix “amaz” which is not a word and continue. Also note that “amazonian” is a continuation of the word “amazon”. In the examples I used to test my solution I tried more complicated words in order to verify that all was well. Hopefully now you can better understand the essence of TDD.

On the top right hand I wrote some initial code to use as the first pass of a solution. We need the mangled string and a Trie holding the words for the dictionary.

We need to somehow traverse the mangle words, determine which are valid and print them followed by a space. We also need to display the words we did not find and flag that fact. I decided to use UPPER CASE characters instead of underlining. The solution in the book does not seem to return / print underlined characters. As a matter of fact I did not see a solution that would underline the characters.

At this point I decided to switch to my Visual Studio Code IDE and start implementing some Trie code so I would be able to use it to solve the actual problem. I am not aware of any interview that would provide code for a Trie so you could use it. It is possible but since the purpose of this exercise is to learn, let’s start by writing some Java code to implement a Trie.

/*
 * Node for Tries.
 */
class TrieNode {

	// ***** members ****
	char                                data;
	public HashMap<Character, TrieNode> children; // could use a LinkedList or array
	TrieNode                            parent;
	public boolean                      isEnd;

	// **** constructor ****
	public TrieNode(char c) {
		this.data = c;
		this.children = new HashMap<Character, TrieNode>();
		this.isEnd = false;
	}

	// **** constructor ****
	public TrieNode() {
		this.children = new HashMap<Character, TrieNode>();
	}

	// **** get the child node associated with this character ****
	public TrieNode getChild(char ch) {

		// // ???? ????
		// System.out.println("getChild <<< ch: " + ch);

		// **** ****
		if (children != null) {
			return children.get(ch);
		}

		// **** ****
		return null;
	}

	// **** ****
	protected List<String> getWords() {

		// **** ****
		List<String> list = new ArrayList<String>();

		// **** add this word to the list ****
		if (isEnd) {

			// // ???? ????
			// System.out.println("getWords <<< isEnd: " + isEnd);

			list.add(toString());

			// // ???? ????
			// System.out.println("getWords <<< list: " + list.toString());
		}

		// **** ****
		if (children != null) {

			// **** iterate through the children ****
			Iterator<Entry<Character, TrieNode>> it = children.entrySet().iterator();
			while (it.hasNext()) {

				// **** ****
				Entry<Character, TrieNode> pair = (Map.Entry<Character, TrieNode>) it.next();

				// // ???? ????
				// System.out.println("getWords <<< ch: " + pair.getKey());

				// **** ****
				TrieNode child = (TrieNode) pair.getValue();
				list.addAll(child.getWords());

				// // ???? ????
				// System.out.println("getWords <<< list: " + list.toString());
			}
		}

		// **** ****
		return list;
	}

	// **** ****
	public String toString() {
		if (parent == null) {
			return "";
		} else {
			return parent.toString() + new String(new char[] { data });
		}
	}
}

Given that the Trie is implemented with a tree, we first should implement a class for a TrieNode.

The members show a holder for a character (not a string), a pointer to a parent, a flag to indicate that if one traverses the tree from the root to this node, the resulting word is valid e.g., “amazon”, but one could continue down the Trie and find “amazonian” which would also be a valid word. Note that we decided to use a hash map. The pro is that if we have few words, we would not waste empty space, the con is the inner complexity of the hash map class implementation and that it is not a good candidate to iterate. Perhaps a hash list or array list would be a better / different choice.

We have a couple constructors and a few methods. Note that all methods are not used in the final version of the code. By using the TDD I experimented during development with some approaches which I abandoned. I have written hundreds of libraries and classes for different commercial products that my company used to develop. In such case all the methods / functions may not be used by a particular product but may be used by others. In such case it is good to leave them around in a library. On the other hand, in production code it is better to remove foreign code so the product is easy to follow and understand.

/*
 * Trie class implementation.
 */
class Trie {

	// **** trie root ****
	private TrieNode root;

	// **** constructor ****
	public Trie() {
		root = new TrieNode('*');
	}

	// **** constructor using an array list ****
	public Trie(ArrayList<String> list) {

		// **** ****
		root = new TrieNode();

		// **** ****
		for (String word : list)
			insert(word);
	}

	// **** constructor using an array of strings ****
	public Trie(String[] arr) {

		// **** ****
		root = new TrieNode();

		// **** ****
		for (String word : arr)
			insert(word);
	}

	// **** insert string into trie ****
	public void insert(String str) {

		// **** ****
		if (search(str) == true) {

			// // ???? ????
			// System.out.println("insert <<< str ==>" + str + "<== FOUND");

			// **** nothing else to do ****
			return;
		}

		// ???? ????
		System.out.println("insert <<< str ==>" + str + "<==");

		// **** ****
		TrieNode current = root;
		TrieNode pre;

		// **** ****
		for (char ch : str.toCharArray()) {

			// // ???? ????
			// System.out.println("insert <<< ch: " + ch);

			// **** ****
			pre = current;
			TrieNode child = current.getChild(ch);

			// **** ****
			if (child != null) {

				// // ???? ????
				// System.out.println("insert <<< child != null");

				current = child;
				child.parent = pre;
			} else {

				// // ???? ????
				// System.out.println("insert <<< child == null");

				current.children.put(ch, new TrieNode(ch));
				current = current.getChild(ch);
				current.parent = pre;
			}
		}

		// **** ****
		current.isEnd = true;
	}

	// **** search for string in the trie ****
	public boolean search(String str) {

		// **** start at the root of the trie ****
		TrieNode current = root;

		// **** look for the string ****
		for (char ch : str.toCharArray()) {
			if (current.getChild(ch) == null)
				return false;
			else
				current = current.getChild(ch);
		}

		// **** check if the string is a full word ****
		if (current.isEnd == true)
			return true;
		else
			return false;
	}

	// **** auto complete ****
	public List<String> autoComplete(String prefix) {

		// ???? ????
		System.out.println("autoComplete <<< prefix ==>" + prefix + "<==");

		// **** ****
		TrieNode lastNode = root;

		// **** ****
		for (int i = 0; i < prefix.length(); i++) {

			// // ???? ????
			// System.out.println("autoComplete <<< prefix.charAt(" + i + "): " +
			// prefix.charAt(i));

			// **** ****
			lastNode = lastNode.getChild(prefix.charAt(i));
			if (lastNode == null)
				return new ArrayList<String>();
		}

		// **** ****
		return lastNode.getWords();
	}

	// **** determine if trie contains this prefix ****
	public boolean contains(String prefix, boolean exact) {

		// **** start with the root of the ****
		TrieNode lastNode = getRoot();

		// **** ****
		for (int i = 0; i < prefix.length(); i++) {

			// **** ****
			lastNode = lastNode.getChild(prefix.charAt(i));

			// **** ****
			if (lastNode == null)
				return false;
		}

		// **** ****
		return !exact || lastNode.isEnd;
	}

	// **** determine if this trie contains this prefix ****
	public boolean contains(String prefix) {
		return contains(prefix, false);
	}

	// **** return the root node of the trie ****
	public TrieNode getRoot() {
		return root;
	}

	// **** add this string to the trie ****
	public void add(String str) {
		add(getRoot(), str);
	}

	// **** add this string to trie recursively ****
	private void add(TrieNode root, String str) {

		// **** check for end condition ****
		if (str.length() == 0)
			return;

		// **** get the next character from this string****
		char ch = str.charAt(0);

		// **** get the child associated with this character ****
		TrieNode child = root.getChild(ch);

		// **** add new trie node (if needed) ****
		if (child == null) {
			TrieNode prev = root;
			child = new TrieNode(ch);
			child.parent = prev;
			root.children.put(ch, child);
		}

		// **** check if this is the last character in the string ****
		if (str.length() == 1)
			child.isEnd = true;
		else
			add(child, str.substring(1));
	}
}

The class represents the root with a node. The first constructor implements an empty Trie. I added a ‘*’ so I could tell in the debugger that I was dealing with the root. Keep in mind that our requirements call for the use of lower case characters so this is not an issue. Besides, the root is not used to store a character. The starting characters are in the children nodes which will grow up to 26 entries if using a dictionary with tens of thousands of words.

The following methods are used to populate Tries.

The search() method allows us to search for a full word / string in a Trie. Note that after processing the word / string there is a check to determine if the last character is the end of a word. This method would fail (return false) if se search for the word “amaz” but would succeed if we search for “amazon” and “amazonian”.

The autoComplete() method is one of the reasons to implement and use a Trie. Think about a drop down menu in which a user is entering a word. After each letter is entered, the UI may call autoComplete() to be able to display possible choices. Most IDEs make use of autocomplete features when selecting methods for classes.

The contains() methods may be used to determine if a prefix is or is not a match in a Trie.

The add() methods are used to add a string to a Trie.

public static void main(String[] args) {

		// // **** build array list to populate trie ****
		// ArrayList<String> al = new ArrayList<String>();
		// al.add("amazon");
		// al.add("amazon prime");
		// al.add("amazing");
		// al.add("amazing spider man");
		// al.add("amazed");
		//
		// al.add("alibaba");
		// al.add("ali express");
		// al.add("ebay");
		// al.add("walmart");
		//
		// // **** instantiate new trie ****
		// Trie t = new Trie(al);

		// // **** build array of strings to populate trie ****
		// String[] arr = {"amazon", "amazon prime", "amazing", "amazing spider man",
		// "amazed",
		// "alibaba", "ali express", "ebay", "walmart"};
		//
		// // **** instantiate new trie ****
		// Trie t = new Trie(arr);

		// **** instantiate new trie ****
		Trie t = new Trie();

		// **** add strings to the trie ****
		t.insert("amazon");
		t.insert("amazon prime");
		t.insert("amazing");
		t.insert("amazing spider man");
		t.insert("amazed");

		t.insert("soda");
		t.insert("cake");
		t.insert("soda");
		t.insert("soda pop");

		// **** add more string(s) to the trie recursively ****
		t.add("coca cola");
		t.add("coconut");
		t.add("cacao");
		t.add("cat");
		t.add("catnip");

		// ???? ????
		System.out.println();

		// **** auto complete using this prefix ****
		// List<String> a = t.autoComplete("amaz");
		// List<String> a = t.autoComplete("wall");
		List<String> a = t.autoComplete("c");

		// ???? ????
		System.out.println("\nmain <<< a.size: " + a.size());

		// **** display auto complete strings ****
		for (int i = 0; i < a.size(); i++)
			System.out.println("main <<< a[" + i + "]: " + a.get(i));
		System.out.println();

		// **** trie contains specified prefix ****
		System.out.println("main <<<       contains(\"a\", true): " + t.contains("a", true));
		System.out.println("main <<<      contains(\"a\", false): " + t.contains("a", false) + "\n");

		System.out.println("main <<<    contains(\"amaz\", true): " + t.contains("amaz", true));
		System.out.println("main <<<   contains(\"amaz\", false): " + t.contains("amaz", false) + "\n");

		System.out.println("main <<<          contains(\"amaz\"): " + t.contains("amaz"));
		System.out.println("main <<<          contains(\"amaz\"): " + t.contains("amaz") + "\n");

		System.out.println("main <<<  contains(\"amazon\", true): " + t.contains("amazon", true));
		System.out.println("main <<< contains(\"amazon\", false): " + t.contains("amazon", false) + "\n");

		System.out.println("main <<<        contains(\"amazon\"): " + t.contains("amazon"));
		System.out.println("main <<<        contains(\"amazon\"): " + t.contains("amazon") + "\n");

		System.out.println("main <<<           contains(\"abc\"): " + t.contains("abc"));
		System.out.println("main <<<           contains(\"abc\"): " + t.contains("abc") + "\n");

		// **** mangled string ****
		String mangled = "jesslookedjustliketimherbrother";
		// String mangled = "thisismikesfavoritefood";
		// String mangled = "thisisawesome";
		// String mangled = "doyouliketodrinkcoke";

		// **** words for our dictionary (implemented as a trie) ****
		String[] words = { "just", "this", "favorite", "awesome", "like", "brother", "looked", "food", "is", "her", "to",
				"drink", "do", "you", "soda" };

		// **** instantiate and populate dictionary (a trie) ****
		Trie dict = new Trie(words);
		System.out.println();

		// **** recover the string ****
		String unmangled = recover(mangled, dict);
		System.out.println("main <<< unmangled ==>" + unmangled + "<==");
	}

The main() method is used to test the Trie and lastly to invoke the recover() method used to solve the problem at hand .

The commented out code instantiates an array list and adds words. The words are then added to a Trie when it is instantiated.

The next set of code commented out creates an array of strings and instantiates a Trie which is populated with the words from the string array.

The following code which is not commented out instantiates an empty Trie. We then use the insert() and the add() method to populate the Trie which will be eventually used as a dictionary.

The autoComplete() function is called with the prefix “c”.

We then experiment with the contains() method calling it with different strings with and without the flag used to request and exact match. The exact match implies that the prefix matches of not matches an exact word in the Trie.

Now we get to the problem at hand. We create a mangled string. Some other passes are commented out. We create an array of strings with words and the Trie populated with the words in the array. We then call the recover() method which generates a string which is displayed.

(base) PS C:\Users\John\workspace6\ReSpace> cd 'c:\Users\John\workspace6\ReSpace'; & 
'C:\Users\John\.vscode\extensions\vscjava.vscode-java-debug-0.22.0\scripts\launcher.bat' 'C:\ProgramData\Oracle\Java\jdk1.8.0_221\bin\java' '-Dfile.encoding=UTF-8' '-cp' 
'C:\Users\John\workspace6\ReSpace\bin' 'Solution' 
insert <<< str ==>amazon<==
insert <<< str ==>amazon prime<==
insert <<< str ==>amazing<==
insert <<< str ==>amazing spider man<==
insert <<< str ==>amazed<==
insert <<< str ==>soda<==
insert <<< str ==>cake<==
insert <<< str ==>soda pop<==

autoComplete <<< prefix ==>c<==

main <<< a.size: 6
main <<< a[0]: cacao
main <<< a[1]: cat
main <<< a[2]: catnip
main <<< a[3]: cake
main <<< a[4]: coca cola
main <<< a[5]: coconut

main <<<       contains("a", true): false
main <<<      contains("a", false): true

main <<<    contains("amaz", true): false
main <<<   contains("amaz", false): true

main <<<          contains("amaz"): true
main <<<          contains("amaz"): true

main <<<  contains("amazon", true): true
main <<< contains("amazon", false): true

main <<<        contains("amazon"): true
main <<<        contains("amazon"): true

main <<<           contains("abc"): false
main <<<           contains("abc"): false

insert <<< str ==>just<==
insert <<< str ==>this<==
insert <<< str ==>favorite<==
insert <<< str ==>awesome<==
insert <<< str ==>like<==
insert <<< str ==>brother<==
insert <<< str ==>looked<==
insert <<< str ==>food<==
insert <<< str ==>is<==
insert <<< str ==>her<==
insert <<< str ==>to<==
insert <<< str ==>drink<==
insert <<< str ==>do<==
insert <<< str ==>you<==
insert <<< str ==>soda<==

recover <<< mang ==>jesslookedjustliketimherbrother<==
main <<< unmangled ==>JESS looked just like TIM her brother<== (base) PS C:\Users\John\workspace6\ReSpace> 

The screen capture of the terminal in the Visual Studio Code IDE seems to capture the correct / expected output of our test program. The un mangled string displays the words “JESS” and “TIM” in uppercase to denote that such words were not found in the dictionary. The rest of the words are in lowercase and properly spaced.

	// **** recover mangled string ****
	static String recover(String mang, Trie dict) {

		// **** ****
		StringBuilder sb = new StringBuilder();

		StringBuilder badSB = new StringBuilder();

		// ???? ????
		System.out.println("recover <<< mang ==>" + mang + "<==");

		// **** traverse the mangled string ****
		String str = "";
		for (int i = 0; i < mang.length(); i++) {

			// **** append next character to string ****
			str += mang.substring(i, i + 1);

			// **** look up this string in the dictionary ****
			boolean found = dict.contains(str, false);

			// **** check if this string was NOT found ****
			if (!found) {

				// **** append character(s) to bad string ****
				badSB.append(str);

				// **** clear string ****
				str = "";
			} else {

				// **** check if this is a complete word ****
				boolean complete = dict.contains(str, true);

				// **** check if complete word in dictionary ****
				if (complete) {

					// **** convert to upper case ****
					String upperCase = badSB.toString().toUpperCase();

					// **** append upper case string with or without leading space ****
					if (upperCase.length() != 0) {
						if (sb.length() != 0)
							sb.append(" ");
						sb.append(upperCase);
					}

					// **** clear the string ****
					badSB.setLength(0);

					// **** append string to string builder with or without leading space ****
					if (sb.length() != 0)
						sb.append(" ");
					sb.append(str);

					// **** clear string ****
					str = "";
				}
			}
		}

		// **** append missing bad word (if needed) ****
		if (badSB.length() != 0) {

			// **** append a space (if needed) ****
			if (sb.length() != 0)
				sb.append(" ");

			// **** append the bad word ****
			sb.append(badSB.toString().toUpperCase());
		}

		// **** return the unmangled string ****
		return sb.toString();
	}

The recover() method takes the mangled string and the dictionary expressed as a Trie. The code is implemented as a loop in which characters are looked up in the Trie and assigned to a bad word string which is displayed in UPPERCASE or to a word which is displayed in lowercase.

The last few lines are used to add a missing / bad word to the output string.

I want to make a point. When designing classes for Tries and NodeTrie it is important to place the node related methods in the NodeTrie class and the Trie related methods in the Trie class. In general, when writing code quickly operations intended for the Trie class or for a Tree class may end in the TrieNode or TreeNode class. This is understandable because one is under pressure to solve the problem in a very short time. In production it is important to segregate the proper methods to the proper class.

The entire code for this project can be found in my GitHub repository.

I would like to note that I purposely started to implement the code for this solution using the Eclipse IDE. I then switched to the Visual Studio Code IDE which is able to open the folder holding the Eclipse project without a problem. I then continued implementation and debugging of the code. I was able to go back and forth between IDEs using the same code base. I would say it is very impressive. Microsoft has put a lot of effort to make tools that work well with open source code!

If you have comments or questions regarding this or any other post in this blog, or if you would like me to help with any phase in the SDLC (Software Development Life Cycle) of a product or service, please do not hesitate and leave me a note below. Requests for help will remain private.

Keep on reading and experimenting. It is the best way to learn and refresh your knowledge!

John

Follow me on Twitter:  @john_canessa

One thought on “Re-Space”

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.