Parse Text

UPDATE – April 09, 2018When I started this post, I was thinking in several follow ups in order to try different approaches and be able to continue to improve on previous passes by adding code or starting from scratch when a new idea came up. I was interested in showing a normal progression that the reader would encounter when developing software. In the days that passed, I decided to limit the subject to a single post. Please let me know if you encounter an issue or would like for me to expand on this entry.

Sometime last week I attempted to solve a simple online challenge. The challenge dealt with parsing a string of text and then obtaining information from the string. There are probably thousands of variants to the challenge. I am not going to cover the exact challenge in this post. I will make my own. Will start simple and will get more complex each time we add a new obstacle.

What I wanted to do with the online challenge was to solve it using the C (not C++ or C#) programming language. Of course that was not a good choice but I wanted to verify that you need to use the right tool for the job. I switched from C to C++ and Java a few times to see if I could get a hint of the data types of the two arguments to the main function / method. This will become clear when we get to the actual code.

In C the main call was something like this:

BoundArrayOfStrings Parse(BoundArrayOfChars string, BoundArrayOfStrings specialWords)

In Java it was something like this:

ArrayList<Something> Parse(String text, array<String> specialWords)

I did not take notes so I cannot recall exactly what the Java method returned.

I was using the Chrome browser. In C is a variable is declared of type BoundArrayOfStrings; which must be a structure, then one would expect that the elements (probably two) would have shown with intellisenseas soon as the ‘.’ Was typed after the variable bas:

BoundArrayOfStrings 	bas;
bas.

It did not. Perhaps something was going on with Chrome and the session. In retrospect I should have logged out or even restarted my machine. Such is life; I did not.

Allow me to show the following source file written in Java:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Scanner;
import java.util.TreeMap;

public class Solution {
	/*
	 * constants
	 */
	final static int	COUNT	= 3;
	
	/*
	 * 
	 */
	public static void main(String[] args) {
		
		// **** open scanner ****
		Scanner sc = new Scanner(System.in);
		
		// **** prompt and read the text to parse ****
		String text = "";
		text = "Hi John; welcome to the test. John this is a test string to test generating and operating on a histogram.";
		System.out.print(">>> text: ");
		String buffer = sc.nextLine();
		if (!buffer.isEmpty()) {
			text = buffer;
		}
		System.out.println("<<< text ==>" + text + "<==\n");
		
		// **** close scanner ****
		sc.close();

		// **** split text into an array ****
		String[] words = text.split("[ ;,.?]");
		System.out.println("<<< words:");
		for (String word : words) {
			System.out.println("<<< word ==>" + word + "<==");
		}
		System.out.println();
		
		// **** create and populate a histogram (words in lower case) **** 
		Map<String, Integer> hist = new TreeMap<String, Integer>();
		int count = 0;
		for (String word : words) {
			if (word.isEmpty()) {
				continue;
			}
			word = word.toLowerCase();
			
			if (hist.containsKey(word)) {
				count = hist.get(word);
				count++;
			} else {
				count = 1;
			}
			hist.put(word, count);
		}
		
		// **** display histogram ****
		System.out.println("<<< hist:");
		for (Map.Entry<String, Integer> entry : hist.entrySet()) {
			String w 	= entry.getKey();
			int c 		= entry.getValue();
			System.out.println("<<< w ==>" + w + "<== count: " + c);	
		}
		System.out.println();
		
		// **** sort histogram by descending value ****
		TreeMapSort sortMap 			= new TreeMapSort();
		Map<String, Integer> sortedHist = sortMap.descSortByValue(hist);
		
		// **** display sorted histogram ****
		System.out.println("<<< desc sortedHist:");
		for (Map.Entry<String, Integer> entry : sortedHist.entrySet()) {
			String w 	= entry.getKey();
			int c 		= entry.getValue();
			System.out.println("<<< w ==>" + w + "<== count: " + c);	
		}
		System.out.println();
		
		// **** put in a string and display the top COUNT words ****
		String topWords = "";
		int i = 0;
		for (Map.Entry<String, Integer> entry : sortedHist.entrySet()) {
			topWords += entry.getKey();
			i++;
			if (i < COUNT) {
				topWords += ", ";
			} else {
				break;
			}
		}
		System.out.println("<<< COUNT: " + COUNT + " topWords ==>" + topWords + "<==\n");
		
		// **** sort histogram by ascending value ****
		sortedHist = sortMap.ascSortByValue(hist);
		
		// **** display sorted histogram ****
		System.out.println("<<< asc sortedHist:");
		for (Map.Entry<String, Integer> entry : sortedHist.entrySet()) {
			String w 	= entry.getKey();
			int c 		= entry.getValue();
			System.out.println("<<< w ==>" + w + "<== count: " + c);	
		}
		System.out.println();

		// **** traverse histogram forwards using Iterator ****
		Iterator<Entry<String, Integer>> iter = sortedHist.entrySet().iterator();
		while (iter.hasNext()) {
			Map.Entry<String, Integer> entry = iter.next();
			System.out.println("iter <<< w ==>" + entry.getKey() + "<== c: " + entry.getValue());
		}
		System.out.println();
		
		// **** traverse histogram backwards ****
		ArrayList<Map.Entry<String, Integer>> al = new ArrayList<Map.Entry<String, Integer>>();
		iter = sortedHist.entrySet().iterator();
		while (iter.hasNext()) {
			al.add(iter.next());
		}
		for (i = al.size() - 1; i >= 0; i--) {
			Map.Entry<String, Integer> entry = al.get(i);
			System.out.println("array list <<< w ==>" + entry.getKey() + "<== c: " + entry.getValue());
		}
		System.out.println();
		
		// **** use a TreeMap ****
		TreeMap<String, Integer> tm = new TreeMap<String, Integer>();
		tm.putAll(sortedHist);
		for (Map.Entry<String, Integer> entry : tm.entrySet()) {
			System.out.println("tm <<< tm ==>" + entry.getKey() + "<== c: " + entry.getValue());
		}
		System.out.println();

		// **** use a LinkedHashMap ****
		LinkedHashMap<String, Integer> lhm = new LinkedHashMap<String, Integer>();
		lhm.putAll(sortedHist);
		for (Map.Entry<String, Integer> entry : lhm.entrySet()) {
			System.out.println("lhm <<< w ==>" + entry.getKey() + "<== c: " + entry.getValue());
		}
		System.out.println();
		
		// **** swap keys and values ****
		HashMap<Integer, String> newMap = new HashMap<Integer, String>();
		System.out.println("newMap <<< sortedHist.size: " + sortedHist.size());
		for (Map.Entry<String, Integer> entry : sortedHist.entrySet()) {
			newMap.put(entry.getValue(), entry.getKey());
		}
		System.out.println("newMap <<<     newMap.size: " + newMap.size());
		for (Map.Entry<Integer, String> entry : newMap.entrySet()) {
			System.out.println("newMap <<< c: " + entry.getKey() + " w ==>" + entry.getValue() + "<==");
		}
		System.out.println();
	}
}

We open a scanner to read in a text string and a set of words. The code should not crash with or without on or the other argument. As you should be able to tell there are four combinations. The code was tested with all combinations.

I always like to start writing code with the test cases. Seems that things go better for several reasons which we will not cover at this time, when one starts writing test code first. Note that the code which invokes the main function / method is going to require some type of scaffolding to run on. When you go to a challenge site (e.g., HackerRank) the scaffolding is most of the times done. That time one spends thinking on how to provide the data to the function / method / class also helps thinking on the actual target code. Of course, the idea behind challenges is to solve them as fast as possible. I am not sure I have ever worked on such environment. What all the environments required was quality and reliable software. By visiting the code periodically, one can continue to optimize and enhance it in such a way that the code may be around and in service for many years.

The second prompt in the main function is to read a list of words which will have different special uses as we write different methods down the line.

Finally we invoke a class that will host the method that parses the string and returns a set of words with associated counts. That code is followed by a couple lines used to display the contents of the array list.

The last line closes the scanner. We all know that there is no need to close the scanner. I always like to free all the resources used. That way I am always in a state of mind to allocate and free objects and resources as needed.

Following is a screen capture of the program parsing a string:

>>> text: 
<<< text ==>Hi John; welcome to the test. John this is a test string to test generating and operating on a histogram.<==

<<< words:
<<< word ==>Hi<==
<<< word ==>John<==
<<< word ==><==
<<< word ==>welcome<==
<<< word ==>to<==
<<< word ==>the<==
<<< word ==>test<==
<<< word ==><==
<<< word ==>John<==
<<< word ==>this<==
<<< word ==>is<==
<<< word ==>a<==
<<< word ==>test<==
<<< word ==>string<==
<<< word ==>to<==
<<< word ==>test<==
<<< word ==>generating<==
<<< word ==>and<==
<<< word ==>operating<==
<<< word ==>on<==
<<< word ==>a<==
<<< word ==>histogram<==

<<< hist:
<<< w ==>a<== count: 2
<<< w ==>and<== count: 1
<<< w ==>generating<== count: 1
<<< w ==>hi<== count: 1
<<< w ==>histogram<== count: 1
<<< w ==>is<== count: 1
<<< w ==>john<== count: 2
<<< w ==>on<== count: 1
<<< w ==>operating<== count: 1
<<< w ==>string<== count: 1
<<< w ==>test<== count: 3
<<< w ==>the<== count: 1
<<< w ==>this<== count: 1
<<< w ==>to<== count: 2
<<< w ==>welcome<== count: 1

<<< desc sortedHist:
<<< w ==>test<== count: 3
<<< w ==>a<== count: 2
<<< w ==>john<== count: 2
<<< w ==>to<== count: 2
<<< w ==>and<== count: 1
<<< w ==>generating<== count: 1
<<< w ==>hi<== count: 1
<<< w ==>histogram<== count: 1
<<< w ==>is<== count: 1
<<< w ==>on<== count: 1
<<< w ==>operating<== count: 1
<<< w ==>string<== count: 1
<<< w ==>the<== count: 1
<<< w ==>this<== count: 1
<<< w ==>welcome<== count: 1

<<< COUNT: 3 topWords ==>test, a, john<==

<<< asc sortedHist:
<<< w ==>and<== count: 1
<<< w ==>generating<== count: 1
<<< w ==>hi<== count: 1
<<< w ==>histogram<== count: 1
<<< w ==>is<== count: 1
<<< w ==>on<== count: 1
<<< w ==>operating<== count: 1
<<< w ==>string<== count: 1
<<< w ==>the<== count: 1
<<< w ==>this<== count: 1
<<< w ==>welcome<== count: 1
<<< w ==>a<== count: 2
<<< w ==>john<== count: 2
<<< w ==>to<== count: 2
<<< w ==>test<== count: 3

iter <<< w ==>and<== c: 1
iter <<< w ==>generating<== c: 1
iter <<< w ==>hi<== c: 1
iter <<< w ==>histogram<== c: 1
iter <<< w ==>is<== c: 1
iter <<< w ==>on<== c: 1
iter <<< w ==>operating<== c: 1
iter <<< w ==>string<== c: 1
iter <<< w ==>the<== c: 1
iter <<< w ==>this<== c: 1
iter <<< w ==>welcome<== c: 1
iter <<< w ==>a<== c: 2
iter <<< w ==>john<== c: 2
iter <<< w ==>to<== c: 2
iter <<< w ==>test<== c: 3

array list <<< w ==>test<== c: 3
array list <<< w ==>to<== c: 2
array list <<< w ==>john<== c: 2
array list <<< w ==>a<== c: 2
array list <<< w ==>welcome<== c: 1
array list <<< w ==>this<== c: 1
array list <<< w ==>the<== c: 1
array list <<< w ==>string<== c: 1
array list <<< w ==>operating<== c: 1
array list <<< w ==>on<== c: 1
array list <<< w ==>is<== c: 1
array list <<< w ==>histogram<== c: 1
array list <<< w ==>hi<== c: 1
array list <<< w ==>generating<== c: 1
array list <<< w ==>and<== c: 1

tm <<< tm ==>a<== c: 2
tm <<< tm ==>and<== c: 1
tm <<< tm ==>generating<== c: 1
tm <<< tm ==>hi<== c: 1
tm <<< tm ==>histogram<== c: 1
tm <<< tm ==>is<== c: 1
tm <<< tm ==>john<== c: 2
tm <<< tm ==>on<== c: 1
tm <<< tm ==>operating<== c: 1
tm <<< tm ==>string<== c: 1
tm <<< tm ==>test<== c: 3
tm <<< tm ==>the<== c: 1
tm <<< tm ==>this<== c: 1
tm <<< tm ==>to<== c: 2
tm <<< tm ==>welcome<== c: 1

lhm <<< w ==>and<== c: 1
lhm <<< w ==>generating<== c: 1
lhm <<< w ==>hi<== c: 1
lhm <<< w ==>histogram<== c: 1
lhm <<< w ==>is<== c: 1
lhm <<< w ==>on<== c: 1
lhm <<< w ==>operating<== c: 1
lhm <<< w ==>string<== c: 1
lhm <<< w ==>the<== c: 1
lhm <<< w ==>this<== c: 1
lhm <<< w ==>welcome<== c: 1
lhm <<< w ==>a<== c: 2
lhm <<< w ==>john<== c: 2
lhm <<< w ==>to<== c: 2
lhm <<< w ==>test<== c: 3

newMap <<< sortedHist.size: 15
newMap <<<     newMap.size: 3
newMap <<< c: 1 w ==>welcome<==
newMap <<< c: 2 w ==>to<==
newMap <<< c: 3 w ==>test<==

In order to process the Map we could use the following class:

/*
import java.util.Comparator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.TreeMap;
*/
import java.util.*;

/*
 * 
 */
public class TreeMapSort {

	/*
	 * constructor
	 */
	public TreeMapSort() {
	}
	
	/*
	 * sort Map based on values (first implementation)
	 */
	public <K, V extends Comparable<V>> Map<K, V> sortByValues(final Map<K, V> map) {
		Comparator<K> valueComparator = new Comparator<K>() {
		public int compare(K k1, K k2) {
			int compare = map.get(k1).compareTo(map.get(k2));
			if (compare == 0) 
				return 1;
			else 
				return compare;
			}
		};
	
	/*
	 * 
	 */
	Map<K, V> sortedByValues = new TreeMap<K, V>(valueComparator);
	sortedByValues.putAll(map);
	
	// **** ****
	return sortedByValues;
	}
	
	/*
	 * sort Map in ascending order based on values (second implementation)
	 */
	public Map<String, Integer> ascSortByValue(Map<String, Integer> unsortedMap) {
		
		// **** 1. create a List of Map (in alphabetical order of key) ****
		List<Map.Entry<String, Integer>> list = new LinkedList<Map.Entry<String, Integer>>(unsortedMap.entrySet());
		
        // **** 2. sort list with Collections.sort() - provide custom Comparator ****
		Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
			public int compare( Map.Entry<String, Integer> o1,
								Map.Entry<String, Integer> o2) {
				return (o1.getValue()).compareTo(o2.getValue());
			}
		});
		
		// **** 3. traverse sorted list and put it into a new insertion order Map LinkedHashMap ****
		Map<String, Integer> sortedMap = new LinkedHashMap<String, Integer>();
		for (Map.Entry<String, Integer> entry : list) {
			sortedMap.put(entry.getKey(), entry.getValue());
		}

		// **** returned the sorted map ****
		return sortedMap;
	}
	
	/*
	 * sort Map in descending order based on values (second implementation)
	 */
	public Map<String, Integer> descSortByValue(Map<String, Integer> unsortedMap) {
		
		// **** 1. create a List of Map (in alphabetical order of key) ****
		List<Map.Entry<String, Integer>> list = new LinkedList<Map.Entry<String, Integer>>(unsortedMap.entrySet());
		
        // **** 2. sort list with Collections.sort() - provide custom Comparator ****
		Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
			public int compare( Map.Entry<String, Integer> o1,
								Map.Entry<String, Integer> o2) {
				return (o2.getValue()).compareTo(o1.getValue());
			}
		});
		
		// **** 3. traverse sorted list and put it into a new insertion order Map LinkedHashMap ****
		Map<String, Integer> sortedMap = new LinkedHashMap<String, Integer>();
		for (Map.Entry<String, Integer> entry : list) {
			sortedMap.put(entry.getKey(), entry.getValue());
		}

		// **** returned the sorted map ****
		return sortedMap;
	}
}

I apologize once again for not doing a better job explaining the progression, but have been busy with work.

If you have comments or questions on this or any other post in this blog, please leave me a message. I try to spend a couple hours every day learning, experimenting and posting in this blog.

John

john.canessa@gmail.com

@john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.