More than a List of Words

When indexing text based word frequency / relevance which may be applicable for web searches, one of the procedures used is to create a term frequency (tf) array followed by an inverse document frequency (idf) one. You can read more about this here.

In a previous post I experimented with some text in order to build hashmaps with the words of sentences (to keep things in perspective for a blog post). In that post I used a string that I copied from a course I took some years ago. The sting was already preprocessed. The text had already been stripped off punctuation marks.

Today I was experimenting with MongoDB using Java and decided to look up this project. I found it in one of my Eclipse workspaces and decided to add some preprocessing and polishing the operations. If you are working with Big Data or AI you would probably be working with Python. On my next post I will touch on some Python.

For sanity I will make some comments on the data followed by the output of the program and ending with the actual code. That way I only insert one piece of text from the Eclipse console and the Java code from the Eclipse IDE.

The original pre processed text was obtained from Alice’s Adventures in Wonderland, by Lewis Carroll. For this post I navigated here (http://sabian.org/alice_in_wonderland1.php) and decided on a paragraph that contained some punctuation marks. I chose:

“Well!” thought Alice to herself “After such a fall as this, I shall think nothing of tumbling down-stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!” (which was very likely true.)

For our purpose we need to remove all punctuation marks and convert the text to lower case. As usual, you need to start with a reasonable / well thought approach and then fine tune it. If there are a lot of tweaks in your initial approach, then you should consider an alternative one.

Let’s start with the Eclipse console output for this program:

<<<            width: 40 characters
<<<      text.length: 258 characters
<<< textArray.length: 48 words

<<< "Well!" thought Alice to herself "After such a fall 
<<< as this, I shall think nothing of tumbling down-stairs! 
<<< How brave they'll all think me at home! Why, I wouldn't 
<<< say anything about it, even if I fell off the top of 
<<< the house!" (which was very likely true.) 

<<<            width: 40 characters
<<<      text.length: 258 characters
<<< textArray.length: 48 words

<<< "well!" thought alice to herself "after such a fall 
<<< as this, i shall think nothing of tumbling down-stairs! 
<<< how brave they'll all think me at home! why, i wouldn't 
<<< say anything about it, even if i fell off the top of 
<<< the house!" (which was very likely true.) 

<<<            width: 40 characters
<<<      text.length: 258 characters
<<< textArray.length: 63 words

<<<  well   thought alice to herself  after such a fall as 
<<< this  i shall think nothing of tumbling down stairs 
<<<  how brave they ll all think me at home  why  i wouldn 
<<< t say anything about it  even if i fell off the top of 
<<< the house    which was very likely true 

<<<            width: 40 characters
<<<      text.length: 244 characters
<<< textArray.length: 51 words

<<< well thought alice to herself after such a fall as 
<<< this i shall think nothing of tumbling down stairs 
<<< how brave they ll all think me at home why i wouldn 
<<< t say anything about it even if i fell off the top of 
<<< the house which was very likely true 

<<< split words.length: 51

words <<< word ==>well<==
words <<< word ==>thought<==
words <<< word ==>alice<==
words <<< word ==>to<==
words <<< word ==>herself<==
words <<< word ==>after<==
words <<< word ==>such<==
words <<< word ==>a<==
words <<< word ==>fall<==
words <<< word ==>as<==
words <<< word ==>this<==
words <<< word ==>i<==
words <<< word ==>shall<==
words <<< word ==>think<==
words <<< word ==>nothing<==
words <<< word ==>of<==
words <<< word ==>tumbling<==
words <<< word ==>down<==
words <<< word ==>stairs<==
words <<< word ==>how<==
words <<< word ==>brave<==
words <<< word ==>they<==
words <<< word ==>ll<==
words <<< word ==>all<==
words <<< word ==>think<==
words <<< word ==>me<==
words <<< word ==>at<==
words <<< word ==>home<==
words <<< word ==>why<==
words <<< word ==>i<==
words <<< word ==>wouldn<==
words <<< word ==>t<==
words <<< word ==>say<==
words <<< word ==>anything<==
words <<< word ==>about<==
words <<< word ==>it<==
words <<< word ==>even<==
words <<< word ==>if<==
words <<< word ==>i<==
words <<< word ==>fell<==
words <<< word ==>off<==
words <<< word ==>the<==
words <<< word ==>top<==
words <<< word ==>of<==
words <<< word ==>the<==
words <<< word ==>house<==
words <<< word ==>which<==
words <<< word ==>was<==
words <<< word ==>very<==
words <<< word ==>likely<==
words <<< word ==>true<==

<<< sorted words: 51

sorted <<< word ==>a<==
sorted <<< word ==>about<==
sorted <<< word ==>after<==
sorted <<< word ==>alice<==
sorted <<< word ==>all<==
sorted <<< word ==>anything<==
sorted <<< word ==>as<==
sorted <<< word ==>at<==
sorted <<< word ==>brave<==
sorted <<< word ==>down<==
sorted <<< word ==>even<==
sorted <<< word ==>fall<==
sorted <<< word ==>fell<==
sorted <<< word ==>herself<==
sorted <<< word ==>home<==
sorted <<< word ==>house<==
sorted <<< word ==>how<==
sorted <<< word ==>i<==
sorted <<< word ==>i<==
sorted <<< word ==>i<==
sorted <<< word ==>if<==
sorted <<< word ==>it<==
sorted <<< word ==>likely<==
sorted <<< word ==>ll<==
sorted <<< word ==>me<==
sorted <<< word ==>nothing<==
sorted <<< word ==>of<==
sorted <<< word ==>of<==
sorted <<< word ==>off<==
sorted <<< word ==>say<==
sorted <<< word ==>shall<==
sorted <<< word ==>stairs<==
sorted <<< word ==>such<==
sorted <<< word ==>t<==
sorted <<< word ==>the<==
sorted <<< word ==>the<==
sorted <<< word ==>they<==
sorted <<< word ==>think<==
sorted <<< word ==>think<==
sorted <<< word ==>this<==
sorted <<< word ==>thought<==
sorted <<< word ==>to<==
sorted <<< word ==>top<==
sorted <<< word ==>true<==
sorted <<< word ==>tumbling<==
sorted <<< word ==>very<==
sorted <<< word ==>was<==
sorted <<< word ==>well<==
sorted <<< word ==>which<==
sorted <<< word ==>why<==
sorted <<< word ==>wouldn<==

<<< hist.size: 46

histo <<< key ==>a<== value: 1
histo <<< key ==>about<== value: 1
histo <<< key ==>after<== value: 1
histo <<< key ==>alice<== value: 1
histo <<< key ==>all<== value: 1
histo <<< key ==>anything<== value: 1
histo <<< key ==>as<== value: 1
histo <<< key ==>at<== value: 1
histo <<< key ==>brave<== value: 1
histo <<< key ==>down<== value: 1
histo <<< key ==>even<== value: 1
histo <<< key ==>fall<== value: 1
histo <<< key ==>fell<== value: 1
histo <<< key ==>herself<== value: 1
histo <<< key ==>home<== value: 1
histo <<< key ==>house<== value: 1
histo <<< key ==>how<== value: 1
histo <<< key ==>i<== value: 3
histo <<< key ==>if<== value: 1
histo <<< key ==>it<== value: 1
histo <<< key ==>likely<== value: 1
histo <<< key ==>ll<== value: 1
histo <<< key ==>me<== value: 1
histo <<< key ==>nothing<== value: 1
histo <<< key ==>of<== value: 2
histo <<< key ==>off<== value: 1
histo <<< key ==>say<== value: 1
histo <<< key ==>shall<== value: 1
histo <<< key ==>stairs<== value: 1
histo <<< key ==>such<== value: 1
histo <<< key ==>t<== value: 1
histo <<< key ==>the<== value: 2
histo <<< key ==>they<== value: 1
histo <<< key ==>think<== value: 2
histo <<< key ==>this<== value: 1
histo <<< key ==>thought<== value: 1
histo <<< key ==>to<== value: 1
histo <<< key ==>top<== value: 1
histo <<< key ==>true<== value: 1
histo <<< key ==>tumbling<== value: 1
histo <<< key ==>very<== value: 1
histo <<< key ==>was<== value: 1
histo <<< key ==>well<== value: 1
histo <<< key ==>which<== value: 1
histo <<< key ==>why<== value: 1
histo <<< key ==>wouldn<== value: 1

sorted <<< key sorted ==>a<== value: 1
sorted <<< key sorted ==>about<== value: 1
sorted <<< key sorted ==>after<== value: 1
sorted <<< key sorted ==>alice<== value: 1
sorted <<< key sorted ==>all<== value: 1
sorted <<< key sorted ==>anything<== value: 1
sorted <<< key sorted ==>as<== value: 1
sorted <<< key sorted ==>at<== value: 1
sorted <<< key sorted ==>brave<== value: 1
sorted <<< key sorted ==>down<== value: 1
sorted <<< key sorted ==>even<== value: 1
sorted <<< key sorted ==>fall<== value: 1
sorted <<< key sorted ==>fell<== value: 1
sorted <<< key sorted ==>herself<== value: 1
sorted <<< key sorted ==>home<== value: 1
sorted <<< key sorted ==>house<== value: 1
sorted <<< key sorted ==>how<== value: 1
sorted <<< key sorted ==>if<== value: 1
sorted <<< key sorted ==>it<== value: 1
sorted <<< key sorted ==>likely<== value: 1
sorted <<< key sorted ==>ll<== value: 1
sorted <<< key sorted ==>me<== value: 1
sorted <<< key sorted ==>nothing<== value: 1
sorted <<< key sorted ==>off<== value: 1
sorted <<< key sorted ==>say<== value: 1
sorted <<< key sorted ==>shall<== value: 1
sorted <<< key sorted ==>stairs<== value: 1
sorted <<< key sorted ==>such<== value: 1
sorted <<< key sorted ==>t<== value: 1
sorted <<< key sorted ==>they<== value: 1
sorted <<< key sorted ==>this<== value: 1
sorted <<< key sorted ==>thought<== value: 1
sorted <<< key sorted ==>to<== value: 1
sorted <<< key sorted ==>top<== value: 1
sorted <<< key sorted ==>true<== value: 1
sorted <<< key sorted ==>tumbling<== value: 1
sorted <<< key sorted ==>very<== value: 1
sorted <<< key sorted ==>was<== value: 1
sorted <<< key sorted ==>well<== value: 1
sorted <<< key sorted ==>which<== value: 1
sorted <<< key sorted ==>why<== value: 1
sorted <<< key sorted ==>wouldn<== value: 1
sorted <<< key sorted ==>of<== value: 2
sorted <<< key sorted ==>the<== value: 2
sorted <<< key sorted ==>think<== value: 2
sorted <<< key sorted ==>i<== value: 3

set <<< iter.next: a=1
set <<< iter.next: about=1
set <<< iter.next: after=1
set <<< iter.next: alice=1
set <<< iter.next: all=1
set <<< iter.next: anything=1
set <<< iter.next: as=1
set <<< iter.next: at=1
set <<< iter.next: brave=1
set <<< iter.next: down=1
set <<< iter.next: even=1
set <<< iter.next: fall=1
set <<< iter.next: fell=1
set <<< iter.next: herself=1
set <<< iter.next: home=1
set <<< iter.next: house=1
set <<< iter.next: how=1
set <<< iter.next: if=1
set <<< iter.next: it=1
set <<< iter.next: likely=1
set <<< iter.next: ll=1
set <<< iter.next: me=1
set <<< iter.next: nothing=1
set <<< iter.next: off=1
set <<< iter.next: say=1
set <<< iter.next: shall=1
set <<< iter.next: stairs=1
set <<< iter.next: such=1
set <<< iter.next: t=1
set <<< iter.next: they=1
set <<< iter.next: this=1
set <<< iter.next: thought=1
set <<< iter.next: to=1
set <<< iter.next: top=1
set <<< iter.next: true=1
set <<< iter.next: tumbling=1
set <<< iter.next: very=1
set <<< iter.next: was=1
set <<< iter.next: well=1
set <<< iter.next: which=1
set <<< iter.next: why=1
set <<< iter.next: wouldn=1
set <<< iter.next: of=2
set <<< iter.next: the=2
set <<< iter.next: think=2
set <<< iter.next: i=3

We start by displaying some information from the method / function that displays the text that we will be processing. Of course, if this would be an actual project, we would have to process the entire book; not just a single paragraph.

After the information for the paragraph and the width we wish to use for display the text, the actual initial text is displayed. You can see why I chose this paragraph. It contains several punctuation marks (e.g., “!,-() etc).

The next part displays the text in lowercase. The idea behind this is to treat different versions of the same word (e.g., She, she, SHE, etc.) as none. As expected, the number of words in the paragraph has not changed.

The next step is to replace all punctuation marks with spaces. We could have used an empty string but in some cases some of the resulting text might not be actual words. For example “down-stairs” was converted to “down stairs” which is reasonable. If we would have encountered “down-stairs-dungeon” we would have generated “downstairsdungeon” which is not a valid word.

As expected the number of words went up from 48 to 63. You can easily find and count them and verify that all is well so far.

The next step which may or may not be taken is to remove the excess of white spaces. If we do not do it now, we could postpone the task and include it in a later step while we build an array of words by just skipping entries with a space “ “. This is one of the items one would consider in production. One approach might be more efficient than the other. In our case efficiency is not that important because we are just processing one paragraph and it takes a fraction of a second. In production one might be dealing with processing hundreds or thousands of paragraphs per second.

Now that we have the text in lower case and without punctuation marks we will start processing it to build our set of sorted words and associate frequencies. You will note that “t” is not really a word, or “wouldn”. In production at this point you might replace / delete words or just use additional software that would remove some words based on the language. Such step falls beyond the scope of this post. That said; I will probably go over such mechanism in a future Natural Language Processing (NLP) post.

For our next step we will build an array of words with one word per entry. In our example it seems that we end up with an array of 51 words.

The next step is to sort the words in ascending alphabetical order. In practice this step is not required, but for a human is nice and reassuring that all is well so far. Not only that, but you can easily locate words that have been repeated. We can see that “I” among others, has been repeated 3 times.

Now we are ready to build a histogram. A histogram is a data structure that holds unique words and associate with each word is the number of times that word appears in our sample paragraph. As we expected the word “I” has an associated count of 3 and “think” a value of 2. So far things seem to be going well.

The next step is also something that I did to better observe the values of each key. The keys / words are in alphabetical order but they are also sorted by value frequency. As expected very few words are used more than once and the words that are appear towards the bottom of the display.

The next step was just done for fun. At the time of the initial post I wanted to put the results in a set. That is the reason we end displaying a set with our data.

Now for the Java code:

import java.util.Arrays;
import java.util.Comparator;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.TreeMap;

public class Solution {

	/**
	 * sort TreeMap based on values (not keys)
	 */
	public static <K, V extends Comparable<V>> Map<K, V> sortByValues(final Map<K, V> map) {
		
		// **** ****
		Comparator<K> valueComparator = new Comparator<K>() {
			public int compare(K k1, K k2) {
				int compare = map.get(k1).compareTo(map.get(k2));
				if (compare == 0) 
					return 1;
				else 
					return compare;
	      		}
			};
	 
		// **** ****
		Map<K, V> sortedByValues = new TreeMap<K, V>(valueComparator);
	    sortedByValues.putAll(map);
	    
	    // **** ****
	    return sortedByValues;
	}	

	/**
	 * Display text within a specified number of characters per line.
	 */
	public static void displayText(String text, int width) {
		System.out.println("<<<            width: " + width + " characters");
		System.out.println("<<<      text.length: " + text.length() + " characters");
		
		// **** put words in an array ****
		String[] textArray = text.split(" ");
		System.out.println("<<< textArray.length: " + textArray.length + " words"); // **** for the looks **** System.out.println(); // **** display the complete text one line at a time **** int lineLen = 0; String lineText = ""; for (String word : textArray) { // **** check if we should print the current line **** if (lineLen >= width) {
	    		System.out.println("<<< " + lineText);
	    		lineText = "";
	    		lineLen = 0;
	    	}
	    	    	
	    	// **** append the next word to the current line ****
	    	lineText += word + " ";
	    	
	    	// **** ****
	    	lineLen += word.length();
	    }

	    // **** print the last line (if needed) ****
	    if (lineLen != 0) {
    		System.out.println("<<< " + lineText);	    	
	    }
	    
	    // **** for the looks ****
	    System.out.println();
	}
	
	/**
	 * Main entry point for this program.
	 */
	public static void main(String[] args) {
		
		final int	TEXT_WIDTH	= 40;
		
//		String text = "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her, and to wonder what would happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything: then, she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves: here and there were maps and pictures hung on pegs. She took a jar down off one of the shelves as she passed: it was labelled “Orange Marmalade”, but to her great disappointment it was empty: she did not like to drop the jar, for fear of killing somebody underneath, so managed to put it into one of the cupbards as she fell past it.";
		String text = "\"Well!\" thought Alice to herself \"After such a fall as this, I shall think nothing of tumbling down-stairs! How brave they'll all think me at home! Why, I wouldn't say anything about it, even if I fell off the top of the house!\" (which was very likely true.)";
		
		// **** display original text ****
		displayText(text, TEXT_WIDTH);

		// **** convert text to lower case ****
		text = text.toLowerCase();
		
		// **** display text in lower case ****
		displayText(text, TEXT_WIDTH);

		// **** replace punctuation marks with a " " ****
		text = text.replaceAll("[^a-zA-Z\\s]", " ");
		
		// **** display text without punctuation marks ****
		displayText(text, TEXT_WIDTH);

		// **** replace multiple spaces with a single one ****
		text = text.replaceAll("\\s\\s+", " ");
		text = text.trim();
		
		// **** display text without multiple spaces ****
		displayText(text, TEXT_WIDTH);

		// **** split the text into an array ****
		String[] words = text.split(" ");
		System.out.println("<<< split words.length: " + words.length + "\n");
		for (String word : words) {
			System.out.println("words <<< word ==>" + word + "<==");
		}
		System.out.println();

		// **** sort and display the words ****
		System.out.println("<<< sorted words: " + words.length + "\n");
		Arrays.sort(words);
		for (String word : words) {
			System.out.println("sorted <<< word ==>" + word + "<==");
		}
		System.out.println();		

		// **** generate a histogram of words and display it ****
		int count = 0;
		TreeMap<String, Integer> hist = new TreeMap<String, Integer>();
		for (String w : words) {
			if (hist.containsKey(w)) {
				count = hist.get(w);
				hist.replace(w, ++count);
			} else {
				hist.put(w, 1);
			}
		}
		System.out.println("<<< hist.size: " + hist.size() + "\n");

		for (Map.Entry<String, Integer> entry : hist.entrySet()) {
			System.out.println("histo <<< key ==>" + entry.getKey() + "<== value: " + entry.getValue());
		}
		System.out.println();

		// **** sort the histogram by values and display  it****
		Map<String, Integer> sortedMap = sortByValues(hist);
		for (Map.Entry<String, Integer> entry : sortedMap.entrySet()) {
			System.out.println("sorted <<< key sorted ==>" + entry.getKey() + "<== value: " + entry.getValue());
		}
		System.out.println();
		
		// **** convert the map to a set and display it ****
		Set<Entry<String, Integer>> set = sortedMap.entrySet();
		Iterator<Entry<String, Integer>> iter = set.iterator();
		while (iter.hasNext()) {
			System.out.println("set <<< iter.next: " + iter.next());
		}
		System.out.println();
	}
}

Hope you enjoyed reading this post. The main reason I experiment with concepts and then generate a blog post is to make sure that what I learn / observe is correct.

If you have comments / questions regarding this or any other post in this blog or if you need my services to help with software related issues / concerns please leave me a note on the bottom of this post or send me a private message (john.canessa@gmail.com).

Enjoy;

John

Follow me on Twitter:  @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.