Top 10 Words in a Text File

how-to-find-highest-repeating-word-from-a-file-in-javaThis entry describes the implementation of an algorithm used to display the top ten (the number is just a constant in the solution) words found in a text file. The solution is implemented in Java.

As is typical, a challenge is described by some requirements. The requirements for this challenge follows:

“Given a text file holding some text (a few thousand words) generate the list of the top ten words found in the text file”.

The approach that I took is to first read the contents of the text file line by line. Since I just pulled a text document from the www, unwanted characters were removed and the remaining converted to lower case after splitting the line into words. The new words are inserted into a hash map. For duplicate words the previous count is read, incremented by one and the new value is associated with the corresponding key (word).  This produces a histogram of words and associated frequencies.word_histogram

The next step is to display the words with the top 10 highest frequencies. Following is a screen capture from the Eclipse IDE for this solution:

main <<< map.size: 7

topList <<< ==>day<== (1)

topList <<< ==>going<== (1)

topList <<< ==>hello<== (1)

topList <<< ==>how<== (1)

topList <<< ==>is<== (1)

topList <<< ==>world<== (1)

topList <<< ==>your<== (1)

The file only contains seven words, none of which appears to repeat. This is a good test to verify that fewer words than requested (e.g., 10) were found and that words with duplicate frequencies (e/g., 1) are displayed.

After we have the histogram, we call the topList() method which creates a new collection of Histogram elements. After the new collection is populated, we sort it using a custom Comparator. This allows us to sort the frequencies in an ascending order (as needed) and continue to associate the words with their frequencies.

The final part of the topList() method prints the requested number of elements. That was just a bonus which was not specified by the requirements. This way, we can change the program from displaying top 10 to top 100 in case we need to consider additional unit testing.

Java code for this solution follows:

package john.canessa.top.ten;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.util.ArrayList;

import java.util.Collections;

import java.util.Comparator;

import java.util.TreeMap;

class Histogram {

public String word;

public int    freq;

}

public class Solution {

/*

* generate and display top words

*/

static void topList(TreeMap<String, Integer>map, int top) {

// **** create collection ****

ArrayList<Histogram> histogram = new ArrayList<Histogram>();

for (String w : map.keySet()) {

Histogram h = new Histogram();

h.word = w;

h.freq = map.get(w);

histogram.add(h);

}

// **** sort collection ****

Collections.sort(histogram, new Comparator<Histogram>() {

@Override

public int compare(Histogram h1, Histogram h2) {

return h2.freq – h1.freq;

}

});

// **** display top elements in collection ****

int t  = 0;

for (Histogram h : histogram) {

System.out.println(“topList <<< ==>” + h.word + “<== (” + h.freq + “)”);

t++;

if (t >= top) {

break;

}

}

}

/*

* test code

*/

public static void main(String[] args) throws IOException {

//            final String fileName                    = “c:\\temp\\crossbow.txt”;

final String fileName                    = “c:\\temp\\small.txt”;

final int top                            = 10;

TreeMap<String, Integer> map      = new TreeMap<String, Integer>();

// **** check if file is not available ****

File inFile = new File(fileName);

if (!inFile.isDirectory() && !inFile.exists()) {

throw new FileNotFoundException(“main <<< fileName ==>” + fileName + “<==”);

}

// **** build map ****

FileReader in        = new FileReader(fileName);

BufferedReader br    = new BufferedReader(in);

while (br.ready()) {

// **** read a line ****

String line   = br.readLine();

// **** split the line into words ****

String words[]       = line.split(” “);

// **** remove characters of no interest and convert the remaining ones to lower case ****

for (String w : words) {

w = w.replaceAll(“[^a-zA-Z’]”, “”).toLowerCase();

if (w.equals(“”)) {

continue;

}

// **** insert new words into the map or update their frequency ****

if (!map.containsKey(w)) {

map.put(w, 1);

} else {

int value = map.get(w);

map.replace(w, ++value);

}

}

}

br.close();

in.close();

// **** display count and map contents ****

System.out.println(“main <<< map.size: ” + map.size());

//            for (String w : map.keySet()) {

//                   System.out.println(“main <<< w ==>” + w + “<== freq: ” + map.get(w));

//            }

// **** find and display top words ****

topList(map, top);

}

}

Following is the console output when we use a larger text file and set the count to top 20:

main <<< map.size: 1566

topList <<< ==>the<== (255)

topList <<< ==>to<== (163)

topList <<< ==>of<== (131)

topList <<< ==>and<== (107)

topList <<< ==>a<== (101)

topList <<< ==>that<== (71)

topList <<< ==>in<== (67)

topList <<< ==>it<== (60)

topList <<< ==>you<== (53)

topList <<< ==>is<== (51)

topList <<< ==>for<== (43)

topList <<< ==>be<== (39)

topList <<< ==>not<== (34)

topList <<< ==>if<== (33)

topList <<< ==>this<== (33)

topList <<< ==>as<== (29)

topList <<< ==>can<== (28)

topList <<< ==>or<== (28)

topList <<< ==>your<== (28)

topList <<< ==>but<== (27)

There are many changes that could be implemented. We could place the insertion of words in a separate method. That could facilitate unit testing.

elegant_codePlease keep in mind that some people are only interested in the solution. I am more interested in the process used to achieve a correct, elegant and efficient solution. Addressing challenges like this one just give a glimpse to the full capabilities of a software developer.

If you have comments or questions, please do not hesitate and send me a message. I will reply and keep the comments in private if you so desire.

John

john.canessa@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *