Autocomplete

trie_diagramYesterday evening I was talking with a software manager about an algorithm to autocomplete words. The idea was to build a data structure that could help (hint) users with the available words that would match what has been entered so far. For example, if someone would be looking for my phone number in a company directory application and would have entered “can” then the autocomplete feature could display “cane” and “canessa”.

My first answer was some type of array / string array could be used. In practice, I always look up what I need to do using a web browser. If I need further information then I tend to look up the subject in books (call me old fashioned). What should have come up in my mind or in a web browser was a Trie. Tries are a familiar data structure which I have used on several occasions. As a matter of fact, I used the concept; with a different name; in a post “Huffman Coding” a few weeks ago.autocomplete

At the time we moved to other topics. I mentioned that I would not let it go until I research autocomplete in more detail. This is why I wrote this blog entry.

The idea for the main program is simple. Read into a Trie the contents of a dictionary. The dictionary I used contains 354,937 English words. After the Trie is ready, the program asks for a word. If the word is found then a message is displayed. If the word is not found, but there are hints, the first letter of the possible hints is displayed.

The Java code for main follows:

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.util.Iterator;

import java.util.Map;

import java.util.Scanner;

public class Solution {

public static void main(String[] args) {

final String dictionary    = “c:\\temp\\dictionary.txt”;

Trie                 trie          = new Trie();

FileInputStream in                = null;

String                     word          = “”;

// **** ****

File file = new File(dictionary);

try {

in = new FileInputStream(file);

} catch (FileNotFoundException e) {

e.printStackTrace();

System.exit(-1);

}

Scanner sc = new Scanner(in);

// **** read the dictionary into the trie ****

System.out.println(“<<< loading dictionary …”);

while (sc.hasNext()) {

word = sc.nextLine();

trie.insert(word);

}

System.out.println(“<<< done loading dictionary.”);

// **** close input stream ****

try {

in.close();

} catch (IOException e) {

e.printStackTrace();

System.exit(-1);

}

// **** close the scanner ****

sc.close();

// **** open a scanner ****

sc = new Scanner(System.in);

// **** exercise the trie ****

do     {

System.out.print(“>>> word (-1 to exit): “);

word = sc.next();

// **** check if we should look this word up ****

if (!word.equals(“-1”)) {

// **** look up full word ****

if (trie.search(word)) {

System.out.println(“<<< word ==>” + word + “<== found”);

} else {

System.out.println(“<<< word ==>” + word + “<== NOT found”);

}

// **** look up prefix ****

if (trie.startsWith(word)) {

System.out.println(“<<< there are words that start with word ==>” + word + “<==”);

TrieNode hints = trie.searchNode(word);

System.out.print(“<<< hints: “);

Iterator it = hints.children.entrySet().iterator();

while (it.hasNext()) {

Map.Entry pair = (Map.Entry)it.next();

System.out.print(pair.getKey() + ” “);

}

System.out.println();

} else {

System.out.println(“<<< there are NO words that start with word ==>” + word + “<==”);

}

}

} while (!word.equals(“-1”));

// **** indicate we are done ****

System.out.println(“<<< bye bye!!!”);

}

}

Let’s assume one enters the word “mango”. The dictionary contains the following words that could autocomplete:

mango

mangoes

mangold

mangolds

mangona

mangonel

mangonels

mangonism

mangonization

mangonize

mangoro

mangos

mangosteen

mangour

Following is output from the program (call me lazy by not displaying the full words):

>>> word (-1 to exit): mango

<<< word ==>mango<== found

<<< there are words that start with word ==>mango<==

<<< hints: r s e u l n

Following is the implementation of the supporting classes:

import java.util.HashMap;

public class Trie {

// **** fields ****

private TrieNode     root;

// **** constructor ****

public Trie() {

root = new TrieNode();

}

// **** methods ****

public void insert(String word) {

HashMap<Character, TrieNode> children = root.children;

for (int i = 0; i < word.length(); i++) {

TrieNode      t = null;

char          c = word.charAt(i);

if (children.containsKey(c)){

t = children.get(c);

} else {

t = new TrieNode(c);

children.put(c, t);

}

children = t.children;

// **** set leaf node ****

if (i == (word.length() – 1))

t.isLeaf = true;

}

}

public boolean search(String word) {

TrieNode t = searchNode(word);

if(t != null && t.isLeaf)

return true;

else

return false;

}

public TrieNode searchNode(String str){

HashMap<Character, TrieNode>     children      = root.children;

TrieNode                  t             = null;

for (int i = 0; i < str.length(); i++) {

char c = str.charAt(i);

if (children.containsKey(c)){

t = children.get(c);

children = t.children;

} else {

return null;

}

}

return t;

}

public boolean startsWith(String prefix) {

if (searchNode(prefix) == null) {

return false;

} else {

return true;

}

}

}

import java.util.HashMap;

public class TrieNode {

// **** fields ****

char                       c;

HashMap<Character, TrieNode>      children = new HashMap<Character, TrieNode>();

boolean                           isLeaf;

// **** constructors ****

public TrieNode() {

}

public TrieNode (char c) {

this.c = c;

}

}

That said, there are other data structures that could be used to implement the autocomplete feature using arrays, ternary search trees, segment trees, … It is just a matter to figure out the parameters (e.g., number of entries, available memory, expected lookup time, …) and tailor the solutions to meet the application / system needs.

If you have a comment or questions, please do not hesitate and send me a message.

John

John.canessa@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *