Rabin-Karp Algorithm

It is a cooler summer Saturday in the Twin Cities of Minneapolis and St. Paul. My wife and I are planning to go out for lunch. As soon as she is ready we should leave.

In the recent posts “MongoDB Text Search” and “Find Damaged – Part II” for one reason or another I have been dealing with full text searches. The MongoDB post covers how to use text searches in the database. A software engineer was in the process of experimenting with MongoDB on the cloud (Atlas) and was interested in performing text searches. The need was associated with a work related request.

The Find Damaged post sequence attempts to find damaged patterns in an extremely long string of text. In this post we will experiment searching for words in a string. When the text search task comes up, depending in your environment, you may already have the search mechanism in your environment. For example, in MongoDB one can use several programming languages and search for text in one or multiple fields in a collection.

In this post we will implement an algorithm to search for text on a string. The algorithm we will implement is known as the Rabin-Karp algorithm. To read more about it you can do it here.

The algorithm uses a special hashing to search for a specified pattern (word). If you try a brute force approach, you would have to start with the first letter in the text and check if the pattern matches. You will then move to the second character in the text and check if the pattern matches. You will repeat the process until the pattern does not match in the remaining characters in the text.

The Rabin-Karp algorithm is somewhat similar, but it uses hashes to quickly match the pattern. It then updates the hash and moves the window one character towards the end. If you copy the code and run it, you will see that is performs quite well. Of course you can always optimize somewhat the algorithm when you have a specific set of requirements in which text search is just one of them.

Let’s look at the output of the program.

main >>>    text: This is text to search for the test
checkAndLoadText <<< text ==>This is text to search for the test<==
checkAndLoadText <<< regex ==>^(c|C):/.*\.txt$<==
checkAndLoadText <<< no match main >>> pattern: search
main <<< pattern ==>search<==
main <<<       N: 35 M: 6
search <<< D: 256 Q: 257
search <<< mscMult: 256  pow(D, M - 1) % Q: 128
search <<< patternHash: 8 textHash: 103
search <<< match found at i: 16
main <<<   count: 1

We are prompted for a string which we will use to search words in. In our example we enter “This is text to search for the test”. We then display the entered string.

Note that a regular expression is displayed by a method named checkAndLoadText(). The idea is to be able to search for text in longer strings without having to type them in. This method checks if a file name has been specified in the text. If so the text in the file is loaded and used as text for the search. I looked on the web for the text for “Alice in Wonderland” and copied it into a file. If I want to test using a small string I just enter it, but if I wish to search with a longer string I just specify a text file. In this case we just used a string.

We then specify a pattern. In this example we use the word “search”. The algorithm runs and it continues to display some debugging information which can also help to understand what is going on. Finally, as expected, we found the pattern once. Note that the word “search” was found at position 16 in the specified text.

	/*
	 * Test scaffolding
	 */
	public static void main(String[] args) throws IOException {

		// **** open the scanner ****
		Scanner sc = new Scanner(System.in);
		
		// **** prompt for the text ****
		System.out.print("main >>>    text: ");
		String text = sc.nextLine();
		
//		// ???? ????
//		text = "c:/temp/alice_full.txt";
		
		// **** check for a file name and if needed replace text with the contents of the specified file ****
		String regex = "^(c|C):/.*\\.txt$";
		text = checkAndLoadText(text, regex);

//		// ???? ????
//		text = "testing: this is the test text for this test";
		
//		// ???? display the text and numbers to display the index ????
//		System.out.println("main <<< text ==>" + text + "<==");
//		System.out.print("main <<<             ");
//		for (int i = 0; i < text.length(); i++) // System.out.print(i % 10); // System.out.println(); // **** prompt for the pattern **** System.out.print("main >>> pattern: ");
		String pattern = sc.nextLine();
		
//		// ???? ????
//		pattern = "test";
		
		System.out.println("main <<< pattern ==>" + pattern + "<==");
		
		// **** close the scanner ****
		sc.close();
		
		// **** ****
		N = text.length();
		M = pattern.length();
		
		// ???? ????
		System.out.println("main <<<       N: " + N + " M: " + M);
		
		// **** check if the text is shorter than the pattern ****
		if (N < M) {
			System.err.println("main <<< UNEXPECTED N: " + N + " < M: " + M);
			System.exit(-1);
		}
		
		// **** search for the pattern in the text ****
		int count = search(pattern, text);
		System.out.println("main <<<   count: " + count);
	}

If we look at our test scaffolding we open a scanner to get user input. I have several lines commented out. They can be used to the numbers under the entered text so when the results return we can easily locate the starts of the pattern. Also, instead of typing on each run the text and pattern, I hard coded some strings to run tests faster.

We want to make sure that the pattern is smaller than the text. There is a check for it. After all is et and done, we call the search() method which returns the count of hominy times the pattern was found in the specified text. In the last example we found the “search” pattern once.

	/*
	 * Search pattern in text using Rabin-Karp algorithm.
	 */
	static int search(String pattern, String text) {
		
		// ****  ****
		int count		= 0;			// count of matches
		int	mscMult		= 1;			// most significant character multiplier
		int	textHash	= 0;			// hash for the text
		int	patternHash	= 0;			// hash for the pattern
		
		// ???? ????
		System.out.println("search <<< D: " + D + " Q: " + Q);
		
		// **** compute mscMult = pow(D, M - 1) % Q ****
		for (int i = 0; i < (M - 1); i++)
			mscMult = (mscMult * D) % Q;
		
		// ???? ????
		System.out.println("search <<< mscMult: " + mscMult + "  pow(D, M - 1) % Q: " + (int)Math.pow((double)D, (double)(M - 1)) % Q);
		
		// **** calculate the hash of the pattern and the first text window ****
		for (int i = 0; i < M; i++) {
			patternHash = (D * patternHash + pattern.charAt(i)) % Q;
			textHash 	= (D * textHash    + text.charAt(i)   ) % Q;
		}
		
		// ???? ????
		System.out.println("search <<< patternHash: " + patternHash + " textHash: " + textHash);
		
		// **** slide the pattern over the text one character at a time ****
		for (int i = 0; i <= (N - M); i++) {
			
			// **** ****
			int j = 0;
			
			// **** check if hashes match ****
			if (patternHash == textHash) {
				
				// **** check if characters match ****
				for (j = 0; j < M; j++) {
					if (text.charAt(i + j) != pattern.charAt(j))
						break;
				}
				
				// **** check if all characters match (pattern found) ****
				if (j == M) {
					
					// ???? ????
					System.out.println("search <<< match found at i: " + i);
					
					// **** increment the count of matches ****
					count++;
				}
			}
			
			// **** compute the hash for the NEXT window of text
			//		remove leading digit and add trailing one ****
			if (i < (N - M)) {
				
				// **** compute the text hash ****
				textHash = (D * (textHash - text.charAt(i) * mscMult) + text.charAt(i + M)) % Q;
				
				// **** ****
				if (textHash < 0)
					textHash += Q;
			}
			
		}
				
		// **** count of matches ****
		return count;
	}

The search() method implements the Rabin-Karp algorithm. It starts by displaying some values.

	// **** constants ****
	static final int 	Q	= 257;		// prime number: 101, 131, 257
	static final int	D	= 256;		// number of characters in alphabet (128 or 256)
	
	// **** global variables ****
	static int 			N	= 0;		// length of text
	static int 			M	= 0;		// length of pattern

We first compute the value for the mscMult variable. This value is the constant we will use to multiply the highest (first) character in the pattern. This will become obvious when we see how the hashes work in the algorithm.

We then compute the hash for the pattern and the hash for the first M characters in the text. M contains the number of characters in the search pattern.

Now we are ready to traverse the text and find if the pattern matches as the window moves from the first character in the text to the last possible position.

Since we have computed the value for the first M characters in the pattern we check if the hashes match. We need to check each character because the hash is not unique. We can improve the hash by using a larger prime number.

If the hashes match, we check if each character in the pattern has a corresponding match in the text. If we find a match we increase the counter. In the example on the first pass there was no match.

We now need to slide the window one character to the right. We do so by removing the contribution of the first character and adding the contribution to the hash from the next character on the right. Now you can see how the mscMult variable is used. This is the most important step in the algorithm. Note that the textHash value may end with a negative number. If so we just add Q to make the value positive.

	/*
	 * Check if the text represents a file name.
	 * If so, replace the text with the contents of the file.
	 */
	static String checkAndLoadText(String text, String regex) throws IOException {
		
		// ???? ????
		System.out.println("checkAndLoadText <<< text ==>" + text + "<==");
		System.out.println("checkAndLoadText <<< regex ==>" + regex + "<==");
		
		// **** ****
		final Pattern pattern = Pattern.compile(regex);
		
		// **** ****
		Matcher m = pattern.matcher(text);
			
		// **** ****
		if (m.find()) {
			
			// ???? ????
			System.out.println("checkAndLoadText <<< match found!!!");
			
			// **** ****
			text = new String(Files.readAllBytes(Paths.get(text)));
		} else {
			
			// ???? ????
			System.out.println("checkAndLoadText <<< no match");
		}

		// **** ****
		return text;
	}

This method is used to check if the specified test string represents a file name. We use a regular expression to extract the path to a text file in my Temp folder (e.g., “c:/temp/alice_full.txt”). If the specified file is available we read in the contents and return them to the caller. This is a simple way to go from a few words to thousands.

main >>>    text: c:/temp/alice_full.txt
checkAndLoadText <<< text ==>c:/temp/alice_full.txt<==
checkAndLoadText <<< regex ==>^(c|C):/.*\.txt$<==
checkAndLoadText <<< match found!!! main >>> pattern: rabbit
main <<< pattern ==>rabbit<==
main <<<       N: 145864 M: 6
search <<< D: 256 Q: 257
search <<< mscMult: 256  pow(D, M - 1) % Q: 128
search <<< patternHash: 251 textHash: 5
search <<< match found at i: 1145
search <<< match found at i: 1337
search <<< match found at i: 1486
search <<< match found at i: 33910
search <<< match found at i: 36254
search <<< match found at i: 36301
main <<<   count: 6

In this last example we specify a file and the software reads it in. The pattern is the word “rabbit”. As you can see the software finds a count of six (6) instances of the specified pattern. A debug statement indicates the position of the pattern in the specified test.

The entire code for this project and the alice_full.txt file can be found in my GitHub repository.

By experimenting with this algorithm, it seems that it may or may not apply to the “Find Damaged” posts. I have run out of time for this week. I will get back to the Find Damaged set of posts next weekend.

If you have comments or questions regarding this or any other post in this blog, or if you would like me to help with any phase in the SDLC (Software Development Life Cycle) of a product or service, please do not hesitate and leave me a note below. Requests for help will remain private.

Keep on reading and experimenting. It is the best way to learn and refresh your knowledge!

John

Follow me on Twitter:  @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.