Rabin-Karp Algorithm – Revisited

In this post we will revisit the implementation of a string-search algorithm developed by Richard Karp and Michael Rabin.

We visited this algorithm in this blog a few years ago. My motivation is a book that I am currently reading. As soon as I am done reading and experimenting with some more advanced algorithms, I will generate several posts associated with the book.

In the meantime, let’s refresh what the Rabin-Karp algorithm is used for and go over an implementation using the Java programming language.

To follow a routine, I generated the following problem description:

static public boolean search(String s, String p) {}

We are given two strings. One string contains some text and we are required to search it for a pattern specified in a second string.

The signature for the function of interest follows:

choco cake choco waffer vanilla cake strawberry jam
choco cake
main <<< s ==>choco cake choco waffer vanilla cake strawberry jam<==
main <<< p ==>choco cake<==
search <<< i: 9 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco waffer vanilla cake strawberry jam choco cake
choco cake
main <<< s ==>choco waffer vanilla cake strawberry jam choco cake<==
main <<< p ==>choco cake<==
search <<< i: 23 wHash: 960 pHash: 960 sub ==>anilla cak<==
search <<< i: 50 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco waffer vanilla cake choco cake strawberry jam
choco cake
main <<< s ==>choco waffer vanilla cake choco cake strawberry jam<==
main <<< p ==>choco cake<==
search <<< i: 23 wHash: 960 pHash: 960 sub ==>anilla cak<==
search <<< i: 30 wHash: 960 pHash: 960 sub ==>cake choco<==
search <<< i: 35 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco cake choco waffer vanilla cake strawberry jam
lemon cake
main <<< s ==>choco cake choco waffer vanilla cake strawberry jam<==
main <<< p ==>lemon cake<==
main <<< NOT found

The string s will hold the text in which we will search for the text specified in the p string.

The following are some examples for us to test if our solution works:

choco cake choco waffer vanilla cake strawberry jam
choco cake
main <<< s ==>choco cake choco waffer vanilla cake strawberry jam<==
main <<< p ==>choco cake<==
search <<< i: 9 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco waffer vanilla cake strawberry jam choco cake
choco cake
main <<< s ==>choco waffer vanilla cake strawberry jam choco cake<==
main <<< p ==>choco cake<==
search <<< i: 23 wHash: 960 pHash: 960 sub ==>anilla cak<==
search <<< i: 50 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco waffer vanilla cake choco cake strawberry jam
choco cake
main <<< s ==>choco waffer vanilla cake choco cake strawberry jam<==
main <<< p ==>choco cake<==
search <<< i: 23 wHash: 960 pHash: 960 sub ==>anilla cak<==
search <<< i: 30 wHash: 960 pHash: 960 sub ==>cake choco<==
search <<< i: 35 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco cake choco waffer vanilla cake strawberry jam
lemon cake
main <<< s ==>choco cake choco waffer vanilla cake strawberry jam<==
main <<< p ==>lemon cake<==
main <<< NOT found

I placed the pattern of interest at different positions in the string s. Note that to be somewhat complete, in the last statement we are looking for a pattern not in the string.

We could use a set of two nested loops with some additional logic and traverse the string s looking for the pattern in the string p. That would be an O(n^2) solution. Perhaps we could use the Rabin-Karp algorithm and do it in o(n) where n is the length of the string s.

Before we start with the solution we should understand how the Rabin-Karp algorithm works. You can find a description here (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm).

In a nutshell we will use two hashes. One will be for the pattern and the other for a window over the string s with the length of the pattern. The hash over the string s will be a window that we will shift from left to right. On each pass we will update the hash for the window by removing the hash for the leftmost character on the window and adding the hash for the next character to the right of the window.

We need to keep in mind that we could have false positives. To avoid them, when the hashes match we will check if the current substring in the window matches the string in the pattern p.

That was a mouthful.

    /**
     * Test scaffold
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {

        // **** open buffered reader ****
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        // **** read string to search ****
        String s = br.readLine();

        // **** read pattern to search in string ****
        String p = br.readLine();

        // **** close buffered reader ****
        br.close();

        // ???? ????
        System.out.println("main <<< s ==>" + s + "<==");
        System.out.println("main <<< p ==>" + p + "<==");

        // **** search in string s for pattern p ****
        if (search(s, p))
            System.out.println("main <<< found");
        else
            System.out.println("main <<< NOT found");
    }

Following is the code for the test scaffold. It is written in Java. Not that it matters but I used VSCode on a Windows 11 machine. You can use the editor of your choice.

The test code open a buffer reader, reads the two input strings, closes the buffer reader and calls the function of interest. Like to keep things simple (KISS rule).


    /**
     * Search for pattern p in string s.
     * @return
     * true if pattern is found
     * false if pattern is not found
     */
    static public boolean search(String s, String p) {

        // **** sanity check(s) ****
        if (s == null || p == null) return false;
        if (s.length() == 0 || p.length() == 0) return false;
        if (s.length() < p.length()) return false;

        // **** initialization ****
        boolean found   = false;

        int pHash       = 0;
        int wHash       = 0;
        int pLen        = p.length();
        int sLen        = s.length();

        // **** generate p hash ****
        for (int i = 0; i < pLen; i++)
            pHash += (int)p.charAt(i);

        // **** generate the initial window hash ****
        for (int i = 0; i < pLen; i++)
            wHash += (int)s.charAt(i);

        // **** loop searching for the pattern ****
        for (int i = pLen - 1; i < sLen && !found; i++) {

            // **** hashes match ****
            if (wHash == pHash) {

                // ???? ????
                System.out.println("search <<< i: " + i + " wHash: " + wHash + " pHash: " + pHash +
                                    " sub ==>" + s.substring(i - pLen + 1, i + 1) + "<==");

                // **** check if pattern was found ****
                if (s.substring(i - pLen + 1, i + 1).equals(p)) {
                    found = true;
                    continue;
                }

            }

            // **** update window hash; if needed ****
            if (i < sLen - 1) {
                wHash -= s.charAt(i - pLen + 1);
                wHash += s.charAt(i + 1);
            }
        }

        // **** pattern found or not ****
        return found;
    }

We start our implementation of the function of interest by checking for some conditions for which we can return an instant answer.

A set of declarations follow. In production I would have appended comments to each of the lines to clarify what the variables are used for. Based on the earlier description of the algorithm we should be able to follow without additional comments at this time.

We then generate the hash value for the string holding the pattern. This pattern will not change during the search.

Following is the generation of the initial hash for the window that we will move over the string s.

We are ready to enter the loop.

Note that the loop traverses the string from left to right once. This makes it an O(n) algorithm where n is the length of string s.

We check if the hashes match. If they do we need to check if the current substring in the window over the string s matches the pattern in string p.

Go back to the test cases. In the second and third examples, we ran into one and two collisions in which the hashes matched yet the strings were not identical.

The way I checked if the substrings match is using a function available in multiple programming languages. The proper way to check the match would be to traverse the pattern and the sub string in the window comparing characters until there is a mismatch or returning true if a match was found.

I would make this change, but it is getting late in the day and will leave it as a bonus point if you are interested.

Finally, we need to update the window hash for the next position. For this we remove the value contributed by the first character in the window and add the value for the next character towards the right side of the window. Note that we need to check if we can update the window hash. If the pattern is not found, we would be past the end of string s and our code would throw an exception.

After we exit the loop, the found variable would hold true if the pattern was found or false if it was not found.

Hope you enjoyed solving this problem as much as I did. The entire code for this project can be found in my GitHub repository named JohnCanessa/Rabin-Karp.

Please note that the code here presented might not be the best possible solution. After solving a problem, you should take a look at the best accepted solutions provided by the different websites (i.e., HackerRank, LeetCode to name a few). The idea is to learn concepts, not to memorize solutions.

If you have comments or questions regarding this, or any other post in this blog, please do not hesitate and leave me a note below. I will reply as soon as possible.

Keep on reading and experimenting. It is one of the best ways to learn, become proficient, refresh your knowledge, and enhance your developer / engineering skills.

Enjoy,

John

 

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.