Algorithms and Data Structures for Massive Datasets

Algorithms and Data Structures for Massive Data Sets

I just finished reading and to some extent experimenting with most of the concepts presented in the book by Dzejla Medjedovic, Emin Tahirovic with illustrations by Ines Dedovic © 2022 Manning Publications ISBN: 9781617298035.

I signed up on Amazon to get the book as soon as it was released. As with most work, the first edition has a few typos. In addition, the book could have had more examples. It seems that as the book progresses the number of examples diminishes. Overall a nice book!

In my opinion, the authors attempt to take the reader from what hashes are, how they are typically used, and how they can be used to reduce the amount of space used to produce some results that in general tend to take considerably more space.

Towards the last chapters many data structures and algorithms are mentioned and briefly described. In particular I am interested in LSM trees. I spent some time reading additional articles and in the next revision of this post I should have a nice example properly documented.

Rabin-Karp Algorithm

The Rabin–Karp algorithm algorithm is a string-searching algorithm that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions.

The authors are trying to let us down the path that in some cases one can come up with an algorithm that works most of the time, but when it seems to work, to make sure, we need to perform some additional work. That said, an algorithm that in general would work on O(n^2) can be transformed to one that works in O(n).

choco cake choco waffer vanilla cake strawberry jam
choco cake
main <<< s ==>choco cake choco waffer vanilla cake strawberry jam<==
main <<< p ==>choco cake<==
search <<< i: 9 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco waffer vanilla cake strawberry jam choco cake
choco cake
main <<< s ==>choco waffer vanilla cake strawberry jam choco cake<==
main <<< p ==>choco cake<==
search <<< i: 23 wHash: 960 pHash: 960 sub ==>anilla cak<==
search <<< i: 50 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco waffer vanilla cake choco cake strawberry jam
choco cake
main <<< s ==>choco waffer vanilla cake choco cake strawberry jam<==
main <<< p ==>choco cake<==
search <<< i: 23 wHash: 960 pHash: 960 sub ==>anilla cak<==
search <<< i: 30 wHash: 960 pHash: 960 sub ==>cake choco<==
search <<< i: 35 wHash: 960 pHash: 960 sub ==>choco cake<==
main <<< found


choco cake choco waffer vanilla cake strawberry jam
lemon cake
main <<< s ==>choco cake choco waffer vanilla cake strawberry jam<==
main <<< p ==>lemon cake<==
main <<< NOT found

In the examples we are provided with a string with a set of words, and we are looking for a pattern. Note in the output that in some cases we get a false positive, but after a check, we skip the possible match and keep on looking until we find the pattern or we determine that the pattern is not there.

	/**
     * Test scaffold
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {

        // **** open buffered reader ****
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        // **** read string to search ****
        System.out.print("main <<< s: ");
        String s = br.readLine();                   // text to search in

        // **** read pattern to search in string ****
        System.out.print("main <<< p: ");
        String p = br.readLine();                   // pattern to search for

        // **** close buffered reader ****
        br.close();

        // ???? ????
        System.out.println("main <<< s ==>" + s + "<==");
        System.out.println("main <<< p ==>" + p + "<==");

        // **** search in string s for pattern p ****
        if (search(s, p))
            System.out.println("main <<< found :o)");
        else
            System.out.println("main <<< NOT found :o(");
    }

This is the test scaffold to check the operation of the algorithm.

    /**
     * Search for pattern p in string s.
     * The string to search in is s.
     * The pattern to search for is in p.
     * 
     * Execution: O(n)
     * Space: O(1)
     * 
     * @return
     * true if pattern is found
     * false if pattern is not found
     */
    static public boolean search(String s, String p) {

        // **** sanity check(s) ****
        if (s == null || p == null) return false;
        if (s.length() == 0 || p.length() == 0) return false;
        if (s.length() < p.length()) return false;

        // **** initialization ****
        boolean found   = false;                    // pattern found flag

        int pHash       = 0;                        // pattern hash
        int wHash       = 0;                        // window in string hash
        int pLen        = p.length();               // pattern length (in bytes)
        int sLen        = s.length();               // string length (in bytes)

        // **** generate hash for p ****
        for (int i = 0; i < pLen; i++)
            pHash += (int)p.charAt(i);

        // **** generate initial window hash
        //      (the width of the window is the size of the pattern) ****
        for (int i = 0; i < pLen; i++)
            wHash += (int)s.charAt(i);

        // **** loop searching for the hash of the pattern in the window hash of the string - O(n) ****
        for (int i = pLen - 1; i < sLen && !found; i++) {

            // **** check if hashes match ****
            if (wHash == pHash) {

                // ???? ????
                System.out.println("search <<< i: " + i + " wHash: " + wHash + " pHash: " + pHash +
                                    " sub ==>" + s.substring(i - pLen + 1, i + 1) + "<==");

                // **** check if pattern was found ****
                if (s.substring(i - pLen + 1, i + 1).equals(p)) {
                    found = true;
                    continue;
                }
            }

            // **** update window hash; if needed - O(1) ****
            if (i < sLen - 1) {

                // **** remove left character component of window hash ****
                wHash -= s.charAt(i - pLen + 1);

                // **** add right character component to window hash ****
                wHash += s.charAt(i + 1);
            }
        }

        // **** pattern found or not ****
        return found;
    }

This function performs the search of the patterns on the specified string.

We generate the hash for the pattern and the first set of letters in the string. Note that we now have two patterns of the same length which we will move one character at a time from left to right.

To keep the search pattern up to date, we remove the hash from the first letter and add the hash of the next letter.

If the hashes match, the condition might be for a false positive. We need to then compare the current string with the pattern. If they match, we found a match.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;


/**
 * Rabin–Karp algorithm is a string-searching algorithm that uses 
 * hashing to find an exact match of a pattern in a text string.
 * https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
 */
class RabinKarp {
}

This is the wrapper for the Java code.

Dictionaries

As we all know, dictionaries are typically implemented using a key-value pair. The key is hashed and the value is placed in the bucket associated with a key.

One of the first things that need to be considered when using a dictionary are collisions. If more than one key maps to the same bucket, we encounter a collision which in most cases needs to be addressed.

There are two main approaches to solve collisions. Linear search in the hash table and implementation of buckets with data structures that are able to handle multiple values (e.g., lists and binary trees).

main <<< pb: {James=12345, Peter=12359, Maria=12376, Jane=12346, Emanuel=12347}

main <<< name: James phone: 12345
main <<< name: Peter phone: 12359
main <<< name: Maria phone: 12376
main <<< name: Jane phone: 12346
main <<< name: Emanuel phone: 12347

main <<< name: James phone: 12345
main <<< name: Peter phone: 12359
main <<< name: Maria phone: 12376
main <<< name: Jane phone: 12346
main <<< name: Emanuel phone: 12347

main <<< pb2: {Emanuel=12347, James=12345, Jane=12346, Maria=12376, Peter=12359}

main <<< name: Emanuel phone: 12347
main <<< name: James phone: 12345
main <<< name: Jane phone: 12346
main <<< name: Maria phone: 12376
main <<< name: Peter phone: 12359

main <<< name: Emanuel phone: 12347
main <<< name: James phone: 12345
main <<< name: Jane phone: 12346
main <<< name: Maria phone: 12376
main <<< name: Peter phone: 12359

In this case we are implementing an office phonebook. The key is the first name of the person and the value is the office extension.

Please note that the two examples do not deal with collisions.

It seems that there are two implementations of the phonebook. In the first the names appear to be kept in the order they were entered. In the second, the names are kept in ascending alphabetical order. I guess the second approach would be a better fit for the task.

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.TreeMap;
import java.util.Map.Entry;


/**
 * Experiment with dictionaries.
 * In the first case it is implemented with a hashmap.
 * In the second example it is implemented with a binary tree.
 */
public class ADictionary {
}

This code contains the equivalent of #include in C/C++. The definition of the empty class follows. We will be generating and testing our code as we populate the class. The code for this section as well as the code for all examples are in GitHub. Will leave a list of repositories at the end of the post.

    /**
     * Test scaffold.
     * @param args
     */
    public static void main(String[] args) {
        
        // **** first phonebook ****
        HashMap<String, Integer> pb = new HashMap<>();

        // **** populate phonebook with some random entries ****
        pb.put("James", 12345);
        pb.put("Jane", 12346);
        pb.put("Emanuel", 12347);
        pb.put("Peter", 12359);
        pb.put("Maria", 12376);

        // ???? ????
        System.out.println("main <<< pb: " + pb.toString() + '\n');

        // **** display name and phone (use entry set) ****
        for (Map.Entry<String, Integer> entry : pb.entrySet())
            System.out.println("main <<< name: " + entry.getKey() + " phone: " + entry.getValue());
        System.out.println();

        // **** display name and phone (use iterator) ****
        Iterator<Entry<String, Integer>> it = pb.entrySet().iterator();
        while (it.hasNext()) {
            Map.Entry<String, Integer> entry = it.next();
            System.out.println("main <<< name: " + entry.getKey() + " phone: " + entry.getValue());
        }
        System.out.println();


        // **** second phonebook ****
        TreeMap<String, Integer> pb2 = new TreeMap<>();

        // **** populate phonebook with same entries ****
        pb2.put("James", 12345);
        pb2.put("Jane", 12346);
        pb2.put("Emanuel", 12347);
        pb2.put("Peter", 12359);
        pb2.put("Maria", 12376);

        // ???? ????
        System.out.println("main <<< pb2: " + pb2.toString() + '\n');

        // **** display name and phone (use entry set) ****
        for (Map.Entry<String, Integer> entry : pb2.entrySet())
            System.out.println("main <<< name: " + entry.getKey() + " phone: " + entry.getValue());
        System.out.println();

        // **** display name and phone (use iterator) ****
        Iterator<Entry<String, Integer>> it2 = pb2.entrySet().iterator();
        while (it2.hasNext()) {
            Map.Entry<String, Integer> entry = it2.next();
            System.out.println("main <<< name: " + entry.getKey() + " phone: " + entry.getValue());
        }
        System.out.println();
    }

Our test scaffold declares a hashmap which is implemented as simply as possible. Depending on the number of entries, due to expansion of the data structure, the order of names might change.

We display our phone book by just creating a string. In the second attempt we create an iterator and traverse the contents of our first phonebook implementation.

For our second implementation, we just change the type of hashmap. We use one that is not only backed up by a hash table but also by a balanced binary tree.

The traversal operations are similar, but on the second implementation the keys are in ascending alphabetical order.

Majority Element

To solve this problem, we will use the Boyer–Moore majority vote algorithm which is an algorithm for finding the majority of a sequence of elements using linear time and constant space. Given that the number of elements in the set is N, for the algorithm to find the majority element it must be present N / 2 + 1; otherwise the algorithm may or may not work.

main >>> list of int: 3,2,3
main <<< lst: [3, 2, 3]
main <<< majority0: 3
main <<< majority1: 3


main >>> list of int: 2,2,1,1,1,2,2
main <<< lst: [2, 2, 1, 1, 1, 2, 2]
main <<< majority0: 2
main <<< majority1: 2


main >>> list of int: 3,5,1,2,2,3,4,3,3,3,3
main <<< lst: [3, 5, 1, 2, 2, 3, 4, 3, 3, 3, 3]
main <<< majority0: 3
main <<< majority1: 3


main >>> list of int: 1,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4
main <<< lst: [1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]
main <<< majority0: 3
main <<< majority1: 3


main >>> list of int: 4,5,5,4,6,4,4
main <<< lst: [4, 5, 5, 4, 6, 4, 4]
main <<< majority0: 4
main <<< majority1: 4


main >>> list of int: 3,3,4,4,4,5
main <<< lst: [3, 3, 4, 4, 4, 5]
main <<< majority0: 5
main <<< majority1: 3

For each test case we are provided with a comma separated list of integers. It seems that our test code places the integers in a list. Two different implementations are called. Note that in the last test we would expect 4 as an answer, Note that 4 is not the majority because N = 6 so N / 2 + 1 = 6 / 2 + 1 = 4 and number 4 only occurs 3 times. 

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;


/**
 * 
 */
class MajorityElementAlt {


    /**
     * Return the majority element from the specified list.
     * @param lst
     * @return
     */
    public static int majority0(List<Integer> lst) {

        // **** sanity check(s) ****
        if (lst.size() == 0) return Integer.MIN_VALUE;
        if (lst.size() == 1) return lst.get(0);

        // **** initialization ****
        int index = 0;
        int count = 1;

        // **** ****
        for (int i = 0; i < lst.size(); i++) {

            // **** ****
            if (lst.get(i) == lst.get(index))
                count += 1;
            else
                count -= 1;

            // *** *****
            if (count == 0) {
                index = i;
                count = 1;
            }
        }

        // **** ****
        return lst.get(index);
    }


    /**
     * Return the majority element from the specified list.
     * This is the original Boyer-Moore implementation.
     * @param lst
     * @return
     */
    public static int majority1(List<Integer> lst) {

        // **** sanity check(s) ****
        if (lst.size() == 0) return Integer.MIN_VALUE;
        if (lst.size() == 1) return lst.get(0);

        // **** initialization ****
        int i           = 1;
        boolean removed = false;

        // **** ****
        while (i < lst.size()) {

            // ???? ????
            // System.out.println("majority1 <<< i: " + i + " (" + lst.get(i - 1) + "," + lst.get(i) + ")");

            // **** if different; remove both elements ****
            if (lst.get(i - 1) != lst.get(i)) {

                // **** *****
                lst.remove(i);
                lst.remove(i - 1);

                // **** flag we removed a pair ****
                removed = true;

                // ??? ????
                // System.out.println("majority1 <<< removed: " + removed);
            }

            // **** if equal; increment index by 1 ****
            else {
                i++;
            }

            // ??? ????
            // System.out.println("majority1 <<< lst: " + lst.toString());
        }

        // ???? ????
        // System.out.println("majority1 <<<   i: " + i + " lst.size: " + lst.size());
        // System.out.println("majority1 <<< lst: " + lst.toString());

        // **** ****
        return lst.size() == 0 ? Integer.MIN_VALUE : lst.get(0);
    }


    /**
     * Test scaffold.
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {

        // **** prompt for a comma separated list of ints ****
        System.out.print("main >>> list of int: ");

        // **** ****
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        // **** read input data and populate list of integers ****
        List<Integer> lst = Arrays.stream(br.readLine().trim().split(","))
                                        .map(Integer::parseInt)
                                        .collect(Collectors.toList());

        // **** ****
        br.close();

        // ???? ????
        System.out.println("main <<< lst: " + lst.toString());

        // **** return the majority element in the specified list ****
        System.out.println("main <<< majority0: " + majority0(lst));

        // **** return the majority element in the specified list ****
        System.out.println("main <<< majority1: " + majority1(lst));
    }
}

The code for the two implementations are followed by the test scaffold.

The first implementation follows the original algorithm. The second implementation was suggested in the book. If you get a chance take a look at figure 4.1 which provides a more visual way to solve the problem. And yes, the illustrations seem to belong in a Dr. Seuss book.

MurmurHash and Bloom Filters

MurmurHash is a non-cryptographic hash function suitable for general hash-based lookup. It was created by Austin Appleby and is currently hosted on GitHub along with its test suite named ‘SMHasher’. The name comes from two basic operations, multiply (MU) and rotate (R), used in its inner loop. Unlike cryptographic hash functions (i.e., MD5, SHA), it is not specifically designed to be difficult to reverse by an adversary, making it unsuitable for cryptographic purposes.

main <<< MurmurHash3 Java project !!!

main <<<      sizeof(Byte): 1
main <<< sizeof(Character): 2
main <<<     sizeof(Short): 2
main <<<   sizeof(Integer): 4
main <<<     sizeof(Float): 4
main <<<      sizeof(Long): 8
main <<<    sizeof(Double): 8

main <<< hash32() hash32: -1985658254
main <<< hash32() hash32: -1985658254
main <<< hash32() hash32: -1985658254

main <<< hash64() hash64: -7184396032096721004
main <<< hash64() hash64: -7184396032096721004
main <<< hash64() shortData: 1234 hash64: 1773849671424857867
main <<< hash64() intData: 1234 hash64: 6622910425223981010
main <<< hash64() longData: 1234 hash64: 8023949850968868787

main <<< hash128() hash128: -4087839859224168750 -7952636393196069105
main <<< hash128() hash128: -4087839859224168750 -7952636393196069105
main <<< hash128() hash128: -4087839859224168750 -7952636393196069105

main <<< bs1.length: 0
main <<<   bs1.size: 64
main <<<        bs1: {}

main <<< bs2.length: 0
main <<<   bs2.size: 64
main <<<        bs2: {}

main <<< bs1.length: 5
main <<<   bs1.size: 64
main <<<        bs1: {0, 1, 2, 4}

main <<< bs2.length: 64
main <<<   bs2.size: 64
main <<<        bs2: {1, 2, 3, 4, 5, 6, 63}

main <<< bs2.length: 5
main <<<   bs2.size: 64
main <<<        bs2: {1, 2, 4}

main <<< bf:  n: 10 f: 0.01 m: 95 k: 6
bitArray: {}
insert <<< item ==>1<==
insert <<< indices: 46, 40, 20, 38, 93, 47
insert <<< bf:  n: 10 f: 0.01 m: 95 k: 6
bitArray: {20, 38, 40, 46, 47, 93}

insert <<< item ==>2<==
insert <<< indices: 51, 37, 75, 15, 88, 48
insert <<< bf:  n: 10 f: 0.01 m: 95 k: 6
bitArray: {15, 20, 37, 38, 40, 46, 47, 48, 51, 75, 88, 93}

insert <<< item ==>42<==
insert <<< indices: 82, 9, 49, 8, 34, 10
insert <<< bf:  n: 10 f: 0.01 m: 95 k: 6
bitArray: {8, 9, 10, 15, 20, 34, 37, 38, 40, 46, 47, 48, 49, 51, 75, 82, 88, 93}

lookup <<< item ==>1<==
lookup <<< indices: 46, 40, 20, 38, 93, 47
main <<< item ==>1<== true

lookup <<< item ==>2<==
lookup <<< indices: 51, 37, 75, 15, 88, 48
main <<< item ==>2<== true

lookup <<< item ==>42<==
lookup <<< indices: 82, 9, 49, 8, 34, 10
main <<< item ==>42<== true

lookup <<< item ==>43<==
lookup <<< indices: 70
main <<< item ==>43<== false

Before working with a library that is new to me, I like to understand what to expect from the methods offered by a class. In the following code we will take a brief look in order to get familiar with the library.

Yes, in Java there is no sizeof() operator so to make it clear of the sizes we are dealing with we have a simple table created by displaying the sizes of the different types.

Our test code which I changed several times to experiment with, shows all the currently available methods that have not been deprecated in the class.

Once we are done with the murmur class we experiment with the BitSet class. This class implements the equivalent of 1-bit booleans which may be used to save space.

Finally we have some code to test a bloom filter implemented using the hash methods in the murmur library.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives.

 

    /**
     * Test scaffold.
     * @param args
     */
    public static void main( String[] args )
    {
        
        // **** initialization ****
        long[] hash128;
        long hash64;

        int hash32  = 0;
        int offset  = 0;
        int length  = 0;
        int seed    = 0;


        // **** experiment with murmur library ****
        System.out.println( "main <<< MurmurHash3 Java project !!!\n" );

        // **** display type and size ****
        System.out.println("main <<<      sizeof(Byte): " + Byte.BYTES);
        System.out.println("main <<< sizeof(Character): " + Character.BYTES);
        System.out.println("main <<<     sizeof(Short): " + Short.BYTES);
        System.out.println("main <<<   sizeof(Integer): " + Integer.BYTES);

        System.out.println("main <<<     sizeof(Float): " + Float.BYTES);
        System.out.println("main <<<      sizeof(Long): " + Long.BYTES);
        System.out.println("main <<<    sizeof(Double): " + Double.BYTES);


        // **** string to experiment with hashes ****
        String str = "Hello";
        byte[] data = str.getBytes();

        // ???? ????
        System.out.println();

        // **** generate 32-bit hash ****
        hash32 = MurmurHash3.hash32(str);
        System.out.println("main <<< hash32() hash32: " + hash32);

        // **** generate 32-bit hash ****
        hash32 = MurmurHash3.hash32(data);
        System.out.println("main <<< hash32() hash32: " + hash32);

        // **** generate 32-bit hash ****
        offset  = 0;
        length  = data.length;
        seed    = MurmurHash3.DEFAULT_SEED;
        hash32  = MurmurHash3.hash32(data, offset, length, seed);
        System.out.println("main <<< hash32() hash32: " + hash32);

        // ???? ????
        System.out.println();

        // **** generate 64-bit hash ****
        hash64 = MurmurHash3.hash64(data);
        System.out.println("main <<< hash64() hash64: " + hash64);

        // **** generate 64-bit hash ****
        offset  = 0;
        length  = data.length;
        seed    = MurmurHash3.DEFAULT_SEED;
        hash64  = MurmurHash3.hash64(data, offset, length, seed);
        System.out.println("main <<< hash64() hash64: " + hash64);

        // **** generate 64-bit hash ****
        short shortData = 1234;
        hash64 = MurmurHash3.hash64(shortData);
        System.out.println("main <<< hash64() shortData: " + shortData + " hash64: " + hash64);

        // **** generate 64-bit hash ****
        int intData = 1234;
        hash64 = MurmurHash3.hash64(intData);
        System.out.println("main <<< hash64() intData: " + intData + " hash64: " + hash64);

        // **** generate 64-bit hash ****
        long longData = 1234;
        hash64 = MurmurHash3.hash64(longData);
        System.out.println("main <<< hash64() longData: " + longData + " hash64: " + hash64);
        
        // ???? ????
        System.out.println();

        // **** generate 128-bit hash ****
        hash128 = MurmurHash3.hash128(str);
        System.out.println("main <<< hash128() hash128: " + hash128[0] + " " + hash128[1]);

        // **** generate 128-bit hash ****
        hash128 = MurmurHash3.hash128(data);
        System.out.println("main <<< hash128() hash128: " + hash128[0] + " " + hash128[1]);

        // **** generate 128-bit hash ****
        offset  = 0;
        length  = data.length;
        seed    = MurmurHash3.DEFAULT_SEED;
        hash128 = MurmurHash3.hash128(data, offset, length, seed);
        System.out.println("main <<< hash128() hash128: " + hash128[0] + " " + hash128[1]);
        System.out.println();


        // **** experiment with bit sets ****
        BitSet bs1 = new BitSet();
        BitSet bs2 = new BitSet(6);

        // ???? ????
        System.out.println("main <<< bs1.length: " + bs1.length());
        System.out.println("main <<<   bs1.size: " + bs1.size());
        System.out.println("main <<<        bs1: " + bs1.toString());
        System.out.println();

        System.out.println("main <<< bs2.length: " + bs2.length());
        System.out.println("main <<<   bs2.size: " + bs2.size());
        System.out.println("main <<<        bs2: " + bs2.toString());
        System.out.println();

        // **** ****
        bs1.set(0);
        bs1.set(1);
        bs1.set(2);
        bs1.set(4);

        // ???? ????
        System.out.println("main <<< bs1.length: " + bs1.length());
        System.out.println("main <<<   bs1.size: " + bs1.size());
        System.out.println("main <<<        bs1: " + bs1.toString());
        System.out.println();

        // **** set bits (random order) ****
        bs2.set(4);
        bs2.set(6);
        bs2.set(5);
        bs2.set(1);
        bs2.set(2);
        bs2.set(3);

        // ???? ????
        // bs2.set(7);
        // bs2.set(65);
        bs2.set(63);
        //bs2.set(0);

        // ???? ????
        System.out.println("main <<< bs2.length: " + bs2.length());
        System.out.println("main <<<   bs2.size: " + bs2.size());
        System.out.println("main <<<        bs2: " + bs2.toString());
        System.out.println();

        // **** AND operation ****
        bs2.and(bs1);

        // ???? ????
        System.out.println("main <<< bs2.length: " + bs2.length());
        System.out.println("main <<<   bs2.size: " + bs2.size());
        System.out.println("main <<<        bs2: " + bs2.toString());
        System.out.println();


        // **** bloom filter using murmur library ****
        int n           = 10;
        double f        = 0.01;
        BloomFilter bf  = new BloomFilter(n, f);

        // ???? ????
        System.out.println("main <<< bf: " + bf.toString());

        // **** insert this item ****
        String item = "1";
        bf.insert(item);

        // ???? ????
        System.out.println();

        // **** insert this item ****
        item = "2";
        bf.insert(item);

        // ???? ????
        System.out.println();

        // **** insert this item ****
        item = "42";
        bf.insert(item);

        // ???? ????
        System.out.println();

        // **** lookup this item ****
        item = "1";
        System.out.println("main <<< item ==>" + item + "<== " + bf.lookup(item) + "\n");

        // **** lookup this item ****
        item = "2";
        System.out.println("main <<< item ==>" + item + "<== " + bf.lookup(item) + "\n");

        // **** lookup this item ****
        item = "42";
        System.out.println("main <<< item ==>" + item + "<== " + bf.lookup(item) + "\n");

        // **** lookup this item ****
        item = "43";
        System.out.println("main <<< item ==>" + item + "<== " + bf.lookup(item) + "\n");
    }

In our test case we insert three strings and look up four. The first three lookups indicate that the strings are in the set. The last lookup indicates that the last string is not in the set.

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.canessa</groupId>
  <artifactId>murmur</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>murmur</name>
  <!-- FIXME change it to the project's website -->
  <url>http://www.example.com</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>

    <!-- Thanks for using https://jar-download.com -->
    <dependency>
      <groupId>commons-codec</groupId>
      <artifactId>commons-codec</artifactId>
      <version>1.13</version>
    </dependency>

  </dependencies>

  <build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
      <plugins>
        <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
        <plugin>
          <artifactId>maven-clean-plugin</artifactId>
          <version>3.1.0</version>
        </plugin>
        <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
        <plugin>
          <artifactId>maven-resources-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>3.8.0</version>
        </plugin>
        <plugin>
          <artifactId>maven-surefire-plugin</artifactId>
          <version>2.22.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-jar-plugin</artifactId>
          <version>3.0.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-install-plugin</artifactId>
          <version>2.5.2</version>
        </plugin>
        <plugin>
          <artifactId>maven-deploy-plugin</artifactId>
          <version>2.8.2</version>
        </plugin>
        <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
        <plugin>
          <artifactId>maven-site-plugin</artifactId>
          <version>3.7.1</version>
        </plugin>
        <plugin>
          <artifactId>maven-project-info-reports-plugin</artifactId>
          <version>3.0.0</version>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>
</project>

In a Maven project in Java one needs to specify dependencies so they are made available to the application. Keep in mind that bringing down code may pose a security risk.

package com.canessa;

import java.lang.Math;
import java.util.BitSet;

// **** ****
import org.apache.commons.codec.digest.MurmurHash3;

/**
 * A Bloom filter is a space-efficient probabilistic data structure, 
 * created by Burton Howard Bloom in 1970, 
 * that is used to test whether an element is a member of a set.
 * 
 * False positive matches are possible, 
 * but false negatives are not 
 * – in other words, a query returns either "possibly in set" or "definitely not in set". 
 * Elements can be added to the set, but not removed; 
 * the more items added, the larger the probability of false positives.
 * 
 */
public class BloomFilter {

    // **** class member(s) ****
    public int      k;                      // number of hash functions
    public int      m;                      // number of bits in the bloom filter
    public int      n;                      // number of elements to insert
    public double   f;                      // false positive rate

    public BitSet   bitArray = null;       // stores matching hashes for a lookup


    /**
     * Constructor
     */
    public BloomFilter(int n, double f) {
        this.n = n;
        this.f = f;

        this.m = calculateM();
        this.k = calculateK();

        this.bitArray = new BitSet(this.m);
        this.bitArray.set(0, this.bitArray.size(), false);
    }


    /**
     * Calculate m
     * @return
     */
    public int calculateM() {
        return (int)(- Math.log(this.f) * this.n / (Math.log(2) * Math.log(2)));
    }


    /**
     * Calculate k
     * @return
     */
    public int calculateK() {
        return (int)(this.m * Math.log(2) / this.n);
    }


    /**
     * Insert string into bloom filter.
     * @param item
     */
    public void insert(String item) {

        // ???? ????
        System.out.println("insert <<< item ==>" + item + "<==");

        // **** ****
        byte[] data = item.getBytes();
        int offset  = 0;
        int length  = data.length;

        // ???? ????
        System.out.print("insert <<< indices: ");

        // **** ****
        for (int i = 0; i < this.k; i++) {

            // **** compute index in bloom filter ****
            int index  = (int)Math.abs(MurmurHash3.hash64(data, offset, length, i) % this.m);

            // ???? ????
            System.out.print(index);
            if (i < this.k - 1) System.out.print(", ");
            else System.out.println();

            // **** set index to true ****
            this.bitArray.set(index);
        }

        // ???? ????
        System.out.println("insert <<< bf: " + this.toString());
    }


    /**
     * Look up string in bloom filter
     * @param str
     * @return
     */
    public boolean lookup(String item) {

        // ???? ????
        System.out.println("lookup <<< item ==>" + item + "<==");

        // **** ****
        byte[] data = item.getBytes();
        int offset  = 0;
        int length  = data.length;
   
        // ???? ????
        System.out.print("lookup <<< indices: ");

        // **** ****
        for (int i = 0; i < this.k; i++) {

            // **** compute index in bloom filter ****
            int index  = (int)Math.abs(MurmurHash3.hash64(data, offset, length, i) % this.m);

            // ???? ????
            System.out.print(index);
            // if (i < this.k - 1) System.out.print(", ");
            // else System.out.println();

            // **** ****
            if (this.bitArray.get(index) == false) {

                // ???? ????
                System.out.println();

                // **** ****
                return false;
            }

            // ???? ????
            if (i < this.k - 1) System.out.print(", ");
            else System.out.println();
        }

        // **** str found in bloom filter ****
        return true;
    }


    /**
     * Return a string representation of this object.
     */
    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();

        sb.append(" n: " + n + " f: " + f + " m: " + m + " k: " + k + "\n");
        sb.append("bitArray: " + bitArray.toString());

        return sb.toString();
    }
}

This code defines in a single file the bloom filter class.

Note that all the class members are public. In a production scenario they should be made private and one should provide the necessary public getters and setters.

Our class requires values for four members which are defined with associated comments. The book contains a Python implementation of a Bloom filter.

Besides the constructor, we have the insert() method used to insert a string into the Bloom filter and the lookup() method used to check if a string is in the filter.

In addition, since we should experiment and understand as much as possible how the Bloom filter works, we have a method that returns a string representation of the contents of the filter.

Consistent Hashing

This is a good example of hashing and how to do it in order to allow changes when removing servers and having to reroute requests to other nodes with the minimum amount of work.

**** Resize Hash Table (to be completed) ****

hashTable[ 0]: null
hashTable[ 1]: null
hashTable[ 2]: null
hashTable[ 3]: null
hashTable[ 4]: null
hashTable[ 5]: null
hashTable[ 6]: null
hashTable[ 7]: null
hashTable[ 8]: null
hashTable[ 9]: null
hashTable[10]: null
hashTable[11]: null
hashTable[12]: null
hashTable[13]: null
hashTable[14]: null
hashTable[15]: null
hashTable[16]: null
hashTable[17]: null
hashTable[18]: null
hashTable[19]: null
hashTable[20]: null
hashTable[21]: null
hashTable[22]: null
hashTable[23]: null
hashTable[24]: null
hashTable[25]: null
hashTable[26]: null
hashTable[27]: null
hashTable[28]: null
hashTable[29]: null
hashTable[30]: null
hashTable[31]: null

**** Consistent Hashing ****

<<< Adding a head node 12...
<<< Adding a node 18. Its prev is 12, and its next is 12
<<< Adding a resource: 24...
<<< Adding a resource: 21...
<<< Adding a resource: 16...
<<< Adding a resource: 23...
<<< Adding a resource: 2...
<<< Adding a resource: 29...
<<< Adding a resource: 28...
<<< Adding a resource: 7...
<<< Adding a resource: 10...
<<< *****
<<< Printing hashring in clockwise order:
<<< Node: 12, Resources: 24 21 23 2 29 28 7 10  
<<< Node: 18, Resources: 16  
<<< *****
<<< Adding a node 5. Its prev is 18, and its next is 12
<<< Moving resource 24 from 12 to 5
<<< Moving resource 21 from 12 to 5
<<< Moving resource 23 from 12 to 5
<<< Moving resource 2 from 12 to 5
<<< Moving resource 29 from 12 to 5
<<< Moving resource 28 from 12 to 5
<<< Adding a node 27. Its prev is 18, and its next is 5
<<< Moving resource 24 from 5 to 27
<<< Moving resource 21 from 5 to 27
<<< Moving resource 23 from 5 to 27
<<< Adding a node 30. Its prev is 27, and its next is 5
<<< Moving resource 29 from 5 to 30
<<< Moving resource 28 from 5 to 30
<<< *****
<<< Printing hashring in clockwise order:
<<< Node: 5, Resources: 2
<<< Node: 12, Resources: 7 10
<<< Node: 18, Resources: 16
<<< Node: 27, Resources: 24 21 23
<<< Node: 30, Resources: 29 28
<<< *****
<<< Removing node: 12:
<<< Moving resource 7 from 12 to 18
<<< Moving resource 10 from 12 to 18
<<< *****
<<< Printing hashring in clockwise order:
<<< Node: 5, Resources: 2
<<< Node: 18, Resources: 16 7 10
<<< Node: 27, Resources: 24 21 23
<<< Node: 30, Resources: 29 28
<<< *****

Please disregard the first part of the test in which I will be creating a hash table, populating it, and then changing the size, which affects most (never generalize) entries.

The next part of the test code covers the example in section 3.3 A simple Implementation in the book which is in Python.

Consistent hashing is a special kind of hashing technique such that when a hash table is resized, only n / m keys need to be remapped on average where n is the number of keys and m is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped because the mapping between the keys and the slots is defined by a modular operation.

In the first set of operations we start with an empty hashring. We then add two nodes which handle requests (servers) and a set of resources (requests). After the nodes and resources are added we display the contents of the hashring.

Note that the ring is displayed in clockwise order starting with the lowest node (in this case 12).

Next we add a mix of nodes and resources. The resulting ring is then displayed.

Our last test is to remove node 12 from the hashring. Only two resources move. The contents of the hashring are then displayed.

I liked implementing the hashring and experimenting with it.

package com.hashing;

import java.util.LinkedHashMap;


/**
 * Hashring node.
 */
public class Node {
    
    // **** class members ****
    public int hashValue                            = 0;        // hash value
    public LinkedHashMap<Integer, String> resources = null;     // hashmap holding resources to process
    public Node next                                = null;     // next node
    public Node prev                                = null;     // previous node


    /**
     * Constructor
     * @param hashValue
     */
    public Node(int hashValue) {

        // **** sanity check(s) ****

        // **** ****
        this.hashValue  = hashValue;
        this.resources  = new LinkedHashMap<Integer, String>();
        this.next       = null;
        this.prev       = null;
    }


    // **** setter(s) ****


    // **** getter(s) ****


}

The hashring Node class contains the members with descriptions.

The constructor illustrates how to create a node. Once again, no sanity checks have been implemented and the class members are made public in order to avoid generating setters and getters. In production we should probably make all members private and implement the getters and setters.

package com.hashing;

/**
 * Consistent hashing is a special kind of hashing technique such that when a hash table is resized, 
 * only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots. 
 * In contrast, in most traditional hash tables, 
 * a change in the number of array slots causes nearly all keys to be remapped 
 * because the mapping between the keys and the slots is defined by a modular operation.
 */
public class App 
{
    public static void main( String[] args )
    {

        // **** initialization ****
        HashRing hr     = null;                 // hashring
        HashTable ht    = null;                 // hash table (to be completed)
        int k           = 0;                    // number of hash values (2 ^ k)
        int n           = 0;                    // number of hash values in hash table (to be completed)

        // **** section message ****
        System.out.println( "**** Resize Hash Table (to be completed) ****\n");

        // **** create hash table with 16 hash values ****
        n   = 32;
        ht  = new HashTable(n);

        // ???? display hash table ????
        ht.printHashTable();


        // **** section message ****
        System.out.println( "\n**** Consistent Hashing ****\n");

        // **** create a hashring with 2^5 == 32 hash values ****
        k   = 5;
        hr  = new HashRing(k);

        // **** add some nodes to the hashring ****
        hr.addNode(12);
        hr.addNode(18);

        // **** add some resources to the hashring ****
        hr.addResource(24);
        hr.addResource(21);
        hr.addResource(16);
        hr.addResource(23);

        hr.addResource(2);
        hr.addResource(29);
        hr.addResource(28);
        hr.addResource(7);

        hr.addResource(10);

        // **** print the hashring ****
        hr.printHashRing();

        // **** add some nodes to the hashring ****
        hr.addNode(5);
        hr.addNode(27);
        hr.addNode(30);

        // **** print the hashring ****
        hr.printHashRing();

        // **** remove a node from the hashring ****
        hr.removeNode(12);

        // **** print the hashring ****
        hr.printHashRing();
    }
}

This is the implementation of our test code. The sequence of operations matches the one presented in the book.

package com.hashing;

import java.util.ArrayList;
import java.util.Map.Entry;


/**
 * This class implements a hashring.
 */
public class HashRing {

    
    // **** class members ****
    public Node head    = null;         // pointer to head node
    public int k        = 0;            // number of nodes = 2 ^ k
    public int min      = 0;            // min number of nodes
    public int max      = 0;            // max number of nodes


    /**
     * Constructor
     * @param k
     */
    public HashRing(int k) {

        // **** sanity check(s) ****

        // **** initialization ****
        this.head   = null;
        this.k      = k;
        this.min    = 0;
        this.max    = (int) Math.pow(2.0, (double)k);
    }

    // **** setter(s) ****


    // **** getter(s) ****


    /**
     * Legal range of resource and node hash values allowed 
     * in the hashring. 
     * @param hashValue
     * @return
     */
    public boolean legalRange(int hashValue) {
        if (this.min <= hashValue && hashValue <= this.max)
            return true;
        else
            return false;
    }


    /**
     * Determine the distance to the closes node for a particular resource.
     * @param a
     * @param b
     * @return
     */
    public int distance(int a, int b) {
        if (a == b)
            return 0;
        else if (a < b)
            return b - a;
        else 
            return (int)Math.pow(2.0, this.k) + (b - a);
    }


    /**
     * Lookup of the appropiate node given a hash value of the resource.
     * @param hashVal
     * @return
     */
    public Node lookupNode(int hashValue) {

        // **** ****
        if (this.legalRange(hashValue)) {

            // **** head node ****
            Node temp = this.head;

            // **** ****
            if (temp == null) {
                return null;
            } else {
    
                // **** ****
                while (distance(temp.hashValue, hashValue) > distance(temp.next.hashValue, hashValue))
                    temp = temp.next;

                // **** ****
                if (temp.hashValue == hashValue) 
                    return temp;
                else 
                    return temp.next;
            }
        }

        // **** node not found ****
        return null;
    }


    /**
     * Move nresource(s) from the orig to the dest node.
     * @param dest
     * @param orig
     * @param deleteTrue
     */
    public void moveResources(Node dest, Node orig, boolean  deleteTrue) {

        // **** initialization ****
        ArrayList<Integer> deleteList   = new ArrayList<>();
        int i                           = 0;
        String j                        = null;
        
        // **** iterate in order over the orig node resources ****
        for (Entry<Integer, String> entry : orig.resources.entrySet()) {

            // **** for ease of use ****
            i = entry.getKey();
            j = entry.getValue();

            // **** ****
            if (distance(i, dest.hashValue) < distance(i, orig.hashValue) || deleteTrue) {

                // **** ****
                dest.resources.put(i, j);

                // **** ****
                deleteList.add(i);

                // **** log activity ****
                System.out.println( "<<< Moving resource " + i + " from " + orig.hashValue + " to " + dest.hashValue);
            }
        }

        // **** delete the reassigned resource(s) from orig ****
        while (!deleteList.isEmpty()) {

            // **** ****
            i = deleteList.remove(0);

            // **** remove this resource from orig ****
            orig.resources.remove(i);
        }
    }


    /**
     * Add node to the hashring.
     * @param hashValue
     */
    public void addNode(int hashValue) {

        // **** ****
        if (legalRange(hashValue)) {

            // **** ****
            Node newNode = new Node(hashValue); 

            // **** empty hashring ****
            if (this.head == null) {
                newNode.next    = newNode;
                newNode.prev    = newNode;
                this.head       = newNode;

                // **** ****
                System.out.println("<<< Adding a head node " + newNode.hashValue + "...");
            } 
            
            // **** the hashring is not empty ****
            else {

                // **** successor ****
                Node temp           = lookupNode(hashValue);

                newNode.next        = temp;
                newNode.prev        = temp.prev;
                newNode.prev.next   = newNode;
                newNode.next.prev   = newNode;

                // **** log activity ****
                System.out.println( "<<< Adding a node " + newNode.hashValue + 
                                    ". Its prev is " + newNode.prev.hashValue +
                                    ", and its next is " + newNode.next.hashValue);

                // **** move resources to next node (do not delete them) ****
                moveResources(newNode, newNode.next, false);

                // **** head pointer changes ****
                if (hashValue < head.hashValue)
                    head = newNode;
            }
        }
    }


    /**
     * Add resource to the hashring.
     * @param hashValueResource
     */
    public void addResource(int hashValueResource) {

        // **** check if in range (circular queue) ****
        if (legalRange(hashValueResource)) {

            // **** log activity ****
            System.out.println("<<< Adding a resource: " + hashValueResource + "...");

            // **** access the target node ****
            Node targetNode = lookupNode(hashValueResource);

            // **** no target node ****
            if (targetNode != null) {
                int key         = hashValueResource;
                String value    = "dummy resource value of " + hashValueResource;
                targetNode.resources.put(key, value);
            }

            // **** log activity ****
            else {
                System.err.println("<<< Cannot add a resource to an empty hashring!!!");
            }
        }
    }


    /**
     * Remove node from hashring.
     * @param hashValue
     */
    public Node removeNode(int hashValue) {

        // **** look up the node using the hash value ****
        Node temp = lookupNode(hashValue);

        // **** found the node ****
        if (temp.hashValue == hashValue) {

            // **** log activity ****
            System.out.println("<<< Removing node: " + hashValue + ": ");

            // **** ****
            moveResources(temp.next, temp, true);
            temp.prev.next = temp.next;
            temp.next.prev = temp.prev;

            // **** ****
            if (head.hashValue == hashValue) {
                head = temp.next;
                if (head == head.next)
                    head = null;
            }

            // **** return next node ****
            return temp.next;
        }

        // **** no such node ****
        else {
            System.out.println("<<< Nothing to remove");
        }

        // **** nothing to return ****
        return null;
    }


    /**
     * Print the contents (nodes and resources) of the specified hashring.
     */
    public void printHashRing() {

        // **** log activity ****
        System.out.println("<<< *****");
        System.out.print("<<< Printing hashring in clockwise order:");

        // **** for starters ****
        Node temp = head;

        // ***** empty hashring ****
        if (head == null) {
            System.out.print("<<< Empty hashring!!!");
        }

        // **** nodes in hashring ****
        else {

            // **** traverse the hashring ****
            while (true) {

                // **** log activity ****
                System.out.print("\n<<< Node: " + temp.hashValue + ", ");
                System.out.print("Resources: ");

                // **** no resources assigned to this node ****
                if (temp.resources.size() == 0) {
                    System.out.print("empty ");
                }

                // **** display assigned resources ****
                else {
                    for (Entry<Integer, String> entry : temp.resources.entrySet())
                        System.out.print(entry.getKey() + " ");
                }

                // **** get to the next node ****
                temp = temp.next;

                // **** log activity ****
                System.out.print(" ");

                // **** done traversing the hashring ****
                if (temp == head)
                    break;
            }
        }

        // **** log activity ****
        System.out.println("\n<<< *****");
    }
   
}

This is the implementation of the Hashring class with its members and methods. Each member contains a brief description that enhances the method name.

With the code and diagrams we should be able to get a good understanding on the operation of the hashring.

Quotient Filter (in progress)

This section will be included in the next version of this post.

Count-min Sketch (CMS)

The Count-min sketch (CMS) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. 

main <<< **** Count-min Sketch (CMS) ****
update <<< a: H
update <<< index: 1, 6, 4, 9
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 1, 0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 1, 0, 0, 0
0, 0, 0, 0, 1, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 1

update <<< a: O
update <<< index: 2, 0, 2, 9
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 1, 1, 0, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 0, 1, 0, 0, 0
0, 0, 1, 0, 1, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0, 0, 2

update <<< a: T
update <<< index: 6, 8, 3, 4
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 1, 1, 0, 0, 0, 1, 0, 0, 0
1, 0, 0, 0, 0, 0, 1, 0, 1, 0
0, 0, 1, 1, 1, 0, 0, 0, 0, 0
0, 0, 0, 0, 1, 0, 0, 0, 0, 2

update <<< a:  
update <<< index: 9, 4, 3, 1
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 1, 1, 0, 0, 0, 1, 0, 0, 1
1, 0, 0, 0, 1, 0, 1, 0, 1, 0
0, 0, 1, 2, 1, 0, 0, 0, 0, 0
0, 1, 0, 0, 1, 0, 0, 0, 0, 2

update <<< a: P
update <<< index: 5, 9, 0, 5
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 1, 1, 0, 0, 1, 1, 0, 0, 1
1, 0, 0, 0, 1, 0, 1, 0, 1, 1
1, 0, 1, 2, 1, 0, 0, 0, 0, 0
0, 1, 0, 0, 1, 1, 0, 0, 0, 2

update <<< a: E
update <<< index: 1, 7, 6, 2
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 2, 1, 0, 0, 1, 1, 0, 0, 1
1, 0, 0, 0, 1, 0, 1, 1, 1, 1
1, 0, 1, 2, 1, 0, 1, 0, 0, 0
0, 1, 1, 0, 1, 1, 0, 0, 0, 2

update <<< a: P
update <<< index: 5, 9, 0, 5
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 2, 1, 0, 0, 2, 1, 0, 0, 1
1, 0, 0, 0, 1, 0, 1, 1, 1, 2
2, 0, 1, 2, 1, 0, 1, 0, 0, 0
0, 1, 1, 0, 1, 2, 0, 0, 0, 2

update <<< a: P
update <<< index: 5, 9, 0, 5
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 2, 1, 0, 0, 3, 1, 0, 0, 1
1, 0, 0, 0, 1, 0, 1, 1, 1, 3
3, 0, 1, 2, 1, 0, 1, 0, 0, 0
0, 1, 1, 0, 1, 3, 0, 0, 0, 2

update <<< a: E
update <<< index: 1, 7, 6, 2
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 1, 0, 0, 3, 1, 0, 0, 1
1, 0, 0, 0, 1, 0, 1, 2, 1, 3
3, 0, 1, 2, 1, 0, 2, 0, 0, 0
0, 1, 2, 0, 1, 3, 0, 0, 0, 2

update <<< a: R
update <<< index: 8, 5, 7, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 1, 0, 0, 3, 1, 0, 1, 1
1, 0, 0, 0, 1, 1, 1, 2, 1, 3
3, 0, 1, 2, 1, 0, 2, 1, 0, 0
0, 1, 2, 0, 1, 3, 0, 0, 0, 3

update <<< a: R
update <<< index: 8, 5, 7, 9
main <<< cms: 
d: 4 w: 10 cmsTbl:
0, 3, 1, 0, 0, 3, 1, 0, 2, 1
1, 0, 0, 0, 1, 2, 1, 2, 1, 3
3, 0, 1, 2, 1, 0, 2, 2, 0, 0
0, 1, 2, 0, 1, 3, 0, 0, 0, 4

update <<< a: O
update <<< index: 2, 0, 2, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 2, 0, 0, 3, 1, 0, 2, 1
2, 0, 0, 0, 1, 2, 1, 2, 1, 3
3, 0, 2, 2, 1, 0, 2, 2, 0, 0
0, 1, 2, 0, 1, 3, 0, 0, 0, 5

update <<< a: N
update <<< index: 8, 4, 6, 0
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 2, 0, 0, 3, 1, 0, 3, 1
2, 0, 0, 0, 2, 2, 1, 2, 1, 3
3, 0, 2, 2, 1, 0, 3, 2, 0, 0
1, 1, 2, 0, 1, 3, 0, 0, 0, 5

update <<< a: I
update <<< index: 2, 9, 8, 7
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 3, 0, 0, 3, 1, 0, 3, 1
2, 0, 0, 0, 2, 2, 1, 2, 1, 4
3, 0, 2, 2, 1, 0, 3, 2, 1, 0
1, 1, 2, 0, 1, 3, 0, 1, 0, 5

update <<< a:
update <<< index: 9, 4, 3, 1
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 3, 0, 0, 3, 1, 0, 3, 2
2, 0, 0, 0, 3, 2, 1, 2, 1, 4
3, 0, 2, 3, 1, 0, 3, 2, 1, 0
1, 2, 2, 0, 1, 3, 0, 1, 0, 5

update <<< a: P
update <<< index: 5, 9, 0, 5
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 3, 0, 0, 4, 1, 0, 3, 2
2, 0, 0, 0, 3, 2, 1, 2, 1, 5
4, 0, 2, 3, 1, 0, 3, 2, 1, 0
1, 2, 2, 0, 1, 4, 0, 1, 0, 5

update <<< a: I
update <<< index: 2, 9, 8, 7
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 4, 0, 0, 4, 1, 0, 3, 2
2, 0, 0, 0, 3, 2, 1, 2, 1, 6
4, 0, 2, 3, 1, 0, 3, 2, 2, 0
1, 2, 2, 0, 1, 4, 0, 2, 0, 5

update <<< a: Z
update <<< index: 3, 3, 3, 4
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 4, 1, 0, 4, 1, 0, 3, 2
2, 0, 0, 1, 3, 2, 1, 2, 1, 6
4, 0, 2, 4, 1, 0, 3, 2, 2, 0
1, 2, 2, 0, 2, 4, 0, 2, 0, 5

update <<< a: Z
update <<< index: 3, 3, 3, 4
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 4, 2, 0, 4, 1, 0, 3, 2
2, 0, 0, 2, 3, 2, 1, 2, 1, 6
4, 0, 2, 5, 1, 0, 3, 2, 2, 0
1, 2, 2, 0, 3, 4, 0, 2, 0, 5

update <<< a: A
update <<< index: 7, 8, 4, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 4, 2, 0, 4, 1, 1, 3, 2
2, 0, 0, 2, 3, 2, 1, 2, 2, 6
4, 0, 2, 5, 2, 0, 3, 2, 2, 0
1, 2, 2, 0, 3, 4, 0, 2, 0, 6

update <<< a:  
update <<< index: 9, 4, 3, 1
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 4, 2, 0, 4, 1, 1, 3, 3
2, 0, 0, 2, 4, 2, 1, 2, 2, 6
4, 0, 2, 6, 2, 0, 3, 2, 2, 0
1, 3, 2, 0, 3, 4, 0, 2, 0, 6

update <<< a: W
update <<< index: 2, 0, 8, 5
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 5, 2, 0, 4, 1, 1, 3, 3
3, 0, 0, 2, 4, 2, 1, 2, 2, 6
4, 0, 2, 6, 2, 0, 3, 2, 3, 0
1, 3, 2, 0, 3, 5, 0, 2, 0, 6

update <<< a: I
update <<< index: 2, 9, 8, 7
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 6, 2, 0, 4, 1, 1, 3, 3
3, 0, 0, 2, 4, 2, 1, 2, 2, 7
4, 0, 2, 6, 2, 0, 3, 2, 4, 0
1, 3, 2, 0, 3, 5, 0, 3, 0, 6

update <<< a: T
update <<< index: 6, 8, 3, 4
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 3, 6, 2, 0, 4, 2, 1, 3, 3
3, 0, 0, 2, 4, 2, 1, 2, 3, 7
4, 0, 2, 7, 2, 0, 3, 2, 4, 0
1, 3, 2, 0, 4, 5, 0, 3, 0, 6

update <<< a: H
update <<< index: 1, 6, 4, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 4, 6, 2, 0, 4, 2, 1, 3, 3
3, 0, 0, 2, 4, 2, 2, 2, 3, 7
4, 0, 2, 7, 3, 0, 3, 2, 4, 0
1, 3, 2, 0, 4, 5, 0, 3, 0, 7

update <<< a:
update <<< index: 9, 4, 3, 1
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 4, 6, 2, 0, 4, 2, 1, 3, 4
3, 0, 0, 2, 5, 2, 2, 2, 3, 7
4, 0, 2, 8, 3, 0, 3, 2, 4, 0
1, 4, 2, 0, 4, 5, 0, 3, 0, 7

update <<< a: A
update <<< index: 7, 8, 4, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 4, 6, 2, 0, 4, 2, 2, 3, 4
3, 0, 0, 2, 5, 2, 2, 2, 4, 7
4, 0, 2, 8, 4, 0, 3, 2, 4, 0
1, 4, 2, 0, 4, 5, 0, 3, 0, 8

update <<< a:
update <<< index: 9, 4, 3, 1
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 4, 6, 2, 0, 4, 2, 2, 3, 5
3, 0, 0, 2, 6, 2, 2, 2, 4, 7
4, 0, 2, 9, 4, 0, 3, 2, 4, 0
1, 5, 2, 0, 4, 5, 0, 3, 0, 8

update <<< a: C
update <<< index: 7, 1, 5, 6
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 4, 6, 2, 0, 4, 2, 3, 3, 5
3, 1, 0, 2, 6, 2, 2, 2, 4, 7
4, 0, 2, 9, 4, 1, 3, 2, 4, 0
1, 5, 2, 0, 4, 5, 1, 3, 0, 8

update <<< a: O
update <<< index: 2, 0, 2, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 4, 7, 2, 0, 4, 2, 3, 3, 5
4, 1, 0, 2, 6, 2, 2, 2, 4, 7
4, 0, 3, 9, 4, 1, 3, 2, 4, 0
1, 5, 2, 0, 4, 5, 1, 3, 0, 9

update <<< a: L
update <<< index: 1, 4, 1, 2
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 5, 7, 2, 0, 4, 2, 3, 3, 5
4, 1, 0, 2, 7, 2, 2, 2, 4, 7
4, 1, 3, 9, 4, 1, 3, 2, 4, 0
1, 5, 3, 0, 4, 5, 1, 3, 0, 9

update <<< a: D
update <<< index: 1, 4, 4, 0
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 6, 7, 2, 0, 4, 2, 3, 3, 5
4, 1, 0, 2, 8, 2, 2, 2, 4, 7
4, 1, 3, 9, 5, 1, 3, 2, 4, 0
2, 5, 3, 0, 4, 5, 1, 3, 0, 9

update <<< a:
update <<< index: 9, 4, 3, 1
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 6, 7, 2, 0, 4, 2, 3, 3, 6
4, 1, 0, 2, 9, 2, 2, 2, 4, 7
4, 1, 3, 10, 5, 1, 3, 2, 4, 0
2, 6, 3, 0, 4, 5, 1, 3, 0, 9

update <<< a: P
update <<< index: 5, 9, 0, 5
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 6, 7, 2, 0, 5, 2, 3, 3, 6
4, 1, 0, 2, 9, 2, 2, 2, 4, 8
5, 1, 3, 10, 5, 1, 3, 2, 4, 0
2, 6, 3, 0, 4, 6, 1, 3, 0, 9

update <<< a: E
update <<< index: 1, 7, 6, 2
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 7, 7, 2, 0, 5, 2, 3, 3, 6
4, 1, 0, 2, 9, 2, 2, 3, 4, 8
5, 1, 3, 10, 5, 1, 4, 2, 4, 0
2, 6, 4, 0, 4, 6, 1, 3, 0, 9

update <<< a: R
update <<< index: 8, 5, 7, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 7, 7, 2, 0, 5, 2, 3, 4, 6
4, 1, 0, 2, 9, 3, 2, 3, 4, 8
5, 1, 3, 10, 5, 1, 4, 3, 4, 0
2, 6, 4, 0, 4, 6, 1, 3, 0, 10

update <<< a: O
update <<< index: 2, 0, 2, 9
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 7, 8, 2, 0, 5, 2, 3, 4, 6
5, 1, 0, 2, 9, 3, 2, 3, 4, 8
5, 1, 4, 10, 5, 1, 4, 3, 4, 0
2, 6, 4, 0, 4, 6, 1, 3, 0, 11

update <<< a: N
update <<< index: 8, 4, 6, 0
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 7, 8, 2, 0, 5, 2, 3, 5, 6
5, 1, 0, 2, 10, 3, 2, 3, 4, 8
5, 1, 4, 10, 5, 1, 5, 3, 4, 0
3, 6, 4, 0, 4, 6, 1, 3, 0, 11

update <<< a: I
update <<< index: 2, 9, 8, 7
main <<< cms:
d: 4 w: 10 cmsTbl:
0, 7, 9, 2, 0, 5, 2, 3, 5, 6
5, 1, 0, 2, 10, 3, 2, 3, 4, 9
5, 1, 4, 10, 5, 1, 5, 3, 5, 0
3, 6, 4, 0, 4, 6, 1, 4, 0, 11

main <<< hist: { =6, A=2, C=1, D=1, E=3, H=2, I=4, L=1, N=2, O=4, P=5, R=3, T=2, W=1, Z=2}        
main <<< hist:
  (6) ******
A (2) **
C (1) *
D (1) *
E (3) ***
H (2) **
I (4) ****
L (1) *
N (2) **
O (4) ****
P (5) *****
R (3) ***
T (2) **
W (1) *
Z (2) **

main <<< ch:   actual: 6 estimate: 6
main <<< ch: A actual: 2 estimate: 3 OVERESTIMATE!!!
main <<< ch: C actual: 1 estimate: 1
main <<< ch: D actual: 1 estimate: 3 OVERESTIMATE!!!
main <<< ch: E actual: 3 estimate: 3
main <<< ch: H actual: 2 estimate: 2
main <<< ch: I actual: 4 estimate: 4
main <<< ch: L actual: 1 estimate: 1
main <<< ch: N actual: 2 estimate: 3 OVERESTIMATE!!!
main <<< ch: O actual: 4 estimate: 4
main <<< ch: P actual: 5 estimate: 5
main <<< ch: R actual: 3 estimate: 3
main <<< ch: T actual: 2 estimate: 2
main <<< ch: W actual: 1 estimate: 5 OVERESTIMATE!!!
main <<< ch: Z actual: 2 estimate: 2

In this example it seems we are adding letters from a string into a CMS data structure. After each operation the cms data structure is displayed. Sorry I left in a long string which contains UPPER case characters and spaces.

After the updates are completed, it seems that on the side we have been collecting data in a histogram. The contents of the data and a graphical representation is made.

Finally we traverse the letters we have used to update the cms data structure and attempt to predict their counts. Note that in this case the cms data structure was not properly configured so we get some overestimates due to collisions. The reason is similar to what happens on Bloom filters when they are not properly configured (size and number of hash functions).

package com.canessa;

import java.util.Map;
import java.util.TreeMap;


/**
 * Test scaffold.
 */
public class App {


    /**
     * Populate histogram with characters from the specified string.
     * @param str
     * @return
     */
    public static TreeMap<Character, Integer> histogram(String str) {

        // **** initiialization ****
        TreeMap<Character, Integer> hist = new TreeMap<>();

        // **** sanity check(s) ****
        if (str.length() <= 0) return hist;

        // **** process each character in the string ****
        for (int i = 0; i < str.length(); i++) {

            // **** for ease of use ****
            char ch = str.charAt(i);

            // **** put this character if not present ****
            Integer count = hist.putIfAbsent(ch, 1);

            // **** update count if needed ****
            if (count != null)
                hist.put(ch, count + 1);
        }

        // **** return histogram ****
        return hist;
    }


    /**
     * Generate a string representing the speficified histogram.
     * @return
     */
    static String histToString(TreeMap<Character, Integer> hist) {
        
        // **** sanity check(s) ****
        if (hist == null) return null;

        // **** initialization ****
        StringBuilder sb = new StringBuilder();

        // **** traverse the entries in the tree map ****
        for (Map.Entry<Character, Integer> entry : hist.entrySet() ) {

            // **** for ease of use ****
            char ch     = entry.getKey();
            int count   = entry.getValue();

            // **** to generate string of stars (*) ****
            StringBuilder stars = new StringBuilder();

            // **** generate string of stars ****
            for (int i = 0; i < count; i++)
                stars.append('*');

            // **** generate line for histogram ****
            sb.append(ch + " (" + count + ") " + stars + '\n');
        }

        // **** return a string representation of the histogram ****
        return sb.toString();
    }


    /**
     * Test scaffold.
     * @param args
     */
    public static void main( String[] args )
    {

        // ***** welcome message ****
        System.out.println( "main <<< **** Count-min Sketch (CMS) ****" );

        // **** initialization ****
        int d       = 4;                        // number of hash tables (rows)
        int w       = 10;                       // number of columns

        // String str  = "ABCABACLBA";          // input stream of characters
        String str  = "HOT PEPPERRONI PIZZA WITH A COLD PERONI";

        // **** instantiate cms ****
        CountMinSketch cms = new CountMinSketch(d, w);


        // **** traverse the string processing all characters ****
        for (int i = 0; i < str.length(); i++) {

            // **** get current character from the input string ****
            char ch = str.charAt(i);

            // **** update the CMS with the hash values associated with this character ****
            cms.update(ch);

            // ???? ????
            System.out.println("main <<< cms: " + cms.toString());
        }


        // **** generate histogram (simple because we have a limited number of characters in the string) ****
        TreeMap<Character, Integer> hist = histogram(str);

        // ???? ????
        System.out.println("main <<< hist: " + hist.toString());

        // **** ****
        System.out.println("main <<< hist:\n" + histToString(hist));


        // **** traverse the histogram processing all characters ****
        for (Map.Entry<Character, Integer> entry : hist.entrySet() ) {

            // **** for ease of use ****
            char ch     = entry.getKey();
            int count   = entry.getValue();

            // **** estimate cms ****
            int estimate = cms.estimate(ch);

            // **** display the estimate for this character ****
            System.out.print("main <<< ch: " + ch + " actual: " + count + " estimate: " + estimate);
            if (count != estimate) System.out.println(" OVERESTIMATE!!!");
            else System.out.println();
        }
    }
}

We start with the test scaffold.

The histogram() function is used to populate our histogram. This is a utility function.

The histToString() function generates a string representation of the histogram.

The first loop in the test scaffold updates the cms one character at a time.

The next few lines are used to generate and display the histogram. This is done so we can see how many times each letter is present in the input string.

In the last loop we traverse the characters in our histogram and call the estimate() method in the cms to calculate the estimate for each character. Note that since in this example the parameters passed to the constructor are off for the size and contents of the characters. I started with a short string and kept on adding words until several OVERESTIMATES showed up.

package com.canessa;

// **** ****
import org.apache.commons.codec.digest.MurmurHash3;


/**
 * Implements a simple count-min sketch object.
 */
public class CountMinSketch {


    // **** class members ****
    public int d            = 0;            // number of hash functions (rows)
    public int w            = 0;            // values generated by the hash functions [1 : w]
    public int[][] cmsTbl   = null;         // cms table


    /**
     * Constructor.
     */
    public CountMinSketch(int d, int w) {
        this.d      = d;
        this.w      = w;
        this.cmsTbl = new int[d][w];
    }


    /**
     * Convert integer to byte array.
     */
    private static final byte[] intToByteArray(int value) {
        return new byte[] {
                            (byte)(value >>> 24),
                            (byte)(value >>> 16),
                            (byte)(value >>> 8),
                            (byte)value};
    }


    /**
     * This method adds another instance of the item to the dataset.
     */
    public void update(char a) {

        // ???? ????
        System.out.println("update <<< a: " + a);

        // ???? ????
        System.out.print("update <<< index: ");

        // **** update table ****
        for (int j = 0; j < this.d; j++) {

            // **** convert input to byte array ****
            byte[] data = intToByteArray(a);
            int len = data.length;

            // **** compute index into cms table ****
            int seed = j;
            int index = (int)(Math.abs(MurmurHash3.hash64(data, 0, len, seed)) % this.w);

            // ???? ????
            System.out.print(index);
            if (j < this.d - 1) System.out.print(", ");

            // **** update cms table entry ****
            cmsTbl[j][index] += 1;
        }

        // ???? ????
        System.out.println();
    }


    /**
     * This method returns the approximate frequency of the querried item.
     * Overestimate happens ONLY if there was a collision in each row.
     * @return
     */
    public int estimate(char a) {

        // **** initialization ****
        byte[] data = intToByteArray(a);
        int min     = Integer.MAX_VALUE;

        // // ???? ????
        // System.out.print("estimate <<< values: ");

        // **** ****
        for (int j = 0; j < this.d; j++) {

            // **** ****
            int seed = j;
            int index = (int)(Math.abs(MurmurHash3.hash64(data, 0, 4, seed)) % this.w);

            // // ???? ????
            // System.out.print(index);
            // if (j < this.d - 1)
            //     System.out.print(", ");

            // **** ****
            if (this.cmsTbl[j][index] < min)
                min = this.cmsTbl[j][index];
        }

        // // ???? ????
        // System.out.println();

        // **** return estimate ****
        return min;
    }


    /**
     * Returns a string representation of this object.
     */
    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();
        sb.append("\nd: " + this.d + " w: " + this.w + " cmsTbl:\n");

        for (int r = 0; r < this.d; r++) {
            for (int c = 0; c < this.w; c++) {

                // **** append cmsTbl value ****
                sb.append(this.cmsTbl[r]);

                // **** append , as a separator ****
                if (c < this.w - 1) sb.append(", ");
            }

            // **** end of row ****
            sb.append('\n');
        }
        return sb.toString();
    }
}

This is the implementation of the CountMinSketch class. At this point I have little to add to it. I just have to say that lines preceded with // ???? ???? comments are for testing only.

DAM Binary Search (in progress)

This section will be included in the next version of this post.

B-Tree

A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

A B+ tree can be viewed as a B-tree in which each node contains only keys (not key–value pairs), and to which an additional level is added at the bottom with linked leaves.

The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, filesystems. This is primarily because unlike binary search trees, B+ trees have very high fanout (number of pointers to child nodes in a node, typically on the order of 100 or more), which reduces the number of I/O operations required to find an element in the tree.

In this section we will just cover a B-Tree. The book briefly describes several types of B-trees.

main <<< bt: 15 17 28 30 32 48 55 57 60 66 76 88 95 96 99 101 102 103 123
main <<<  k: 6 NOT present
main <<<  k: 15 present

Our test scaffold displays the entries made in a random order into the B-tree. A check is done for an existing and a non-existing element.

#include <boost/lockfree/stack.hpp>
#include <iostream>

using namespace std;


/// <summary>
/// 
/// </summary>
class BTreeNode
{

    // **** ****
private:
    int*        keys;                   // array of keys
    int         t;                      // minimum degree (defines the range for number of keys)
    BTreeNode** C;                      // array of child pointers
    int         n;                      // current number of keys
    bool        leaf;                   // true when node is leaf; otherwise false

    // **** ****
public:

    /// <summary>
    /// Constructor.
    /// </summary>
    /// <param name="_t"></param>
    /// <param name="_leaf"></param>
    BTreeNode(int _t, bool _leaf);


    /// <summary>
    /// Utility function to insert a new key in the subtree rooted with
    /// this node. It is assumed that the node must be non-full when this
    /// function is called.
    /// </summary>
    /// <param name="k"></param>
    void insertNonFull(int k);


    /// <summary>
    /// Utility function to split the child y of this node. i is index of y in
    // child array C[].  The Child y must be full when this function is called
    /// </summary>
    /// <param name="i"></param>
    /// <param name="y"></param>
    void splitChild(int i, BTreeNode* y);


    /// <summary>
    /// Function used to traverse all nodes in a subtree rooted with this node.
    /// </summary>
    void traverse();


    /// <summary>
    /// Function to search a key in the subtree rooted with this node.
    /// </summary>
    /// <param name="k"></param>
    /// <returns>NULL if k is not present</returns>
    BTreeNode* search(int k);


    /// <summary>
    /// Make BTree friend of this so that we can access private members of this
    /// class in BTree functions.
    /// </summary>
    friend class BTree;
};


/// <summary>
/// A B-tree is a self-balancing tree data structure that maintains sorted data and allows 
/// searches, sequential access, insertions, and deletions in logarithmic time.
/// The B-tree generalizes the binary search tree, allowing for nodes with more than two children.
/// </summary>
class BTree
{

private:
    BTreeNode*  root;
    int         t;

    // **** methods ****
public:

    /// <summary>
    /// Constructor
    /// </summary>
    /// <param name="_t"></param>
    BTree(int _t)
    {
        root    = NULL; 
        t       = _t;
    }


    // **** getter(s) ****
    BTreeNode* getRoot()
    {
        return this->root;
    }


    /// <summary>
    /// Traverse the B-Tree.
    /// </summary>
    void traverse()
    {
        if (root != NULL)
        {

            // **** ****
            root->traverse();

            // **** for the looks ****
            cout << endl;
        }
    }


    /// <summary>
    /// Search for a key in this B-Tree
    /// </summary>
    /// <param name="k"></param>
    /// <returns></returns>
    BTreeNode* search(int k)
    {
        return (root == NULL) ? NULL : root->search(k);
    }


    /// <summary>
    /// Insert a new key in this B-Tree.
    /// </summary>
    /// <param name="k"></param>
    void insert(int k);
};


/// <summary>
/// Constructor for BTreeNode class.
/// </summary>
/// <param name="t1"></param>
/// <param name="leaf1"></param>
BTreeNode::BTreeNode(int t1, bool leaf1)
{

    // **** copy the given minimum degree and leaf property ****
    t       = t1;
    leaf    = leaf1;

    // **** allocate memory for maximum number of possible keys and child pointers ****
    keys    = new int[2 * t - 1];
    C       = new BTreeNode * [2 * t];

    // **** initialize the number of keys as 0 ****
    n       = 0;
}


/// <summary>
/// Used to traverse all nodes in a subtree rooted with this node.
/// </summary>
void BTreeNode::traverse()
{

    // **** there are n keys and n + 1 children,
    //      traverse through n keys and first n children ****
    int i;

    // **** ****
    for (i = 0; i < n; i++)
    {
        // **** if this is not leaf, then before printing key[i],
        //      traverse the subtree rooted with child C[i] ****
        if (leaf == false)
            C[i]->traverse();

        // **** display key ****
        //cout << " " << keys[i];
        cout << keys[i] << " ";

    }

    // **** print the subtree rooted with last child ****
    if (leaf == false)
        C[i]->traverse();
}


/// <summary>
/// Used to search key k in subtree rooted with this node.
/// </summary>
/// <param name="k"></param>
/// <returns></returns>
BTreeNode* BTreeNode::search(int k)
{

    // **** find the first key greater than or equal to k ****
    int i = 0;
    while (i < n && k > keys[i])
        i++;

    // **** if the found key is equal to k, return this node ****
    if (keys[i] == k)
        return this;

    // **** if key is not found, this is a leaf node ****
    if (leaf == true)
        return NULL;

    // **** go to the appropriate child ****
    return C[i]->search(k);
}


/// <summary>
/// Used to inserts a new key in this B-Tree.
/// </summary>
/// <param name="k"></param>
void BTree::insert(int k)
{

    // **** if tree is empty ****
    if (root == NULL)
    {
        // **** allocate memory for root ****
        root = new BTreeNode(t, true);
        root->keys[0] = k;      // insert key
        root->n = 1;            // update number of keys in root
    }

    // **** tree is not empty ****
    else
    {

        // **** if root is full, then tree grows in height ****
        if (root->n == 2 * t - 1)
        {

            // **** allocate memory for new root ****
            BTreeNode* s = new BTreeNode(t, false);

            // **** make old root as child of new root ****
            s->C[0] = root;

            // **** split the old root and move 1 key to the new root ****
            s->splitChild(0, root);

            // **** new root has two children now
            //      decide which of the two children is going to have new key
            int i = 0;
            if (s->keys[0] < k)
                i++;
            s->C[i]->insertNonFull(k);

            // **** change root ****
            root = s;
        }

        // **** root not full, call insertNonFull for root
        else
        {
            root->insertNonFull(k);
        }
    }
}


/// <summary>
/// Utility function to insert a new key in this node.
// The assumption is, the node must be non-full when this
// function is called.
/// </summary>
/// <param name="k"></param>
void BTreeNode::insertNonFull(int k)
{

    // **** initialize index as index of rightmost element ****
    int i = n - 1;

    // **** this is a leaf node ****
    if (leaf == true)
    {

        // a) find the location of new key to be inserted
        // b) move all greater keys to one place ahead
        while (i >= 0 && keys[i] > k)
        {

            // **** ****
            keys[i + 1] = keys[i];

            // **** ****
            i--;
        }

        // **** insert the new key at the found location ****
        keys[i + 1] = k;
        n           = n + 1;
    }

    // **** this node is not leaf ****
    else 
    {
        // **** find the child which is going to have the new key ****
        while (i >= 0 && keys[i] > k)
            i--;

        // **** determine if the found child is full ****
        if (C[i + 1]->n == 2 * t - 1)
        {

            // **** if the child is full, split it ****
            splitChild(i + 1, C[i + 1]);

            // **** after split, the middle key of C[i] goes up and
            //      C[i] is splitted into two.  See which of the two
            //      is going to have the new key ****
            if (keys[i + 1] < k)
                i++;
        }
        C[i + 1]->insertNonFull(k);
    }
}


/// <summary>
/// Utility function to split the child y of this node.
/// Note that y must be full when this function is called.
/// </summary>
/// <param name="i"></param>
/// <param name="y"></param>
void BTreeNode::splitChild(int i, BTreeNode* y)
{

    // **** create a new node which is going to store (t-1) keys of y ****
    BTreeNode* z 	= new BTreeNode(y->t, y->leaf);
    z->n 			= t - 1;

    // **** copy the last (t-1) keys of y to z ****
    for (int j = 0; j < t - 1; j++)
        z->keys[j] = y->keys[j + t];

    // **** copy the last t children of y to z ****
    if (y->leaf == false)
    {
        for (int j = 0; j < t; j++)
            z->C[j] = y->C[j + t];
    }

    // **** reduce the number of keys in y ****
    y->n = t - 1;

    // **** since this node is going to have a new child,
    //      create space of new child ****
    for (int j = n; j >= i + 1; j--)
        C[j + 1] = C[j];

    // **** link new child to this node ****
    C[i + 1] = z;

    // **** a key of y will move to this node. Find the location of
    //      new key and move all greater keys one space ahead ****
    for (int j = n - 1; j >= i; j--)
        keys[j + 1] = keys[j];

    // **** copy the middle key of y to this node ****
    keys[i] = y->keys[t - 1];

    // **** increment count of keys in this node ****
    n = n + 1;
}


/// <summary>
/// Test scaffold.
/// </summary>
/// <returns></returns>
int main()
{

    // **** initialization ****
    int h{4};                           // minimum degree / height of B-Tree
    int k{0};                           // key to search for in B-Tree

    // *** instantiate B-Tree ****
    BTree bt(h);

    // **** insert nodes into the B-Tree ****
    //bt.insert(10);
    //bt.insert(20);
    //bt.insert(5);
    //bt.insert(6);

    //bt.insert(12);
    //bt.insert(30);
    //bt.insert(7);
    //bt.insert(17);


    // **** ****
    //bt.insert(45);
    //bt.insert(5);
    //bt.insert(31);
    //bt.insert(8);
    //bt.insert(15);
    //bt.insert(40);
    //bt.insert(20);
    //bt.insert(42);
    //bt.insert(51);
    //bt.insert(1);
    //bt.insert(28);
    //bt.insert(36);
    //bt.insert(37);
    //bt.insert(38);
    //bt.insert(23);


    // **** ****
    bt.insert(28);
    bt.insert(15);
    bt.insert(17);

    bt.insert(30);
    bt.insert(95);
    bt.insert(102);
    bt.insert(60);

    bt.insert(32);
    bt.insert(48);
    bt.insert(55);
    bt.insert(57);

    bt.insert(123);
    bt.insert(103);

    bt.insert(101);
    bt.insert(99);
    bt.insert(96);

    bt.insert(66);
    bt.insert(76);
    bt.insert(88);


    // **** traverse the B-Tree ****
    cout << "main <<< bt: ";
    bt.traverse();

    // **** search for this key in the B-Tree ****
    k = 6;
    (bt.search(k) != NULL) ? cout << "main <<<  k: " << k << " present\n" : cout << "main <<<  k: " << k << " NOT present\n";

    // **** search for this key in the B-Tree ****
    k = 15;
    (bt.search(k) != NULL) ? cout << "main <<<  k: " << k << " present\n" : cout << "main <<<  k: " << k << " NOT present\n";

    // **** all went well ****
    return 0;
}

The BTreeNode class is shown.

The BTree class follows.

The last section contains the main() which implements the test scaffold.

After instantiating the BTree class, some inserts have been commented out. They were used for experimentation.

A set of values are inserted into the B-Tree.

We then search for value 6 followed by value 15. In the current state, the results of these operations were displayed earlier.

LSM Tree (in progress)

The book covers this subject without code. It mentions that B-Trees were used in earlier implementations but the approach has changed by merging log files at different levels.

Conclusion

Enjoyed reading the book and have learned a lot from it. Will complete the missing sections.

Enjoy,

John

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.