Vector Model and Similarity Search

Have you ever wondered how computers search for text and similar images?

For example, if you use Windows, open a File Explorer window. From top to bottom the windows has the title bar, the menu bar, the tool bar. Under the toolbar there are two text fields. The one on the left displays the full path to the current folder / directory. The one on the right displays “Search <current_folder>” e.g., “Algorithms”. I have enabled in my computer “Index Properties and File Contents”. By default when you search, Windows will only search the file names and properties; not the contents of the file. Depending on your usage, you might need to index some or all the files in all folders in your computer. In my case, I perform searches in all types of documents. If you mostly use the Office Suite, you might enable search only on folders holding your *.docx files. The reason for this is that the mechanism uses additional disk and memory to operate.

Now that we have defined a context for the topic, let’s discuss the flavor of how the text search algorithms work. I have no idea of the actual implementation used by Windows.

To start let’s define a simple case with three (3) documents, each of which contains a very reduce set of words. You could think of them as being names of articles on-line.

My first attempt was to look at names of newspapers in Minneapolis and St. Paul, MN (I live in Minnesota). I came up with the following list:

Star Tribune
St. Paul Pioneer Press
Duluth News-Tribune
St. Cloud Times
The Minnesota Daily
The Free Press
Faribault Daily News
Owatonna People’s Press
West Central Tribune
The Brainerd Dispatch
Willmar West Central Tribune

For the purpose of this post, I decided to edit the names in order to be able to get only three newspapers. The list of fictitious newspapers follows:

d1: saint paul tribune
d2: saint paul post
d3: willmar west tribune

Note that the names of the newspapers are in lower case.

With our articles let’s now create the term frequency (tf) matrix.

	paul	post	saint	tribune	west	willmar
d1	1	    0	    1	    1	    0	    0
d2	1	    1	    1	    0	    0	    0
d3	0	    0	    0	    1	    1	    1

On the top row of the frequency matrix we have all the words in all our documents in alphabetical order. Having the words in alphabetic order might make it easier for future operations, especially when considering that the English language has about 172,000 words. Depending on the field(s) of interest, most technical words would probably not be included in the regular English dictionary. This implies that one may easily end with more than 200,000 columns.

The first column contains the list of documents of interest. In our simple example we have three, but in practice you might be searching millions of documents at a time.

Our next step is to compute the Inverse Document Frequency (IDF) for our example. The following shows the actual computations:

paul		2		        log2(3/2) = 0.584   <====
post		1		        log2(3/1) = 1.584
saint		2		        log2(3/2) = 0.584   <====
tribune		2		        log2(3/2) = 0.584
west		1		        log2(3/1) = 1.584
willmar		1		        log2(3/1) = 1.584

Why do we build the IDF? The IDF is a penalty factor used to diminish the importance of simple words that may be repeated in a document without adding any value to a search. For example “the” or “is” may be found in a document dozens of times, but in most cases, it does not add value to a search. I say in most cases, because if you are searching for a rock band e.g., “The Beatles”, “The The” then the word “the” would be of interest.

A typical question that often arises is why use log2() instead of log10() or ln() (which is equivalent to loge() where e = )? The answer is that it does not matter as illustrated by the following:

log2(x) = ln(x) / ln(2)

log2(x) = log10(x) / log10(2)

ln() = loge()

where e	= 2.718281828...

In this case, the result of using a different base for the logarithms is an IDF value multiplied by a constant. Since the same constant is applied to all values, feel free to use the algorithm base you are most comfortable with. In practice you will not be performing the operations using a calculator, but probably will be doing this in python using a standard library (i.e., NumPy) function / method.

Now let’s build the tf-idf (term frequency – inverse document frequency) matrix. To do this we start with the tf matrix and replace the values with the IDF and add the Length for each document. I personally would like to see a different name for the Length column; which would convey a better description e.g., Distance.

Computing the Length per document is done using the following operations:

Length(d1)  sqrt(0.584^2 + 0.584^2 + 0.584^2) = 1.011
Length(d2)  sqrt(0.584^2 + 1.584^2 + 0.584^2) = 1.786
Length(d3)  sqrt(0.584^2 + 1.584^2 + 1.584^2) = 2.316

Now let’s plug in the Length values to the tf-idf matrix:

	paul	post	saint	tribune	west	willmar Length
d1	0.584   0	    0.584	0.584   0	    0       1.011
d2	0.584   1.584   0.584	0	    0	    0       1.786
d3	0	    0	    0	    0.584   1.584   1.584   2.316

We are now ready to perform a search in the vector space. We need a query. For example and simplicity purpose let’s use the following query:

q:  saint saint paul

In actuality one would not repeat a term in the search, but for this example we will.

The max frequency of term “saint” is 2 and max frequency of term “paul”  is 1.

Our next step is to compute the Inverse Document Frequency (IDF) for our query.

saint   2               log2(3/2) = 0.584
paul    1               log2(3/1) = 1.584

The query vector would be:

  paul              post    saint               tribune west    willmar
Q[(1/2)*0.584=0.292 0       (2/2)*0.584=0.584   0       0       0]
Q[0.292             0       0.584               0       0       0]

We then compute the Length:

length(q): sqrt(0.292^2 + 0.584^2) = 0.652

Finally we check the similarity with each of the three documents:

similarity = cos(theta) = A . B / ||A|| ||B||

                   "saint"		  "paul"
                   from dx & Q	  from dx & Q	  length(dx) * length(Q)
cosSim(d1,q)	= (0.584*0.584 +  0.584*0.292) / (1.011      * 0.652)		= 0.776
cosSim(d2,q)	= (0.584*0.584 +  0.584*0.292) / (1.786      * 0.652)       = 0.292
cosSim(d3,q)	= (0.000*0.584 +  0.000*0.292) / (2.316      * 0.652)       = 0.000

The results seem to be reasonable. The first document would be the most similar to the query followed by the second. The third document should not be included in the results because the query has no common terms.

I always like to check results in order to make sure that the concepts have been properly applied. In this case we could create the three sample documents and then use Apache Lucene ( to compare results.

We will create three short and simple CSV files using oocalc by entering the following commands:

$ cd Downloads/big-data-2/vector/data
$ oocalc

LibreOffice Calc starts.

Enter the data for the first document one word per column. In the first column enter without the double quotes “Saint”. In the second column enter “Paul”. Finally in the third column enter “Tribune”.

On the menu bard select File -> Save As… The”Save” window is displayed.

In the “Name:” field enter doc1.csv

Using the Places and Name panes, navigate to:  Downloads/big-data-2/vector/my_data

In the File type -> Text CSV

Then click on the <Save> button.

On the “Confirm File Format” window click on the <Use Text CSV Format> button to save the data in Unicode (UTF-8) format.

On the “Export Text File” window leave the field options to the defaults and click on the <OK> button.

Close the spreadsheet and repeat the previous steps to create two additional documents with names doc2.csv and doc3.csv. For the second document enter “Saint”, “Paul”, and “Post”. For the third document enter “Willmar”, “West” and “Tribune”.

To make sure all is well you could enter the following command:

$ more doc*.csv

Let’s go up one folder and take a look at what is available using the following commands:

$ cd ..
$ ls -l
total 32
drwx------ 3 cloudera cloudera 4096 Aug 10 08:13 data
-rw------- 1 cloudera cloudera 7031 Apr 25  2016 LuceneQuery.class
-rw------- 1 cloudera cloudera 6863 Apr 25  2016 LuceneTFIDF.class
drwxrwxr-x 2 cloudera cloudera 4096 Aug 10 08:13 my_data
-rwx------ 1 cloudera cloudera  166 Apr 25  2016
-rwx------ 1 cloudera cloudera  165 Apr 25  2016

Let’s take a look at a document search by entering the following commands:

$ ./ my_data
Index Location:my_data/index
Skipping (not csv/htm/html/xml/txt) : write.lock
Indexed : my_data/doc2.csv
Indexed : my_data/doc1.csv
Indexed : my_data/doc3.csv

3 new documents added.
Enter query for Lucene (q=quit): 
saint saint paul
Displaying 2 results.
1) my_data/doc2.csv score :0.8660254
2) my_data/doc1.csv score :0.8660254
Enter query for Lucene (q=quit): 

For a query we used the same text “saint saint paul” as in the example we ran earlier in this post. Note that the scores for the first two documents are equal and that the third document was not included.

Finally we can compute the TF-IDF values as illustrated by:

$ ./ my_data
Index Location:my_data/index
Skipping (not csv,htm,html,xml,txt : write.lock
Indexed : my_data/doc2.csv
Indexed : my_data/doc1.csv
Indexed : my_data/doc3.csv
3 new documents added.
Enter a term to calculate TF-IDF (q=quit): 
Doc # 0: my_data/doc2.csv   TF-IDF = 1.0
Doc # 1: my_data/doc1.csv   TF-IDF = 1.0
Enter a term to calculate TF-IDF (q=quit): 
Doc # 0: my_data/doc2.csv   TF-IDF = 1.0
Doc # 1: my_data/doc1.csv   TF-IDF = 1.0
Enter a term to calculate TF-IDF (q=quit): 
Doc # 2: my_data/doc3.csv   TF-IDF = 1.4054651260375977
Enter a term to calculate TF-IDF (q=quit): 
Doc # 1: my_data/doc1.csv   TF-IDF = 1.0
Doc # 2: my_data/doc3.csv   TF-IDF = 1.0
Enter a term to calculate TF-IDF (q=quit): 

Of course I conducted additional experiments editing the existing documents and using different queries. Make sure you understand the results. If something does not look right run the computations manually. Please note that the numbers might not be the same, but the results MUST be in the same order.

The search for similar images is similar to text but images are expressed as histograms with pixel values in ranges. For color images with three channels, Red, Green and Blue, one could use ranges 32 values apart and end up with three challes each with values in the ranges of 0-31, 32-63, … , 224-255.

If you have questions regarding images please drop me a message at the end of this post. I will follow up with a second post that deals with RGB images.

Remember that “you do not know what you cannot explain”.


Follow me on Twitter: @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.