More than a List of Words

When indexing text based word frequency / relevance which may be applicable for web searches, one of the procedures used is to create a term frequency (tf) array followed by an inverse document frequency (idf) one. You can read more about this here.

In a previous post I experimented with some text in order to build hashmaps with the words of sentences (to keep things in perspective for a blog post). In that post I used a string that I copied from a course I took some years ago. The sting was already preprocessed. The text had already been stripped off punctuation marks. Continue reading “More than a List of Words”

Simple Problems in Python

Last week I was reading a post on Medium “First Steps in Data Science with Python NumPy” by Kshitij Bajracharya.

What called my attention is his opening statement “I’ve read that the best way to learn something is to blog about it”. I believe Kshitij hit it right on. The reason I agree is that I have been a believer in “If you can’t explain it simply, you don’t understand it well enough”. This quote is attributed to Albert Einstein. Continue reading “Simple Problems in Python”

Vector Model and Similarity Search

Have you ever wondered how computers search for text and similar images?

For example, if you use Windows, open a File Explorer window. From top to bottom the windows has the title bar, the menu bar, the tool bar. Under the toolbar there are two text fields. The one on the left displays the full path to the current folder / directory. The one on the right displays “Search <current_folder>” e.g., “Algorithms”. I have enabled in my computer “Index Properties and File Contents”. By default when you search, Windows will only search the file names and properties; not the contents of the file. Depending on your usage, you might need to index some or all the files in all folders in your computer. In my case, I perform searches in all types of documents. If you mostly use the Office Suite, you might enable search only on folders holding your *.docx files. The reason for this is that the mechanism uses additional disk and memory to operate. Continue reading “Vector Model and Similarity Search”