More than a List of Words

When indexing text based word frequency / relevance which may be applicable for web searches, one of the procedures used is to create a term frequency (tf) array followed by an inverse document frequency (idf) one. You can read more about this here.

In a previous post I experimented with some text in order to build hashmaps with the words of sentences (to keep things in perspective for a blog post). In that post I used a string that I copied from a course I took some years ago. The sting was already preprocessed. The text had already been stripped off punctuation marks. Continue reading “More than a List of Words”

Vector Model and Similarity Search

Have you ever wondered how computers search for text and similar images?

For example, if you use Windows, open a File Explorer window. From top to bottom the windows has the title bar, the menu bar, the tool bar. Under the toolbar there are two text fields. The one on the left displays the full path to the current folder / directory. The one on the right displays “Search <current_folder>” e.g., “Algorithms”. I have enabled in my computer “Index Properties and File Contents”. By default when you search, Windows will only search the file names and properties; not the contents of the file. Depending on your usage, you might need to index some or all the files in all folders in your computer. In my case, I perform searches in all types of documents. If you mostly use the Office Suite, you might enable search only on folders holding your *.docx files. The reason for this is that the mechanism uses additional disk and memory to operate. Continue reading “Vector Model and Similarity Search”

Radix Sort

Earlier this week I ran into a description of Radix Sort. This sorting algorithm has been around for a few centuries (yes; that is not a typo). The algorithm dates back to 1887 to the work of Herman Hollerith (and yes; he was the inventor of the Hollerith Card Code for punched cards used in the past century).

This sorting algorithm is not the fastest, it requires additional space, but has been around for a long time. When you read about it, seems like it should not work; but it does. Continue reading “Radix Sort”

Code Complexity

Code complexity is a subject that is taught early on in Computer Science curricula. Not sure if early is the reason why many software developers tend to forget what it is and how to apply it.

Last week a group of developers were talking about the complexity of algorithms and Big O Notation came up. As a matter of fact when considering complexity there are:

Theta notation
Big O notation
Omega notation

Continue reading “Code Complexity”

Two Dimensional Array

A few days ago a group of software engineers were discussing how the order in which the numbers of rows versus columns in a two dimensional array affect performance. That is; if an array has more rows than columns as opposed to more columns than rows, the time it takes to traverse the array will be affected. Continue reading “Two Dimensional Array”

Parse Text

UPDATE – April 09, 2018When I started this post, I was thinking in several follow ups in order to try different approaches and be able to continue to improve on previous passes by adding code or starting from scratch when a new idea came up. I was interested in showing a normal progression that the reader would encounter when developing software. In the days that passed, I decided to limit the subject to a single post. Please let me know if you encounter an issue or would like for me to expand on this entry.

Sometime last week I attempted to solve a simple online challenge. The challenge dealt with parsing a string of text and then obtaining information from the string. There are probably thousands of variants to the challenge. I am not going to cover the exact challenge in this post. I will make my own. Will start simple and will get more complex each time we add a new obstacle. Continue reading “Parse Text”

Managing Binary Trees – Linux

This post deals with an API for binary trees in Linux. The API consists of the following functions:

Function Brief Description
tdelete() Deletes an item from a tree.
tdestroy() Deletes the entire tree.
tfind() Finds an item and if not found returns NULL.
tsearch() Search a binary tree for an item.
twalk() Performs a depth-first, left to right tree traversal.

In the past I have used and implemented, using different programming languages, several classes, methods and functions to deal with binary trees. Binary trees are used to keep data in sorted order. This allows for quicker search times. This particular implementation comes with the Linux operating system. You can read more about it by typing on a Linux console:  man twalk Continue reading “Managing Binary Trees – Linux”

Log Files

About a couple decades ago I came up with the idea for a storage server. This occurred while experimenting and working with a Hierarchical Storage Manager (HSM) product that worked on a Sun Solaris system and used tapes in an automated library to manage the contents of the files. At that time the idea of a Content Addressable Storage (CAS) was several years away. Continue reading “Log Files”

New, Duplicate or in Segment – Part 4

Ditto. This post covers the continuation of the original proposed (by me) challenge to HackerRank. Seems that I made the mistake to generate the post New, Duplicate or in Segment – Part 2 which describes this challenge. As you can verify I did not suggested a solution. Apparently that was enough to disqualify this entry. Next time I come up with a proposed challenge will not post a thing about it. Continue reading “New, Duplicate or in Segment – Part 4”