Regular Expressions

A regular expression is a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” / “match” or “find and replace” operations on strings.

I recall when I was working for a fortune 500 company and the department replaced VT100 DEC displays for one that provided direct support for Emacs text editors. It has been a while and I am not able to recall the name of the manufacturer. At the time the development team switched to UNIX. Tools such as awk, Emacs, and sed due to their power and ease to concatenate (pipe) became very popular. At that time software developers had to learn the basics of regular expressions.

Today regular expressions are available in most software platforms (e.g., Linux, .NET) and programming languages (e.g., C++, C#, Python). I have a copy of Regular Expressions Pocket Reference by Tony Stubblebine in my office. Now and then I ran into situations in which regular expressions allows me to get something done in a few statements compared to hundreds of LOC.

The following text needs to be parsed in order to extract different components.


<address value = "HelloWorld" name = "Name1" prop = "DrinkCoke" /address>

I will cover more on this requirement in the next post in this blog.

Following is a screen capture of a console showing how to extract some of the components in the previous text:

main str ==>
<address value = "HelloWorld" name = "Name1" prop = "DrinkCoke" /address><==

main <<< start tag ==><address<==

main <<< end tag ==>/address><==

main <<< attribute ==>value = "HelloWorld"<==
main <<< attribute ==>name = "Name1"<==
main <<< attribute ==>prop = "DrinkCoke"<==

main <<< attribute name ==>value<==
main <<< attribute name ==>name<==
main <<< attribute name ==>prop<==

main <<< attribute value ==>"HelloWorld"<==
main <<< attribute value ==>"Name1"<==
main <<< attribute value ==>"DrinkCoke"<==

main <<< str ==>     trim this string     <==
main <<< str ==>trim this string<==

main <<< str ==>     trim this string     <==
main <<< str ==>trim this string     <==
main <<< str ==>trim this string<==

main <<< str ==>     trim this string     <==
main <<< str ==>trim this string<==

The first few lines extract most of the information of interest in the text. The last few lines illustrate how to trim spaces the front and back of a text string. The string class in C++ does not appear to have a string.trim() method. As always I like to start experimenting with different approaches to get a task done. In one of the passes, the attribute names were preceded and succeeded by spaces. This motivated to add the last part in the code. The first attempt uses a couple of string methods. The last two uses regular expressions.

Following is the C++ code developed using the Microsoft Visual Studio Enterprise 2017 edition IDE:

#include <iostream>

#include <string>
#include <regex>

using namespace std;

/*
*/
int main() {

	// **** <tag# value = "HelloWorld" name = "Name1" prop = "DrinkCoke" /tag#> ****

	//string str = "<tag1 value = \"HelloWorld\" name = \"Name1\" prop = \"DrinkCoke\" /tag1>";
	string str = "
<address value = \"HelloWorld\" name = \"Name1\" prop = \"DrinkCoke\" /address>";
	//string str = "</tag2>";
	cout << "main str ==>" << str << "<==" << endl << endl;

	// **** start tag ****

	//regex rgxStartTag("^<tag\\d+");
	regex rgxStartTag("^<([a-zA-Z0-9_]+)");
	if (regex_search(str, rgxStartTag)) {
		regex_iterator<string::iterator> current(str.begin(), str.end(), rgxStartTag);
		regex_iterator<string::iterator> end;
		while (current != end) {
			cout << "main <<< start tag ==>" << current->str() << "<==" << endl;
			current++;
		}
	}
	cout << endl; // **** end tag **** //regex rgxEndTag("/tag\\d+>$");
	regex rgxEndTag("/([a-zA-Z0-9_]+)>");
	if (regex_search(str, rgxEndTag)) {
		regex_iterator<string::iterator> current(str.begin(), str.end(), rgxEndTag);
		regex_iterator<string::iterator> end;
		while (current != end) {
			cout << "main <<< end tag ==>" << current->str() << "<==" << endl;
			current++;
		}
	}
	cout << endl;

	// **** attribute ****

	regex rgxAttribute("\\w+\\s=\\s\"\\w+\"");
	if (regex_search(str, rgxAttribute)) {
		regex_iterator<string::iterator> current(str.begin(), str.end(), rgxAttribute);
		regex_iterator<string::iterator> end;
		while (current != end) {
			cout << "main <<< attribute ==>" << current->str() << "<==" << endl;
			current++;
		}
	}
	cout << endl;

	// **** attribute name (look-around assertions) ****

	regex rgxAttributeName("\\w+(?=\\s=)");
	if (regex_search(str, rgxAttributeName)) {
		regex_iterator<string::iterator> current(str.begin(), str.end(), rgxAttributeName);
		regex_iterator<string::iterator> end;
		while (current != end) {
			cout << "main <<< attribute name ==>" << current->str() << "<==" << endl;
			current++;
		}
	}
	cout << endl;

	// **** attribute value ****

	regex rgxAttributeValue("\"([^\"]*)\"");
	if (regex_search(str, rgxAttributeValue)) {
		regex_iterator<string::iterator> current(str.begin(), str.end(), rgxAttributeValue);
		regex_iterator<string::iterator> end;
		while (current != end) {
			cout << "main <<< attribute value ==>" << current->str() << "<==" << endl;
			current++;
		}
	}
	cout << endl;

	// **** trim characters from string ****

	str = "     trim this string     ";
	cout << "main <<< str ==>" << str << "<==" << endl;
	while (str.length() && str.at(0) == ' ') {
		str = str.substr(1);
	}
	while (str.length() && str.at(str.length() - 1) == ' ') {
		str = str.substr(0, str.length() - 1);
	}
	cout << "main <<< str ==>" << str << "<==" << endl << endl;

	// **** trim characters from string ****

	str = "     trim this string     ";
	cout << "main <<< str ==>" << str << "<==" << endl;
	regex rgxLeadSpace("^ +");
	str = regex_replace(str, rgxLeadSpace, "");
	cout << "main <<< str ==>" << str << "<==" << endl;
	regex rgxTrailSpace(" +\$");
	str = regex_replace(str, rgxTrailSpace, "");
	cout << "main <<< str ==>" << str << "<==" << endl << endl;

	// **** trim characters from string ****

	str = "     trim this string     ";
	cout << "main <<< str ==>" << str << "<==" << endl;
	str = regex_replace(str, regex("^\\s+|\\s+\$"), "");
	cout << "main <<< str ==>" << str << "<==" << endl;

	// **** ****

	return 0;
}

If you have any comments or suggestions please let me know. Regular expressions are extremely powerful and there are many ways of achieving similar results.

Enjoy;

John

Follow me on Twitter:  @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.