Hope your day is going well. As you might know, I enjoy spending the mornings of Saturdays and Sundays reading and experimenting. To me that is the best way of learning. I use about eight hours per weekend to learn.
Earlier this year I purchased the book Getting Started with Natural Language Processing by Ekaterina Kockmar published by Manning. A few years back I took an on-line course on machine learning which touched on some of the topics also covered by this book.
On a separate note, I will use the VSCode IDE and the plugin GitHub Copilot to generate code for this post. At this point I need to disclose that I am a Microsoft employee. That said, I have been using VSCode for several years. What is new is that I recently installed GitHub Copilot. To learn more about Getting started with GitHub Copilot take a few minutes reading the article.
There might be better ways to work with Python and IDEs. One popular method is to use a Jupyter Notebook that may be of assistance if you wish to use a Jupyter Notebook. The main advantage of the notebook is that you can copy the code and the data in a single package.
In this post I will develop the Python code using the VSCode IDE and will run it in the Anaconda environment. I have been using Anaconda for a few years.
When I start a new Python project I like to update Anaconda as is shown here:
# **** If you want a stable set of packages that have been tested for interoperability **** (base) C:\Users\johnc>conda update conda Retrieving notices: ...working... done Collecting package metadata (current_repodata.json): done Solving environment: / The environment is inconsistent, please check the package plan carefully The following packages are causing the inconsistency: - defaults/win-64::anaconda==custom=py38_1 - defaults/win-64::anaconda-navigator==2.0.4=py38_0 - defaults/win-64::astropy==4.3.1=py38hc7d831d_0 - defaults/win-64::bokeh==2.3.3=py38haa95532_0 - defaults/noarch::dask==2021.8.1=pyhd3eb1b0_0 - defaults/win-64::imagecodecs==2021.3.31=py38h5da4933_0 - defaults/noarch::imageio==2.9.0=pyhd3eb1b0_0 - conda-forge/noarch::ipympl==0.7.0=pyhd8ed1ab_0 - defaults/win-64::lcms2==2.12=h83e58a3_0 - defaults/win-64::libtiff==4.2.0=hd0e1b90_0 - defaults/win-64::matplotlib==3.4.2=py38haa95532_0 - defaults/win-64::matplotlib-base==3.4.2=py38h49ac443_0 - defaults/win-64::openjpeg==2.4.0=h4fc8c34_0 - defaults/win-64::pillow==8.3.1=py38h4fa10fc_0 - defaults/win-64::scikit-image==0.18.1=py38hf11a4ad_0 - defaults/noarch::seaborn==0.11.2=pyhd3eb1b0_0 - defaults/noarch::tifffile==2021.4.8=pyhd3eb1b0_2 - defaults/win-64::_anaconda_depends==2020.07=py38_0 done ## Package Plan ## environment location: C:\Users\johnc\anaconda3 added / updated specs: - conda The following packages will be downloaded: package | build ---------------------------|----------------- anaconda-client-1.11.1 | py38haa95532_0 154 KB astroid-2.14.2 | py38haa95532_0 395 KB boltons-23.0.0 | py38haa95532_0 421 KB comtypes-1.1.14 | py38haa95532_0 271 KB conda-23.3.1 | py38haa95532_0 972 KB conda-repo-cli-1.0.41 | py38haa95532_0 142 KB cryptography-39.0.1 | py38h21b164f_0 1.0 MB curl-7.88.1 | h2bbff1b_0 147 KB cython-0.29.33 | py38hd77b12b_0 1.9 MB flit-core-3.8.0 | py38haa95532_0 85 KB fsspec-2023.3.0 | py38haa95532_0 234 KB future-0.18.3 | py38haa95532_0 704 KB giflib-5.2.1 | h8cc25b3_3 88 KB importlib-metadata-6.0.0 | py38haa95532_0 39 KB importlib_metadata-6.0.0 | hd3eb1b0_0 8 KB ipython-8.10.0 | py38haa95532_0 1.1 MB jaraco.classes-3.2.1 | pyhd3eb1b0_0 9 KB jpeg-9e | h2bbff1b_1 320 KB jsonpatch-1.32 | pyhd3eb1b0_0 15 KB jsonpointer-2.1 | pyhd3eb1b0_0 9 KB jsonschema-4.17.3 | py38haa95532_0 155 KB jupyter_client-8.1.0 | py38haa95532_0 196 KB jupyter_console-6.6.3 | py38haa95532_0 63 KB jupyter_core-5.3.0 | py38haa95532_0 107 KB jupyterlab_server-2.21.0 | py38haa95532_0 82 KB jupyterlab_widgets-3.0.5 | py38haa95532_0 179 KB jxrlib-1.1 | he774522_2 337 KB keyring-23.13.1 | py38haa95532_0 83 KB libarchive-3.6.2 | h2033e3e_1 1.8 MB libcurl-7.88.1 | h86230a5_0 328 KB libdeflate-1.17 | h2bbff1b_0 151 KB libpng-1.6.39 | h8cc25b3_0 369 KB libwebp-base-1.2.4 | h2bbff1b_1 304 KB libxml2-2.10.3 | h0ad7f3c_0 2.9 MB libxslt-1.1.37 | h2bbff1b_0 448 KB lxml-4.9.2 | py38h2bbff1b_0 1.1 MB nbclassic-0.5.3 | py38haa95532_0 6.0 MB networkx-2.8.4 | py38haa95532_1 2.6 MB notebook-6.5.3 | py38haa95532_0 555 KB openssl-1.1.1t | h2bbff1b_0 5.5 MB packaging-23.0 | py38haa95532_0 69 KB pandas-1.5.3 | py38hf11a4ad_0 10.5 MB pandoc-2.12 | haa95532_3 14.6 MB pip-23.0.1 | py38haa95532_0 2.7 MB pkginfo-1.9.6 | py38haa95532_0 69 KB pycurl-7.45.2 | py38hcd4344a_0 132 KB pylint-2.16.2 | py38haa95532_0 756 KB pyopenssl-23.0.0 | py38haa95532_0 97 KB pytoolconfig-1.2.5 | py38haa95532_1 32 KB pywinpty-2.0.10 | py38h5da7b33_0 229 KB requests-2.28.1 | py38haa95532_1 98 KB rope-1.7.0 | py38haa95532_0 438 KB scikit-learn-1.2.2 | py38hd77b12b_0 6.5 MB scipy-1.10.0 | py38h321e85e_1 18.7 MB sqlite-3.41.1 | h2bbff1b_0 897 KB statsmodels-0.13.5 | py38h080aedc_1 9.7 MB tbb-2021.8.0 | h59b6b97_0 149 KB tqdm-4.65.0 | py38hd4e2768_0 149 KB urllib3-1.26.15 | py38haa95532_0 194 KB werkzeug-2.2.3 | py38haa95532_0 341 KB wheel-0.38.4 | py38haa95532_0 83 KB xlwings-0.29.1 | py38haa95532_0 1.2 MB zstandard-0.19.0 | py38h2bbff1b_0 340 KB zstd-1.5.4 | hd43e919_0 683 KB ------------------------------------------------------------ Total: 99.8 MB The following NEW packages will be INSTALLED: boltons pkgs/main/win-64::boltons-23.0.0-py38haa95532_0 jaraco.classes pkgs/main/noarch::jaraco.classes-3.2.1-pyhd3eb1b0_0 jsonpatch pkgs/main/noarch::jsonpatch-1.32-pyhd3eb1b0_0 jsonpointer pkgs/main/noarch::jsonpointer-2.1-pyhd3eb1b0_0 jxrlib pkgs/main/win-64::jxrlib-1.1-he774522_2 libwebp-base pkgs/main/win-64::libwebp-base-1.2.4-h2bbff1b_1 pytoolconfig pkgs/main/win-64::pytoolconfig-1.2.5-py38haa95532_1 The following packages will be UPDATED: anaconda-client 1.11.0-py38haa95532_0 --> 1.11.1-py38haa95532_0 astroid 2.11.7-py38haa95532_0 --> 2.14.2-py38haa95532_0 comtypes 1.1.10-py38haa95532_1002 --> 1.1.14-py38haa95532_0 conda 23.1.0-py38haa95532_0 --> 23.3.1-py38haa95532_0 conda-repo-cli 1.0.27-py38haa95532_0 --> 1.0.41-py38haa95532_0 cryptography 38.0.4-py38h21b164f_0 --> 39.0.1-py38h21b164f_0 curl 7.87.0-h2bbff1b_0 --> 7.88.1-h2bbff1b_0 cython 0.29.32-py38hd77b12b_0 --> 0.29.33-py38hd77b12b_0 flit-core pkgs/main/noarch::flit-core-3.6.0-pyh~ --> pkgs/main/win-64::flit-core-3.8.0-py38haa95532_0 fsspec 2022.11.0-py38haa95532_0 --> 2023.3.0-py38haa95532_0 future 0.18.2-py38_1 --> 0.18.3-py38haa95532_0 giflib 5.2.1-h8cc25b3_1 --> 5.2.1-h8cc25b3_3 importlib-metadata 4.11.3-py38haa95532_0 --> 6.0.0-py38haa95532_0 importlib_metadata 4.11.3-hd3eb1b0_0 --> 6.0.0-hd3eb1b0_0 ipython 8.8.0-py38haa95532_0 --> 8.10.0-py38haa95532_0 jpeg 9e-h2bbff1b_0 --> 9e-h2bbff1b_1 jsonschema 4.16.0-py38haa95532_0 --> 4.17.3-py38haa95532_0 jupyter_client 7.4.8-py38haa95532_0 --> 8.1.0-py38haa95532_0 jupyter_console 6.4.4-py38haa95532_0 --> 6.6.3-py38haa95532_0 jupyter_core 5.1.1-py38haa95532_0 --> 5.3.0-py38haa95532_0 jupyterlab_server 2.16.5-py38haa95532_0 --> 2.21.0-py38haa95532_0 jupyterlab_widgets pkgs/main/noarch::jupyterlab_widgets-~ --> pkgs/main/win-64::jupyterlab_widgets-3.0.5-py38haa95532_0 keyring 23.4.0-py38haa95532_0 --> 23.13.1-py38haa95532_0 libarchive 3.6.2-hebabd0d_0 --> 3.6.2-h2033e3e_1 libcurl 7.87.0-h86230a5_0 --> 7.88.1-h86230a5_0 libdeflate 1.8-h2bbff1b_5 --> 1.17-h2bbff1b_0 libpng 1.6.37-h2a8f88b_0 --> 1.6.39-h8cc25b3_0 libxml2 2.9.14-h0ad7f3c_0 --> 2.10.3-h0ad7f3c_0 libxslt 1.1.35-h2bbff1b_0 --> 1.1.37-h2bbff1b_0 lxml 4.9.1-py38h1985fb9_0 --> 4.9.2-py38h2bbff1b_0 nbclassic 0.4.8-py38haa95532_0 --> 0.5.3-py38haa95532_0 networkx 2.8.4-py38haa95532_0 --> 2.8.4-py38haa95532_1 notebook 6.5.2-py38haa95532_0 --> 6.5.3-py38haa95532_0 openssl 1.1.1s-h2bbff1b_0 --> 1.1.1t-h2bbff1b_0 packaging 22.0-py38haa95532_0 --> 23.0-py38haa95532_0 pandas 1.5.2-py38hf11a4ad_0 --> 1.5.3-py38hf11a4ad_0 pandoc 2.12-haa95532_1 --> 2.12-haa95532_3 pip 22.3.1-py38haa95532_0 --> 23.0.1-py38haa95532_0 pkginfo 1.8.3-py38haa95532_0 --> 1.9.6-py38haa95532_0 pycurl 7.45.1-py38hcd4344a_0 --> 7.45.2-py38hcd4344a_0 pylint 2.14.5-py38haa95532_0 --> 2.16.2-py38haa95532_0 pyopenssl pkgs/main/noarch::pyopenssl-22.0.0-py~ --> pkgs/main/win-64::pyopenssl-23.0.0-py38haa95532_0 pywinpty 2.0.2-py38h5da7b33_0 --> 2.0.10-py38h5da7b33_0 requests 2.28.1-py38haa95532_0 --> 2.28.1-py38haa95532_1 rope pkgs/main/noarch::rope-0.22.0-pyhd3eb~ --> pkgs/main/win-64::rope-1.7.0-py38haa95532_0 scikit-learn 1.2.0-py38hd77b12b_0 --> 1.2.2-py38hd77b12b_0 scipy 1.10.0-py38h321e85e_0 --> 1.10.0-py38h321e85e_1 sqlite 3.40.1-h2bbff1b_0 --> 3.41.1-h2bbff1b_0 statsmodels 0.13.5-py38h080aedc_0 --> 0.13.5-py38h080aedc_1 tbb 2021.6.0-h59b6b97_1 --> 2021.8.0-h59b6b97_0 tqdm 4.64.1-py38haa95532_0 --> 4.65.0-py38hd4e2768_0 urllib3 1.26.14-py38haa95532_0 --> 1.26.15-py38haa95532_0 werkzeug 2.2.2-py38haa95532_0 --> 2.2.3-py38haa95532_0 wheel pkgs/main/noarch::wheel-0.37.1-pyhd3e~ --> pkgs/main/win-64::wheel-0.38.4-py38haa95532_0 xlwings 0.27.15-py38haa95532_0 --> 0.29.1-py38haa95532_0 zstandard 0.18.0-py38h2bbff1b_0 --> 0.19.0-py38h2bbff1b_0 zstd 1.5.2-h19a0ad4_0 --> 1.5.4-hd43e919_0 Proceed ([y]/n)? y <=== proceed Downloading and Extracting Packages Preparing transaction: done Verifying transaction: done Executing transaction: - Windows 64-bit packages of scikit-learn can be accelerated using scikit-learn-intelex. More details are available here: https://intel.github.io/scikit-learn-intelex For example: $ conda install scikit-learn-intelex $ python -m sklearnex my_application.py done (base) C:\Users\johnc>
Once all has been updated, let’s warm up with some simple code.
# **** using the Anaconda command prompt **** (base) C:\Users\johnc>python Python 3.8.16 (default, Jan 17 2023, 22:25:28) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. # **** **** >>> print("Hello world"); Hello world # **** **** >>> print("My name is Bond, James Bond"); My name is Bond, James Bond # **** **** >>> print("Hello world"); print("My name is Bond, James Bond"); Hello world My name is Bond, James Bond # **** **** >>> print("Hello World");\ ... print("My name is Bond, James Bond"); Hello World My name is Bond, James Bond # **** run the Python code in the ex1.py file **** (base) C:\Users\johnc>python c:\temp\ex1.py first line second line
The first few lines are to switch from a compiled to a scripted language. It always takes me a few hours after I have not used Python for a couple months.
In the last line we ask Anaconda to run the Python interpreter on the specified file. It seems that the file prints two lines.
Now that we tested that we can execute a Python file, we can use the VSCode IDE to write some code and periodically execute it.
# **** run the ex1.py file **** (base) C:\Users\johnc>python c:\temp\ex1.py first line second line # **** edit ex1.py with VSCode **** (base) C:\Users\johnc>code c:\temp\ex1.py # **** run the updated file **** (base) C:\Users\johnc>python c:\temp\ex1.py first line second line third and last line # **** after a second update ... (note the line terminators) **** (base) C:\Users\johnc>python c:\temp\ex1.py first line. second line. third and last line! # **** close VSCode **** # **** type the contents of the ex1.py file **** (base) C:\Users\johnc>type c:\temp\ex1.py print("first line."); print("second line."); print("third and last line!");
We start by verifying that we can still access the ex1.py file by executing the Python script.
We then invoke the VSCode IDE and open the c:\temp\ex1.py file. Once the IDE ones we add a third line to the script. We save the update and execute the code. We can now see three lines.
We go back to the IDE and update the three lines by adding line terminators. We save our changes and run the script one more time. The updated lines are displayed.
We then close the VSCode IDE and we are done with this file.
# **** start the Python interpreter from the Anaconda prompt **** (base) C:\Users\johnc>python Python 3.8.16 (default, Jan 17 2023, 22:25:28) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. # **** define the text string **** >>> text = "Define which data represents each class for the machine learning algorithm" # **** print the contents of text **** >>> print(text) Define which data represents each class for the machine learning algorithm # **** split the words in text **** >>> text.split(" ") ['Define', 'which', 'data', 'represents', 'each', 'class', 'for', 'the', 'machine', 'learning', 'algorithm'] >>>
From an Anaconda console, we start the Python interpreter. That way we are able to execute Python commands. This seems to be a great way to experiment and get individual commands to work just like we desire.
We define a variable named text and assign some text to it. To verify that all is well de print the content of text. All seems to be well so far.
In the last line we split the text using a space as a delimiter. The words are displayed. One of the steps in our next post in this blog will be to develop a spam filter. One of the tasks is to split the text into words. In this case we just split simple words and we did not consider that the first word ‘Define’ is in uppercase. In order not to differentiate ‘Define’ from ‘define’ we should get all words in lowercase and then continue to process the text. Will find up more about this shortly.
# **** input text **** text = 'Define which data represents "ham" class and which data represents "spam" class for the machine learning algorithm.' # **** split includes words: '"ham"', '"spam"', and 'algorithm.' **** print('text:', text.split()) # **** define list of delimiters **** delimiters = [' ', '.', '"'] # **** print delimiters **** print('delimiters: ', delimiters) # **** define variable words to keep list of processed words **** words = [] # **** define variable word to keep current word **** word = '' # **** loop through each character in text **** for c in text: # **** if character is in delimiters list **** if c in delimiters: # **** add current word in lowercase to list of words (if not blank) **** if word != '': words.append(word.lower()) # **** print last word in words list **** print('word:', words[-1]) # **** reset current word **** word = '' # **** if character is not in delimiters list **** else: # **** add character to current word **** word += c # **** print list of words **** print('words:', words)
This script starts by defining some text. Note that the text includes words like Define, “ham”, “spam” and algorithm.
We specify a set of delimiters in an attempt to eliminate punctuation marks from the text.
An array for words is defined. A word is defined to hold the current complete word as our algorithm traverses the input text.
The loop traverses the text one character at a time. If the current character is a delimiter and the current word is not blank, we append the word to words; otherwise if the character is not a delimiter we add the character to the word.
After all is said and done, we display the list of words.
# **** run python script **** (base) C:\Users\johnc>python c:\temp\split2.py text: ['Define', 'which', 'data', 'represents', '"ham"', 'class', 'and', 'which', 'data', 'represents', '"spam"', 'class', 'for', 'the', 'machine', 'learning', 'algorithm.'] delimiters: [' ', '.', '"'] word: define word: which word: data word: represents word: ham word: class word: and word: which word: data word: represents word: spam word: class word: for word: the word: machine word: learning word: algorithm words: ['define', 'which', 'data', 'represents', 'ham', 'class', 'and', 'which', 'data', 'represents', 'spam', 'class', 'for', 'the', 'machine', 'learning', 'algorithm'] (base) C:\Users\johnc>
The run of the split2.py script is shown. Given that the input text is quite simple, the list of words looks acceptable at this time.
Let’s now take a look at the output of script c:\temp\split3.py which follows:
A set of three lines seem to indicate that some package is being initialized.
Once that is done, the script displays a set of delimiters. The delimiters are followed by a set of words that were displayed after splitting some text.
The words are then displayed.
The list of words seem to have been cleared. Some processing occurs, and the results are displayed. Note that the number of splitted words in the first pass are different from the ones in the second one. This seems to indicate that there are many ways to split text, and some are better than others.
# **** run python script **** (base) C:\Users\johnc>python c:\temp\split3.py [nltk_data] Downloading package punkt to [nltk_data] C:\Users\johnc\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! text: ['Define', 'which', 'data', 'represents', '"ham"', 'class', 'and', 'which', 'data', 'represents', '"spam"', 'class', 'for', 'the', 'machine', 'learning', 'algorithm.'] delimiters: [' ', '.', '"'] word: define word: which word: data word: represents word: ham word: class word: and word: which word: data word: represents word: spam word: class word: for word: the word: machine word: learning word: algorithm words: ['define', 'which', 'data', 'represents', 'ham', 'class', 'and', 'which', 'data', 'represents', 'spam', 'class', 'for', 'the', 'machine', 'learning', 'algorithm'] words: [] words: ['Define', 'which', 'data', 'represents', '``', 'ham', "''", 'class', 'and', 'which', 'data', 'represents', '``', 'spam', "''", 'class', 'for', 'the', 'machine', 'learning', 'algorithm', '.'] (base) C:\Users\johnc>
In the next post in this blog we will get into the steps needed to create a working spam filter.
import nltk # Import the Natural Language Toolkit from nltk import word_tokenize # Import the word tokenizer nltk.download('punkt') # Download the Punkt tokenizer # **** define get_words function **** def get_words(text): # **** split text into words **** words = word_tokenize(text) # **** return list of words **** return words # **** input text **** text = 'Define which data represents "ham" class and which data represents "spam" class for the machine learning algorithm.' # **** split includes words: '"ham"', '"spam"', and 'algorithm.' **** print('text:', text.split()) # **** define list of delimiters **** delimiters = [' ', '.', '"'] # **** print delimiters **** print('delimiters: ', delimiters) # **** define variable words to keep list of processed words **** words = [] # **** define variable word to keep current word **** word = '' # **** loop through each character in text **** for c in text: # **** if character is in delimiters list **** if c in delimiters: # **** add current word in lowercase to list of words (if not blank) **** if word != '': words.append(word.lower()) # **** print last word in words list **** print('word:', words[-1]) # **** reset current word **** word = '' # **** if character is not in delimiters list **** else: # **** add character to current word **** word += c # **** print list of words **** print('words:', words) # **** clear list of words **** words.clear() # **** verify list of words has been cleared **** print('words:', words) # **** tokenize text **** words = get_words(text) # **** display list of tokenized words **** print('words:', words)
The code that generated the previous output start by importing some Natural Language Toolkit () software that we will use to experiment in this piece of code.
Similar to a previous listing in this post, we declare a function in which we use the word_tokenize() function to split the word in the text.
We then assign some input text.
In the first split attempt we just use the split() method and display the results. Note that the results contain the words: ‘Define’, ‘“ham”’, ‘“spam”, and ‘algorithm.’.
We then define a list of delimiters to split the text. The delimiters are then printed.
We define a list to keep the words and a variable to keep individual words which will be generated as we parse each letter from our text.
The code that follows is the same we previously used to extract the words from our text variable. The list of words is then printed. Note that our code now produces the words: ‘define’, ‘ham’, ‘spam’, and ‘algorithm’.
The list of words is cleared and displayed.
We now use the get_words() function which uses the word_tokenize() from the NLTK toolkit. The results are then displayed.
The list of words is similar yet different from the previous list. This was done to help us understand the need of tokenization as a step in splitting the input text for our spam task.
I have to say that it was quite interesting providing comments in our code which were processed by GitHub Copilot to generate code. By editing comments one is able to guide the underlying software to get the desired output. I believe this will be an important step when using this tool.
Hope you enjoyed the post. I will put the associated code from this post in my SplitText GitHub repository.
Enjoy,
John