BeautifulSoup

The past weekend was kind of cold in the Twin Cities area of Minneapolis and St. Paul. Seems like winter a somewhat ahead of time.

In the past few months I have been spending time learning and experimenting with machine learning (ML) and Big Data. Machine learning seems to require a lot of properly cleaned samples. This is one more case when garbage in implies garbage out. That said; the first step is to collect data. Data can come from different sources i.e., databases, files, public repositories, the Internet, etc. Data can be collected from the Internet in different ways. In general one can collect data from the internet using two main approaches: web scraping and via an API. I will cover both of these approaches in the following posts.

In this post I will start from scratch attempting to collect some data from my website using a technique called web scraping. You can scrape web sites by writing code from scratch, or you can use a library dedicated for such purpose. One of the best libraries is Beautiful Soup. It is a Python library designed to get the data one seeks from web sites.

In order to learn a tool I like to watch one or more videos, try one or more tutorials, and then go on my own. By attempting to explain what I am doing in writing, seems to help understanding the new concepts and come p with the best possible approach.

I run into Python Tutorial: Web Scraping with BeautifulSoup and Requests by Corey Schafer. After completing the tutorial, I moved on to scraping my web site. Note that I have not included scraping my website in this post. Hope you enjoy and learn something from this post.

I started by looking that different videos on the subject. Started some and hit some issues. Depending on the age of the videos and the version of the libraries in use, if things do not work as expected, I tend to spend a few minutes and moving on to a new video or tutorial. In this case I ended up with the tutorial by Corey. I started using my main Windows 10 computer. I ran into an issue attempting to print the HTML contents of a couple web sites. I decided to move to one of my Linux machines. The reason was that all seemed to work well on the tutorial and Corey was using a Macintosh computer. The Mac OS is derived from UNIX, not Linux, but Linux resembles UNIX and behaves in many ways very similar. It seems to me to be a much better fit to work with Linux than Windows when using Python.

I already had Python and Anaconda installed in my CentOS 7 system. I have been using gvim, Python and IPython so far. I decided that it was time to install a better Python editor. I decided to use Sublime Text. In addition, when scraping a web site or just accessing data from other sources, a commonly used format is CSV. For Linux and very often in Windows I use LibreOffice.

Following are my notes on downloading and installing LibreOffice in my Linux server:

$ wget http://download.documentfoundation.org/libreoffice/stable/6.1.2/rpm/x86_64/LibreOffice_6.1.2_Linux_x86-64_rpm.tar.gz

$ su -
## OR ##
$ sudo -i

# tar -xvf LibreOffice_6.1.2*
# cd LibreOffice_6.1.2*
# yum localinstall RPMS/*.rpm

The package which includes calc for spreadsheets was installed and made accessible from:

Applications -> Programming -> LibreOffice 6.1 Calc

$ sudo rpm -v --import https://download.sublimetext.com/sublimehq-rpm-pub.gpg
$ sudo yum-config-manager --add-repo https://download.sublimetext.com/rpm/stable/x86_64/sublime-text.repo
$ sudo yum install sublime-text

The editor was installed and made accessible from:

Applications -> Programming -> Sublime Text

For starters I followed the tutorial and all worked well. Not sure if I will cover scraping my web site at this point in time. The concepts are similar due to the fact that I also use WordPress for my blogs.

I copied the Python software back to my Windows machine due to the fact that I have all set for editing and posting in WordPress. Have considered moving it all to Linux but so far it has not been an issue. I believe that now that I am concentrating on Python and Java I will probably move it all to my CentOS machine.

In an attempt to avoid issues, I will update conda and the entire package using an administrator console on my Windows 10 machine:

C:\WINDOWS\system32> conda update conda
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment C:\Program Files\Anaconda3:

The following NEW packages will be INSTALLED:

    vc:    14-0

The following packages will be UPDATED:

    conda: 4.3.6-py35_0 --> 4.3.30-py35hec795fb_0

Proceed ([y]/n)? y

vc-14-0.tar.bz 100% |###############################| Time: 0:00:00  70.82 kB/s
conda-4.3.30-p 100% |###############################| Time: 0:00:00   3.30 MB/s

C:\WINDOWS\system32>

C:\WINDOWS\system32> conda update --all
Solving environment: done

## Package Plan ##

  environment location: C:\Program Files\Anaconda3


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    matplotlib-3.0.0           |   py35hd159220_0         6.8 MB
    pillow-5.2.0               |   py35h08bbbbd_0         653 KB
    pandoc-2.2.3.2             |                0        21.0 MB
    freetype-2.9.1             |       ha9979f8_1         470 KB
    nbconvert-5.3.1            |           py35_0         424 KB
    ------------------------------------------------------------
                                           Total:        29.3 MB

The following packages will be UPDATED:

    freetype:   2.8-h51f8f2c_1       --> 2.9.1-ha9979f8_1
    matplotlib: 2.2.2-py35h153e9ff_1 --> 3.0.0-py35hd159220_0
    pandoc:     1.19.2.1-hb2460c7_1  --> 2.2.3.2-0
    pillow:     5.1.0-py35h0738816_0 --> 5.2.0-py35h08bbbbd_0

The following packages will be DOWNGRADED:

    nbconvert:  5.4.0-py35_1         --> 5.3.1-py35_0

Proceed ([y]/n)? y


Downloading and Extracting Packages
matplotlib-3.0.0     | 6.8 MB    | ########################################################################## | 100%
pillow-5.2.0         | 653 KB    | ########################################################################## | 100%
pandoc-2.2.3.2       | 21.0 MB   | ########################################################################## | 100%
freetype-2.9.1       | 470 KB    | ########################################################################## | 100%
nbconvert-5.3.1      | 424 KB    | ########################################################################## | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

C:\WINDOWS\system32>

Recently I read a post on KDnuggets regarding Python editors What is the Best Python IDE for Data Science by Saurabh Hooda. The post touched on a set of IDEs that work well with Python. By the way, Visual Studio has a Python plugin which I have been using for over a year. Seems to work well and was not included in the post. As you are aware due to this blog, I was also experimenting with Sublime Text which was not included in the list. I guess there are many IDEs for Python given how popular it has become in the past few years.

For this post I watched a couple times the video by Corey Schafer. It is very well done. The video provides a quick introduction to BeautifulSoup and gives advice on the incremental steps to get from an idea to actual code that extracts what the user wants/needs and writes it to a CSV file. CSV files are common when collecting/exchanging data. They can be viewed in all platforms with tools like Microsoft Excel. At the beginning of this post I wrote my notes on what I had to do to install Calc by LibreOffice. I have installed it in a virtual machine that I run in Oracle VM VirtualBox.

I went back and forth using Linux and Windows for the code in this post. I ended up moving it to Windows and using Spyder. For some reason, the IDE which is distributed with Anaconda does not seem to start from a standard command line on Windows. I had to invoke it from an administrator command prompt.

From a plain Command Prompt:

C:\Documents\_John_CANESSA\Technical Development\BeautifulSoup>spyder
Traceback (most recent call last):
  File "C:\Program Files\Anaconda3\Scripts\spyder-script.py", line 6, in <module>
    from spyder.app.start import main
  File "C:\Program Files\Anaconda3\lib\site-packages\spyder\app\start.py", line 37, in <module>
    from spyder.config.base import (get_conf_path, running_in_mac_app,
  File "C:\Program Files\Anaconda3\lib\site-packages\spyder\config\base.py", line 261, in <module>
    LANG_FILE = get_conf_path('langconfig')
  File "C:\Program Files\Anaconda3\lib\site-packages\spyder\config\base.py", line 173, in get_conf_path
    os.mkdir(conf_dir)
PermissionError: [WinError 5] Access is denied: 'C:\\WINDOWS\\system32\\config\\systemprofile\\.spyder-py3'

From and Administrator Command Prompt:

C:\Documents\_John_CANESSA\Technical Development\BeautifulSoup>spyder

Go figure.

The goal of the exercise is to scrape from Corey’s web site the headlines, summaries and YouTube links for the videos there listed. I will show you what I did which happens to be quite similar to what Corey explained.

If you watch the video, there are two separate exercises. The first provides an introduction on how to use BeautifulSoup with an HTML file. The idea is that the requests Python library will allow us to get an HTML document form the web which we will then parse. Before we add that step, we can just parse an HTML file.

The contents of the HTML file follows:

<!doctype html>
<html class="no-js" lang="">

    <head>
        <title>Sample Test Website</title>
        <meta charset="utf-8">
        <link rel="stylesheet" href="css/normalize.css">
        <link rel="stylesheet" href="css/main.css">
    </head>
    
    <body>
        

<h1 id='site_title'>John's Test Website</h1>


        

<hr>

</hr>


        
        <!-- first article -->
        

<div class="article">
            

<h2><a href="http://www.johncanessa.com/2018/10/30/equal-stacks/">Equal Stacks - Headline</a></h2>


            

This is a summary of Equal Stacks

        </div>


        

<hr>

</hr>


        
        <!-- second article -->
        

<div class="article">
            

<h2><a href="http://www.johncanessa.com/2018/10/29/transform-strings/">Transform Strings - Headline</a></h2>


            

This is a summary of Transform Strings

        </div>


        

<hr>

</hr>



        <!-- footer -->
        

<div class='footer'>
            

Footer Information

        </div>



        <img src="" data-wp-preserve="%3Cscript%20src%3D%22js%2Fvendor%2Fmodernizr-3.5.0.min.js%22%3E%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="&lt;script&gt;" title="&lt;script&gt;" />
        <img src="" data-wp-preserve="%3Cscript%20src%3D%22js%2Fplugins.js%22%3E%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="&lt;script&gt;" title="&lt;script&gt;" />
        <img src="" data-wp-preserve="%3Cscript%20src%3D%22js%2Fmain.js%22%3E%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="&lt;script&gt;" title="&lt;script&gt;" />
    </body>
    
</html>

The two entries in the page are from my blog.

The following Python code can be used to parse the HTML file:

# **** imports ****
from bs4 import BeautifulSoup
import requests
import csv

# **** ****
print("\r\nentering vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv")

# **** open local HTML file and pass it to BeutifulSoup ****
with open('simple.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

# **** print the contents of the web site ****
print("soup: %s\n" % soup.prettify())

# **** first title tag on the page ****
match = soup.title.text
print("match: %s" % match)

# **** ****
match = soup.div
print("match: \n%s" % match)

# **** find (first item) ****
article = soup.find('div', class_='article')
print("article: \n%s\n" % article)

# **** ****
headline = article.h2.a.text
print("headline: %s" % headline)

# **** ****
summary = article.p.text
print("summary: %s" % summary)
print()

# **** loop over all articles ****
for article in soup.find_all('div', class_='article'):
    headline = article.h2.a.text
    summary  = article.p.text
    print("headline: %s" % headline)
    print(" summary: %s" % summary)
    print()

# **** done with the first pass ****
#exit()

The following is the edited (I did not feel like commenting out the second part; sorry) output generated while parsing the local HTML file.

soup: <!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Sample Test Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  

<h1 id="site_title">
   John's Test Website
  </h1>


  <hr/>
  <!-- first article -->
  

<div class="article">
   

<h2>
    <a href="http://www.johncanessa.com/2018/10/30/equal-stacks/">
     Equal Stacks - Headline
    </a>
   </h2>


   

    This is a summary of Equal Stacks
   

  </div>


  <hr/>
  <!-- second article -->
  

<div class="article">
   

<h2>
    <a href="http://www.johncanessa.com/2018/10/29/transform-strings/">
     Transform Strings - Headline
    </a>
   </h2>


   

    This is a summary of Transform Strings
   

  </div>


  <hr/>
  <!-- footer -->
  

<div class="footer">
   

    Footer Information
   

  </div>


  <img src="" data-wp-preserve="%3Cscript%20src%3D%22js%2Fvendor%2Fmodernizr-3.5.0.min.js%22%3E%0A%20%20%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="&lt;script&gt;" title="&lt;script&gt;" />
  <img src="" data-wp-preserve="%3Cscript%20src%3D%22js%2Fplugins.js%22%3E%0A%20%20%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="&lt;script&gt;" title="&lt;script&gt;" />
  <img src="" data-wp-preserve="%3Cscript%20src%3D%22js%2Fmain.js%22%3E%0A%20%20%3C%2Fscript%3E" data-mce-resize="false" data-mce-placeholder="1" class="mce-object" width="20" height="20" alt="&lt;script&gt;" title="&lt;script&gt;" />
 </body>
</html>

match: Sample Test Website
match: 


<div class="article">


<h2><a href="http://www.johncanessa.com/2018/10/30/equal-stacks/">Equal Stacks - Headline</a></h2>




This is a summary of Equal Stacks

</div>


article: 


<div class="article">


<h2><a href="http://www.johncanessa.com/2018/10/30/equal-stacks/">Equal Stacks - Headline</a></h2>




This is a summary of Equal Stacks

</div>



headline: Equal Stacks - Headline
summary: This is a summary of Equal Stacks

headline: Equal Stacks - Headline
 summary: This is a summary of Equal Stacks

headline: Transform Strings - Headline
 summary: This is a summary of Transform Strings

We open the file and print the parsed version returned by BeautifulSoup.

We then print the contents of the <title> tag.

We then select the first <div> in the file. Seems like what we need is in <div> tags. Now we find in the document the first <div> with class=’article’ and display it. See how we are moving from the entire HTML to a tag of interest. The idea is that later after we extract what we need from the first <div> we can loop processing all such tags, knowing that we are able to parse individual ones. This is a much better approach that writing all the code at once and then start debugging it. I personally like to start with a blank sheet and slowly add to it while testing until a robust and well featured piece of code emerges. This is known as Test Driven Development (TDD).

The next step is to extract the headline and the summary from the article. Such values are then printed on the console.

Now we use previous code to build a loop for each article in the page. The loop returns an article which we know how to parse. Done!

Now let’s move to the second part which is parsing the actual web site and collecting three items instead of two illustrated in the first section. The approach is similar. Please note that the part that deals with the CSV file was added after scraping of the desired parts moved to a loop was working.

The output for the second part follows:

article: 
<article class="post-1531 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-computer-science tag-iterable tag-iterator tag-programming tag-programming-terms tag-video entry" itemscope="" itemtype="http://schema.org/CreativeWork">
 
<header class="entry-header">
  
<h2 class="entry-title" itemprop="headline">
   <a href="http://coreyms.com/development/python/python-coding-problem-creating-your-own-iterators" rel="bookmark">
    Python Coding Problem: Creating Your Own Iterators
   </a>
  </h2>

  

   <time class="entry-time" datetime="2018-10-24T12:57:06+00:00" itemprop="datePublished">
    October 24, 2018
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
    <a class="entry-author-link" href="http://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </span>
    </a>
   </span>
   <span class="entry-comments-link">
    <a href="http://coreyms.com/development/python/python-coding-problem-creating-your-own-iterators#respond">
     <span class="dsq-postid" data-dsqidentifier="1531 http://coreyms.com/?p=1531">
      Leave a Comment
     </span>
    </a>
   </span>
  

 </header>

 
<div class="entry-content" itemprop="text">
  

   In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
  

  

   <span class="embed-youtube" style="text-align:center; display: block;">
    <iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/C3Z9lJXI6Qw?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" type="text/html" width="640">
    </iframe>
   </span>
  

 </div>

 
<footer class="entry-footer">
  

   <span class="entry-categories">
    Filed Under:
    <a href="http://coreyms.com/category/development" rel="category tag">
     Development
    </a>
    ,
    <a href="http://coreyms.com/category/development/python" rel="category tag">
     Python
    </a>
   </span>
   <span class="entry-tags">
    Tagged With:
    <a href="http://coreyms.com/tag/computer-science" rel="tag">
     Computer Science
    </a>
    ,
    <a href="http://coreyms.com/tag/iterable" rel="tag">
     iterable
    </a>
    ,
    <a href="http://coreyms.com/tag/iterator" rel="tag">
     iterator
    </a>
    ,
    <a href="http://coreyms.com/tag/programming" rel="tag">
     Programming
    </a>
    ,
    <a href="http://coreyms.com/tag/programming-terms" rel="tag">
     Programming Terms
    </a>
    ,
    <a href="http://coreyms.com/tag/video" rel="tag">
     Video
    </a>
   </span>
  

 </footer>

</article>


headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
vid_src: https://www.youtube.com/embed/C3Z9lJXI6Qw?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent
vid_id: C3Z9lJXI6Qw
yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

headline: Python Coding Problem: Creating Your Own Iterators
 summary: In this Python Coding Problem, we will be creating our own iterators from scratch. First, we will create an iterator using a class. Then we will create an iterator with the same functionality using a generator. If you haven’t watched the tutorial video on Iterators and Iterables then I would suggest watching that first. With that said, let’s get started…
 yt_link: https://youtube.com/watch?v=C3Z9lJXI6Qw

Hope you enjoyed this post and take a look at the Corey’s blog. Attempting to explain something is a good way to verify that you understood the topic at hand.

Happy software development;

John

Follow me on Twitter: @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.