Oct 16, 2020

Sentiment Analysis By Retrieving Data From Web Page

In this we have retrieving data form web pages and perform the following task:

Steps needs to perform:

  • Retrieving Text from Static Website

  • Beautiful Soup

  • Using Newspaper3K to handle text cleanup

  • Several Web Examples

  • Processing Local Text File

  • Basic WordCloud with WordCloud

  • Readability with Textatistic

  • Sentiment Analysis with TextBlob

Be able to:

  • Download text from (some) web pages and prep for text analysis.

  • Clean up the text with Beautiful Soup, if possible.

  • Learn to use a library like Article to extract articles from most news sites and blogs including key meta-data.

  • Practice manipulating speech transcript data from Rev.com.

  • Perform sentiment analysis and plot sentence level subjectivity and polarity data with matplotlib and plotly express

Wrangling Text from Web-pages is Hard!

  • Each web-site stores data different so you need to be a sleuth.

  • Most modern sites no longer store the text as part of the page.

  • Static web pages are hard to find.

  • You could spend a semester just on retrieving data from web-pages or other APIs.

  • Many web-pages have restrictions on what you can retrieve. (See robots.txt before making heavy use of a web-page.)

  • Most book examples will use a static, locally stored text file as input.

  • Some newer tools (e.g. Article) can make it "easier" to retrieve properly formatted pages.

Install from Command or Terminal Prompt (not Jupyter Notebook)

TextBlob Module

  • conda install -c conda-forge textblob

  • ipython -m textblob.download_corpora

  • Note: Windows users may need to run as administrator

wordcloud Module

*conda install -c conda-forge wordcloud

Install from Command or Terminal Prompt (continued)

Newspaper3k

  • https://github.com/codelucas/newspaper

  • Reliable text scraping

  • pip3 install newspaper3k

TextTastic Module

  • Not required for our assignments but good for practice and examples

  • pip install textatistic

  • Note: Windows users may need to run as administrator*

  • Some students have reported needing to install VS Code to get Textatistic to work (ymmv)*

https://www.rev.com/blog/transcripts/donald-trump-white-house-rally-speech-transcript-october-10-first-event-since-covid-diagnosis

Importing All Related Libraries

import requests        # import from web
 
from bs4 import BeautifulSoup      # clean up text
 
from wordcloud import WordCloud    # create word clouds
 
from textblob import TextBlob      # basic NLP, install first
 
from textatistic import Textatistic   # readability, install first
 

 
from pathlib import Path    # for quick import of text file for NLP
 

 
import pandas as pd
 
import seaborn as sns
 
import matplotlib.pyplot as plt
 
from plotly import express as px
 

 
# Magics
 
%config InlineBackend.figure_format = 'retina'
 
%matplotlib inline

Example 1: Extraction Block or Not Allowed

## 403 Forbidden Error, extract blocked / not allowed
 
url = 'https://www.americanrhetoric.com/speeches/mlkihaveadream.htm'
 

 
response = requests.get(url)   # retrieve the webpage
 
response.content               # show content from the retrieved page

Example 2: Static, Predominately Text-based Web-Page

## Static web page - JFK speech re: moon
 
url = 'https://er.jsc.nasa.gov/seh/ricetalk.htm'
 

 
response = requests.get(url)
 
response.content       # Notice moderate amount of HTML code

soup = BeautifulSoup(response.content, 'html5lib')
 
text = soup.get_text(strip=True)  # text without tags
 
text       #BeautifulSoup has done a decent job on this page removing HTML

Example 3: Sometimes it is easier to copy and paste to a file

## Somewhat hidden text
 
url = 'https://www.whitehouse.gov/briefings-statements/remarks-president-trump-2020-united-states-military-academy-west-point-graduation-ceremony/'
 

 
response = requests.get(url)
 
response.content            # UGLY! significant amount of code -- Where's the text??

## Soup doesn't help much in this case
 
soup = BeautifulSoup(response.content, 'html5lib')
 
text = soup.get_text(strip=True)  
 
text

Let's try Article

Steps: Install newspaper3k via pip (only do this once per machine)

  1. Import Article from newspaper (once per notebook)

  2. Create an article object and set it to the URL of the web-page (required once per web-page)

  3. Download (required after creating article object)

  4. Parse the downloaded object (required once per download, separate data into text, authors, title, date, etc.)

Now you are ready for other tasks (view text, check authors and publication date; perform NLP tasks).

https://newspaper.readthedocs.io/en/latest/

Try it!

Try the newspaper article code on your own link to a site that is likely to have all of the atributes

# Change the url to your own, comment out all urls but one, note may not show all of article if behind pay wall
 
url = 'https://hbr.org/2020/04/bringing-an-analytics-mindset-to-the-pandemic'
 
# url = 'https://www.wsj.com/articles/ceos-increasingly-see-sustainability-as-path-to-profitability-11602535250'
 
# url = 'https://www.cnn.com/2020/10/13/health/us-coronavirus-tuesday/index.html'

article = Article(url)
 
article.download()
 
article.parse()

print("Title: ", article.title)  
 
print("Authors: ", article.authors)  
 
print("Publication Date: ", article.publish_date)  
 
print("First Image:", article.top_image)  
 
print("Video Links:", article.movies)

print("Title: ", article.title)
 
print()
 
print(article.text)

article.nlp()
 

 
print("KeyWords: ", article.keywords)   # creates a list of authors; no authors on this page
 
print()
 
print("Summary: ", article.summary)  # no publish date on this web-page

Processing a Transcript with Newspaper3k

  • We can leverage article to retrieve text of transcribed speeches though we may need to process the data a bit to prepare it for analysis.

  • Most transcripts include speaker names, time stamps and other information.

Speech Transcript

  • These examples are specifically for the transcript site https://www.rev.com/blog/transcripts

  • Modifications likely for other speech sources

# Set the url
 

 
# From rev.com
 
url = 'https://www.rev.com/blog/transcripts/donald-trump-mosinee-wi-rally-speech-transcript-september-17'
 
event = '-mosinee-2020' # this will be part of the file name for a text file we create 
 

 
#url = 'https://www.rev.com/blog/transcripts/ruth-bader-ginsburg-stanford-rathbun-lecture-transcript-2017'
 
#event = '-standfordlecture-2017'

# Minimum code needed to get to the text of the speech
 
article = Article(url)
 
article.download()
 
article.parse()
 
print(article.text) 

# write the text to a file
 

 
with open('speech.txt''w'as f:
 
    f.writelines(text)
 

 
with open('speech.txt''r'as f:
 
for cnt, line in enumerate(f):
 
print(f'Line {cnt}{line}')

# Custom processing for rev site
 
# line 0 = speaker and time
 
# line 2 = what speaker said
 
# lines 1 and 3 = blanks
 
# create four lists of the components of speech
 

 
with open('speech.txt''r'as f:
 
    speech = f.readlines()
 

 
tmp = []
 
speaker = []
 
time = []
 
words = []
 

 
for cnt, line in enumerate(speech):
 
if cnt % 2 == 0:
 
        tmp.append(line.rstrip())   # temp list of just the text lines 0,2
 

 
for i in range(0,len(tmp),2):
 
    speaker.append(tmp[i].split(': ')[0])  #split speaker line into 2 parts
 
    time.append(tmp[i].split(': ')[1])
 
    words.append(tmp[i+1])    # words from speaker

# find unique speaker names for later filter
 

 
set(speaker)

spkr = 'Donald Trump'
 
file = spkr.split()[len(spkr.split())-1] + event + '.txt'
 
file

# use write instead of writelines since we don't want entire list
 
# remember to add new line
 

 
with open(file,'w'as f:
 
for i in range(0,len(speaker)):
 
if speaker[i] == spkr:
 
            f.write(words[i]+'\n')

# Confirm good file
 
text = Path(file).read_text()
 
text

Web scrapping using beautiful soup is used in machine learning or data science to extract useful data and preform machine learning algorithms or task on it, like: sentiment analysis in NLP etc.

If need any help related to this then contact us at: contact@codersarts.com