Spam or Ham?

Monalisa Panda
Sep 2, 2020
6 min read

Updated: Mar 25, 2021

In today's world, we see the infected messages, fake prompts, or infected emails very often in our day to day life. These messages are called spams. Hackers send these kinda things in order to maliciously attack our systems. Without any proper knowledge, anyone can fall prey to these kinds of attacks. With the help of these spams, hackers aim to get private information like credit card credentials from our system. The spam detection system came into existence in order to prevent this kind of attack.

The above image gives an overview of spam filtering, plenty of emails arrive every day, some goes to spam and rest stays in our primary inbox(unless you have further categories defined). The blue box in the middle is the Machine Learning Model, how does it decide which mail is spam and which one is not.

Environment Setup:

The project is set up in Anaconda Environment on the jupyter notebook.

Dependencies/Libraries Required:

pandas
sklearn
pickle
nltk
matplotlib
word cloud
seaborn

In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. The emails in the Ling-spam corpus have been already preprocessed in the following ways:
Removal of stop words – Stop words like “and”, “the”, “of”, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails.
Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider the meaning of the sentence).
We still need to remove the non-words like punctuation marks or special characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once.

4. Train-Test Split: The data need to be split into training and test in order to evaluate how good the classifier is working.

5. Training the model & Predictions: In order to make predictions on data using a classifier, the classifier first needs to be trained on the training data. Once the model is trained we can get started and make predictions using the new data, which the model has not seen.

6. Evaluation: The Classifier could be evaluated using different evaluation measures such as confusion matrix, F1-Score, Accuracy score, etc

7. Visualize Results: In this, we visualize the obtained evaluation measures.

Importing the Libraries:

%matplotlib inline
from sklearn import metrics
import seaborn as sn 
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer

import pickle
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score,accuracy_score
from wordcloud import WordCloud
import matplotlib.pyplot as plt 
from sklearn import model_selection, preprocessing,svm

In this step, we imported all the required libraries like seaborn, pandas(for preprocessing). nltk(For textual) etc.

Loading The Data:

raw_data = pd.read_csv(loc,engine='python')
data = pd.DataFrame()
data['target'] = raw_data['v1']
data['text'] = raw_data['v2']
data.head()

In this step, we have read the raw_data.

In the next step let's see how many ham and spam are present.

ham = [i for i in data['target'] if i == 'ham']
spam = [i for i in data['target'] if i == 'spam']
len(ham),len(spam)

O/p: (4825, 747)

So hence we have 4825 ham emails whereas we have 747 spam emails that are there in our dataset.

Convert Categorical Column into Numerical ones:

In this step, we need to convert the categorical column (The target column) to numerical ones i.e we need to convert it to zeros and ones.

data['target'] = data['target'].map({'ham': 0, 'spam': 1})
data.head()

Wordcloud:

In this Step, the word cloud has built on column SentimentText.

Word cloud for spam words.

spam_words = ' '.join(list(data[data['target'] == 1]['text']))
spam_wc = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(spam_words)
plt.figure(figsize=(10, 7))
plt.imshow(spam_wc, interpolation="bilinear")
plt.axis('off')
plt.show()

In this section, the word cloud has made on column SentimentText.

In the first step, all words are joined. Then a word cloud with height 800 and width 500, with font size 110 has been plotted. (With figure size of width 10 and height 7) the word cloud interpolation is bilinear.

2. Word cloud for Ham emails:

ham_words = ' '.join(list(data[data['target'] == 0]['text']))
ham_wc = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(ham_words)
plt.figure(figsize=(10, 7))
plt.imshow(ham_wc, interpolation="bilinear")
plt.axis('off')
plt.show()

Vectorize the Text:

In the first part of this series, we explored the most basic type of word vectorizer, the Bag of Words Model, which will not work very well for our Spam or Ham classifier due to its simplicity.

Instead, we will use the TF-IDF vectorizer (Term Frequency — Inverse Document Frequency), a similar embedding technique that takes into account the importance of each term to document.

While most vectorizers have their unique advantages, it is not always clear which one to use. In our case, the TF-IDF vectorizer was chosen for its simplicity and efficiency in vectorizing documents such as text messages.

TF-IDF vectorizes documents by calculating a TF-IDF statistic between the document and each term in the vocabulary. The document vector is constructed by using each statistic as an element in the vector.

After settling with TF-IDF, we must decide the granularity of our vectorizer. A popular alternative to assigning each word as its own term is to use a tokenizer. A tokenizer splits documents into tokens (thus assigning each token to its own term) based on white space and special characters.

data.replace(r'\b\w{1,4}\b','', regex =True, inplace = True)
vectorizer = CountVectorizer()
vectorizer.fit(data['text'])
vec = vectorizer.transform(data['text'])
data['encoded_text'] = vec
data.head()

Train Test Splitting:

SO we have split our dataset into training and testing part with a 70-30 ratio.

Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(vec,data['target'],test_size=0.3)

Training Model & Predictions:

The next step is to select the type of classifier to use. Typically in this step, we will choose several candidate classifiers and evaluate them against the testing set to see which one works the best. To keep things, we can assume that a Support Vector Machine works well enough.

The objective of the SVM is to find

The C term is used as a regularization to influence the objective function.

A larger value of C typically results in a hyperplane with a smaller margin as it gives more emphasis to the accuracy rather than the margin width. Parameters such as this can be precisely tuned via grid search.

SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X , Train_Y)
predictions_SVM = SVM.predict(Test_X)

Evaluation:

print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
print(classification_report(Test_Y,predictions_SVM))
print(f1_score(Test_Y,predictions_SVM, average='weighted'))

So here we can see a very good accuracy of 97% that we have got on this dataset.

Visualize The Results Through Confusion Matrix:

A confusion matrix is used to describe the performance of a classiﬁcation model:

True positives (TP): cases when classiﬁer predicted TRUE (they have the disease), and the correct class was TRUE (patient has the disease).
True negatives (TN): cases when the model predicted FALSE (no disease), and the correct class was FALSE (patient do not have the disease).
False positives (FP) (Type I error): classiﬁer predicted TRUE, but the correct class was FALSE (the patient did not have the disease).
False negatives (FN) (Type II error): classiﬁer predicted FALSE (patients do not have the disease), but they actually do have the disease.

cm=metrics.confusion_matrix(Test_Y,predictions_SVM)
plt.matshow(cm)

Create a heatmap of confusion matrix:

plt.figure(figsize = (10,7))
ax= plt.subplot()
ax.set_title('Confusion Matrix'); 
sn.heatmap(cm, annot=True,ax = ax)

df = pd.DataFrame(Test_Y)
df['pred'] = predictions_SVM
sent = df['target']
pred = df['pred']
ham= len([i for i in df['pred'] if i ==0])
spam = len([i for i in df['pred'] if i ==1])

In this above line of code, we have created a pred column which stores the predictions part.

Let's plot the spam and ham emails.

plt.title('Spam/ham distribution')
cat = ['spam', 'ham']
freq = [spam,ham]
plt.ylabel('frequency')
plt.bar(cat,freq,color= ['blue','green'])
plt.show()

So in this manner, we have built the Spam detection model.

For more reference go through this GitHub Link: https://github.com/kapuskaFaizan/NLP-jupyter_notebook/blob/master/spam_detection.ipynb

Thank You For Reading!!

Happy Learning

Note:

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free contact us on contact@codersarts.com.