• For any query, contact us at
• +91-9872993883
• +91-8283824812
• info@ris-ai.com

# Email spam Detection with Machine Learning¶

### Import the libraries : ¶

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string


#### Load the data and print the first 5 rows :¶

In [3]:
df = pd.read_csv("data.csv")

Out[3]:
text spam
0 Subject: naturally irresistible your corporate... 1
1 Subject: the stock trading gunslinger fanny i... 1
2 Subject: unbelievable new homes made easy im ... 1
3 Subject: 4 color printing special request add... 1
4 Subject: do not have money , get software cds ... 1

#### Now let’s explore the data and get the number of rows & columns : ¶

In [4]:
df.shape

Out[4]:
(5728, 2)

#### To get the column names in the data set :¶

In [5]:
df.columns

Out[5]:
Index(['text', 'spam'], dtype='object')

### To check for duplicates and remove them :¶

In [6]:
df.drop_duplicates(inplace=True)
print(df.shape)

(5695, 2)


### To see the number of missing data for each column : ¶

In [7]:
print(df.isnull().sum())

text    0
spam    0
dtype: int64


### Stop words in natural language processing, are useless words (data). ¶

In [8]:
# download the stopwords package

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/webtunix/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[8]:
True

### Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words. ¶

In [9]:
def process(text):
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)

clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
return clean
# to show the tokenization

Out[9]:
0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

### Now convert the text into a matrix of token counts : ¶

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])


### Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value. ¶

In [11]:
#split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)

(5695, 37229)


### Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features. ¶

In [12]:
# create and train the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)

##### To see the classifiers prediction and actual values on the data set : ¶
In [13]:
print(classifier.predict(xtrain))
print(ytrain.values)

[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]

##### Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score. ¶
In [14]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))

              precision    recall  f1-score   support

0       1.00      1.00      1.00      3457
1       0.99      1.00      0.99      1099

accuracy                           1.00      4556
macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556

Confusion Matrix:
[[3445   12]
[   1 1098]]
Accuracy:
0.9971466198419666

###### It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (xtest & ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text. ¶
In [15]:
#print the predictions
print(classifier.predict(xtest))
#print the actual values
print(ytest.values)

[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]

###### Now let’s evaluate the model on the test data set : ¶
In [16]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))

              precision    recall  f1-score   support

0       1.00      0.99      0.99       870
1       0.97      1.00      0.98       269

accuracy                           0.99      1139
macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139

Confusion Matrix:
[[862   8]
[  1 268]]
Accuracy:
0.9920983318700615