• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Email spam Detection with Machine Learning

In this Data Science Project I will show you how to detect email spam using Machine Learning technique called Natural Language Processing and Python.

So this program will detect if an email is spam (1) or not (0)

Import the libraries :

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string

Load the data and print the first 5 rows :

In [3]:
df = pd.read_csv("data.csv")
df.head()
Out[3]:
text spam
0 Subject: naturally irresistible your corporate... 1
1 Subject: the stock trading gunslinger fanny i... 1
2 Subject: unbelievable new homes made easy im ... 1
3 Subject: 4 color printing special request add... 1
4 Subject: do not have money , get software cds ... 1

Now let’s explore the data and get the number of rows & columns :

In [4]:
df.shape
Out[4]:
(5728, 2)

To get the column names in the data set :

In [5]:
df.columns
Out[5]:
Index(['text', 'spam'], dtype='object')

To check for duplicates and remove them :

In [6]:
df.drop_duplicates(inplace=True)
print(df.shape)
(5695, 2)

To see the number of missing data for each column :

In [7]:
print(df.isnull().sum())
text    0
spam    0
dtype: int64

Now Download the stop words

Stop words in natural language processing, are useless words (data).

In [8]:
# download the stopwords package
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/webtunix/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[8]:
True

Now Create a function to clean the text and return the tokens. The cleaning of the text can be done by first removing punctuation and then removing the useless words also known as stop words.

In [9]:
def process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    clean = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
    return clean
# to show the tokenization
df['text'].head().apply(process)
Out[9]:
0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

Now convert the text into a matrix of token counts :

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
message = CountVectorizer(analyzer=process).fit_transform(df['text'])

Now we need to split the data into training and testing sets, and then we will use this one row of data for testing to make our prediction later on and test to see if the prediction matches with the actual value.

In [11]:
#split the data into 80% training and 20% testing
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(message, df['spam'], test_size=0.20, random_state=0)
# To see the shape of the data
print(message.shape)
(5695, 37229)

Now we need to create and train the Multinomial Naive Bayes classifier which is suitable for classification with discrete features.

In [12]:
# create and train the Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(xtrain, ytrain)
To see the classifiers prediction and actual values on the data set :
In [13]:
print(classifier.predict(xtrain))
print(ytrain.values)
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
Now let’s see how well our model performed by evaluating the Naive Bayes classifier and the report, confusion matrix & accuracy score.
In [14]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtrain)
print(classification_report(ytrain, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: \n", accuracy_score(ytrain, pred))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       0.99      1.00      0.99      1099

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556


Confusion Matrix: 
 [[3445   12]
 [   1 1098]]
Accuracy: 
 0.9971466198419666
It looks like the model used is 99.71% accurate. Let’s test the model on the test data set (xtest & ytest) by printing the predicted value, and the actual value to see if the model can accurately classify the email text.
In [15]:
#print the predictions
print(classifier.predict(xtest))
#print the actual values
print(ytest.values)
[1 0 0 ... 0 0 0]
[1 0 0 ... 0 0 0]
Now let’s evaluate the model on the test data set :
In [16]:
# Evaluating the model on the training data set
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
pred = classifier.predict(xtest)
print(classification_report(ytest, pred))
print()
print("Confusion Matrix: \n", confusion_matrix(ytest, pred))
print("Accuracy: \n", accuracy_score(ytest, pred))
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       870
           1       0.97      1.00      0.98       269

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139


Confusion Matrix: 
 [[862   8]
 [  1 268]]
Accuracy: 
 0.9920983318700615
The classifier accurately identified the email messages as spam or not spam with 99.2 % accuracy on the test data.

Resources You Will Ever Need