• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Spam data analysis using python

Data Analysis includes cleaning and processing data to a useful form to make better business decisions . Let’s build a Spam Data Analysis with Python which can tell whether a given text is spam or not.

Spam Data Analysis with Python

Loading the dataset and importing libraries.
In [27]:
import pandas as pd
import numpy as np
#df=pd.read_csv("spam.csv")
df=pd.read_csv("spam.csv",encoding='latin')
print(df)
        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  
2           NaN        NaN  
3           NaN        NaN  
4           NaN        NaN  
...         ...        ...  
5567        NaN        NaN  
5568        NaN        NaN  
5569        NaN        NaN  
5570        NaN        NaN  
5571        NaN        NaN  

[5572 rows x 5 columns]

Using info command

In [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB

Using head command

In [31]:
df.head(30)
Out[31]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
5 spam FreeMsg Hey there darling it's been 3 week's n... NaN NaN NaN
6 ham Even my brother is not like to speak with me. ... NaN NaN NaN
7 ham As per your request 'Melle Melle (Oru Minnamin... NaN NaN NaN
8 spam WINNER!! As a valued network customer you have... NaN NaN NaN
9 spam Had your mobile 11 months or more? U R entitle... NaN NaN NaN
10 ham I'm gonna be home soon and i don't want to tal... NaN NaN NaN
11 spam SIX chances to win CASH! From 100 to 20,000 po... NaN NaN NaN
12 spam URGENT! You have won a 1 week FREE membership ... NaN NaN NaN
13 ham I've been searching for the right words to tha... NaN NaN NaN
14 ham I HAVE A DATE ON SUNDAY WITH WILL!! NaN NaN NaN
15 spam XXXMobileMovieClub: To use your credit, click ... NaN NaN NaN
16 ham Oh k...i'm watching here:) NaN NaN NaN
17 ham Eh u remember how 2 spell his name... Yes i di... NaN NaN NaN
18 ham Fine if thatåÕs the way u feel. ThatåÕs the wa... NaN NaN NaN
19 spam England v Macedonia - dont miss the goals/team... NaN NaN NaN
20 ham Is that seriously how you spell his name? NaN NaN NaN
21 ham I‰Û÷m going to try for 2 months ha ha only joking NaN NaN NaN
22 ham So Ì_ pay first lar... Then when is da stock c... NaN NaN NaN
23 ham Aft i finish my lunch then i go str down lor. ... NaN NaN NaN
24 ham Ffffffffff. Alright no way I can meet up with ... NaN NaN NaN
25 ham Just forced myself to eat a slice. I'm really ... NaN NaN NaN
26 ham Lol your always so convincing. NaN NaN NaN
27 ham Did you catch the bus ? Are you frying an egg ... NaN NaN NaN
28 ham I'm back &amp; we're packing the car now, I'll... NaN NaN NaN
29 ham Ahhh. Work. I vaguely remember that! What does... NaN NaN NaN

Using shape command

In [4]:
df.shape
Out[4]:
(5572, 5)

Cleaning the duplicate values from the dataset

In [5]:
dupl=df[df.duplicated()]
print("printing all duplicate values\n")
dupl
printing all duplicate values

Out[5]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
102 ham As per your request 'Melle Melle (Oru Minnamin... NaN NaN NaN
153 ham As per your request 'Melle Melle (Oru Minnamin... NaN NaN NaN
206 ham As I entered my cabin my PA said, '' Happy B'd... NaN NaN NaN
222 ham Sorry, I'll call later NaN NaN NaN
325 ham No calls..messages..missed calls NaN NaN NaN
... ... ... ... ... ...
5524 spam You are awarded a SiPix Digital Camera! call 0... NaN NaN NaN
5535 ham I know you are thinkin malaria. But relax, chi... NaN NaN NaN
5539 ham Just sleeping..and surfing NaN NaN NaN
5553 ham Hahaha..use your brain dear NaN NaN NaN
5558 ham Sorry, I'll call later NaN NaN NaN

403 rows × 5 columns

Deletion of duplicate values

In [6]:
#delete all duplicate data
df.drop_duplicates(inplace=True)
# Again check if any duplicate data left

dupl1 = df[df.duplicated()]
dupl1
Out[6]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4

Description about data

In [7]:
df.describe()
Out[7]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
count 5169 5169 43 10 5
unique 2 5169 43 10 5
top ham Oops sorry. Just to check that you don't mind ... this wont even start........ Datz confidence.." i wil tolerat.bcs ur my someone..... But Never comfort me with a lie\" gud ni8 and swe...
freq 4516 1 1 1 1

Using null command for deletion

In [8]:
df.isnull().sum()
Out[8]:
v1               0
v2               0
Unnamed: 2    5126
Unnamed: 3    5159
Unnamed: 4    5164
dtype: int64
In [14]:
# delete null values
df1=df.dropna()
print("after drop all null values\n")
df1
after drop all null values

Out[14]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
281 ham \Wen u miss someone the person is definitely special for u..... B... why to miss them just Keep-in-touch\" gdeve.."
1038 ham Edison has rightly said, \A fool can ask more ... GN GE GNT:-)"
2255 ham I just lov this line: \Hurt me with the truth I don't mind i wil tolerat.bcs ur my someone..... But Never comfort me with a lie\" gud ni8 and swe...
3525 ham \HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE... HAD A COOL NYTHO TX 4 FONIN HON CALL 2MWEN IM BK FRMCLOUD 9! J X\""
4668 ham When I was born, GOD said, \Oh No! Another IDI... GOD said \"OH No! COMPETITION\". Who knew one day these two will become FREINDS FOREVER!"

Replace data from upper case

In [15]:
# cleaning the data
df['v2'] = df['v2'].apply(lambda x:x.upper())
df['v2']
Out[15]:
0       GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ...
1                           OK LAR... JOKING WIF U ONI...
2       FREE ENTRY IN 2 A WKLY COMP TO WIN FA CUP FINA...
3       U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY...
4       NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO...
                              ...                        
5567    THIS IS THE 2ND TIME WE HAVE TRIED 2 CONTACT U...
5568                WILL Ì_ B GOING TO ESPLANADE FR HOME?
5569    PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S...
5570    THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D...
5571                           ROFL. ITS TRUE TO ITS NAME
Name: v2, Length: 5169, dtype: object

Removal of digits and words from data

In [17]:
import re

df['v2'] = df['v2'].apply(lambda x: re.sub('\w*\d\w*','', x))
df['v2']
Out[17]:
0       GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ...
1                           OK LAR... JOKING WIF U ONI...
2       FREE ENTRY IN  A WKLY COMP TO WIN FA CUP FINAL...
3       U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY...
4       NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO...
                              ...                        
5567    THIS IS THE  TIME WE HAVE TRIED  CONTACT U. U ...
5568                WILL Ì_ B GOING TO ESPLANADE FR HOME?
5569    PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S...
5570    THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D...
5571                           ROFL. ITS TRUE TO ITS NAME
Name: v2, Length: 5169, dtype: object

Removal of extra space

In [18]:
df['v2'] = df['v2'].apply(lambda x: re.sub(' +',' ', x))
df['v2']
Out[18]:
0       GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ...
1                           OK LAR... JOKING WIF U ONI...
2       FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL ...
3       U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY...
4       NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO...
                              ...                        
5567    THIS IS THE TIME WE HAVE TRIED CONTACT U. U HA...
5568                WILL Ì_ B GOING TO ESPLANADE FR HOME?
5569    PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S...
5570    THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D...
5571                           ROFL. ITS TRUE TO ITS NAME
Name: v2, Length: 5169, dtype: object

Printing the data

In [19]:
for index,text in enumerate(df['v2'][0:10]):
    print('After cleaning data, Review %d:\n'%(index+1), text)
After cleaning data, Review 1:
 GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY IN BUGIS N GREAT WORLD LA E BUFFET... CINE THERE GOT AMORE WAT...
After cleaning data, Review 2:
 OK LAR... JOKING WIF U ONI...
After cleaning data, Review 3:
 FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL TKTS MAY . TEXT FA TO TO RECEIVE ENTRY QUESTION(STD TXT RATE)T&C'S APPLY 'S
After cleaning data, Review 4:
 U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY...
After cleaning data, Review 5:
 NAH I DON'T THINK HE GOES TO USF, HE LIVES AROUND HERE THOUGH
After cleaning data, Review 6:
 FREEMSG HEY THERE DARLING IT'S BEEN WEEK'S NOW AND NO WORD BACK! I'D LIKE SOME FUN YOU UP FOR IT STILL? TB OK! XXX STD CHGS TO SEND, ţ. TO RCV
After cleaning data, Review 7:
 EVEN MY BROTHER IS NOT LIKE TO SPEAK WITH ME. THEY TREAT ME LIKE AIDS PATENT.
After cleaning data, Review 8:
 AS PER YOUR REQUEST 'MELLE MELLE (ORU MINNAMINUNGINTE NURUNGU VETTAM)' HAS BEEN SET AS YOUR CALLERTUNE FOR ALL CALLERS. PRESS * TO COPY YOUR FRIENDS CALLERTUNE
After cleaning data, Review 9:
 WINNER!! AS A VALUED NETWORK CUSTOMER YOU HAVE BEEN SELECTED TO RECEIVEA ţ PRIZE REWARD! TO CLAIM CALL . CLAIM CODE . VALID HOURS ONLY.
After cleaning data, Review 10:
 HAD YOUR MOBILE MONTHS OR MORE? U R ENTITLED TO UPDATE TO THE LATEST COLOUR MOBILES WITH CAMERA FOR FREE! CALL THE MOBILE UPDATE CO FREE ON 
In [16]:
!pip3 install wordcloud
Collecting wordcloud
  Using cached https://files.pythonhosted.org/packages/05/e7/52e4bef8e2e3499f6e96cc8ff7e0902a40b95014143b062acde4ff8b9fc8/wordcloud-1.8.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy>=1.6.1 (from wordcloud)
  Using cached https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting pillow (from wordcloud)
  Using cached https://files.pythonhosted.org/packages/df/74/4a981d12fa26b83c9230b67dee44d1361a372e0f22785f093969fd98b964/Pillow-8.3.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Collecting matplotlib (from wordcloud)
  Using cached https://files.pythonhosted.org/packages/09/03/b7b30fa81cb687d1178e085d0f01111ceaea3bf81f9330c937fb6f6c8ca0/matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.1 (from matplotlib->wordcloud)
  Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib->wordcloud)
  Using cached https://files.pythonhosted.org/packages/a7/1b/cbd8ae738719b5f41592a12057ef5442e2ed5f5cb5451f8fc7e9f8875a1a/kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 (from matplotlib->wordcloud)
  Using cached https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl
Collecting cycler>=0.10 (from matplotlib->wordcloud)
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil>=2.1->matplotlib->wordcloud)
  Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl
Installing collected packages: numpy, pillow, six, python-dateutil, kiwisolver, pyparsing, cycler, matplotlib, wordcloud
Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.4 numpy-1.19.5 pillow-8.3.1 pyparsing-2.4.7 python-dateutil-2.8.2 six-1.16.0 wordcloud-1.8.1

Printing and visualizationg of the data

In [23]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
df_ham = df[df.v1 == 'ham']
text_ham = " ".join(text for text in df_ham['v2'])
ham_cloud = WordCloud(collocations = False, background_color = 'white').generate(text_ham)
plt.imshow(ham_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Wordcloud Visualization of Ham Data

Printing a Wordcloud visualization of data, containing in spam values.

In [21]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
df_spam = df[df.v1 == 'spam']
text_spam = " ".join(text for text in df_spam['v2'])
spam_cloud = WordCloud(collocations = False, background_color = 'white').generate(text_spam)
plt.imshow(spam_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Wordcloud Visualization of Spam Data
In [24]:
!pip3 install seaborn
Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/68/ad/6c2406ae175f59ec616714e408979b674fe27b9587f79d59a528ddfbcd5b/seaborn-0.11.1-py3-none-any.whl (285kB)
    100% |████████████████████████████████| 286kB 531kB/s ta 0:00:01
Collecting pandas>=0.23 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/c3/e2/00cacecafbab071c787019f00ad84ca3185952f6bb9bca9550ed83870d4d/pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting scipy>=1.0 (from seaborn)
  Downloading https://files.pythonhosted.org/packages/c8/89/63171228d5ced148f5ced50305c89e8576ffc695a90b58fe5bb602b910c2/scipy-1.5.4-cp36-cp36m-manylinux1_x86_64.whl (25.9MB)
    100% |████████████████████████████████| 25.9MB 59kB/s eta 0:00:01
Collecting numpy>=1.15 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting matplotlib>=2.2 (from seaborn)
  Using cached https://files.pythonhosted.org/packages/09/03/b7b30fa81cb687d1178e085d0f01111ceaea3bf81f9330c937fb6f6c8ca0/matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.7.3 (from pandas>=0.23->seaborn)
  Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl
Collecting pytz>=2017.2 (from pandas>=0.23->seaborn)
  Using cached https://files.pythonhosted.org/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl
Collecting cycler>=0.10 (from matplotlib>=2.2->seaborn)
  Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 (from matplotlib>=2.2->seaborn)
  Using cached https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl
Collecting kiwisolver>=1.0.1 (from matplotlib>=2.2->seaborn)
  Using cached https://files.pythonhosted.org/packages/a7/1b/cbd8ae738719b5f41592a12057ef5442e2ed5f5cb5451f8fc7e9f8875a1a/kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting pillow>=6.2.0 (from matplotlib>=2.2->seaborn)
  Using cached https://files.pythonhosted.org/packages/df/74/4a981d12fa26b83c9230b67dee44d1361a372e0f22785f093969fd98b964/Pillow-8.3.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Collecting six>=1.5 (from python-dateutil>=2.7.3->pandas>=0.23->seaborn)
  Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl
Installing collected packages: six, python-dateutil, numpy, pytz, pandas, scipy, cycler, pyparsing, kiwisolver, pillow, matplotlib, seaborn
Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.4 numpy-1.19.5 pandas-1.1.5 pillow-8.3.1 pyparsing-2.4.7 python-dateutil-2.8.2 pytz-2021.1 scipy-1.5.4 seaborn-0.11.1 six-1.16.0

Using labelEncoder for text value

In [24]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df["v1"]=le.fit_transform(df["v1"])
df
Out[24]:
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 0 GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ... NaN NaN NaN
1 0 OK LAR... JOKING WIF U ONI... NaN NaN NaN
2 1 FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL ... NaN NaN NaN
3 0 U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY... NaN NaN NaN
4 0 NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO... NaN NaN NaN
... ... ... ... ... ...
5567 1 THIS IS THE TIME WE HAVE TRIED CONTACT U. U HA... NaN NaN NaN
5568 0 WILL Ì_ B GOING TO ESPLANADE FR HOME? NaN NaN NaN
5569 0 PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S... NaN NaN NaN
5570 0 THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D... NaN NaN NaN
5571 0 ROFL. ITS TRUE TO ITS NAME NaN NaN NaN

5169 rows × 5 columns

Plotting of pie graph for ham and spam

In [35]:
df['v1'].value_counts().plot(x='v1',y='value_counts',kind='pie',autopct='%1.1f%%',title="PieChart of ham/spam values")
Out[35]:
<AxesSubplot:title={'center':'PieChart of ham/spam values'}, ylabel='v1'>
Ham Spam Pie Chart

Plot a Bar graph for counting the values of ham and spam

In [36]:
df['v1'].value_counts().plot(x='v1',y='value_counts',kind='bar',color=['blue','red'],title='Count values')
for index, value in enumerate(df['v1'].value_counts()):
    plt.text(index,value,str(value))
Ham Spam Bar Chart
In [ ]:

Conclusion:
Detection of spam is important for securing communication. The accurate Spam Detection is a big issue and many detection methods are available.We have proposed a method for Spam Detection using Machine Learning.
In [ ]:

Resources You Will Ever Need