Spam data analysis using python
Data Analysis includes cleaning and processing data to a useful form to make better business decisions . Let’s build a Spam Data Analysis with Python which can tell whether a given text is spam or not.
import pandas as pd import numpy as np #df=pd.read_csv("spam.csv") df=pd.read_csv("spam.csv",encoding='latin') print(df)
v1 v2 Unnamed: 2 \ 0 ham Go until jurong point, crazy.. Available only ... NaN 1 ham Ok lar... Joking wif u oni... NaN 2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN 3 ham U dun say so early hor... U c already then say... NaN 4 ham Nah I don't think he goes to usf, he lives aro... NaN ... ... ... ... 5567 spam This is the 2nd time we have tried 2 contact u... NaN 5568 ham Will Ì_ b going to esplanade fr home? NaN 5569 ham Pity, * was in mood for that. So...any other s... NaN 5570 ham The guy did some bitching but I acted like i'd... NaN 5571 ham Rofl. Its true to its name NaN Unnamed: 3 Unnamed: 4 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN ... ... ... 5567 NaN NaN 5568 NaN NaN 5569 NaN NaN 5570 NaN NaN 5571 NaN NaN [5572 rows x 5 columns]
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5572 entries, 0 to 5571 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 v1 5572 non-null object 1 v2 5572 non-null object 2 Unnamed: 2 50 non-null object 3 Unnamed: 3 12 non-null object 4 Unnamed: 4 6 non-null object dtypes: object(5) memory usage: 217.8+ KB
df.head(30)
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... | NaN | NaN | NaN |
1 | ham | Ok lar... Joking wif u oni... | NaN | NaN | NaN |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... | NaN | NaN | NaN |
3 | ham | U dun say so early hor... U c already then say... | NaN | NaN | NaN |
4 | ham | Nah I don't think he goes to usf, he lives aro... | NaN | NaN | NaN |
5 | spam | FreeMsg Hey there darling it's been 3 week's n... | NaN | NaN | NaN |
6 | ham | Even my brother is not like to speak with me. ... | NaN | NaN | NaN |
7 | ham | As per your request 'Melle Melle (Oru Minnamin... | NaN | NaN | NaN |
8 | spam | WINNER!! As a valued network customer you have... | NaN | NaN | NaN |
9 | spam | Had your mobile 11 months or more? U R entitle... | NaN | NaN | NaN |
10 | ham | I'm gonna be home soon and i don't want to tal... | NaN | NaN | NaN |
11 | spam | SIX chances to win CASH! From 100 to 20,000 po... | NaN | NaN | NaN |
12 | spam | URGENT! You have won a 1 week FREE membership ... | NaN | NaN | NaN |
13 | ham | I've been searching for the right words to tha... | NaN | NaN | NaN |
14 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! | NaN | NaN | NaN |
15 | spam | XXXMobileMovieClub: To use your credit, click ... | NaN | NaN | NaN |
16 | ham | Oh k...i'm watching here:) | NaN | NaN | NaN |
17 | ham | Eh u remember how 2 spell his name... Yes i di... | NaN | NaN | NaN |
18 | ham | Fine if thatåÕs the way u feel. ThatåÕs the wa... | NaN | NaN | NaN |
19 | spam | England v Macedonia - dont miss the goals/team... | NaN | NaN | NaN |
20 | ham | Is that seriously how you spell his name? | NaN | NaN | NaN |
21 | ham | IÛ÷m going to try for 2 months ha ha only joking | NaN | NaN | NaN |
22 | ham | So Ì_ pay first lar... Then when is da stock c... | NaN | NaN | NaN |
23 | ham | Aft i finish my lunch then i go str down lor. ... | NaN | NaN | NaN |
24 | ham | Ffffffffff. Alright no way I can meet up with ... | NaN | NaN | NaN |
25 | ham | Just forced myself to eat a slice. I'm really ... | NaN | NaN | NaN |
26 | ham | Lol your always so convincing. | NaN | NaN | NaN |
27 | ham | Did you catch the bus ? Are you frying an egg ... | NaN | NaN | NaN |
28 | ham | I'm back & we're packing the car now, I'll... | NaN | NaN | NaN |
29 | ham | Ahhh. Work. I vaguely remember that! What does... | NaN | NaN | NaN |
df.shape
(5572, 5)
dupl=df[df.duplicated()] print("printing all duplicate values\n") dupl
printing all duplicate values
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
102 | ham | As per your request 'Melle Melle (Oru Minnamin... | NaN | NaN | NaN |
153 | ham | As per your request 'Melle Melle (Oru Minnamin... | NaN | NaN | NaN |
206 | ham | As I entered my cabin my PA said, '' Happy B'd... | NaN | NaN | NaN |
222 | ham | Sorry, I'll call later | NaN | NaN | NaN |
325 | ham | No calls..messages..missed calls | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... |
5524 | spam | You are awarded a SiPix Digital Camera! call 0... | NaN | NaN | NaN |
5535 | ham | I know you are thinkin malaria. But relax, chi... | NaN | NaN | NaN |
5539 | ham | Just sleeping..and surfing | NaN | NaN | NaN |
5553 | ham | Hahaha..use your brain dear | NaN | NaN | NaN |
5558 | ham | Sorry, I'll call later | NaN | NaN | NaN |
403 rows × 5 columns
#delete all duplicate data df.drop_duplicates(inplace=True) # Again check if any duplicate data left dupl1 = df[df.duplicated()] dupl1
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 |
---|
df.describe()
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
count | 5169 | 5169 | 43 | 10 | 5 |
unique | 2 | 5169 | 43 | 10 | 5 |
top | ham | Oops sorry. Just to check that you don't mind ... | this wont even start........ Datz confidence.." | i wil tolerat.bcs ur my someone..... But | Never comfort me with a lie\" gud ni8 and swe... |
freq | 4516 | 1 | 1 | 1 | 1 |
df.isnull().sum()
v1 0 v2 0 Unnamed: 2 5126 Unnamed: 3 5159 Unnamed: 4 5164 dtype: int64
# delete null values df1=df.dropna() print("after drop all null values\n") df1
after drop all null values
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
281 | ham | \Wen u miss someone | the person is definitely special for u..... B... | why to miss them | just Keep-in-touch\" gdeve.." |
1038 | ham | Edison has rightly said, \A fool can ask more ... | GN | GE | GNT:-)" |
2255 | ham | I just lov this line: \Hurt me with the truth | I don't mind | i wil tolerat.bcs ur my someone..... But | Never comfort me with a lie\" gud ni8 and swe... |
3525 | ham | \HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE... | HAD A COOL NYTHO | TX 4 FONIN HON | CALL 2MWEN IM BK FRMCLOUD 9! J X\"" |
4668 | ham | When I was born, GOD said, \Oh No! Another IDI... | GOD said | \"OH No! COMPETITION\". Who knew | one day these two will become FREINDS FOREVER!" |
# cleaning the data df['v2'] = df['v2'].apply(lambda x:x.upper()) df['v2']
0 GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ... 1 OK LAR... JOKING WIF U ONI... 2 FREE ENTRY IN 2 A WKLY COMP TO WIN FA CUP FINA... 3 U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY... 4 NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO... ... 5567 THIS IS THE 2ND TIME WE HAVE TRIED 2 CONTACT U... 5568 WILL Ì_ B GOING TO ESPLANADE FR HOME? 5569 PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S... 5570 THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D... 5571 ROFL. ITS TRUE TO ITS NAME Name: v2, Length: 5169, dtype: object
import re df['v2'] = df['v2'].apply(lambda x: re.sub('\w*\d\w*','', x)) df['v2']
0 GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ... 1 OK LAR... JOKING WIF U ONI... 2 FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL... 3 U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY... 4 NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO... ... 5567 THIS IS THE TIME WE HAVE TRIED CONTACT U. U ... 5568 WILL Ì_ B GOING TO ESPLANADE FR HOME? 5569 PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S... 5570 THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D... 5571 ROFL. ITS TRUE TO ITS NAME Name: v2, Length: 5169, dtype: object
df['v2'] = df['v2'].apply(lambda x: re.sub(' +',' ', x)) df['v2']
0 GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ... 1 OK LAR... JOKING WIF U ONI... 2 FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL ... 3 U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY... 4 NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO... ... 5567 THIS IS THE TIME WE HAVE TRIED CONTACT U. U HA... 5568 WILL Ì_ B GOING TO ESPLANADE FR HOME? 5569 PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S... 5570 THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D... 5571 ROFL. ITS TRUE TO ITS NAME Name: v2, Length: 5169, dtype: object
for index,text in enumerate(df['v2'][0:10]): print('After cleaning data, Review %d:\n'%(index+1), text)
After cleaning data, Review 1: GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY IN BUGIS N GREAT WORLD LA E BUFFET... CINE THERE GOT AMORE WAT... After cleaning data, Review 2: OK LAR... JOKING WIF U ONI... After cleaning data, Review 3: FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL TKTS MAY . TEXT FA TO TO RECEIVE ENTRY QUESTION(STD TXT RATE)T&C'S APPLY 'S After cleaning data, Review 4: U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY... After cleaning data, Review 5: NAH I DON'T THINK HE GOES TO USF, HE LIVES AROUND HERE THOUGH After cleaning data, Review 6: FREEMSG HEY THERE DARLING IT'S BEEN WEEK'S NOW AND NO WORD BACK! I'D LIKE SOME FUN YOU UP FOR IT STILL? TB OK! XXX STD CHGS TO SEND, ţ. TO RCV After cleaning data, Review 7: EVEN MY BROTHER IS NOT LIKE TO SPEAK WITH ME. THEY TREAT ME LIKE AIDS PATENT. After cleaning data, Review 8: AS PER YOUR REQUEST 'MELLE MELLE (ORU MINNAMINUNGINTE NURUNGU VETTAM)' HAS BEEN SET AS YOUR CALLERTUNE FOR ALL CALLERS. PRESS * TO COPY YOUR FRIENDS CALLERTUNE After cleaning data, Review 9: WINNER!! AS A VALUED NETWORK CUSTOMER YOU HAVE BEEN SELECTED TO RECEIVEA ţ PRIZE REWARD! TO CLAIM CALL . CLAIM CODE . VALID HOURS ONLY. After cleaning data, Review 10: HAD YOUR MOBILE MONTHS OR MORE? U R ENTITLED TO UPDATE TO THE LATEST COLOUR MOBILES WITH CAMERA FOR FREE! CALL THE MOBILE UPDATE CO FREE ON
!pip3 install wordcloud
Collecting wordcloud Using cached https://files.pythonhosted.org/packages/05/e7/52e4bef8e2e3499f6e96cc8ff7e0902a40b95014143b062acde4ff8b9fc8/wordcloud-1.8.1-cp36-cp36m-manylinux1_x86_64.whl Collecting numpy>=1.6.1 (from wordcloud) Using cached https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl Collecting pillow (from wordcloud) Using cached https://files.pythonhosted.org/packages/df/74/4a981d12fa26b83c9230b67dee44d1361a372e0f22785f093969fd98b964/Pillow-8.3.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl Collecting matplotlib (from wordcloud) Using cached https://files.pythonhosted.org/packages/09/03/b7b30fa81cb687d1178e085d0f01111ceaea3bf81f9330c937fb6f6c8ca0/matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl Collecting python-dateutil>=2.1 (from matplotlib->wordcloud) Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl Collecting kiwisolver>=1.0.1 (from matplotlib->wordcloud) Using cached https://files.pythonhosted.org/packages/a7/1b/cbd8ae738719b5f41592a12057ef5442e2ed5f5cb5451f8fc7e9f8875a1a/kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 (from matplotlib->wordcloud) Using cached https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl Collecting cycler>=0.10 (from matplotlib->wordcloud) Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl Collecting six>=1.5 (from python-dateutil>=2.1->matplotlib->wordcloud) Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl Installing collected packages: numpy, pillow, six, python-dateutil, kiwisolver, pyparsing, cycler, matplotlib, wordcloud Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.4 numpy-1.19.5 pillow-8.3.1 pyparsing-2.4.7 python-dateutil-2.8.2 six-1.16.0 wordcloud-1.8.1
from wordcloud import WordCloud import matplotlib.pyplot as plt df_ham = df[df.v1 == 'ham'] text_ham = " ".join(text for text in df_ham['v2']) ham_cloud = WordCloud(collocations = False, background_color = 'white').generate(text_ham) plt.imshow(ham_cloud, interpolation='bilinear') plt.axis("off") plt.show()
from wordcloud import WordCloud import matplotlib.pyplot as plt df_spam = df[df.v1 == 'spam'] text_spam = " ".join(text for text in df_spam['v2']) spam_cloud = WordCloud(collocations = False, background_color = 'white').generate(text_spam) plt.imshow(spam_cloud, interpolation='bilinear') plt.axis("off") plt.show()
!pip3 install seaborn
Collecting seaborn Downloading https://files.pythonhosted.org/packages/68/ad/6c2406ae175f59ec616714e408979b674fe27b9587f79d59a528ddfbcd5b/seaborn-0.11.1-py3-none-any.whl (285kB) 100% |████████████████████████████████| 286kB 531kB/s ta 0:00:01 Collecting pandas>=0.23 (from seaborn) Using cached https://files.pythonhosted.org/packages/c3/e2/00cacecafbab071c787019f00ad84ca3185952f6bb9bca9550ed83870d4d/pandas-1.1.5-cp36-cp36m-manylinux1_x86_64.whl Collecting scipy>=1.0 (from seaborn) Downloading https://files.pythonhosted.org/packages/c8/89/63171228d5ced148f5ced50305c89e8576ffc695a90b58fe5bb602b910c2/scipy-1.5.4-cp36-cp36m-manylinux1_x86_64.whl (25.9MB) 100% |████████████████████████████████| 25.9MB 59kB/s eta 0:00:01 Collecting numpy>=1.15 (from seaborn) Using cached https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl Collecting matplotlib>=2.2 (from seaborn) Using cached https://files.pythonhosted.org/packages/09/03/b7b30fa81cb687d1178e085d0f01111ceaea3bf81f9330c937fb6f6c8ca0/matplotlib-3.3.4-cp36-cp36m-manylinux1_x86_64.whl Collecting python-dateutil>=2.7.3 (from pandas>=0.23->seaborn) Using cached https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl Collecting pytz>=2017.2 (from pandas>=0.23->seaborn) Using cached https://files.pythonhosted.org/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl Collecting cycler>=0.10 (from matplotlib>=2.2->seaborn) Using cached https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 (from matplotlib>=2.2->seaborn) Using cached https://files.pythonhosted.org/packages/8a/bb/488841f56197b13700afd5658fc279a2025a39e22449b7cf29864669b15d/pyparsing-2.4.7-py2.py3-none-any.whl Collecting kiwisolver>=1.0.1 (from matplotlib>=2.2->seaborn) Using cached https://files.pythonhosted.org/packages/a7/1b/cbd8ae738719b5f41592a12057ef5442e2ed5f5cb5451f8fc7e9f8875a1a/kiwisolver-1.3.1-cp36-cp36m-manylinux1_x86_64.whl Collecting pillow>=6.2.0 (from matplotlib>=2.2->seaborn) Using cached https://files.pythonhosted.org/packages/df/74/4a981d12fa26b83c9230b67dee44d1361a372e0f22785f093969fd98b964/Pillow-8.3.1-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl Collecting six>=1.5 (from python-dateutil>=2.7.3->pandas>=0.23->seaborn) Using cached https://files.pythonhosted.org/packages/d9/5a/e7c31adbe875f2abbb91bd84cf2dc52d792b5a01506781dbcf25c91daf11/six-1.16.0-py2.py3-none-any.whl Installing collected packages: six, python-dateutil, numpy, pytz, pandas, scipy, cycler, pyparsing, kiwisolver, pillow, matplotlib, seaborn Successfully installed cycler-0.10.0 kiwisolver-1.3.1 matplotlib-3.3.4 numpy-1.19.5 pandas-1.1.5 pillow-8.3.1 pyparsing-2.4.7 python-dateutil-2.8.2 pytz-2021.1 scipy-1.5.4 seaborn-0.11.1 six-1.16.0
from sklearn.preprocessing import LabelEncoder le=LabelEncoder() df["v1"]=le.fit_transform(df["v1"]) df
v1 | v2 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | |
---|---|---|---|---|---|
0 | 0 | GO UNTIL JURONG POINT, CRAZY.. AVAILABLE ONLY ... | NaN | NaN | NaN |
1 | 0 | OK LAR... JOKING WIF U ONI... | NaN | NaN | NaN |
2 | 1 | FREE ENTRY IN A WKLY COMP TO WIN FA CUP FINAL ... | NaN | NaN | NaN |
3 | 0 | U DUN SAY SO EARLY HOR... U C ALREADY THEN SAY... | NaN | NaN | NaN |
4 | 0 | NAH I DON'T THINK HE GOES TO USF, HE LIVES ARO... | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... |
5567 | 1 | THIS IS THE TIME WE HAVE TRIED CONTACT U. U HA... | NaN | NaN | NaN |
5568 | 0 | WILL Ì_ B GOING TO ESPLANADE FR HOME? | NaN | NaN | NaN |
5569 | 0 | PITY, * WAS IN MOOD FOR THAT. SO...ANY OTHER S... | NaN | NaN | NaN |
5570 | 0 | THE GUY DID SOME BITCHING BUT I ACTED LIKE I'D... | NaN | NaN | NaN |
5571 | 0 | ROFL. ITS TRUE TO ITS NAME | NaN | NaN | NaN |
5169 rows × 5 columns
df['v1'].value_counts().plot(x='v1',y='value_counts',kind='pie',autopct='%1.1f%%',title="PieChart of ham/spam values")
<AxesSubplot:title={'center':'PieChart of ham/spam values'}, ylabel='v1'>
df['v1'].value_counts().plot(x='v1',y='value_counts',kind='bar',color=['blue','red'],title='Count values') for index, value in enumerate(df['v1'].value_counts()): plt.text(index,value,str(value))