Fake News Detection Using Machine Learning
Project Objective: In today world it become difficult to find out that the news which come infornt us is real or not due to misleading element in society. So to find out that the given news is real or fake we can find out it with the help of machine learning to detect news orignality. If not such news items may contain false and/or exaggerated claims, and may end up being viralized by algorithms, and users may end up in a filter bubble.
Passive Aggressive Classifier belongs to the category of online learning algorithms in machine learning. It works by responding as passive for correct classifications and responding as aggressive for any miscalculation. Passive Aggressive Classifier is an online learning algorithm where you train a system incrementally by feeding it instances sequentially, individually or in small groups called mini-batches. Simply put, it remains passive for correct predictions and responds aggressively to incorrect predictions. Now let’s see how to implement the aggressive passive classifier using the Python programming language.
Start this task by importing the necessary Python libraries:
import numpy as np import pandas as pd import itertools
Now we read CSV file name fake_or_real_news.csv. We will use this dataset to try and predict news given is real or fake. It contain 3 columns i.e. id, title, text and label(tell news is fake or real) and 20800 columns i.e. number of entries.
#Read the data df=pd.read_csv('news.csv') #Get shape and head df.shape df.info() df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20800 entries, 0 to 20799 Data columns (total 3 columns): id 20800 non-null int64 title 20242 non-null object label 20800 non-null object dtypes: int64(1), object(2) memory usage: 487.6+ KB
id | title | label | |
---|---|---|---|
0 | 0 | House Dem Aide: We Didn’t Even See Comey’s Let... | 1 |
1 | 1 | FLYNN: Hillary Clinton, Big Woman on Campus - ... | 0 |
2 | 2 | Why the Truth Might Get You Fired | 1 |
3 | 3 | 15 Civilians Killed In Single US Airstrike Hav... | 1 |
4 | 4 | Iranian woman jailed for fictional unpublished... | 1 |
If value of label is 1 then it's mean that news is real if the value of label is 0 then news is fake.
#Get the labels labels=df.label labels.head()
0 1 1 0 2 1 3 1 4 1 Name: label, dtype: object
from sklearn.model_selection import train_test_split #Split the dataset x_train,x_test,y_train,y_test=train_test_split(df['title'].values.astype('U'), labels, test_size=0.2)
Initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.
TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.
Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer #Initialize a TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7) #Fit and transform train set, transform test set tfidf_train=tfidf_vectorizer.fit_transform(x_train) tfidf_test=tfidf_vectorizer.transform(x_test)
We’ll fit this on tfidf_train and y_train. Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics.
from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.metrics import accuracy_score #Initialize a PassiveAggressiveClassifier pac=PassiveAggressiveClassifier(max_iter=50) pac.fit(tfidf_train,y_train)
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None, early_stopping=False, fit_intercept=True, loss='hinge', max_iter=50, n_iter_no_change=5, n_jobs=None, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False)
We got an accuracy of approx 92% with this model. Finally, let’s print out a confusion matrix to gain insight into the number of false and true negatives and positives.
#Predict on the test set and calculate accuracy y_pred=pac.predict(tfidf_test) score=accuracy_score(y_test,y_pred) print('Acurracy:',score) from sklearn.metrics import confusion_matrix # Build confusion matrix data=confusion_matrix(y_test,y_pred)
Acurracy: 0.9225961538461539
Now, we have the y_pred which are the predicted values from our Model and y_test which are the actual values. Let us compare are see how well our model did. As you can see from the screenshot below - our basic model did pretty well.
df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) df
Actual | Predicted | |
---|---|---|
19643 | 0 | 0 |
713 | 1 | 1 |
17459 | 1 | 1 |
7145 | 0 | 0 |
19021 | 1 | 1 |
6132 | 1 | 1 |
15590 | 0 | 1 |
549 | 0 | 1 |
2239 | 0 | 0 |
5496 | 0 | 0 |
4576 | 0 | 0 |
19009 | 1 | 1 |
4358 | 1 | 1 |
14958 | 0 | 0 |
2252 | 1 | 1 |
11183 | 0 | 0 |
20275 | 1 | 1 |
19753 | 0 | 0 |
9275 | 1 | 1 |
189 | 1 | 1 |
17820 | 0 | 0 |
7875 | 0 | 0 |
13868 | 0 | 0 |
20623 | 0 | 0 |
11065 | 0 | 0 |
1997 | 0 | 0 |
1629 | 0 | 0 |
10534 | 1 | 1 |
2180 | 1 | 1 |
689 | 1 | 1 |
... | ... | ... |
3943 | 1 | 1 |
9561 | 0 | 0 |
14727 | 0 | 0 |
17368 | 0 | 0 |
4736 | 1 | 1 |
19724 | 0 | 0 |
12463 | 1 | 0 |
17524 | 1 | 1 |
9893 | 1 | 0 |
11086 | 1 | 1 |
2043 | 1 | 1 |
3390 | 1 | 1 |
7176 | 0 | 0 |
3931 | 0 | 0 |
19954 | 0 | 0 |
18146 | 0 | 0 |
8108 | 0 | 0 |
9441 | 1 | 1 |
16158 | 0 | 1 |
19078 | 0 | 0 |
12413 | 1 | 1 |
4224 | 0 | 0 |
9240 | 0 | 0 |
1878 | 1 | 1 |
19288 | 1 | 1 |
17272 | 1 | 1 |
6029 | 0 | 0 |
1103 | 1 | 1 |
4612 | 0 | 0 |
19297 | 0 | 0 |
4160 rows × 2 columns
We learned to detect fake news with Python. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of approx 92% in magnitude.