Principal Components Analysis
Principal Component Analysis is a technique to standardize the data and aims at reducing a large set of variables to a small set and that small set contains most of the information in the large set.
In the above content we've applied operations on dataset and then we've encoded our dataset according to the needs of the PCA .
Before performing PCA we've standarized the dataset and then we've applied PCA in the datatset after that we've done plotting on the dataset through PCA.
PCA(Principle Components Analysis) It is used to represent multivariate data table as smaller set of variables in order to observe trends,clusters and outlier's.
We've taken head command to represent first five values in the dataset. It has taken values from 0 because of default command.
import pandas as pd import numpy as np import matplotlib.pyplot as plt df=pd.read_csv("2014-aps-employee-census-5-point-dataset.csv") print(df.head())
AS q1 q2@ \ 0 Large (1,001 or more employees) Male 40 to 54 years 1 Large (1,001 or more employees) Male Under 40 years 2 Large (1,001 or more employees) Male Under 40 years 3 Large (1,001 or more employees) 4 Large (1,001 or more employees) Male 55 years or older q6@ q18a q18b q18c \ 0 Trainee/Graduate/APS Agree Agree Agree 1 Trainee/Graduate/APS Agree Agree Agree 2 Trainee/Graduate/APS Agree Strongly agree Agree 3 Trainee/Graduate/APS Strongly agree Agree Agree 4 Trainee/Graduate/APS Agree Strongly agree Agree q18d q18e \ 0 Agree Neither agree nor disagree 1 Neither agree nor disagree Neither agree nor disagree 2 Neither agree nor disagree Strongly disagree 3 Agree Agree 4 Agree Agree q18f ... q79d \ 0 Neither agree nor disagree ... Neither agree nor disagree 1 Neither agree nor disagree ... Neither agree nor disagree 2 Strongly disagree ... Neither agree nor disagree 3 Agree ... Neither agree nor disagree 4 Agree ... Disagree q79e \ 0 Neither agree nor disagree 1 Agree 2 Agree 3 Disagree 4 Disagree q80 q81a.1 q81b.1 q81c.1 \ 0 No 1 Not applicable (i.e. you did not experience pr... 2 No 3 No 4 No q81d.1 q81e.1 q81f.1 q81g.1 0 1 2 3 4 [5 rows x 225 columns]
These methods are showing how many values are there in a dataset and how much frequency is there, how many unique values are there and what is the shape of the dataset.
And what values are there on the top of the dataset.
print("shape of dataset:",df.shape) print("\n Dataset description:\n",df.describe())
shape of dataset: (99392, 225) Dataset description: AS q1 q2@ \ count 99392 99392 99392 unique 3 4 4 top Large (1,001 or more employees) Female 40 to 54 years freq 86884 56250 42924 q6@ q18a q18b q18c q18d q18e q18f ... \ count 99392 99392 99392 99392 99392 99392 99392 ... unique 4 6 6 6 6 6 6 ... top Trainee/Graduate/APS Agree Agree Agree Agree Agree Agree ... freq 67630 55703 45573 52271 48549 43131 51448 ... q79d q79e q80 \ count 99392 99392 99392 unique 6 6 5 top Agree Agree Not applicable (i.e. you did not experience pr... freq 45745 43011 41766 q81a.1 q81b.1 q81c.1 q81d.1 q81e.1 q81f.1 q81g.1 count 99392 99392 99392 99392 99392 99392 99392 unique 2 2 2 2 2 2 2 top freq 98934 99260 98710 98428 99231 98785 99105 [4 rows x 225 columns]
Here we're showing about number of coloums present in the dataset and also how many unique values are present in the dataset.
With the help of unique command we're showing how many unique values are present in the particular column.
print("Columns name:",df.columns) print("\nUnique values in column 'AS':",np.unique(df['AS']))
Columns name: Index(['AS', 'q1', 'q2@', 'q6@', 'q18a', 'q18b', 'q18c', 'q18d', 'q18e', 'q18f', ... 'q79d', 'q79e', 'q80', 'q81a.1', 'q81b.1', 'q81c.1', 'q81d.1', 'q81e.1', 'q81f.1', 'q81g.1'], dtype='object', length=225) Unique values in column 'AS': ['Large (1,001 or more employees)' 'Medium (251 to 1,000 employees)' 'Small (Less than 250 employees)']
print(np.unique(df['q2@']))
[' ' '40 to 54 years' '55 years or older' 'Under 40 years']
With the help of isnull method we're finding out the null values that are present in our dataset.
df.isnull().sum()
AS 0 q1 0 q2@ 0 q6@ 0 q18a 0 .. q81c.1 0 q81d.1 0 q81e.1 0 q81f.1 0 q81g.1 0 Length: 225, dtype: int64
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99392 entries, 0 to 99391 Columns: 225 entries, AS to q81g.1 dtypes: object(225) memory usage: 170.6+ MB None
With the help of this command we're finding the frequency of each values present in our particular columns of datatset.
df["AS"].value_counts()
Large (1,001 or more employees) 86884 Medium (251 to 1,000 employees) 8884 Small (Less than 250 employees) 3624 Name: AS, dtype: int64
We're using this method to clean the data with the help of dropna to remove all null values.
df = df.dropna() print(df)
AS q1 q2@ \ 0 Large (1,001 or more employees) Male 40 to 54 years 1 Large (1,001 or more employees) Male Under 40 years 2 Large (1,001 or more employees) Male Under 40 years 3 Large (1,001 or more employees) 4 Large (1,001 or more employees) Male 55 years or older ... ... ... ... 99387 Medium (251 to 1,000 employees) Female 40 to 54 years 99388 Medium (251 to 1,000 employees) Male 55 years or older 99389 Medium (251 to 1,000 employees) Male 55 years or older 99390 Medium (251 to 1,000 employees) Male 55 years or older 99391 Medium (251 to 1,000 employees) Female 55 years or older q6@ q18a q18b \ 0 Trainee/Graduate/APS Agree Agree 1 Trainee/Graduate/APS Agree Agree 2 Trainee/Graduate/APS Agree Strongly agree 3 Trainee/Graduate/APS Strongly agree Agree 4 Trainee/Graduate/APS Agree Strongly agree ... ... ... ... 99387 Trainee/Graduate/APS Agree Agree 99388 Trainee/Graduate/APS Strongly agree Strongly disagree 99389 Trainee/Graduate/APS Agree Agree 99390 Trainee/Graduate/APS Agree Agree 99391 Trainee/Graduate/APS Agree Strongly agree q18c q18d \ 0 Agree Agree 1 Agree Neither agree nor disagree 2 Agree Neither agree nor disagree 3 Agree Agree 4 Agree Agree ... ... ... 99387 Agree Agree 99388 Strongly disagree Strongly disagree 99389 Agree Agree 99390 Agree Agree 99391 Strongly agree Strongly agree q18e q18f ... \ 0 Neither agree nor disagree Neither agree nor disagree ... 1 Neither agree nor disagree Neither agree nor disagree ... 2 Strongly disagree Strongly disagree ... 3 Agree Agree ... 4 Agree Agree ... ... ... ... ... 99387 Agree Agree ... 99388 Strongly disagree Agree ... 99389 Neither agree nor disagree Disagree ... 99390 Agree Disagree ... 99391 Agree Disagree ... q79d q79e \ 0 Neither agree nor disagree Neither agree nor disagree 1 Neither agree nor disagree Agree 2 Neither agree nor disagree Agree 3 Neither agree nor disagree Disagree 4 Disagree Disagree ... ... ... 99387 99388 99389 99390 99391 q80 q81a.1 q81b.1 q81c.1 \ 0 No 1 Not applicable (i.e. you did not experience pr... 2 No 3 No 4 No ... ... ... ... ... 99387 99388 99389 99390 99391 q81d.1 q81e.1 q81f.1 q81g.1 0 1 2 3 4 ... ... ... ... ... 99387 99388 99389 99390 99391 [99392 rows x 225 columns]
Label encoder: It converts categorical values to numeric values. LabelEncoder encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels.
from sklearn.preprocessing import LabelEncoder coll=df.columns num_coll=len(coll) print("Total columns in dataset:",num_coll)
Total columns in dataset: 225
With the help of multi label encoder we're converting large amount of categorical data into numerical data for multiple columns of the dataset.
class MultiColumnLabelEncoder: def __init__(self,columns = None): self.columns = columns def fit(self,X,y=None): return self def transform(self,X): output = X.copy() if self.columns is not None: for col in self.columns: output[col] = LabelEncoder().fit_transform(output[col]) else: for colname,col in output.iteritems(): output[colname] = LabelEncoder().fit_transform(col) return output def fit_transform(self,X,y=None): return self.fit(X,y).transform(X) encode_df=MultiColumnLabelEncoder(columns = coll).fit_transform(df)
Here we're using train and test for the datatset. So that before performing our operation on PCA we test and train our dataset.
X = encode_df.iloc[:, 0:100].values y = encode_df.iloc[:, 100:].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Standardiaztion means converting or rescaling large amount of data or small amount of data so that result are not biased or large amount of data is not dominant on small amount of data and hence result can be biased . so to correct this thing we use standardization.
from sklearn.preprocessing import StandardScaler scaler=StandardScaler() scaler.fit(encode_df) scaled_data=scaler.transform(encode_df) print(scaled_data)
[[-0.35512512 1.08217509 -0.86199097 ... -0.04027998 -0.07838787 -0.05381374] [-0.35512512 1.08217509 1.16639595 ... -0.04027998 -0.07838787 -0.05381374] [-0.35512512 1.08217509 1.16639595 ... -0.04027998 -0.07838787 -0.05381374] ... [ 1.83286125 1.08217509 0.15220249 ... -0.04027998 -0.07838787 -0.05381374] [ 1.83286125 1.08217509 0.15220249 ... -0.04027998 -0.07838787 -0.05381374] [ 1.83286125 -0.78784032 0.15220249 ... -0.04027998 -0.07838787 -0.05381374]]
It is used to represent multivariate data table as smaller set of variables in order to observe trends,cluster's and outliers.
from sklearn.decomposition import PCA pca = PCA(n_components =100) pca.fit(scaled_data) X_pca = pca.transform(scaled_data) explained_variance = pca.explained_variance_ratio_ print("PCA variance ratio:\n ",explained_variance)
PCA variance ratio: [0.17174386 0.05400343 0.04285115 0.03413928 0.02813929 0.02192731 0.01637397 0.01541781 0.01355456 0.01194664 0.01162375 0.01075132 0.00993616 0.00955814 0.00926424 0.00907456 0.00868036 0.00862437 0.00857389 0.00801754 0.00693549 0.00690604 0.00672259 0.00628069 0.00578766 0.00570613 0.00564489 0.00542409 0.00536122 0.00515135 0.00510966 0.00508185 0.00487065 0.00480339 0.00469909 0.00466639 0.00457426 0.00455165 0.00451562 0.00446899 0.00440665 0.004346 0.00432109 0.00428343 0.00423037 0.00415517 0.00411799 0.00409009 0.00406203 0.0040447 0.00392605 0.00388362 0.00386713 0.00380245 0.00373605 0.00370093 0.00364178 0.00359485 0.00351624 0.00348304 0.00343211 0.00340265 0.00334397 0.00331924 0.00330868 0.0032896 0.00326933 0.00322748 0.00319879 0.00318379 0.00310807 0.00308975 0.00306869 0.00301716 0.00298357 0.00298094 0.00294148 0.00289031 0.00284248 0.00279262 0.00276843 0.00272978 0.00271601 0.00268968 0.00268609 0.00265651 0.00260367 0.00260165 0.00256548 0.00253245 0.00251968 0.00246755 0.00245905 0.00244337 0.00242123 0.00239962 0.00235046 0.00234205 0.00231747 0.00231347]
print(X_pca)
[[-1.59152612 0.55564691 -0.70495291 ... 0.77915177 1.08884638 0.78325715] [-2.57328202 1.4814192 -1.7046726 ... 0.17181242 -0.03189908 -1.2864775 ] [-3.97356152 0.91220517 5.74592169 ... -0.28159834 0.96923543 -0.27102818] ... [13.99032327 1.43328238 -1.05774149 ... -0.68599246 0.178349 -0.07620442] [13.25045555 0.87868275 -0.68725882 ... -1.1036088 -0.38240104 0.46014404] [ 8.81093543 5.91687096 4.69821008 ... -1.16038284 -1.09898136 -0.7022699 ]]
print("X_pca shape:",X_pca.shape) print("Standard scaler shape:",scaled_data.shape)
X_pca shape: (99392, 100) Standard scaler shape: (99392, 225)
It returns sum of vector of the variance explained by each dimension.
pca.explained_variance_ratio_.cumsum()
array([0.17174386, 0.22574729, 0.26859844, 0.30273772, 0.33087701, 0.35280432, 0.36917829, 0.38459611, 0.39815067, 0.41009731, 0.42172106, 0.43247237, 0.44240853, 0.45196667, 0.46123091, 0.47030547, 0.47898583, 0.4876102 , 0.49618408, 0.50420163, 0.51113711, 0.51804315, 0.52476574, 0.53104643, 0.53683408, 0.54254021, 0.5481851 , 0.55360919, 0.5589704 , 0.56412175, 0.56923141, 0.57431326, 0.57918391, 0.5839873 , 0.5886864 , 0.59335279, 0.59792705, 0.6024787 , 0.60699431, 0.6114633 , 0.61586995, 0.62021594, 0.62453703, 0.62882047, 0.63305084, 0.63720601, 0.641324 , 0.64541409, 0.64947612, 0.65352082, 0.65744687, 0.66133049, 0.66519762, 0.66900008, 0.67273612, 0.67643706, 0.68007884, 0.68367369, 0.68718993, 0.69067297, 0.69410508, 0.69750774, 0.7008517 , 0.70417095, 0.70747963, 0.71076923, 0.71403856, 0.71726604, 0.72046482, 0.72364861, 0.72675669, 0.72984644, 0.73291513, 0.73593228, 0.73891585, 0.74189679, 0.74483826, 0.74772857, 0.75057106, 0.75336368, 0.75613211, 0.75886189, 0.7615779 , 0.76426758, 0.76695367, 0.76961018, 0.77221385, 0.7748155 , 0.77738099, 0.77991344, 0.78243312, 0.78490068, 0.78735973, 0.7898031 , 0.79222433, 0.79462395, 0.79697441, 0.79931646, 0.80163393, 0.80394741])
print(pca.components_)
[[-4.99442165e-06 -5.25975613e-03 -2.13945274e-03 ... -5.64910048e-03 -8.38891747e-03 -4.89141221e-03] [ 2.78183433e-03 1.45460517e-02 -1.00834039e-02 ... 2.00351858e-02 2.42798487e-02 9.76354200e-03] [ 1.74241577e-02 6.91796896e-04 8.58904232e-03 ... -7.00410370e-03 -1.64223006e-02 -1.08194920e-02] ... [ 4.73681228e-02 1.63870006e-02 -4.82537985e-02 ... 1.81208672e-02 2.04451962e-01 8.95866809e-03] [ 2.75000546e-02 -1.15534197e-02 1.62820191e-02 ... -8.16662086e-03 -5.62814675e-02 -9.40465919e-03] [-9.25218962e-03 -7.08683049e-03 -2.61927038e-02 ... 3.14174560e-02 2.06966563e-01 4.55041143e-02]]
principal_Df = pd.DataFrame(data = X_pca) print("Principle Component of n(columns)\n",principal_Df.tail())
Principle Component of n(columns) 0 1 2 3 4 5 6 \ 99387 13.813840 -0.899497 -0.583618 1.036175 -4.244118 3.485645 -2.109940 99388 10.334663 5.139781 1.020396 0.804072 -2.202810 3.686753 0.766625 99389 13.990323 1.433283 -1.057741 1.627129 -3.713760 3.764278 -1.170729 99390 13.250456 0.878683 -0.687258 1.424517 -2.945365 3.944510 -1.687646 99391 8.810935 5.916871 4.698210 -0.963108 -4.863728 1.608598 -0.110916 7 8 9 ... 90 91 92 \ 99387 -2.076125 0.755177 -2.453250 ... 1.128403 -1.385875 1.708233 99388 0.143015 -0.709548 -5.805359 ... 0.414398 0.293454 0.563784 99389 -0.104195 0.116808 -2.935770 ... 0.916435 -1.083431 1.722635 99390 -1.452443 0.301982 -2.944357 ... -0.303025 -0.716382 0.276747 99391 -1.242712 1.170820 -5.558526 ... 0.004581 -0.833099 0.366975 93 94 95 96 97 98 99 99387 0.253080 1.007792 -0.487912 -0.022340 1.037823 -0.604213 0.730468 99388 0.330160 0.633721 -0.246346 0.018870 0.691985 -0.224734 -0.467560 99389 0.030010 0.739100 -0.335656 -0.260550 0.110084 -1.239035 0.690112 99390 0.922285 0.119716 -0.273016 -0.294379 0.768032 -0.509641 0.776732 99391 0.531050 0.003078 -1.255948 -0.370940 1.612783 -0.488076 -0.678922 [5 rows x 100 columns]
Here we've plotted a graph with the help of PCA We've defined three colors to the graph.
plt.figure(figsize=(15,10)) plt.scatter(X_pca[:,0],X_pca[:,1],c=encode_df['AS']) plt.xlabel('PC1 "First Principal Component"') plt.ylabel('PC2 "Second Principal Component"') plt.colorbar(); plt.show()