Prediction Of Employee Salary
Project Objective: Lets assume the HR team of a company uses to determine what salary to offer to a new employee. For our project, let's take an example that an employee has applied for the role of a Regional Manager and has already worked as a Regional Manager for 2 years. So based on the data provided(Position_Salaries.csv) from employee last company - he falls between level 6 and level 7 - Lets say he falls under level 6.5. So, we want to build a model to predict what salary we should offer new employee if we come to know the true salary from previous company.
Firstly, we import necessary library(numpy, matplotlib and pandas) for this model.
import numpy as np import matplotlib.pyplot as plt import pandas as pd
we need to predict the salary for an employee who falls under Level 6.5. So we really do not need the first column "Position". Here X is our independent variable which is the "Level" and y is the dependent variable which is the "Salary"
dataset = pd.read_csv('Position_Salaries.csv') print(dataset) # Show all the data in Position_Salaries.csv file X = dataset.iloc[:, 1:-1].values #which simply means take all rows and all columns from index 1 upto index 2 but not including index 2 print("level", X) y = dataset.iloc[:, -1].values #which simply means take all rows and only columns with index 2 print("salary", y)
Position Level Salary 0 Business Analyst 1 45000 1 Junior Consultant 2 50000 2 Senior Consultant 3 60000 3 Manager 4 80000 4 Country Manager 5 110000 5 Region Manager 6 150000 6 Partner 7 200000 7 Senior Partner 8 300000 8 C-level 9 500000 9 CEO 10 1000000 level [[ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [10]] salary [ 45000 50000 60000 80000 110000 150000 200000 300000 500000 1000000]
First we will build a simple linear regression model to see what prediction it makes and then compare it to the prediction made by the Polynomial Regression to see which is more accurate.
We will be using the LinearRegression class from the library sklearn.linear_model. We create an object of the LinearRegression class and call the fit method passing the X and y.
# Training the Linear Regression model on the whole dataset from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
# Visualising the Linear Regression results plt.scatter(X, y, color = 'red') plt.plot(X, lin_reg.predict(X), color = 'blue') plt.title('Truth or Bluff (Linear Regression)') plt.xlabel('Position Level') plt.ylabel('Salary') plt.show()
If we look at the graph, we can see that a person at level 6.5 should be offered a salary of around $300k and the difference between predicted line(blue) and orignal value(red dot) had more gap in between.We will confirm this in next step by getting prediction of salary by linear regression.
lin_reg.predict([[6.5]])
array([330378.78787879])
We can see that the prediction is way off as it predicts $330k. Now lets check the predictions by implementing Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree = 4) X_poly = poly_reg.fit_transform(X) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
We will be using the PolynomialFeatures class from the sklearn.preprocessing library for this purpose. When we create an object of this class - we have to pass the degree parameter. Lets begin by choose degree as 4 for more accuracy. Then we call the fit_transform method to transform matrix X.
from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=4) X_poly = poly_reg.fit_transform(X)
Now we will create a new linear regression object called lin_reg_2 and pass X_poly to it instead of X.
in_reg_2 = LinearRegression() lin_reg_2.fit(X_poly,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Lets plot the graph to look at the results for Polynomial Regression
plt.scatter(X,y, color="red") plt.plot(X, lin_reg_2.predict(poly_reg.fit_transform(X))) plt.title("Poly Regression Degree 2") plt.xlabel("Level") plt.ylabel("Salary") plt.show()
If we look at the graph, we can see that a person at level 6.5 should be offered a salary of around $190k. We will confirm this in next step.
lin_reg_2.predict(poly_reg.fit_transform([[6.5]]))
array([158862.45265158])
We get a prediction of $158k which looks reasonable based on our dataset
So in this case by using Linear Regression - we got a prediction of $330k and by using Polynomial Regression we got a prediction of 158k. which is shows that Polynomial Regression is mor reasonable.