Predict Startup Profit
The competition goal is to predict the profit of startup profit on the bases of data provided which are on the bases of Research and Development Spend(R&D Spend), Administration Spend, Marketing Spend and State. We use multiple regression in this model because we have to predict profit(dependent variable) on bases of multiple field(independent variables) rather then one field just like we done in Simple Linear Regression. This model can help those people who want to invest in startup company by analysing profit of the comapny.
Firstly, we import necessary library(numpy, matplotlib and pandas) for this model.
import numpy as np import matplotlib.pyplot as plt import pandas as pd
Now we read CSV file 50_Startups_with_states.csv It contains data about 50 startups It has 5 columns - "R&D Spend", "Administration", "Marketing Spend", "State", "Profit" The first 3 columns indicate how much each startup spends on Research and Development, how much they spend on Marketing and how much they spend on Administration cost. The state column indicates which state the startup is based in. And the last column states the profit made by the start up.
dataset = pd.read_csv('50_Startups_with_states.csv') print(dataset)
R&D Spend Administration Marketing Spend State Profit 0 165349.20 136897.80 471784.10 New York 192261.83 1 162597.70 151377.59 443898.53 California 191792.06 2 153441.51 101145.55 407934.54 Florida 191050.39 3 144372.41 118671.85 383199.62 New York 182901.99 4 142107.34 91391.77 366168.42 Florida 166187.94 5 131876.90 99814.71 362861.36 New York 156991.12 6 134615.46 147198.87 127716.82 California 156122.51 7 130298.13 145530.06 323876.68 Florida 155752.60 8 120542.52 148718.95 311613.29 New York 152211.77 9 123334.88 108679.17 304981.62 California 149759.96 10 101913.08 110594.11 229160.95 Florida 146121.95 11 100671.96 91790.61 249744.55 California 144259.40 12 93863.75 127320.38 249839.44 Florida 141585.52 13 91992.39 135495.07 252664.93 California 134307.35 14 119943.24 156547.42 256512.92 Florida 132602.65 15 114523.61 122616.84 261776.23 New York 129917.04 16 78013.11 121597.55 264346.06 California 126992.93 17 94657.16 145077.58 282574.31 New York 125370.37 18 91749.16 114175.79 294919.57 Florida 124266.90 19 86419.70 153514.11 0.00 New York 122776.86 20 76253.86 113867.30 298664.47 California 118474.03 21 78389.47 153773.43 299737.29 New York 111313.02 22 73994.56 122782.75 303319.26 Florida 110352.25 23 67532.53 105751.03 304768.73 Florida 108733.99 24 77044.01 99281.34 140574.81 New York 108552.04 25 64664.71 139553.16 137962.62 California 107404.34 26 75328.87 144135.98 134050.07 Florida 105733.54 27 72107.60 127864.55 353183.81 New York 105008.31 28 66051.52 182645.56 118148.20 Florida 103282.38 29 65605.48 153032.06 107138.38 New York 101004.64 30 61994.48 115641.28 91131.24 Florida 99937.59 31 61136.38 152701.92 88218.23 New York 97483.56 32 63408.86 129219.61 46085.25 California 97427.84 33 55493.95 103057.49 214634.81 Florida 96778.92 34 46426.07 157693.92 210797.67 California 96712.80 35 46014.02 85047.44 205517.64 New York 96479.51 36 28663.76 127056.21 201126.82 Florida 90708.19 37 44069.95 51283.14 197029.42 California 89949.14 38 20229.59 65947.93 185265.10 New York 81229.06 39 38558.51 82982.09 174999.30 California 81005.76 40 28754.33 118546.05 172795.67 California 78239.91 41 27892.92 84710.77 164470.71 Florida 77798.83 42 23640.93 96189.63 148001.11 California 71498.49 43 15505.73 127382.30 35534.17 New York 69758.98 44 22177.74 154806.14 28334.72 California 65200.33 45 1000.23 124153.04 1903.93 New York 64926.08 46 1315.46 115816.21 297114.46 Florida 49490.75 47 0.00 135426.92 0.00 California 42559.73 48 542.05 51743.15 0.00 New York 35673.41 49 0.00 116983.80 45173.06 California 14681.40
Here X is contains all the independent variable and y is the dependent variable("Profit")
# Importing the dataset X = dataset.iloc[:, :-1].values #which simply means take all rows and all columns except last one y = dataset.iloc[:, -1].values #which simply means take all rows and only columns with last column
We use OneHotEncoder to convert state column string data into numeric value in the form of dummy variables. As you can see that state column(on index: 3) is converted into 0, 1 form. If data in column contain New York value it will be repesented as [0.0 0.0 1.0] same apply for California: [1.0 0.0 0.0] and Florida: [0.0 1.0 0.0].
# Encoding categorical data from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough') X = np.array(ct.fit_transform(X)) print(X)
[[0.0 0.0 1.0 165349.2 136897.8 471784.1] [1.0 0.0 0.0 162597.7 151377.59 443898.53] [0.0 1.0 0.0 153441.51 101145.55 407934.54] [0.0 0.0 1.0 144372.41 118671.85 383199.62] [0.0 1.0 0.0 142107.34 91391.77 366168.42] [0.0 0.0 1.0 131876.9 99814.71 362861.36] [1.0 0.0 0.0 134615.46 147198.87 127716.82] [0.0 1.0 0.0 130298.13 145530.06 323876.68] [0.0 0.0 1.0 120542.52 148718.95 311613.29] [1.0 0.0 0.0 123334.88 108679.17 304981.62] [0.0 1.0 0.0 101913.08 110594.11 229160.95] [1.0 0.0 0.0 100671.96 91790.61 249744.55] [0.0 1.0 0.0 93863.75 127320.38 249839.44] [1.0 0.0 0.0 91992.39 135495.07 252664.93] [0.0 1.0 0.0 119943.24 156547.42 256512.92] [0.0 0.0 1.0 114523.61 122616.84 261776.23] [1.0 0.0 0.0 78013.11 121597.55 264346.06] [0.0 0.0 1.0 94657.16 145077.58 282574.31] [0.0 1.0 0.0 91749.16 114175.79 294919.57] [0.0 0.0 1.0 86419.7 153514.11 0.0] [1.0 0.0 0.0 76253.86 113867.3 298664.47] [0.0 0.0 1.0 78389.47 153773.43 299737.29] [0.0 1.0 0.0 73994.56 122782.75 303319.26] [0.0 1.0 0.0 67532.53 105751.03 304768.73] [0.0 0.0 1.0 77044.01 99281.34 140574.81] [1.0 0.0 0.0 64664.71 139553.16 137962.62] [0.0 1.0 0.0 75328.87 144135.98 134050.07] [0.0 0.0 1.0 72107.6 127864.55 353183.81] [0.0 1.0 0.0 66051.52 182645.56 118148.2] [0.0 0.0 1.0 65605.48 153032.06 107138.38] [0.0 1.0 0.0 61994.48 115641.28 91131.24] [0.0 0.0 1.0 61136.38 152701.92 88218.23] [1.0 0.0 0.0 63408.86 129219.61 46085.25] [0.0 1.0 0.0 55493.95 103057.49 214634.81] [1.0 0.0 0.0 46426.07 157693.92 210797.67] [0.0 0.0 1.0 46014.02 85047.44 205517.64] [0.0 1.0 0.0 28663.76 127056.21 201126.82] [1.0 0.0 0.0 44069.95 51283.14 197029.42] [0.0 0.0 1.0 20229.59 65947.93 185265.1] [1.0 0.0 0.0 38558.51 82982.09 174999.3] [1.0 0.0 0.0 28754.33 118546.05 172795.67] [0.0 1.0 0.0 27892.92 84710.77 164470.71] [1.0 0.0 0.0 23640.93 96189.63 148001.11] [0.0 0.0 1.0 15505.73 127382.3 35534.17] [1.0 0.0 0.0 22177.74 154806.14 28334.72] [0.0 0.0 1.0 1000.23 124153.04 1903.93] [0.0 1.0 0.0 1315.46 115816.21 297114.46] [1.0 0.0 0.0 0.0 135426.92 0.0] [0.0 0.0 1.0 542.05 51743.15 0.0] [1.0 0.0 0.0 0.0 116983.8 45173.06]]
Next we have to split the dataset into training and testing. We will use the training dataset for training the model and then check the performance of the model on the test dataset.
For this we will use the train_test_split method from library model_selection We are providing a test_size of 0.2 which means test set will contain 10 observations and training set will contain 40 observations. The random_state=0 is required only if you want to compare your results with mine.
# Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
This is a very simple step. We will be using the LinearRegression class from the library sklearn.linear_model. First we create an object of the LinearRegression class and call the fit method passing the X_train and y_train.
# Training the Multiple Linear Regression model on the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Using the regressor we trained in the previous step, we will now use it to predict the results of the test set and compare the predicted values with the actual values. We use precision=2 to show only two decimal value after float value.
# Predicting the Test set results y_pred = regressor.predict(X_test) np.set_printoptions(precision=2)
Now, we have the y_pred which are the predicted values from our Model and y_test which are the actual values. Let us compare are see how well our model did. As you can see from the screenshot below - our basic model did pretty well.
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
[[103015.2 103282.38] [132582.28 144259.4 ] [132447.74 146121.95] [ 71976.1 77798.83] [178537.48 191050.39] [116161.24 105008.31] [ 67851.69 81229.06] [ 98791.73 97483.56] [113969.44 110352.25] [167921.07 166187.94]]
Here, you can see the Multiper linear regression Graph showing approximately straight line for predicted value(y_pred) and actual value(y_test) of profit column.
plt.plot(y_test, y_pred) plt.show()