Product Demand Forecasting in Python
Product Demand Forecasting in Python is the process of using predictive analysis of historical data to estimate and predict future demand for a product or service. It helps the business make better decisions that estimate the total sales and revenue for future . Also reduce risks and make efficient financial decisions.In this article , we will find out the properties of a product which impacts its sale.
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
train['source']='train' test['source']='test' data = pd.concat([train, test],ignore_index=True) print(train.shape, test.shape, data.shape)
(8523, 13) (5681, 12) (14204, 13)
my_project_env/lib/python3.5/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. This is separate from the ipykernel package so we can avoid doing imports until
data.head()
Item_Fat_Content | Item_Identifier | Item_MRP | Item_Outlet_Sales | Item_Type | Item_Visibility | Item_Weight | Outlet_Establishment_Year | Outlet_Identifier | Outlet_Location_Type | Outlet_Size | Outlet_Type | source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Low Fat | FDA15 | 249.8092 | 3735.1380 | Dairy | 0.016047 | 9.30 | 1999 | OUT049 | Tier 1 | Medium | Supermarket Type1 | train |
1 | Regular | DRC01 | 48.2692 | 443.4228 | Soft Drinks | 0.019278 | 5.92 | 2009 | OUT018 | Tier 3 | Medium | Supermarket Type2 | train |
2 | Low Fat | FDN15 | 141.6180 | 2097.2700 | Meat | 0.016760 | 17.50 | 1999 | OUT049 | Tier 1 | Medium | Supermarket Type1 | train |
3 | Regular | FDX07 | 182.0950 | 732.3800 | Fruits and Vegetables | 0.000000 | 19.20 | 1998 | OUT010 | Tier 3 | NaN | Grocery Store | train |
4 | Low Fat | NCD19 | 53.8614 | 994.7052 | Household | 0.000000 | 8.93 | 1987 | OUT013 | Tier 3 | High | Supermarket Type1 | train |
data.describe()
Item_MRP | Item_Outlet_Sales | Item_Visibility | Item_Weight | Outlet_Establishment_Year | |
---|---|---|---|---|---|
count | 14204.000000 | 8523.000000 | 14204.000000 | 11765.000000 | 14204.000000 |
mean | 141.004977 | 2181.288914 | 0.065953 | 12.792854 | 1997.830681 |
std | 62.086938 | 1706.499616 | 0.051459 | 4.652502 | 8.371664 |
min | 31.290000 | 33.290000 | 0.000000 | 4.555000 | 1985.000000 |
25% | 94.012000 | 834.247400 | 0.027036 | 8.710000 | 1987.000000 |
50% | 142.247000 | 1794.331000 | 0.054021 | 12.600000 | 1999.000000 |
75% | 185.855600 | 3101.296400 | 0.094037 | 16.750000 | 2004.000000 |
max | 266.888400 | 13086.964800 | 0.328391 | 21.350000 | 2009.000000 |
This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers.
data.apply(lambda x: sum(x.isnull()))
Item_Fat_Content 0 Item_Identifier 0 Item_MRP 0 Item_Outlet_Sales 5681 Item_Type 0 Item_Visibility 0 Item_Weight 2439 Outlet_Establishment_Year 0 Outlet_Identifier 0 Outlet_Location_Type 0 Outlet_Size 4016 Outlet_Type 0 source 0 dtype: int64
We found two variables with missing values – Item_Weight and Outlet_Size. Lets impute the former by the average weight of the particular item. This can be done as:
data.Item_Outlet_Sales = data.Item_Outlet_Sales.fillna(data.Item_Outlet_Sales.mean())
data.Item_Weight = data.Item_Weight.fillna(data.Item_Weight.mean())
data['Outlet_Size'].value_counts()
Medium 4655 Small 3980 High 1553 Name: Outlet_Size, dtype: int64
data.Outlet_Size = data.Outlet_Size.fillna('Medium')
data.apply(lambda x: sum(x.isnull()))
Item_Fat_Content 0 Item_Identifier 0 Item_MRP 0 Item_Outlet_Sales 0 Item_Type 0 Item_Visibility 0 Item_Weight 0 Outlet_Establishment_Year 0 Outlet_Identifier 0 Outlet_Location_Type 0 Outlet_Size 0 Outlet_Type 0 source 0 dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14204 entries, 0 to 14203 Data columns (total 13 columns): Item_Fat_Content 14204 non-null object Item_Identifier 14204 non-null object Item_MRP 14204 non-null float64 Item_Outlet_Sales 14204 non-null float64 Item_Type 14204 non-null object Item_Visibility 14204 non-null float64 Item_Weight 14204 non-null float64 Outlet_Establishment_Year 14204 non-null int64 Outlet_Identifier 14204 non-null object Outlet_Location_Type 14204 non-null object Outlet_Size 14204 non-null object Outlet_Type 14204 non-null object source 14204 non-null object dtypes: float64(4), int64(1), object(8) memory usage: 1.4+ MB
data['Item_Identifier'].value_counts() data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2]) data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food', 'NC':'Non-Consumable', 'DR':'Drinks'}) data['Item_Type_Combined'].value_counts()
Food 10201 Non-Consumable 2686 Drinks 1317 Name: Item_Type_Combined, dtype: int64
Since scikit-learn accepts only numerical variables, I converted all categories of nominal variables into numeric types. Also, I wanted Outlet_Identifier as a variable as well. So I created a new variable ‘Outlet’ same as Outlet_Identifier and coded that. Outlet_Identifier should remain as it is, because it will be required in the submission file.
#Import library: from sklearn.preprocessing import LabelEncoder, OneHotEncoder le = LabelEncoder() data['Outlet'] = le.fit_transform(data['Outlet_Identifier']) var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet'] le = LabelEncoder() for i in var_mod: data[i] = le.fit_transform(data[i])
One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable. For example, the Item_Fat_Content has 3 categories – ‘Low Fat’, ‘Regular’ and ‘Non-Edible’. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers – 0 (if the category is not present) and 1(if category is present). This can be done using ‘get_dummies’ function of Pandas.
data = pd.get_dummies(data, columns=['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Type_Combined','Outlet'])
data.head()
Item_Identifier | Item_MRP | Item_Outlet_Sales | Item_Type | Item_Visibility | Item_Weight | Outlet_Establishment_Year | Outlet_Identifier | source | Item_Fat_Content_0 | ... | Outlet_0 | Outlet_1 | Outlet_2 | Outlet_3 | Outlet_4 | Outlet_5 | Outlet_6 | Outlet_7 | Outlet_8 | Outlet_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FDA15 | 249.8092 | 3735.1380 | Dairy | 0.016047 | 9.30 | 1999 | OUT049 | train | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | DRC01 | 48.2692 | 443.4228 | Soft Drinks | 0.019278 | 5.92 | 2009 | OUT018 | train | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | FDN15 | 141.6180 | 2097.2700 | Meat | 0.016760 | 17.50 | 1999 | OUT049 | train | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | FDX07 | 182.0950 | 732.3800 | Fruits and Vegetables | 0.000000 | 19.20 | 1998 | OUT010 | train | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | NCD19 | 53.8614 | 994.7052 | Household | 0.000000 | 8.93 | 1987 | OUT013 | train | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 37 columns
data.dtypes
Item_Identifier object Item_MRP float64 Item_Outlet_Sales float64 Item_Type object Item_Visibility float64 Item_Weight float64 Outlet_Establishment_Year int64 Outlet_Identifier object source object Item_Fat_Content_0 uint8 Item_Fat_Content_1 uint8 Item_Fat_Content_2 uint8 Item_Fat_Content_3 uint8 Item_Fat_Content_4 uint8 Outlet_Location_Type_0 uint8 Outlet_Location_Type_1 uint8 Outlet_Location_Type_2 uint8 Outlet_Size_0 uint8 Outlet_Size_1 uint8 Outlet_Size_2 uint8 Outlet_Type_0 uint8 Outlet_Type_1 uint8 Outlet_Type_2 uint8 Outlet_Type_3 uint8 Item_Type_Combined_0 uint8 Item_Type_Combined_1 uint8 Item_Type_Combined_2 uint8 Outlet_0 uint8 Outlet_1 uint8 Outlet_2 uint8 Outlet_3 uint8 Outlet_4 uint8 Outlet_5 uint8 Outlet_6 uint8 Outlet_7 uint8 Outlet_8 uint8 Outlet_9 uint8 dtype: object
Final step is to convert data back into train and test data sets. Its generally a good idea to export both of these as modified data sets so that they can be re-used for multiple sessions. This can be achieved using following code:
import warnings warnings.filterwarnings('ignore') data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)
train = data.loc[data['source']=="train"] test = data.loc[data['source']=="test"]
test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True) train.drop(['source'],axis=1,inplace=True)
train.to_csv("train_modified.csv",index=False) test.to_csv("test_modified.csv",index=False)
Lets start by making a baseline model. Baseline model is the one which requires no predictive model and its like an informed guess. For instance, in this case lets predict the sales as the overall average sales. This can be done as:
train2 = pd.read_csv("train_modified.csv") test2 = pd.read_csv("test_modified.csv")
train2.head()
Item_Identifier | Item_MRP | Item_Outlet_Sales | Item_Visibility | Item_Weight | Outlet_Identifier | Item_Fat_Content_0 | Item_Fat_Content_1 | Item_Fat_Content_2 | Item_Fat_Content_3 | ... | Outlet_0 | Outlet_1 | Outlet_2 | Outlet_3 | Outlet_4 | Outlet_5 | Outlet_6 | Outlet_7 | Outlet_8 | Outlet_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FDA15 | 249.8092 | 3735.1380 | 0.016047 | 9.30 | OUT049 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | DRC01 | 48.2692 | 443.4228 | 0.019278 | 5.92 | OUT018 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | FDN15 | 141.6180 | 2097.2700 | 0.016760 | 17.50 | OUT049 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | FDX07 | 182.0950 | 732.3800 | 0.000000 | 19.20 | OUT010 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | NCD19 | 53.8614 | 994.7052 | 0.000000 | 8.93 | OUT013 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 34 columns
X_train = train2.drop(['Item_Outlet_Sales', 'Outlet_Identifier','Item_Identifier'], axis=1) y_train = train2.Item_Outlet_Sales
X_test = test2.drop(['Outlet_Identifier','Item_Identifier'], axis=1)
X_train.head()
Item_MRP | Item_Visibility | Item_Weight | Item_Fat_Content_0 | Item_Fat_Content_1 | Item_Fat_Content_2 | Item_Fat_Content_3 | Item_Fat_Content_4 | Outlet_Location_Type_0 | Outlet_Location_Type_1 | ... | Outlet_0 | Outlet_1 | Outlet_2 | Outlet_3 | Outlet_4 | Outlet_5 | Outlet_6 | Outlet_7 | Outlet_8 | Outlet_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 249.8092 | 0.016047 | 9.30 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 48.2692 | 0.019278 | 5.92 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 141.6180 | 0.016760 | 17.50 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 182.0950 | 0.000000 | 19.20 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 53.8614 | 0.000000 | 8.93 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 31 columns
y_train.head()
0 3735.1380 1 443.4228 2 2097.2700 3 732.3800 4 994.7052 Name: Item_Outlet_Sales, dtype: float64
import warnings warnings.filterwarnings('ignore') from sklearn.metrics import accuracy_score, r2_score, mean_squared_error from sklearn.model_selection import cross_val_score from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor regressor = RandomForestRegressor(n_estimators=100,max_depth=6, min_samples_leaf=50,n_jobs=4) regressor.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=6, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=50, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4, oob_score=False, random_state=None, verbose=0, warm_start=False)
y_pred = regressor.predict(X_test) y_pred
array([1644.58503874, 1365.02660944, 607.55946827, ..., 1967.37127117, 3679.8358505 , 1318.82000327])
rf_accuracy = round(regressor.score(X_train,y_train),2) rf_accuracy
0.61
r2_score(y_train, regressor.predict(X_train))
0.6127340118877953
import warnings warnings.filterwarnings('ignore') cv_score = cross_val_score(regressor, X_train, y_train, cv=5) print(np.sqrt(np.abs(cv_score)))
[0.78025888 0.76655097 0.75912421 0.78069719 0.77997061]
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, regressor.predict(X_train))))
RMSE : 1062
Since I’ll be making many models, instead of repeating the codes again and again, I would like to define a generic function which takes the algorithm and data as input and makes the model, performs cross-validation and generates submission.
submission = pd.DataFrame({ 'Item_Identifier':test2['Item_Identifier'], 'Outlet_Identifier':test2['Outlet_Identifier'], 'Item_Outlet_Sales': y_pred },columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'])
submission.to_csv('submission3.csv',index=False)
We started with making some hypothesis about the data without looking at it. Then we moved on to data exploration where we found out some nuances in the data which required remediation. Next, we performed data cleaning and feature engineering, where we imputed missing values and solved other irregularities, made new features and also made the data model-friendly by one-hot-coding. Finally we made random forest model and got a glimpse of how to tune them for better results and get 61% accuracy rate for sales predictions.