Wine Quality Prediction using Machine Learning
According to experts, wine quality is checked with its smell, flavor and color but we are not a wine experts. Here’s the use of Machine Learning comes . In this article , we will focus on Wine Quality Prediction on the basis of given features. Also every industry need to prove product quality to promote their product so quality check is important.
Firstly, we import necessary library for this model. Numpy will be used for making the mathematical calculations more accurate, pandas will be used to work with file formats like csv, xls etc. and sklearn (scikit-learn) will be used to import our classifier for prediction.from sklearn.model_selection import train_test_split is used to split our dataset into training and testing data. from sklearn import preprocessing is used to preprocess the data before fitting into predictor, or converting it to a range of -1,1, which is easy to understand for the Machine Learning Algorithms. from sklearn import tree is used to import our decision tree classifier, which we will be using for prediction.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn import tree
Now we read CSV file name winequality-red.csv. which have fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality columns.
dataset_url = 'winequality-red.csv' data = pd.read_csv(dataset_url, sep=';')
data.head() data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): fixed acidity 1599 non-null float64 volatile acidity 1599 non-null float64 citric acid 1599 non-null float64 residual sugar 1599 non-null float64 chlorides 1599 non-null float64 free sulfur dioxide 1599 non-null float64 total sulfur dioxide 1599 non-null float64 density 1599 non-null float64 pH 1599 non-null float64 sulphates 1599 non-null float64 alcohol 1599 non-null float64 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
In every Machine Learning program, there are two things, features and labels. Features are the part of a dataset which are used to predict the label. And labels on the other hand are mapped to features. After the model has been trained, we give features to it, so that it can predict the labels. So, if we analyse this dataset, since we have to predict the wine quality, the attribute quality will become our label and the rest of the attributes will become the features.We just stored and quality in y, which is the common symbol used to represent the labels in Machine Learning and dropped quality and stored the remaining features in X , again common symbol for features in ML.
y = data.quality X = data.drop('quality', axis=1)
Split our dataset into test and train data, we will be using the train data to to train our model for predicting wine quality. We have used, train_test_split() function that we imported from sklearn to split the data. Notice we have used test_size=0.2 to make the test data 20% of the original data. The rest 80% is used for training.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
print(X_train.head())
fixed acidity volatile acidity citric acid residual sugar chlorides \ 104 7.2 0.490 0.24 2.2 0.070 1273 7.5 0.580 0.20 2.0 0.073 253 7.7 0.775 0.42 1.9 0.092 944 8.3 0.300 0.49 3.8 0.090 358 11.9 0.430 0.66 3.1 0.109 free sulfur dioxide total sulfur dioxide density pH sulphates \ 104 5.0 36.0 0.99600 3.33 0.48 1273 34.0 44.0 0.99494 3.10 0.43 253 8.0 86.0 0.99590 3.23 0.59 944 11.0 24.0 0.99498 3.27 0.64 358 10.0 23.0 1.00000 3.15 0.85 alcohol 104 9.4 1273 9.3 253 9.5 944 12.1 358 10.4
data normalization will be happen it is part of pre-processing in which data is converted to fit in a range of -1 and 1. These are simply, the values which are understood by a Machine Learning Algorithm easily.
X_train_scaled = preprocessing.scale(X_train) X_train_scaled
array([[-0.63399594, -0.21765447, -0.14303419, ..., 0.11677241, -1.06775661, -0.96550033], [-0.46219916, 0.28178151, -0.35003926, ..., -1.35247219, -1.37012138, -1.06017513], [-0.34766797, 1.36389278, 0.78848861, ..., -0.52202959, -0.40255412, -0.87082553], ..., [-0.86305831, 0.44826016, -1.17805953, ..., 0.69169421, -0.34208117, -0.87082553], [ 0.45405034, -0.16216158, 0.16747341, ..., 0.05289221, 0.32312132, 0.07592243], [-0.63399594, -0.93906198, 0.99549368, ..., 0.56393381, 1.16974266, 0.54929641]])
Now the values of all the train attributes are in the range of -1 and 1 and that is exactly what we were aiming for. Time has now come for the most exciting step, training our algorithm so that it can predict the wine quality. We do so by importing a DecisionTreeClassifier() and using fit() to train it.
clf=tree.DecisionTreeClassifier() clf.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')
This score can change over time depending on the size of your dataset and shuffling of data when we divide the data into test and train, but you can always expect a range of ±5 around your first result.
confidence = clf.score(X_test, y_test) print("\nThe confidence score:\n") print(confidence)
The confidence score: 0.6125
Our predicted information is stored in y_pred but it has far too many columns to compare it with the expected labels we stored in y_test .
y_pred = clf.predict(X_test)
we will just take first five entries of both, print them and compare them. We just converted y_pred from a numpy array to a list, so that we can compare with ease. Then we printed the first five elements of that list using for loop. And finally, we just printed the first five values that we were expecting, which were stored in y_test using head() function.
#converting the numpy array to list x=np.array(y_pred).tolist() #printing first 5 predictions print("\nThe prediction:\n") for i in range(0,5): print(x[i]) #printing first five expectations print("\nThe expectation:\n") print(y_test.head())
The prediction: 5 7 4 5 5 The expectation: 1522 5 875 7 747 5 401 6 1254 5 Name: quality, dtype: int64
Almost all of the values in the prediction are similar to the expectations. Our predictor got wrong just once, predicting 7 as 6, but that’s it. This gives us the accuracy of 80% for 5 examples. Of course, as the examples increases the accuracy goes down, precisely to 0.612575 or approx 62.1875%, but overall our predictor performs quite well, in-fact any accuracy % greater than 50% is considered as great.