Understanding Random Forest Alogirthm
Random Forest is a flexible, easy to use machine learning algorithm without hyper-parameter tuning and provides a great result most of the time.Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
Random forest is a supervised learning algorithm. The "forest" term is used as to group decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier
*(Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data Here is the link to download csv file ( )
df = pd.read_csv('Social_Network_Ads.csv')
x = df.iloc[:, 2: 4].values y = df.iloc[:, 4].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
output = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=1, max_features="sqrt", random_state=0) output.fit(x_train, y_train) y_pred = output.predict(x_test) print(y_pred)
[0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 0 1]
This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower.
These are the maximum number of features Random Forest is allowed to try in individual tree
This parameter tells the engine how many processors is it allowed to use. A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor.
This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if given with same parameters and training data.
This is a validation method. This method simply tags every observation used in different tress. And then it finds out a maximum vote score for every observation based on only trees which did not use this particular observation to train itself.
Random forest is a great algorithm to train early in the model development process, to see how it performance.
The algorithm is also a great choice for anyone who needs to develop a model quickly. On top of that, it provides a pretty good indicator of the importance it assigns to your features.
Random forests are also very hard to beat performance wise. Of course, you can probably always find a model that can perform better, like a neural network for example, but these usually take more time to develop, though they can handle a lot of different feature types, like binary, categorical and numerical.