• +91-9872993883
• +91-8283824812
• info@ris-ai.com

Understanding Random Forest Alogirthm |¶

What is Random forest algorithm? ¶

Random Forest is a flexible, easy to use machine learning algorithm without hyper-parameter tuning and provides a great result most of the time.Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

How Random Forest Algorithm works?¶

Random forest is a supervised learning algorithm. The "forest" term is used as to group decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

Random Forest Algorithm ¶

Step-Wise represntation on how random forest algorithm works. ¶

1. Importing Different Libraries ¶

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

2. Read data from csv file ¶

*(Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data Here is the link to download csv file ( )

In [3]:

3. After reading csv file,do partitioning of data of input features and target data ¶

In [10]:
x = df.iloc[:, 2: 4].values
y = df.iloc[:, 4].values

4. We take training dataset as x_train and y_train, and testing data sets as x_test and y_test by taking size 0.3 ¶

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

5. Fit training data in the Random Forest Classifier present with different Hyperparameters and to predict the output ¶

In [6]:
output = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=1, max_features="sqrt",
random_state=0)
output.fit(x_train, y_train)

y_pred = output.predict(x_test)
print(y_pred)
[0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 0 0 0 0
0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0
0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1
0 1 1 1 1 0 0 0 1]

Hyperparameters used:¶

*n_estimators : ¶

This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower.

*max_features:¶

These are the maximum number of features Random Forest is allowed to try in individual tree

*n_jobs :¶

This parameter tells the engine how many processors is it allowed to use. A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor.

*random_state : ¶

This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if given with same parameters and training data.

*oob_score :¶

This is a validation method. This method simply tags every observation used in different tress. And then it finds out a maximum vote score for every observation based on only trees which did not use this particular observation to train itself.

What do we need in order for our random forest to make accurate predictions? ¶

1. We need features that have at least some predictive power.
2. The trees of the forest and more importantly their predictions need to be uncorrelated (or at least have low correlations with each other).

Conclusion¶

Random forest is a great algorithm to train early in the model development process, to see how it performance.

The algorithm is also a great choice for anyone who needs to develop a model quickly. On top of that, it provides a pretty good indicator of the importance it assigns to your features.

Random forests are also very hard to beat performance wise. Of course, you can probably always find a model that can perform better, like a neural network for example, but these usually take more time to develop, though they can handle a lot of different feature types, like binary, categorical and numerical.