How Random trees classification and regression algorithm works

Random trees is a decision-tree-based supervised machine learning method that is used by the Train Using AutoML tool. A decision tree is overly sensitive to training data. In this method, many decision trees are created that are used for prediction. Each tree generates its own prediction and is used as part of a majority vote to make final predictions. The final predictions are not based on a single tree but on the entire forest of decision trees. The use of the entire forest helps avoid overfitting the model to the training dataset, as does the use of both a random subset of the training data and a random subset of explanatory variables in each tree that constitutes the forest.

Bootstrapping is used to create a random subset of the training data. The subset is the same size as the original training data since the data is selected randomly with repetition. This makes the model less sensitive to the original training data. The random selection of explanatory variables reduces the correlation between trees and causes less variance. This level of variance make random trees more effective than decision trees. Using bootstrapping and the aggregation of results together is called bagging. To test the accuracy of a tree, the subset of data that is not selected (out-of-bag) is used. The method iterates different settings to find the forest with the least out-of-bag error.

In the example below, the first five decision trees of a random trees model that classifies flowers based on their sepal and petal width and length are shown.

Decision trees of random trees model example

Additional Resources

Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news 2, no. 3 (2002): 18-22.

Understanding Random Forest

Related topics


In this topic
  1. Additional Resources