The Data Science Lab

Decision Tree Regression from Scratch Using C#

double trainAcc = dt.Accuracy(trainX, trainY, 0.10);
Console.WriteLine("Train data accuracy = " +
  trainAcc.ToString("F4"));

double testAcc = dt.Accuracy(testX, testY, 0.10);
Console.WriteLine("Test data accuracy = " +
  testAcc.ToString("F4"));

The demo program scores 0.9000 accuracy on the training data (9 out of 10 correct) but only 0.2000 accuracy on the test data (1 out of 5 correct). This points out one of the biggest weaknesses of decision trees -- they often overfit severely on the training data and predict poorly on new data. That said, every problem scenario is different, and many decision tree models predict very well.

Using the Decision Tree Model
After the decision tree model has been evaluated, the demo concludes by using the model to predict the income of a previously unseen person who is male, age 34, lives in Oklahoma, and is a political moderate:

double[] x = new double[] { 0, 0.34, 2, 1 };
double predY = dt.Predict(x, verbose: true);
// Console.WriteLine("Predicted income = " +
// predY.ToString("F5"));  // 0.28600
Console.WriteLine("End demo ");
Console.ReadLine();  // keep shell open

Predictions must be made using the same encoding and normalization that's used when building the tree model. The verbose parameter of the Predict() method displays the logic that's used when computing the prediction. Interpretability is the biggest advantage of using decision tree models. When the verbose parameter equals true, the Predict() method also displays a pseudo-code summary of the prediction rule: "IF (*) AND (column 1 < 0.36) AND (column 0 == 0) THEN Predicted Y = 0.28600." In words, if any, and age is less than 36 and sex equals male, then the predicted income is $28,600."

In a non-demo scenario with a realistic amount of training data, the decision tree prediction rule often repeats variables. For example, you might see something like, "IF column 1 less-than 0.55 AND column 3 not-equal 2 AND column 1 less-than 0.48 AND . . " This wonkiness is characteristic of most decision trees. It is possible to post-process decision tree prediction rules to remove unnecessary duplication of predictor variables, but the demo program, and most machine learning library implementations, do not do so by default.


Wrapping Up
Implementing decision tree regression from scratch is not simple, but by doing so you can modify the system to handle unusual problem scenarios, and you can easily integrate the prediction system into other systems.

The decision tree regression program presented in this article can be used as-is as a template for most regression problems. But you can also modify the demo program in many ways. One common modification is to pass column names to the DecisionTree object so that prediction rule pseudo-code is easier to read, for example, instead of "IF column 1 less-than 0.36 and column 0 is-equal 0 THEN . . " you could emit "IF age less-than 0.36 AND sex is-equal 0 THEN . . "

There are several variations of decision trees that you might consider implementing. A technique called bootstrap aggregation (with the somewhat misleading shortcut name "bagging") creates multiple decision trees where each uses a different randomly selected subset of the rows of the training data, and then the results of the trees are combined to produce a single prediction.

A random forest creates multiple decision trees, each using a different randomly selected subset of predictor variables/columns, and then the results are combined.

In boosted trees ("boosting"), you create a random forest except that each tree is created based on how well the previous tree works. The algorithm is fairly complex. A specific type of boosting is named AdaBoost, and because it's so common, it's pretty much synonymous with "boosting."

Gradient boosting is a generalization of regular boosting. A common, specific type of gradient boosting is named XGBoost ("extreme gradient boosting"). The implementation of XGBoost is very complex -- so much so that only one open source version, developed over the course of several years, is used in practice.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube