The Data Science Lab

Multi-Class Classification Using LightGBM

Creating and Training the LightGBM Model
The demo program creates and trains a LightGBM multi-class classifier using these statements:

  # 2. create and train model
  print("Creating and training LGBM multi-class model ")
  params = {
    # 'objective': 'multiclass',  # not needed
    'boosting_type': 'gbdt',  # default
    'num_leaves': 31,  # default
    'max_depth': -1,  # default (unlimited) 
    'n_estimators': 50,  # default = 100
    'learning_rate': 0.05,  # default = 0.10
    'min_data_in_leaf': 5,  # default = 20
    'random_state': 0,
    'verbosity': -1  # only fatal. default = 1 error, warn
  }
  model = lgbm.LGBMClassifier(**params)
  model.fit(train_x, train_y)

The classifier object is named model and is instantiated by setting up its parameters as a Python Dictionary collection named params. The main challenge when using LightGBM is wading through the dozens of parameters. The LGBMClassifier class/object has 19 parameters (num_leaves, max_depth and so on) and behind the scenes there are 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction and so on), for a total of 76 parameters to deal with.

Documentation for the parameters can be found here and here.

Because the number of parameters is not manageable, you must rely on the default values and then try to find the handful of parameters that will create a good model. Based on my experience, the three most important parameters to explore and modify are n_estimators, min_data_in_leaf and learning_rate.

A LightGBM classifier is made up of n_estimators (default value is 100), relatively small decision trees that are called weak learners, or sometimes base learners. The weak trees are constructed sequentially where each tree uses gradients of the error from the previous tree. If the value of n_estimators is too small, then there aren't enough weak learners to create a model that predicts well (underfit). If the value of n_estimators is too large, then the model will overfit the training data and predict poorly on new, previously unseen data items.


The num_leaves parameter controls the overall size of the weak learner trees. The default value of 31 translates to a balanced tree that has five levels with 1, 2, 4, 8, 16 leaf nodes respectively. An unbalanced tree might have more levels. Weak learners that are too small might underfit, too large might overfit.

The max_depth parameter controls the number of levels that each weak learner has. The default value is -1, which means that there is no explicit limit. In most cases, the num_leaves parameter will prevent the depth of the weak learners from becoming too large.

The min_data_in_leaf parameter controls the size of the leaf nodes in the weak learners. The default value of 20 means that each leaf node must have at least 20 associated data items. For a relatively small set of training data, the default greatly reduces the number of leaf nodes. For the demo with 200 training items, there would be a maximum of 200 / 20 = 10 leaf nodes, which would likely underfit the model and lead to poor prediction accuracy. The demo modifies the value of min_data_in_leaf from 20 to 5, which gave much better results.

To recap, the n_estimators parameter controls the overall number of weak tree learners. The key parameters to control the size and shape of the weak learners are num_leaves, max_depth and min_data_in_leaf. Based on my experience, I typically experiment with n_estimators (the default value of 100 is often too large for small datasets) and min_data_in_leaf (the default of 20 is often too large for small datasets). I usually leave the num_leaves and max_depth parameter values at their default values of 31 and -1 (unlimited) respectively unless the model just doesn't predict well.

The demo modifies the learning_rate parameter from the default value of 0.10 to 0.05. The learning rate controls how much each weak learner tree changes from the previous learner. The effect of changing the learning_rate can vary quite a bit depending on the size and shape of the weak learners, but as a rule of thumb, smaller values work better for smaller datasets.

The demo modifies the value of the random_state parameter from its default value of None (Python's version of null) to 0. The None value means that results are not reproducible due to the random initialization component of the training process. Any value other than None will give (mostly) reproducible results, subject to multi-threading issues.

The demo modifies the value of the verbosity parameter from its default value of 1 to -1. The default value of 1 prints warning messages, regular error messages and fatal error messages. The demo value of -1 prints only fatal error messages. I did this only to keep the output small so I could take a screenshot. In a non-demo scenario you should leave the verbosity value at 1 in most situations.

After setting up the parameter values in a Dictionary collection, they are passed to the LGBMClassifier using the Python ** syntax, which means unpack the values to parameters. Parameter values can be passed directly, for example model = lgbm.LGBMClassifier(n_estimators = 50, learning_rate = 0.05 and so on), but because there are so many parameters, this approach is rarely used.

The model is trained using the fit() method. Almost too easy because all the work is done when setting up the parameters.

Evaluating the Model
It's possible to evaluate a trained LightGBM multi-class model in several ways. The most basic approach is to compute prediction accuracy (number correct predictions divided by total number of predictions) on the training and test data. The demo program defines an accuracy() function where the key statements are:

preds = model.predict(data_x)  # all predicted values
n_correct = np.sum(preds == data_y)
result = n_correct / len(data_x)

The output of the simple accuracy() function is:

Accuracy on training data = 0.9750
Accuracy on test data = 0.8250

This set-based approach is fast, but iterating through each data item instead allows you to see exactly which items are incorrectly predicted.

The demo defines a show_accuracy() function that gives more detailed information. The output of the show_accuracy() function is:

Accuracy on test data:
Overall accuracy =   0.8250
class 0 :  ct =  11  correct =   8  wrong =   3  acc =  0.7273
class 1 :  ct =  14  correct =  13  wrong =   1  acc =  0.9286
class 2 :  ct =  15  correct =  12  wrong =   3  acc =  0.8000
For multi-class classification problems, it's more or less standard practice to compute and display a confusion matrix that shows where incorrect predictions have been made. The demo defines a confusion_matrix_multi() function that computes a confusion matrix and a show_confusion() function that displays the matrix. The demo output is:
Confusion matrix for test data:
actual     0:  8  2  1
actual     1:  1 13  0
actual     2:  2  1 12
------------
predicted      0  1  2

The entries off the main diagonal are incorrect predictions. For example, the 2 value at row [0] and column [1] indicates that there were 2 data items that have true political leaning = 0 (conservative) but were predicted to be class 1 (moderate).

The LightGBM Python API is integrated with the scikit-learn machine learning package. The scikit-learn package is included in the Anaconda distribution and so you can directly call built-in scikit-learn modules. For example, you can get a confusion matrix using these statements:

from sklearn.metrics import confusion_matrix
pred_y = model.predict(test_x)  # all predicteds
cm = confusion_matrix(test_y, pred_y)
print(cm)

Or you can get detailed evaluation (a bit too detailed in my opinion) like so:

from sklearn.metrics import classification_report
pred_y = model.predict(test_x)  # all predicteds
report = classification_report(test_y, pred_y,
  labels=[0, 1, 2])
print(report)

The ability to use built-in scikit-learn library modules is powerful but has a bit of a learning curve if you're not familiar with the library.

Using and Saving the LightGBM Model
Using a trained LightGBM classifier is simple, subject to two minor syntax details. Example calling code is:

x = np.array([[0, 35, 2, 55000.00]], dtype=np.float64)  # 2D
pred = model.predict(x)
print("Predicted politics = " + str(pred[0]))

Notice that the input x values have double square brackets to make the input a 2D matrix, which the predict() method requires. Alternatively, you can declare a 1D vector and then reshape it to 2D:

x = np.array([0, 35, 2, 55000.00], dtype=np.float64)  # 1D
x = x.reshape(1, -1)  # 1 row, n cols 2D
pred = model.predict(x)

The return value from the predict() method is an array rather than a scalar value. So, when the input is a single data item, and you want just the single predicted class, you can access the class at index [0] like so:

pred = model.predict(x)
print("Predicted politics = " + str(pred[0]))

Alternatively:

pred = model.predict(x)  # array
pred = pred[0]           # scalar
print("Predicted politics = " + str(pred))

The demo program saves the trained LightGBM in binary format using the Python pickle library. (In ordinary English, the word "pickle" means to preserve). The calling code is:

import pickle
print("Saving model ")
pth = ".\\Models\\politics_model.pkl"
with open(pth, "wb") as f:
  pickle.dump(model, f)

The code assumes the existence of a subdirectory named Models. The "wb" argument means "write to file as binary." The "pkl" extension is common, but any extension name can be used.

A LightGBM model saved using pickle can be loaded into memory from another program and used like so:

pth = ".\\Models\\politics_model.pkl"
with open(pth, "rb") as f:
  model2 = pickle.load(f)
x = np.array([[0, 35, 2, 55000.00]], dtype=np.float64)
pred = model2.predict(x)

There are other ways to save a trained LightGBM model, but the pickle approach is the easiest and the most common.

Wrapping Up
The LightGBM system was inspired by the XGBoost (extreme gradient boosting) system, which in turn was inspired by earlier tree boosting algorithms. The "boosting" term of the LightGBM name refers to the technique of combining several weak learners into one strong learning model. The "gradient" term refers to the technique of using the Calculus gradient of the error of a weak learner to construct the next weak learner in the model sequence. The "machine" term is an old way to indicate that a system is a machine learning one rather than a classical statistics one.

Arguably, the two most powerful techniques for multi-class classification on non-trivial datasets are neural networks and tree boosting. In some recent multi-class classification challenges, LightGBM entries have dominated the contest leader board. This may be due, in part, to the fact that LightGBM can be used out-of-the-box, which leaves a lot of time for hyperparameter fine-tuning. Using a neural network classifier requires significantly more background knowledge and effort.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube