The Data Science Lab
Binary Classification Using a scikit Neural Network
When working with scikit, you'll spend most of your time reading the documentation and trying to figure out what each parameter does. The MLPClassifier class is especially complex because many of the parameters interact with each other.
Your first parameter decision is the solver to use for training the network. Your choices are 'adam', 'sgd', or 'lbfgs'. I recommend 'sgd' for most problems, even though 'adam' is the default. The 'adam' solver is essentially a sophisticated version of 'sgd'. The 'lbfgs' solver works in a completely different way from 'adam' and 'sgd'.
Your next parameter decision is the number of hidden layers and the number of processing nodes in each layer. The demo uses two hidden layers with 10 nodes each. More layers and more nodes are not always better, so you must experiment. The default is one hidden layer with 100 nodes.
Your next decision is hidden node activation. Your choices are 'identity', 'logistic', 'tanh', 'relu'. If you use an 'sgd' solver' I suggest using 'tanh' activation. If you use an 'adam' solver', I suggest using 'relu' activation. The 'identity' and 'logistic' hidden node activation are rarely used.
Your next set of decisions are related to the training learning rate. The demo uses the 'constant' rate type. Alternatives are 'invscaling' and 'adaptive'. These are very complicated and I don't recommend using them. If you use a 'constant' learning rate type, you specify that rate using the learning_rate_init parameter. This value often has a huge effect on the performance of the resulting neural network model. Typical values to experiment with are 0.001, 0.01, 0.05 and 0.10.
Your next decision is the batch_size parameter. The demo uses 10. I recommend that your batch size evenly divides the number of training items so that all batches of training data have the same size. Because the demo has 200 training items, each batch will have 200 / 10 = 20 data items.
Your next parameter decision is whether or not to use nesterovs_momentum. The default value is True, but I recommend setting to False. Momentum is an old technique that was designed primarily to speed up training. But in my opinion the advantage gained by using momentum is usually outweighed by having to experiment with yet another parameter value, the momentum parameter.
Your next parameter decision is the alpha value. The alpha parameter controls what is called L2 regularization. Regularization shrinks the weights and biases of neural network to prevent them from becoming huge, which in turn causes model overfitting. Overfitting means the model predicts well on the training data, but when presented with new, previously unseen test data, the model predicts poorly. The default value of alpha is 0.0001, but I recommend setting alpha to zero and only experimenting with alpha values if significant overfitting occurs.
Your next parameter decision is max_iter to set the maximum number of training iterations. This is strictly a matter of trial and error. The demo sets the verbose parameter to False, but setting it to True will allow you to monitor training and determine a good value for the max_iter parameter (when the loss value stops changing much).
Your last parameter decisions are n_iter_no_change and tol. The n_iter_no_change specifies that training should stop if there are a certain number of iterations where no improvement (decrease in the error/loss value) has been made. The tol ("tolerance") parameter specifies exactly what no improvement means.
To recap, the MLPClassifier has a large number of interacting parameters. There are essentially an infinite number of combinations of the values of the parameters so you must experiment using trial and error. With each neural network example you encounter, your intuition will grow, and you'll be able to zero-in on good parameters values more quickly. This is the reason that machine learning with neural networks is sometimes said to be part art and part science.
Training the Neural Network
After the neural network has been prepared, training is easy:
# 3. train
print("Training with bat sz = " + \
str(params['batch_size']) + " lrn rate = " + \
str(params['learning_rate_init']) + " ")
print("Stop if no change " + \
str(params['n_iter_no_change']) + " iterations ")
net.fit(train_x, train_y)
print("Done ")
The backslash character is used for Python line continuation. The fit() method requires a matrix of predictor values and a vector of target labels. There are no optional parameters for fit() so you don't have much to think about -- all the decisions are made when selecting the constructor parameters.
Evaluating the Trained Model
The demo computes the accuracy of the trained model like so:
# 4. evaluate model
acc_train = net.score(train_x, train_y)
print("Accuracy on train = %0.4f " % acc_train)
acc_test = net.score(test_x, test_y)
print("Accuracy on test = %0.4f " % acc_test)
The score() function computes a simple accuracy, which is just the number of correct predictions divided by the total number of predictions. However, for classification problems you usually also want to know the accuracy of the model for each class label. The easiest way to do this is to use the scikit confusion matrix:
from sklearn.metrics import confusion_matrix
y_predicteds = net.predict(test_x)
cm = confusion_matrix(test_y, y_predicteds)
print("\nConfusion matrix: \n")
# print(cm) # raw
show_confusion(cm) # custom formatted
For the demo program, the result of displaying a raw confusion matrix is:
[[ 19 7 ]
[ 0 14 ]]
The raw confusion matrix is a bit difficult to interpret so I usually write a program-defined helper function named show_confusion() to add formatting labels. The output of show_confusion() is:
actual 0: 19 7
actual 1: 0 14
------------
predicted 0 1
The code for show_confusion is in Listing 1. A good model should have roughly similar accuracy values for all class labels. If any class label has a very low accuracy, you need to investigate.
For binary classification problems, it's standard practice to compute additional measures of accuracy: precision, recall and F1 score. The demo does so using these statements:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
y_predicteds = net.predict(test_x)
precision = precision_score(test_y, y_predicteds)
print("Precision on test = %0.4f " % precision)
recall = recall_score(test_y, y_predicteds)
print("Recall on test = %0.4f " % recall)
f1 = f1_score(test_y, y_predicteds)
print("F1 score on test = %0.4f " % f1)
It's easy to overthink precision and recall. It's best to interpret them as additional accuracy metrics and you should only be concerned when you see a very low value. Precision and recall are somewhat ambiguous because assigning class 0 and class 1 to outcomes is usually arbitrary. The F1 score is just the harmonic mean of precision and recall.
Using the Trained Model
The demo uses the trained model like so:
# 5. use model
print("Setting age = 30 Oklahoma $40,000 moderate ")
X = np.array([[0.30, 0,0,1, 0.4000, 0,1,0]],
dtype=np.float32)
probs = net.predict_proba(X)
print("Prediction pseudo-probs: ")
print(probs)
Because the neural network model was trained using normalized and encoded data, the X-input must be normalized and encoded in the same way. Notice the double square brackets on the X-input. The predict_proba() method expects a matrix rather than a vector. The result of the proba() method ("probability array") is a vector of pseudo-probabilities that sum to 1. If the class-to-predict is ordinal encoded, the index of the largest value corresponds to the predicted class.
The demo concludes by predicting the political type directly by using the predict() method:
sex = net.predict(X)
print("Predicted class: ")
print(sex) # a vector with a single value
if sex[0] == 0: print("male")
elif sex[0] == 1: print("female")
The return result is an array with one value rather than a scalar value because the predict() method accepts a matrix of predictor values instead of a single vector of values.
Saving the Trained Model
The demo doesn't save the trained model. The most common way to save a trained naive Bayes classifier model is to use the Python pickle library ("pickle" means to preserve in English). For example:
import pickle
print("Saving binary classifier model ")
path = ".\\Models\\gender_nn_model.pkl"
pickle.dump(model, open(path, "wb"))
This code assumes there is a directory named Models. The saved model could be loaded and used from another program like so:
# predict sex for unknown person
# age = 40, Nebraska, $54,000 conservative
X = np.array([[0.40, 0,1,0, 0.5400, 1,0,0]],
dtype=np.float32)
with open(path, 'rb') as f:
loaded_model = pickle.load(f)
pa = loaded_model.predict_proba(x)
print(pa) # pseudo-probabilities
There are several other ways to save and load a trained scikit model, but using the pickle library is simplest.
Wrapping Up
When using the scikit library for binary classification, the main alternative to the MLPClassifier neural network module is the scikit DecisionTreeClassifier module. Decision trees are useful for relatively small datasets that have a relatively simple underlying structure, and when the trained model must be easily interpretable. Neural networks are useful for large datasets with complex structures, but neural models are not easy to interpret. Because the scikit library is so easy to use, it's common to try both approaches and optionally combine the results.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].