The Data Science Lab
Binary Classification Using a scikit Neural Network
Machine learning with neural networks is sometimes said to be part art and part science. Dr. James McCaffrey of Microsoft Research teaches both with a full-code, step-by-step tutorial.
A binary classification problem is one where the goal is to predict the value of a variable where there are exactly two discrete possibilities. For example, you might want to predict the sex of a person (male = 0, female = 1) based on their age, state where they live, income and political leaning (conservative, moderate, liberal). Note that when there are three or more possible values to predict (for example, predict political leaning), the problem is called multi-class classification, which typically uses different algorithms than binary classification.
Arguably the most powerful binary classification technique is a neural network model. There are several tools and code libraries that you can use to create a neural network classifier. The scikit-learn library (also called scikit or sklearn) is based on the Python language and is one of the most popular.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program loads a 200-item set of training data and a 40-item set of test data into memory. Next, the demo creates and trains a neural network model using the MLPClassifier module ("multi-layer perceptron," an old term for a neural network) from the scikit library.
After training, the model is applied to the training data and the test data. The model scores 93 percent accuracy (186 out of 200 correct) on the training data, and 82.50 percent accuracy (33 out of 40 correct) on the test data.
The demo concludes by predicting the sex of a person who is age 30, from Oklahoma, makes $40,000 per year and is a political moderate The prediction is [[0.9708 0.0292]]. These are pseudo-probabilities, and because the value at index [0] is largest, the predicted sex is class 0 = male.
This article assumes you have intermediate or better skill with a C-family programming language, but doesn't assume you know much about neural networks or the scikit library. The complete source code for the demo program is presented in this article and the accompanying file download. The source code and training and test data are also available online.
Installing the scikit Library
There are several ways to install the scikit library. I recommend installing the Anaconda Python distribution. Anaconda contains the scikit library, a core Python engine, plus more than 500 libraries that are (mostly) compatible with one another. I used Anaconda3-2022.10, which contains Python 3.9.13 and the scikit 1.0.2 version. The demo code runs on Windows 10 or 11.
Briefly, Anaconda is installed using a Windows self-extracting executable file. The setup process is mostly straightforward and takes about 15 minutes following step-by-step instructions. The instructions can be easily adapted for Anaconda3-2022.10.
There are more up-to-date versions of Anaconda / Python / scikit library available. But because the Python ecosystem has hundreds of libraries, if you install the most recent versions of these libraries, you run a greater risk of library incompatibilities -- a major headache when working with Python.
The Data
The data is artificial. There are 200 training items and 40 test items. The structure of data looks like:
1 0.24 1 0 0 0.2950 0 0 1
0 0.39 0 0 1 0.5120 0 1 0
1 0.63 0 1 0 0.7580 1 0 0
0 0.36 1 0 0 0.4450 0 1 0
1 0.27 0 1 0 0.2860 0 0 1
. . .
The tab-delimited fields are sex (0 = male, 1 = female), age (divided by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by $100,000) and political leaning (conservative = 100, moderate = 010, liberal = 001). For scikit neural network classification, the numeric predictors should all be normalized to approximately the same range -- typically 0.0 to 1.0 or -1.0 to +1.0 -- because normalizing prevents predictors with large magnitudes from overwhelming those with small magnitudes.
For categorical predictor variables, I recommend one-hot encoding. For example, if there were five states instead of just three, the states would be encoded as 10000, 01000, 00100, 00010, 00001. For binary predictor variables, such as is_citizen, you can encode using either zero-one encoding or minus-one-plus-one encoding. In spite of decades of research, there are some topics, such as binary predictor encoding, that are not well understood.
The Demo Program
The complete demo program is presented in Listing 1. Notepad is my preferred code editor but most of my colleagues use one of the many excellent code editors that are available for Python. I indent my Python program using two spaces rather than the more common four spaces.
The program imports the NumPy library, which contains numeric array functionality, and the MLPClassifier module, which contains neural network functionality. Notice the name of the root scikit module is sklearn rather than scikit.
import numpy as np
from sklearn.neural_network import MLPClassifier
import warnings
warnings.filterwarnings('ignore') # early-stop warnings
The demo specifies that no Python warnings should be displayed. I do this to keep the output tidy, but in a non-demo scenario you definitely want to see warning messages.
Listing 1: Complete Demo Program
# people_gender_nn_sckit.py
# predict sex (0 = male, 1 = female)
# from age, state, income, politics
# Anaconda3-2022.10 Python 3.9.13 scikit 1.0.2
# Windows 10/11
import numpy as np
from sklearn.neural_network import MLPClassifier
import warnings
warnings.filterwarnings('ignore') # early-stop warnings
# ---------------------------------------------------------
def show_confusion(cm):
dim = len(cm)
mx = np.max(cm) # largest count in cm
wid = len(str(mx)) + 1 # width to print
fmt = "%" + str(wid) + "d" # like "%3d"
for i in range(dim):
print("actual ", end="")
print("%3d:" % i, end="")
for j in range(dim):
print(fmt % cm[i][j], end="")
print("")
print("------------")
print("predicted ", end="")
for j in range(dim):
print(fmt % j, end="")
print("")
# ---------------------------------------------------------
def main():
# 0. get ready
print("\nBegin scikit neural network binary example ")
print("Predict sex from age, State, income, politics ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)
# 1. load data
print("\nLoading data into memory ")
train_file = ".\\Data\\people_train.txt"
train_xy = np.loadtxt(train_file, usecols=range(0,9),
delimiter="\t", comments="#", dtype=np.float32)
train_x = train_xy[:,1:9]
train_y = train_xy[:,0].astype(np.int64)
# load, two calls to loadtxt() technique
test_file = ".\\Data\\people_test.txt"
test_x = np.loadtxt(test_file, usecols=range(1,9),
delimiter="\t", comments="#", dtype=np.float32)
test_y = np.loadtxt(test_file, usecols=0,
delimiter="\t", comments="#", dtype=np.int64)
print("\nTraining data:")
print(train_x[0:4])
print(". . . \n")
print(train_y[0:4])
print(". . . ")
# ---------------------------------------------------------
# 2. create network
# MLPClassifier(hidden_layer_sizes=(100,),
# activation='relu', *, solver='adam', alpha=0.0001,
# batch_size='auto', learning_rate='constant',
# learning_rate_init=0.001, power_t=0.5, max_iter=200,
# shuffle=True, random_state=None, tol=0.0001,
# verbose=False, warm_start=False, momentum=0.9,
# nesterovs_momentum=True, early_stopping=False,
# validation_fraction=0.1, beta_1=0.9, beta_2=0.999,
# epsilon=1e-08, n_iter_no_change=10, max_fun=15000)
params = { 'hidden_layer_sizes' : [10,10],
'activation' : 'tanh',
'solver' : 'sgd',
'alpha' : 0.001,
'batch_size' : 10,
'random_state' : 0,
'tol' : 0.0001,
'nesterovs_momentum' : False,
'learning_rate' : 'constant',
'learning_rate_init' : 0.01,
'max_iter' : 500,
'shuffle' : True,
'n_iter_no_change' : 50,
'verbose' : False }
print("\nCreating 8-(10-10)-1 tanh neural network ")
net = MLPClassifier(**params)
# ---------------------------------------------------------
# 3. train
print("\nTraining with bat sz = " + \
str(params['batch_size']) + " lrn rate = " + \
str(params['learning_rate_init']) + " ")
print("Stop if no change " + \
str(params['n_iter_no_change']) + " iterations ")
net.fit(train_x, train_y)
print("Done ")
# ---------------------------------------------------------
# 4. evaluate model
acc_train = net.score(train_x, train_y)
print("\nAccuracy on train = %0.4f " % acc_train)
acc_test = net.score(test_x, test_y)
print("Accuracy on test = %0.4f " % acc_test)
from sklearn.metrics import confusion_matrix
y_predicteds = net.predict(test_x)
cm = confusion_matrix(test_y, y_predicteds)
print("\nConfusion matrix: \n")
# print(cm) # raw
show_confusion(cm) # custom formatted
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
y_predicteds = net.predict(test_x)
precision = precision_score(test_y, y_predicteds)
print("\nPrecision on test = %0.4f " % precision)
recall = recall_score(test_y, y_predicteds)
print("Recall on test = %0.4f " % recall)
f1 = f1_score(test_y, y_predicteds)
print("F1 score on test = %0.4f " % f1)
# ---------------------------------------------------------
# 5. use model
print("\nSetting age = 30 Oklahoma $40,000 moderate ")
X = np.array([[0.30, 0,0,1, 0.4000, 0,1,0]],
dtype=np.float32)
probs = net.predict_proba(X)
print("\nPrediction pseudo-probs: ")
print(probs)
sex = net.predict(X)
print("\nPredicted class: ")
print(sex) # a vector with a single value
if sex[0] == 0: print("male")
elif sex[0] == 1: print("female")
# ---------------------------------------------------------
# 6. TODO: save model using pickle
print("\nEnd scikit binary neural network demo ")
if __name__ == "__main__":
main()
All the program logic is contained in a main() function. The demo begins by setting the NumPy random seed:
def main():
# 0. get ready
print("Begin scikit neural network binary example ")
print("Predict sex from age, State, income, politics ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)
. . .
Technically, setting the random seed value isn't necessary, but doing so helps you to get reproducible results in most situations. The set_printoptions() function formats NumPy arrays to four decimals without using scientific notation.
Loading the Training and Test Data
The demo program loads the training data into memory using these statements:
# 1. load data
print("Loading data into memory ")
train_file = ".\\Data\\people_train.txt"
train_xy = np.loadtxt(train_file, usecols=range(0,9),
delimiter="\t", comments="#", dtype=np.float32)
train_x = train_xy[:,1:9]
train_y = train_xy[:,0].astype(np.int64)
This code assumes the data files are stored in a directory named Data. There are many ways to load data into memory. I prefer using the NumPy library loadtxt() function, but a common alternative is the Pandas library read_csv() function.
The code reads all 200 lines of training data (columns 0 to 8 inclusive) into a matrix named train_xy and then splits the data into a matrix of predictor values and a vector of target gender values. The colon syntax means "all rows." The target labels are converted from type float32 to int64.
The 40-item test data is read into memory using an alternate technique that calls loadtxt() twice:
test_file = ".\\Data\\people_test.txt"
test_x = np.loadtxt(test_file, usecols=range(1,9),
delimiter="\t", comments="#", dtype=np.float32)
test_y = np.loadtxt(test_file, usecols=0,
delimiter="\t", comments="#", dtype=np.int64)
The demo program prints the first four training predictor items and the first four target gender values:
print("Training data:")
print(train_x[0:4])
print(". . . ")
print(train_y[0:4])
print(". . . ")
In a non-demo scenario you might want to display all the training data and all the test data to verify the data has been read properly.
Creating the Neural Network Model
Creating the multi-class classification neural network model is simultaneously simple and complicated. First, the demo program sets up the network parameters in a Python Dictionary object like so:
# 2. create network
params = { 'hidden_layer_sizes' : [10,10],
'activation' : 'tanh', 'solver' : 'sgd',
'alpha' : 0.001, 'batch_size' : 10,
'random_state' : 0, 'tol' : 0.0001,
'nesterovs_momentum' : False,
'learning_rate' : 'constant',
'learning_rate_init' : 0.01,
'max_iter' : 500, 'shuffle' : True,
'n_iter_no_change' : 50, 'verbose' : False }
After the parameters are set, they are fed to a neural network constructor:
print("Creating 8-(10-10)-1 tanh neural network ")
net = MLPClassifier(**params)
The ** syntax means to unpack the Dictionary values and pass them to the constructor. Like many scikit models, the MLPClassifier class has a lot of parameters and default values. The signature is:
MLPClassifier(hidden_layer_sizes=(100,),
activation='relu', *, solver='adam', alpha=0.0001,
batch_size='auto', learning_rate='constant',
learning_rate_init=0.001, power_t=0.5, max_iter=200,
shuffle=True, random_state=None, tol=0.0001,
verbose=False, warm_start=False, momentum=0.9,
nesterovs_momentum=True, early_stopping=False,
validation_fraction=0.1, beta_1=0.9, beta_2=0.999,
epsilon=1e-08, n_iter_no_change=10, max_fun=15000)