Binary Classification Using New PyTorch Best Practices, Part 2: Training, Accuracy, Predictions -- Visual Studio Magazine

Binary Classification Using New PyTorch Best Practices, Part 2: Training, Accuracy, Predictions

Dr. James McCaffrey of Microsoft Research explains how to train a network, compute its accuracy, use it to make predictions and save it for use by other programs.

By James McCaffrey
10/14/2022

This is the second of two articles that explain how to create and use a PyTorch binary classifier. A good way to see where this article is headed is to examine the screenshot of a demo program in Figure 1.

The demo program predicts the gender (male, female) of a person. The first article in the series explained how to prepare the training and test data, and how to define the neural network classifier. This article explains how to train the network, compute the accuracy of the trained network, use the network to make predictions and save the network for use by other programs.

A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo begins by loading a 200-item file of training data and a 40-item set of test data. Each tab-delimited line represents a person. The fields are gender (male = 0, female = 1), age, state of residence, annual income and politics type. The goal is to predict gender from age, state, income and politics type.

Figure 1: Binary Classification Using PyTorch Demo Run — **[Click on image for larger view.]** *Figure 1:* Binary Classification Using PyTorch Demo Run

After the training data is loaded into memory, the demo creates an 8-(10-10)-1 neural network. This means there are eight input nodes, two hidden neural layers with 10 nodes each and one output node.

The demo prepares to train the network by setting a batch size of 10, stochastic gradient descent (SGD) optimization with a learning rate of 0.01, and maximum training epochs of 500 passes through the training data. The meaning of these values and how they are determined will be explained shortly.

The demo program monitors training by computing and displaying loss values. The loss values slowly decrease which indicates that training is probably succeeding. The magnitude of the loss values isn't directly interpretable; the important thing is that the loss decreases.

After 500 training epochs, the demo program computes the accuracy of the trained model on the training data as 82.50 percent (165 out of 200 correct). The model accuracy on the test data is 85 percent (34 out of 40 correct). For binary classification models, in addition to accuracy, it's standard practice to compute additional metrics: precision, recall and F1 score.

After evaluating the trained network, the demo saves the trained model to file so that it can be used without having to retrain the network from scratch. There are two main ways to save a PyTorch model. The demo uses the save-state approach.

After saving the model, the demo predicts the gender for a person who is 30 years old, from Oklahoma, who makes $40,000 annually and is a political moderate. The raw prediction is 0.3193. This value is a pseudo-probability where values less than 0.5 indicate class 0 (male) and values greater than 0.5 indicate class 1 (female). Therefore the prediction is male.

This article assumes you have a basic familiarity with Python and intermediate or better experience with a C-family language but does not assume you know much about PyTorch or neural networks. The complete demo program source code and data can be found here.

Overall Program Structure
The overall structure of the demo program is presented in Listing 1. The demo program is named people_gender.py. The program imports the NumPy (numerical Python) library and assigns it an alias of np. The program imports PyTorch and assigns it an alias of T. Most PyTorch programs do not use the T alias but my work colleagues and I often do so to save space. The demo program indents using two spaces rather than the more common four spaces, again to save space.

Listing 1: Overall Program Structure

# people_gender.py
# binary classification
# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10/11 

import numpy as np
import torch as T
device = T.device('cpu')

class PeopleDataset(T.utils.data.Dataset): . . .

class Net(T.nn.Module): . . .

def metrics(model, ds, thresh=0.5): . . .

def main():
  # 0. get started
  print("People gender using PyTorch ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create Dataset objects
  # 2. create network
  # 3. train model
  # 4. evaluate model accuracy
  # 5. save model (state_dict approach)
  # 6. make a prediction
  
  print("End People binary classification demo ")

if __name__ == "__main__":
  main()

The demo program places all the control logic in a main() function. Some of my colleagues prefer to implement a program-defined train() function to handle the code that performs the training.

The demo program begins by setting the seed values for the NumPy random number generator and the PyTorch generator. Setting seed values is helpful so that demo runs are mostly reproducible. However, when working with complex neural networks such as Transformer networks, exact reproducibility cannot always be guaranteed because of separate threads of execution.

Preparing to Train the Network
Training a neural network is the process of finding values for the weights and biases so that the network produces output that matches the training data. Most of the demo program code is associated with training the network. The terms network and model are often used interchangeably. In some development environments, network is used to refer to a neural network before it has been trained, and model is used to refer to a network after it has been trained.

The normalized and encoded training data looks like:

 1   0.24   1   0   0   0.2950   0  0  1
 0   0.39   0   0   1   0.5120   0  1  0
 1   0.63   0   1   0   0.7580   1  0  0
 0   0.36   1   0   0   0.4450   0  1  0
. . .

The fields are gender (0 = male, 1 = female), age (divided by 100), state (Michigan = 100, Nebraqska = 010, Oklahoma = 001), income (divided by 100,000) and political leaning (conservative = 100, moderate = 010, liberal = 001).

In the main() function, the training and test data are loaded into memory as Dataset objects, and then the training Dataset is passed to a DataLoader object:

  # 1. create Dataset and DataLoader objects
  print("Creating People train and test Datasets ")
  train_file = ".\\Data\\people_train.txt"
  test_file = ".\\Data\\people_test.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows
  test_ds = PeopleDataset(test_file)    # 40 rows
  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

Unlike Dataset objects that must be defined for each specific binary classification problem, DataLoader objects are ready to use as-is. The batch size of 10 is a hyperparameter. The special case when batch size is set to 1 is sometimes called online training.

Although not necessary, it's generally a good idea to set a batch size that evenly divides the total number of training items so that all batches of training data have the same size. In the demo, with a batch size of 10 and 200 training items, each batch will have 20 items. When the batch size doesn't evenly divide the number of training items, the last batch will be smaller than all the others. The DataLoader class has an optional drop_last parameter with a default value of False. If set to True, the DataLoader will ignore last batches that are smaller.

It's very important to explicitly set the shuffle parameter to True. The default value is False. When shuffle is set to True, the training data will be served up in a random order which is what you want during training. If shuffle is set to False, the training data is served up sequentially. This almost always results in failed training because the updates to the network weights and biases oscillate, and no progress is made.

Creating the Network
The demo program creates the neural network like so:

  # 2. create neural network
  print("Creating 8-(10-10)-1 binary NN classifier ")
  net = Net().to(device)
  net.train()

The neural network is instantiated using normal Python syntax but with .to(device) appended to explicitly place storage in either "cpu" or "cuda" memory. Recall that device is a global-scope value set to "cpu" in the demo.

The network is set into training mode with the somewhat misleading statement net.train(). PyTorch neural networks can be in one of two modes, train() or eval(). The network should be in train() mode during training and eval() mode at all other times.

The train() vs. eval() mode is often confusing for people who are new to PyTorch in part because in many situations it doesn't matter what mode the network is in. Briefly, if a neural network uses dropout or batch normalization, then you get different results when computing output values depending on whether the network is in train() or eval() mode. But if a network doesn't use dropout or batch normalization, you get the same results for train() and eval() mode.

Because the demo network doesn't use dropout or batch normalization, it's not necessary to switch between train() and eval() mode. However, in my opinion it's good practice to always explicitly set a network to train() mode during training and eval() mode at all other times. By default, a network is in train() mode.

The statement net.train() is rather misleading because it suggests that some sort of training is going on. If I had been the person who implemented the train() method, I would have named it set_train_mode() instead. Also, the train() method operates by reference and so the statement net.train() modifies the net object. If you are a fan of functional programming, you can write net = net.train() instead.

Training the Network
The code that trains the network is presented in Listing 2. Training a neural network involves two nested loops. The outer loop iterates a fixed number of epochs (with a possible short-circuit exit). An epoch is one complete pass through the training data. The inner loop iterates through all training data items.

Listing 2: Training the Network

  # 3. train network
  lrn_rate = 0.01
  loss_func = T.nn.BCELoss()  # binary cross entropy
  optimizer = T.optim.SGD(net.parameters(),
    lr=lrn_rate)
  max_epochs = 500
  ep_log_interval = 100

  print("Loss function: " + str(loss_func))
  print("Optimizer: " + str(optimizer.__class__.__name__))
  print("Learn rate: " + "%0.3f" % lrn_rate)
  print("Batch size: " + str(bat_size))
  print("Max epochs: " + str(max_epochs))

  print("Starting training")
  for epoch in range(0, max_epochs):
    epoch_loss = 0.0            # for one full epoch
    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]             # [bs,8]  inputs
      Y = batch[1]             # [bs,1]  targets
      oupt = net(X)            # [bs,1]  computeds 

      loss_val = loss_func(oupt, Y)   # a tensor
      epoch_loss += loss_val.item()   # accumulate
      optimizer.zero_grad()  # reset all gradients
      loss_val.backward()    # compute new gradients
      optimizer.step()       # update all weights

    if epoch % ep_log_interval == 0:
      print("epoch = %4d   loss = %8.4f" % \
        (epoch, epoch_loss))
  print("Done ")

The five statements that prepare training are:

  lrn_rate = 0.01
  loss_func = T.nn.BCELoss()  # binary cross entropy
  optimizer = T.optim.SGD(net.parameters(),
    lr=lrn_rate)
  max_epochs = 500
  ep_log_interval = 100

The number of epochs to train is a hyperparameter that must be determined by trial and error. The ep_log_interval specifies how often to display progress messages.

The loss function is set to BCELoss(), which assumes that the output nodes have sigmoid() activation applied. There is a strong coupling between loss function and output node activation. In the early days of neural networks, MSELoss() was often used (mean squared error), but BCELoss() is now far more common.

The demo uses stochastic gradient descent optimization (SGD) with a fixed learning rate of 0.01 that controls how much weights and biases change on each update. PyTorch supports 13 different optimization algorithms. The two most common are SGD and Adam (adaptive moment estimation). SGD often works reasonably well for simple networks, including binary classifiers. Adam often works better than SGD for deep neural networks.

PyTorch beginners sometimes fall into a trap of trying to learn everything about every optimization algorithm. Most of my experienced colleagues use just two or three algorithms and adjust the learning rate. My recommendation is to use SGD and Adam and try other algorithms only when those two fail.

It's important to monitor training progress because training failure is the norm rather than the exception. There are several ways to monitor training progress. The demo program uses the simplest approach which is to accumulate the total loss for one epoch, and then display that accumulated loss value every so often (ep_log_interval = 100 in the demo).

The inner training loop is where all the work is done:

  for (batch_idx, batch) in enumerate(train_ldr):
    X = batch[0]  # inputs
    Y = batch[1]  # correct class/label/politics
    optimizer.zero_grad()
    oupt = net(X)
    loss_val = loss_func(oupt, Y)  # a tensor
    epoch_loss += loss_val.item()  # accumulate
    loss_val.backward()
    optimizer.step()

The enumerate() function returns the current batch index (0 through 19) and a batch of input values (age, state, income, politics) with associated correct target values (0 or 1). Using enumerate() is optional and you can skip getting the batch index by writing "for batch in train_ldr" instead.

Get Code Download

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.