Binary Classification Using PyTorch: Training -- Visual Studio Magazine

Binary Classification Using PyTorch: Training

Dr. James McCaffrey of Microsoft Research continues his examination of creating a PyTorch neural network binary classifier through six steps, here addressing step No. 4: training the network.

By James McCaffrey
11/04/2020

The goal of a binary classification problem is to predict an output value that can be one of just two possible discrete values, such as "male" or "female." This article is the third in a series of four articles that present a complete end-to-end production-quality example of binary classification using a PyTorch neural network. The example problem is to predict if a banknote (think euro or dollar bill) is authentic or a forgery based on four predictor variables extracted from a digital image of the banknote.

The process of creating a PyTorch neural network binary classifier consists of six steps:

Prepare the training and test data
Implement a Dataset object to serve up the data
Design and implement a neural network
Write code to train the network
Write code to evaluate the model (the trained network)
Write code to save and use the model to make predictions for new, previously unseen data

Each of the six steps is fairly complicated, and the six steps are tightly coupled which adds to the challenge. This article covers the fourth step.

A good way to see where this series of articles is headed is to take a look at the screenshot of the demo program in Figure 1. The demo begins by creating Dataset and DataLoader objects which have been designed to work with the well-known Banknote Authentication data. Next, the demo creates a 4-(8-8)-1 deep neural network. Then the demo prepares training by setting up a loss function (binary cross entropy), a training optimizer function (stochastic gradient descent), and parameters for training (learning rate and max epochs).

Figure 1: Banknote Binary Classification in Action — **[Click on image for larger view.]** *Figure 1:* Banknote Binary Classification in Action

The demo trains the neural network for 100 epochs using batches of 10 items at a time. An epoch is one complete pass through the training data. For example, if there were 2,000 training data items and training was performed using batches of 50 items at a time, one epoch would consist processing 40 batches of data. During training, the demo computes and displays a measure of the current error. Because error slowly decreases, training is succeeding.

After training the network, the demo program computes the classification accuracy of the model on the training data (99.09 percent correct) and on the test data (99.27 percent correct). Because the two accuracy values are similar, it is likely that model overfitting has not occurred. After evaluating the trained model, the demo program saves the model using the state dictionary approach, which is the most common of three standard techniques.

The demo concludes by using the trained model to make a prediction. The four normalized input predictor values are (0.22, 0.09, -0.28, 0.16). The computed output value is 0.277069 which is less than 0.5 and therefore the prediction is class 0, which in turn means authentic banknote.

This article assumes you have an intermediate or better familiarity with a C-family programming language, preferably Python, but doesn't assume you know very much about PyTorch. The complete source code for the demo program, and the two data files used, are available in the download that accompanies this article. All normal error checking code has been omitted to keep the main ideas as clear as possible.

To run the demo program, you must have Python and PyTorch installed on your machine. The demo programs were developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6) and PyTorch version 1.6.0 for CPU installed via pip. You can find detailed step-by-step installation instructions for this configuration in my blog post.

The Banknote Authentication Data
The raw Banknote Authentication data looks like:

3.6216, 8.6661, -2.8073, -0.44699, 0
4.5459, 8.1674, -2.4586, -1.46210, 0
. . .
-2.5419, -0.65804, 2.6842, 1.1952, 1

The raw data can be found online. The goal is to predict the value in the fifth column (0 = authentic banknote, 1 = forged banknote) using the four predictor values. There are a total of 1,372 data items. The raw data was prepared in the following way. First, all four raw numeric predictor values were normalized by dividing by 20 so they're all between -1.0 and +1.0. Next, 1-based ID values from 1 to 1372 were added so that items can be tracked. Next, a utility program split the data into a training data file with 1,097 randomly selected items (80 percent of the 1,372 items) and a test data file with 275 items (the other 20 percent).

After the structure of the training and test files was established, I coded a PyTorch Dataset class to read data into memory and serve the data up in batches using a PyTorch DataLoader object. A Dataset class definition for the normalized and ID-augmented Banknote Authentication is shown in Listing 1.

Listing 1: A Dataset Class for the Banknote Data

class BanknoteDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None):
    all_data = np.loadtxt(src_file, max_rows=num_rows,
      usecols=range(1,6), delimiter="\t", skiprows=0,
      dtype=np.float32)  # strip IDs off
    self.x_data = T.tensor(all_data[:,0:4],
      dtype=T.float32).to(device)
    self.y_data = T.tensor(all_data[:,4],
      dtype=T.float32).reshape(-1,1).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx,:]  # idx rows, all 4 cols
    lbl = self.y_data[idx,:]    # idx rows, the 1 col
    sample = { 'predictors' : preds, 'target' : lbl }
    return sample

Preparing data and defining a PyTorch Dataset is not trivial. You can find the article that explains how to create Dataset objects and use them with DataLoader objects here in The Data Science Lab.

The Neural Network Architecture
In the previous article in this series, I described how to design and implement a neural network for binary classification using the Banknote Authentication data. One possible definition is presented in Listing 2. The code defines a 4-(8-8)-1 neural network.

Listing 2: A Neural Network for the Banknote Data

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 8)  # 4-(8-8)-1
    self.hid2 = T.nn.Linear(8, 8)
    self.oupt = T.nn.Linear(8, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight) 
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight) 
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)  
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x)) 
    z = T.tanh(self.hid2(z))
    z = T.sigmoid(self.oupt(z)) 
    return z

If you are new to PyTorch, the number of design decisions for a neural network can seem overwhelming. But with every program you write, you learn which design decisions are important and which don't affect the final prediction model very much, and the pieces of the puzzle quickly fall into place.

The Overall Program Structure
The overall structure of the PyTorch binary classification program, with a few minor edits to save space, is shown in Listing 3. I indent my Python programs using two spaces rather than the more common four spaces as a matter of personal preference.

Listing 3: The Structure of the Demo Program

# banknote_bnn.py
# PyTorch 1.6.0-CPU Anaconda3-2020.02
# Python 3.7.6 Windows 10 

import numpy as np
import torch as T
device = T.device("cpu")

# IDs 0001 to 1372 added
# data has been k=20 normalized (all four columns)
# ID  variance  skewness  kurtosis  entropy  class
# [0]    [1]      [2]       [3]       [4]     [5]
#  (0 = authentic, 1 = forgery)  # verified
# train: 1097 items (80%), test: 275 item (20%)

class BanknoteDataset(T.utils.data.Dataset):
  def __init__(self, src_file, num_rows=None): . . .
  def __len__(self): . . .
  def __getitem__(self, idx): . . .

# ----------------------------------------------------

def accuracy(model, ds): . . .

# ----------------------------------------------------

class Net(T.nn.Module):
  def __init__(self): . . .
  def forward(self, x): . . .

# ----------------------------------------------------

def main():
  # 0. get started
  print("Banknote authentication using PyTorch ")

  T.manual_seed(1)
  np.random.seed(1)

  # 1. create Dataset and DataLoader objects
  # 2. create neural network
  # 3. train network
  # 4. evaluate model
  # 5. save model
  # 6. make a prediction 

  print("End Banknote demo ")

if __name__== "__main__":
  main()

It's important to document the versions of Python and PyTorch being used because both systems are under continuous development. Dealing with versioning incompatibilities is a significant headache when working with PyTorch and is something you should not underestimate.

I like to use "T" as the top-level alias for the torch package. Most of my colleagues don't use a top-level alias and spell out "torch" dozens of times per program. Also, I use the full form of sub-packages rather than supplying aliases such as "import torch.nn.functional as functional." In my opinion, using the full form is easier to understand and less error-prone than using many aliases.

The demo program defines a program-scope CPU device object. I usually develop my PyTorch programs on a desktop CPU machine. After I get that version working, converting to a CUDA GPU system only requires changing the global device object to T.device("cuda") plus a minor amount of debugging.

The demo program defines just one helper method, accuracy(). All of the rest of the program control logic is contained in a single main() function. It is possible to define other helper functions such as train_net(), evaluate_model(), and save_model(), but in my opinion this modularization approach unexpectedly makes the program more difficult to understand rather than easier to understand.

Training the Neural Network
The details of training a neural network with PyTorch are complicated so buckle up. In very high level pseudo-code, the process to train a neural network looks like:

loop max_epochs times loop until all batches processed read a batch of training data (inputs, targets) compute outputs using the inputs compute error between outputs and targets use error to update weights and biases end-loop (all batches) end-loop (all epochs)

The difficult part of training is the "use error to update weights and biases" step. PyTorch does most of the hard work for you. It's not easy to understand neural network training without seeing a working program. The program shown in Listing 4 demonstrates how to train a network for binary classification. The screenshot in Figure 2 shows the output from the test program.

Listing 4: Training a Neural Network

# test_training.py

import numpy as np
import torch as T
device = T.device("cpu")

class BanknoteDataset(T.utils.data.Dataset): . . .
  # see Listing 1

class Net(T.nn.Module): . . .
  # see Listing 2

print("Begin test of training ")
  
T.manual_seed(1)
np.random.seed(1)
train_file = ".\\Data\\banknote_k20_train.txt"
train_ds = BanknoteDataset(train_file)  # all rows

bat_size = 10
train_ldr = T.utils.data.DataLoader(train_ds,
  batch_size=bat_size, shuffle=True)

net = Net().to(device)
net = net.train()  # set training mode

lrn_rate = 0.01
loss_obj = T.nn.BCELoss()  # binary cross entropy
optimizer = T.optim.SGD(net.parameters(),
  lr=lrn_rate)

for epoch in range(0, 100):
  epoch_loss = 0.0  # sum of avg loss/item/batch

  for (batch_idx, batch) in enumerate(train_ldr):
    X = batch['predictors']  # [10,4]  inputs
    Y = batch['target']      # [10,1]  targets
    optimizer.zero_grad()
    oupt = net(X)            # [10,1]  computeds 

    loss_val = loss_obj(oupt, Y)   # a tensor
    epoch_loss += loss_val.item()  # accumulate
    loss_val.backward()  # compute all gradients
    optimizer.step()     # update all wts, biases

  if epoch % 10 == 0:  
    print("epoch = %4d   loss = %0.4f" % \
      (epoch, epoch_loss))

print("Done ")

The training demo program begins execution with:

T.manual_seed(1)
np.random.seed(1)
train_file = ".\\Data\\banknote_k20_train.txt"
train_ds = BanknoteDataset(train_file)  # all rows

The global PyTorch and NumPy random number generator seeds are set so that results will be reproducible. Unfortunately, due to multiple threads of execution, in some cases your results will not be reproducible even if you set the seed values. The demo assumes that the training data is located in a subdirectory named Data. The BanknoteDataset object reads all 1,097 training data items into memory. If your training data size is very large you can read just part of the data into memory using the num_rows parameter.

Figure 2: Testing the Training Code — **[Click on image for larger view.]** *Figure 2:* Testing the Training Code

The demo program prepares training with these statements:

Get Code Download

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.