The Data Science Lab
Multi-Class Classification Using PyTorch: Preparing Data
Dr. James McCaffrey of Microsoft Research kicks off a four-part series on multi-class classification, designed to predict a value that can be one of three or more possible discrete values.
The goal of a multi-class classification problem is to predict a value that can be one of three or more possible discrete values, such as "red," "yellow" or "green" for a traffic signal. This article is the first in a series of four articles that present a complete end-to-end production-quality example of multi-class classification using a PyTorch neural network. The example problem is to predict a college student's major ("finance," "geology" or "history") from their sex, number of units completed, home state and score on an admission test.
The process of creating a PyTorch neural network multi-class classifier consists of six steps:
- Prepare the training and test data
- Implement a Dataset object to serve up the data
- Design and implement a neural network
- Write code to train the network
- Write code to evaluate the model (the trained network)
- Write code to save and use the model to make predictions for new, previously unseen data
Each of the six steps is fairly complicated, and the six steps are tightly coupled which adds to the difficulty. This article covers the first two steps.
A good way to see where this series of articles is headed is to take a look at the screenshot of the demo program in Figure 1. The demo begins by creating Dataset and DataLoader objects which have been designed to work with the student data. Next, the demo creates a 6-(10-10)-3 deep neural network. The demo prepares training by setting up a loss function (cross entropy), a training optimizer function (stochastic gradient descent) and parameters for training (learning rate and max epochs).
The demo trains the neural network for 1,000 epochs in batches of 10 items. An epoch is one complete pass through the training data. The training data has 200 items and the test data has 40 items. Therefore, one training epoch consists of processing 20 batches of 10 training items.
During training, the demo computes and displays a measure of the current error (also called loss) every 100 epochs. Because error slowly decreases, it appears that training is succeeding. Behind the scenes, the demo program saves checkpoint information after every 100 epochs so that if the training machine crashes, training can be resumed without having to start from the beginning.
After training the network, the demo program computes the classification accuracy of the model on the training data (163 out of 200 correct = 81.50 percent) and on the test data (31 out of 40 correct = 77.50 percent). Because the two accuracy values are similar, it is likely that model overfitting has not occurred. After evaluating the trained model, the demo program saves the model using the state dictionary approach, which is the most common of three standard techniques.
The demo concludes by using the trained model to make a prediction. The raw input is (sex = "M", units = 30.5, state = "oklahoma", score = 543). The raw input is normalized and encoded as (sex = -1, units = 0.305, state = 0, 0, 1, score = 0.543). The computed output vector is [0.7104, 0.2849, 0.0047]. These values represent the pseudo-probabilities of student majors "finance," "geology" and "history" respectively. Because the probability associated with "finance" is the largest, the predicted major is "finance."
This article assumes you have an intermediate or better familiarity with a C-family programming language, preferably Python, but doesn't assume you know very much about PyTorch. The complete source code for the demo program, and the two data files used, are available in the download that accompanies this article. All normal error checking code has been omitted to keep the main ideas as clear as possible.
To run the demo program, you must have Python and PyTorch installed on your machine. The demo programs were developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6) and PyTorch version 1.7.0 for CPU installed via pip. You can find detailed step-by-step installation instructions for this configuration in my blog post here.
The Student Data
The raw Student data is synthetic and was generated programmatically. There are a total of 240 data items, divided into a 200-item training dataset and a 40-item test dataset. The raw data looks like:
M 39.5 oklahoma 512 geology
F 27.5 nebraska 286 history
M 22.0 maryland 335 finance
F 50.0 nebraska 565 geology
. . .
M 59.5 oklahoma 694 history
Each line of tab-delimited data represents a hypothetical student at a hypothetical college. The first four values on each line are the predictors (often called features in machine learning terminology), and the fifth value is the dependent value to predict (often called the class or the label).
The first value on each line is the student's sex ("M" = male, "F" = female). The second value is the number of units completed by the student so far. The third value is the student's home state. For simplicity, there are just three states: "maryland," "nebraska" and "oklahoma." The fourth value is the student's test score on some sort of admission exam. The fifth value is the student's major. For simplicity there are just three majors to predict: "finance," "geology" and "history."
When using a PyTorch neural network, categorical predictor data must be encoded into a numeric form, and numeric predictor data should be normalized. For multi-class classification, the dependent value should be ordinal encoded.
The raw Student data was prepared in the following way. The gender values were encoded as "M" = -1 and "F" = +1. The units-completed values were normalized by dividing by 100. The student home state values were one-hot encoded as "maryland' = (1, 0, 0), "nebraska" = (0, 1, 0), "oklahoma" = (0, 0, 1). The test scores were normalized by dividing by 1000. The dependent values-to-predict, student majors, were ordinal encoded as "finance" = 0, "geology" = 1, "history" = 2.
Because the synthetic Student data is mixed numeric and categorical and has multiple dimensions, it's not possible to easily display the data in a graph. But you can get a good idea of what the data is like by examining the graph in Figure 2. It shows a 100-item subset of the raw data, with just the units-completed and test score predictor variables. Notice the data is not linearly separable so simple classification techniques such as multi-class logistic regression, decision trees and non-kernel multi-class support vector machines would likely create poor prediction models.
In a non-demo scenario, data preparation can be very time-consuming. It's not uncommon for data preparation to take 80 percent or even more of the total time and effort required to create a prediction model. The demo system presented in this article performs all data preparation as a preprocessing step. An alternative approach is to programmatically perform data normalization and encoding on the fly.
The Overall Program Structure
The overall structure of the demo PyTorch multi-class classification program, with a few minor edits to save space, is shown in Listing 1. I indent my Python programs using two spaces rather than the more common four spaces.
Listing 1: The Structure of the Demo Program
# student_major.py
# PyTorch 1.7.0-CPU Anaconda3-2020.02
# Python 3.7.6 Windows 10
import numpy as np
import time
import torch as T
device = T.device("cpu")
class StudentDataset(T.utils.data.Dataset):
# sex units state test_score major
# -1 0.395 0 0 1 0.5120 1
# 1 0.275 0 1 0 0.2860 2
# -1 0.220 1 0 0 0.3350 0
# sex: -1 = male, +1 = female
# state: maryland, nebraska, oklahoma
# major: finance, geology, history
def __init__(self, src_file, n_rows=None): . . .
def __len__(self): . . .
def __getitem__(self, idx): . . .
# ----------------------------------------------------
def accuracy(model, ds): . . .
# ----------------------------------------------------
class Net(T.nn.Module):
def __init__(self): . . .
def forward(self, x): . . .
# ----------------------------------------------------
def main():
# 0. get started
print("Begin predict student major ")
np.random.seed(1)
T.manual_seed(1)
# 1. create Dataset and DataLoader objects
# 2. create neural network
# 3. train network
# 4. evaluate model
# 5. save model
# 6. make a prediction
print("End predict student major demo ")
if __name__== "__main__":
main()
It's important to document the versions of Python and PyTorch being used because both systems are under continuous development. Dealing with versioning incompatibilities is a significant headache when working with PyTorch and is something you should not underestimate. The demo program imports the Python time module to timestamp saved checkpoints.
I prefer to use "T" as the top-level alias for the torch package. Most of my colleagues don't use a top-level alias and spell out "torch" dozens of times per program. Also, I use the full form of sub-packages rather than supplying aliases such as "import torch.nn.functional as functional". In my opinion, using the full form is easier to understand and less error-prone than using many aliases.
The demo program defines a program-scope CPU device object. I usually develop my PyTorch programs on a desktop CPU machine. After I get that version working, converting to a CUDA GPU system only requires changing the global device object to T.device("cuda") plus a minor amount of debugging.
The demo program defines just one helper method, accuracy(). All of the rest of the program control logic is contained in a main() function. It is possible to define other helper functions such as train_net(), evaluate_model() and save_model(), but in my opinion this modularization approach makes the program more difficult to understand rather than easier to understand.
Defining a Student Dataset Class
Serving up batches of data for training a network and evaluating the accuracy of a trained model is a bit trickier than you might expect if you're new to PyTorch. In the early days of PyTorch, the most common approach was to write completely custom code. You can still write one-off code for loading data, but now the most common approach is to implement a Dataset and DataLoader. Briefly, a Dataset object loads all training or test data into memory, and a DataLoader object serves up the data in batches.
You can think of a PyTorch Dataset as an interface that must be implemented. At a minimum, you must define an __init__() method which reads data from file into memory, a __len__() method which returns the total number of items in the source data, and a __getitem__() method which returns a single batch of data items. There are many design alternatives and no two Dataset class definitions will be the same.
A DataLoader object is instantiated by passing in a Dataset object. The DataLoader object can be iterated, serving up one batch of data at a time. Unlike the Dataset which must be implemented, a DataLoader is ready to use as-is.
The definition of class StudentDataset is shown in Listing 2. In most cases, the structures of the training and test data files are the same and you can use a single Dataset definition for both files. If the structures of your files are different, then you'd have to define two different Dataset classes, or parameterize the Dataset definition.
Listing 2: Class StudentDataset Definition
class StudentDataset(T.utils.data.Dataset):
# sex units state test_score major
# -1 0.395 0 0 1 0.5120 1
# 1 0.275 0 1 0 0.2860 2
# -1 0.220 1 0 0 0.3350 0
# sex: -1 = male, +1 = female
# state: maryland, nebraska, oklahoma
# major: finance, geology, history
def __init__(self, src_file, n_rows=None):
all_xy = np.loadtxt(src_file, max_rows=n_rows,
usecols=[0,1,2,3,4,5,6], delimiter="\t",
skiprows=0, comments="#", dtype=np.float32)
n = len(all_xy)
tmp_x = all_xy[0:n,0:6] # all rows, cols [0,6)
tmp_y = all_xy[0:n,6] # 1-D required
self.x_data = \
T.tensor(tmp_x, dtype=T.float32).to(device)
self.y_data = \
T.tensor(tmp_y, dtype=T.int64).to(device)
def __len__(self):
return len(self.x_data)
def __getitem__(self, idx):
preds = self.x_data[idx]
trgts = self.y_data[idx]
sample = {
'predictors' : preds,
'targets' : trgts
}
return sample
The __init__() method begins by reading all relevant data from file into memory using the NumPy loadtxt() function:
all_xy = np.loadtxt(src_file, max_rows=n_rows,
usecols=[0,1,2,3,4,5,6], delimiter="\t",
skiprows=0, comments="#", dtype=np.float32)
The synthetic Student data contains both predictor values and labels-to-predict values in the same file, so both can be read at the same time. A slightly less efficient alternative is to read the predictor values with one call to loadtxt() and then read the values-to-predict with a second call.
Python has dozens of ways to read a text file into memory, but using loadtxt() is the technique I prefer. Some of my colleagues favor using the NumPy genfromtxt() or fromfile() functions, or the Pandas read_csv() function. The data is read into a NumPy matrix as float32 values.