Autoencoder Anomaly Detection Using PyTorch -- Visual Studio Magazine

Autoencoder Anomaly Detection Using PyTorch

Dr. James McCaffrey of Microsoft Research provides full code and step-by-step examples of anomaly detection, used to find items in a dataset that are different from the majority for tasks like detecting credit card fraud.

By James McCaffrey
04/13/2021

Anomaly detection is the process of finding items in a dataset that are different in some way from the majority of the items. For example, you could examine a dataset of credit card transactions to find anomalous items that might indicate a fraudulent transaction. This article explains how to use a PyTorch neural autoencoder to find anomalies in a dataset.

A good way to see where this article is headed is to take a look at the screenshot of a demo program in Figure 1. The demo analyzes a dataset of 3,823 images of handwritten digits where each image is 8 by 8 pixels. The demo program presented in this article uses image data, but the autoencoder anomaly detection technique can work with any type of data.

The demo begins by creating a Dataset object that stores the images in memory. Next, the demo creates a 65-32-8-32-65 neural autoencoder. An autoencoder learns to predict its input. Therefore, the autoencoder input and output both have 65 values -- 64 pixel grayscale values (0 to 16) plus a label (0 to 9). Notice that the demo program analyzes both the predictors (pixel values) and the dataset labels (digits). Depending upon your particular anomaly detection scenario, you might not include the labels.

The demo sets up training parameters for the batch size (10), number of epochs to train (100), loss function (mean squared error), optimization algorithm (stochastic gradient descent) and learning rate (0.005). After training the autoencoder, the demo scans the dataset and computes the reconstruction error for each data item. The data item that has the largest error is item [486] with error = 0.1352. The demo concludes by displaying that anomalous item, which is a "7" digit.

Figure 1: Autoencoder Anomaly Detection in Action — **[Click on image for larger view.]** *Figure 1:* Autoencoder Anomaly Detection in Action

This article assumes you have an intermediate or better familiarity with a C-family programming language, preferably Python, but doesn't assume you know very much about PyTorch. The complete source code for the demo program is presented in this article. The source code is also available in the accompanying file download. All normal error checking code has been omitted to keep the main ideas as clear as possible.

To run the demo program, you must have Python and PyTorch installed on your machine. The demo programs were developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6) and PyTorch version 1.8.0 for CPU installed via pip. Installation is not trivial. You can find detailed step-by-step installation instructions for this configuration in my blog post.

The UCI Digits Dataset
The UCI Digits dataset can be found here. There is a 3,823-item file named optdigits.tra (intended for training) and a 1,797-item file named optdigits.tes (for testing). I downloaded the files and renamed them to optdigits_train_3823.txt and optdigits_test_1797.txt. Each file is a simple, comma-delimited text file. Each line represents an 8 by 8 handwritten digit from "0" to "9."

The UCI Digits dataset resembles the well-known MNIST dataset. MNIST has 60,000 training and 10,000 test image. Each image is 28 by 28 = 784 pixels, and the source MNIST files are stored in a proprietary binary format. The UCI digits dataset is much easier to work with.

The UCI Digits data looks like:

0,1,6,16,12, . . . 1,0,0,13,0
2,7,8,11,15, . . . 16,0,7,4,1
. . .

The first 64 values on each line are the image pixel values. Each pixel is a grayscale value between 0 and 16. The last value on each line is the digit/label. There are about 380 of each digit in the training file and about 180 of each digit in the test file, but the digits are not evenly distributed. The counts of each "0" though "9" digit are:

optdigits_train_3823.txt:
[376, 389, 380, 389, 387, 376, 377, 387, 380, 382]

optdigits_test_1797.txt:
[178, 182, 177, 183, 181, 182, 181, 179, 174, 180]

The 10 images in Figure 2 are representative digits. With only 64 pixels, each image is quite crude when displayed visually.

Figure 2: Examples of UCI Digits Data Displayed Visually — **[Click on image for larger view.]** *Figure 2:* Examples of UCI Digits Data Displayed Visually

The demo program defines a PyTorch Dataset class to load the data in memory. See Listing 1.

Listing 1: A Dataset Class for the UCI Digits Data

import torch as T
import numpy as np

class UCI_Digits_Dataset(T.utils.data.Dataset):
  # 8,12,0,16, . . 15,7
  # 64 pixel values [0-16], digit [0-9]

  def __init__(self, src_file, n_rows=None):
    all_xy = np.loadtxt(src_file, max_rows=n_rows,
      usecols=range(0,65), delimiter=",", comments="#",
      dtype=np.float32)
    self.xy_data = T.tensor(all_xy, dtype=T.float32).to(device) 
    self.xy_data[:, 0:64] /= 16.0   # normalize pixels
    self.xy_data[:, 64] /= 9.0      # normalize digit/label

  def __len__(self):
    return len(self.xy_data)

  def __getitem__(self, idx):
    xy = self.xy_data[idx]
    return xy

The class loads a file of UCI digits data into memory as a two-dimensional array using the NumPy loadtxt() function. Alternatives loading functions include the NumPy genfromtxt() or fromfile() functions, or the Pandas read_csv() function. After converting the NumPy array to a PyTorch tensor array, the pixel values in columns [0] to [63] are normalized by dividing by 16, and the label values in column [64] are normalized by dividing by 9. The resulting pixel and label values are all between 0.0 and 1.0.

In most scenarios, the __getitem__() method returns a Python tuple with predictors and labels. But for an autoencoder, each data item acts as both the input and the target to predict.

The Dataset can be used with code like this:

fn = ".\\Data\\optdigits_train_3823.txt"
my_ds = UCI_Digits_Dataset(fn)
my_ldr = T.utils.data.DataLoader(my_ds, \
  batch_size=10, shuffle=True)
for (b_ix, batch) in enumerate(my_ldr):
  # b_ix is the batch index
  # batch item has 65 values between 0 and 1
. . .

The Dataset object is passed to a built-in PyTorch DataLoader object. The DataLoader object serves up the data in batches of a specified size, in a random order on each pass through the Dataset.

The design pattern presented here will work for most autoencoder anomaly detection scenarios. If your raw data contains a categorical variable, such as "color" with possible values "red", "blue" or "green", you can one-hot encode the data: "red" = (1, 0, 0), "blue" = (0, 1, 0), "green" = (0, 0, 1). If your source data is too large to load into memory, you'll have to write a custom data loader that buffers the data. I describe how to create streaming data loaders in a previous article; you can find it here .

Autoencoders
An autoencoder is a neural network that predicts its own input. The diagram in Figure 3 shows the architecture of the 65-32-8-32-65 autoencoder used in the demo program. An input image x, with 65 values between 0 and 1 is fed to the autoencoder. A neural layer transforms the 65-values tensor down to 32 values. The next layer produces a core tensor with 8 values. The core 8 values generate 32 values, which in turn generate 65 values. The size of the first and last layers of an autoencoder are determined by the problem data, but the number of interior hidden layers, and the number of nodes in each hidden layer, are hyperparameters that must be determined by trial and error guided by experience.

Figure 3: Autoencoder Architecture for the UCI Digits Dataset — **[Click on image for larger view.]** *Figure 3:* Autoencoder Architecture for the UCI Digits Dataset

The idea is that the first part of the autoencoder finds the fundamental information contained in the input image, stripping away noise and random error. The second part of the autoencoder generates a cleaned version of the input. The first part of an autoencoder is called the encoder component, and the second part is called the decoder.

To use an autoencoder for anomaly detection, you compare the reconstructed version of an image with its source input. If the reconstructed version of an image differs greatly from its input, the image is anomalous in some way.

The definition of the demo program autoencoder is presented in Listing 2. There are many design alternatives. The __init__() method defines four fully-connected ("fc") layers. You might want to parameterize __init__() to accept the layer sizes instead of hard-coding them as the demo does. The class uses default initialization for weights and biases. Weight and bias initialization is a surprisingly complex topic. You might want to explicitly initialize weights using the T.nn.init.uniform_() function. I sometimes get significantly better results using explicit weight initialization.

Listing 2: Autoencoder Definition for UCI Digits Dataset

class Autoencoder(T.nn.Module):  # 65-32-8-32-65
  def __init__(self):
    super(Autoencoder, self).__init__()
    self.fc1 = T.nn.Linear(65, 32)
    self.fc2 = T.nn.Linear(32, 8)
    self.fc3 = T.nn.Linear(8, 32)
    self.fc4 = T.nn.Linear(32, 65)

  def encode(self, x):  # 65-32-8
    z = T.tanh(self.fc1(x))
    z = T.tanh(self.fc2(z))  # latent in [-1,+1]
    return z  

  def decode(self, x):  # 8-32-65
    z = T.tanh(self.fc3(x))
    z = T.sigmoid(self.fc4(z))  # [0.0, 1.0]
    return z
    
  def forward(self, x):  # 65-32-8-32-65
    z = self.encode(x) 
    z = self.decode(z) 
    return z  # in [0.0, 1.0]

The Autoencoder defines explicit encode() and decode() methods, and then defines the forward() method using encode() and decode(). Because an autoencoder for anomaly detection often doesn't directly use the values in the interior core layer, it's possible to eliminate encode() and decode() and define the forward() method directly:

def forward(self, x):
  z = T.tanh(self.fc1(x))
  z = T.tanh(self.fc2(z))  # latent in [-1,+1]
  z = T.tanh(self.fc3(z))
  z = T.sigmoid(self.fc4(z))  # [0.0, 1.0]
  return z

Using this approach, the first part of forward() acts as the encoder component and the second part acts as the decoder.

The demo program uses tanh() activation on all layers except the final output layer, where sigmoid() is used because the output values must be in range [0.0, 1.0] to match the input values. Many of the autoencoder examples I see online use relu() activation for interior layers. The relu() function was designed for use with very deep neural architectures. For autoencoders, which are usually relatively shallow, I often, but not always, get better results with tanh() activation.

The Overall Program Structure
The overall structure of the PyTorch autoencoder anomaly detection demo program, with a few minor edits to save space, is shown in Listing 3. I prefer to indent my Python programs using two spaces rather than the more common four spaces.

Listing 3: The Structure of the Autoencoder Anomaly Program

# uci_digits_auto_anomaly.py

# autoencoder reconstruction error anomaly detection
# uses an encoder-decoder architecture
# PyTorch 1.8.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10 

import numpy as np
import matplotlib.pyplot as plt
import torch as T

device = T.device("cpu") 

# -----------------------------------------------------------

class UCI_Digits_Dataset(T.utils.data.Dataset): . . .

class Autoencoder(T.nn.Module): . . .
    
def display_digit(ds, idx, save=False): . . .

def train(ae, ds, bs, me, le, lr): . . .

def make_err_list(model, ds): . . .

# -----------------------------------------------------------

def main():
  # 0. get started
  # 1. create Dataset object
  # 2. create autoencoder net
  # 3. train autoencoder model
  # 4. compute and store reconstruction errors
  # 5. show most anomalous data item

  print("End autoencoder anomaly detection demo ")

# -----------------------------------------------------------

if __name__ == "__main__":
  main()

It's important to document the versions of Python and PyTorch being used because both systems are under continuous development. Dealing with versioning incompatibilities is a significant headache when working with PyTorch and is something you should not underestimate.

I prefer to use "T" as the top-level alias for the torch package. Most of my colleagues don't use a top-level alias and spell out "torch" many times per program. Also, I use the full form of submodules rather than supplying aliases such as "import torch.nn.functional as functional." In my opinion, using the full form is easier to understand and less error-prone than using many aliases.

The demo program defines a program-scope CPU device object. I usually develop my PyTorch programs on a desktop CPU machine. After I get that version working, converting to a CUDA GPU system only requires changing the global device object to T.device("cuda") plus a minor amount of debugging.

The demo program defines three helper methods: display_digit(), train() and make_err_list(). All of the rest of the program control logic is contained in a main() function. Using helper functions makes the code a bit more difficult to understand, but allows you to manage and modify the code more easily.

Get Code Download

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.