The Data Science Lab

Data Anomaly Detection Using a Neural Autoencoder with C#

The number of input nodes and output nodes, 9 in the case of the demo, is entirely determined by the normalized source data. The number of hidden nodes, 6 in the demo, is a hyperparameter that must be determined by trial and error. If too few hidden nodes are used, the autoencoder doesn't have enough power to model the source data well. If too many hidden nodes are used, the autoencoder will essentially memorize the source data -- overfitting the data, and the model won't find anomalies.

The seed value is used to initialize the autoencoder weights and biases to small random values. Different seed values can give significantly different results, but you shouldn't try to fine tune the model by adjusting the seed parameter.

The autoencoder is trained using these statements:

int maxEpochs = 1000;
double lrnRate = 0.01;
int batSize = 10;
Console.WriteLine("Starting training ");
nn.Train(dataX, dataX, lrnRate, batSize, maxEpochs);
Console.WriteLine("Done ");

Behind the scenes, the demo system trains the autoencoder using a clever algorithm called back-propagation. The maxEpochs, lrnRate, and batSize parameters are hyperparameters that must be determined by trial and error. If too few training epochs are used, the model will underfit the source data. Too many training epochs will overfit the data.

The learning rate controls how much the weights and biases change on each update during training. A very small learning rate will slowly but surely approach optimal weight and bias values, but training could be too slow. A large learning rate will quickly converge to a solution but could skip over optimal weight and bias values.

The batch size specifies how many data items to group together during training. A batch size of 1 is sometimes called "online training," but is rarely used. A batch size equal to the number of data items (240 in the demo) is sometimes called "full batch training," but is rarely used. In practice, it's a good idea to specify a batch size that evenly divides the number of data items so that all batches are the same size.


To recap, to use the demo program as a template, after you normalize and encode your source data, the number of input and output nodes is determined by the data. You must experiment with the number of hidden nodes, the maxEpochs, lrnRate, and batSize parameters. You don't have to modify the underlying methods.

Analyzing the Data
After the neural autoencoder has been trained, the trained model is called to make sure the model makes sense:

Console.WriteLine("Predicting output for male" +
  " 39 Oklahoma $51,200 moderate ");
double[] X = new double[] { 0, 0.39, 0, 0, 1,
  0.51200, 0, 1, 0 };
double[] y = nn.ComputeOutput(X);
Console.WriteLine("Predicted output: ");
Utils.VecShow(y, 5, 9, true);

The trained model should return a vector that is close to, but not exactly the same as, the input vector. In this example, the input is (0, 0.39, 0, 0, 1, 0.51200, 0, 1, 0). The computed output is (0.00390, 0.39768, -0.00035, -0.00252, 1.00286, 0.50118, -0.00225, 1.00290, -0.00065). The sex, State, and political leaning variables are predicted very closely. The age and income variables are not predicted perfectly, but they're close to the input values.

The demo calls an Analyze() method like so:

Console.WriteLine("Analyzing data ");
nn.Analyze(dataX, rawFileArray);

In very high level pseudo-code, the Analyze() method is:

initialize maxError = 0
loop each normalized/encoded data item
  feed data item to autoencoder, fetch output
  compute Euclidean distance between input and output
  if distance > maxError then
    maxError = distance
    most anomalous idx = curr idx
  end-if
end-loop

The demo uses Euclidean distance between vectors as a measure of error. For example, if vector v1 = (3.0, 5.0, 6.0) and vector v2 = (2.0, 8.0, 6.0), the Euclidean distance is sqrt( (3.0 - 2.0)^2 + (5.0 - 8.0)^2 + (6.0 - 6.0)^2 ) = sqrt(1.0 + 9.0 + 0.0) = sqrt(10.0) = 3.16.

A common variation of ordinary Euclidean distance is to divide the distance by the number of elements in the vector (also known as the dimension of the vector) to make it easier to compare systems that have datasets with different vector dimensions.

Euclidean distance heavily punishes deviations due to the squaring operation. An alternative is to use the sum of the absolute values of the differences between vector elements. For the two vectors above, that would be abs(3.0 - 2.0) + abs(5.0 - 8.0) + abs(6.0 - 6.0) = 1.0 + 3.0 + 0.0 = 4.0. You should have no trouble modifying the Analyze() method to use different error metrics.

The Analyze() method accepts, as parameters, the normalized dataset in order to compute error, and the raw data in order to show the anomalous data in a friendly format.

The demo program finds the single most anomalous data item. Another approach is to save the reconstruction errors for all data items, and sort the errors to from largest error to smallest error. This will give you the n most anomalous data items instead of just the single most anomalous.

Wrapping Up
The neural autoencoder anomaly detection technique presented in this article is just one of many ways to look for data anomalies. The technique assumes you are working with tabular data, such as log files. Working with image data, working with time series data, and working with natural language data, all require more specialized techniques. And there are specialized techniques for working with specific types of data, such as fraud detection systems. That said, applying a neural autoencoder anomaly detection system to tabular data is typically the best way to start.

A limitation of the autoencoder architecture presented in this article is that it only has a single hidden layer. Neural autoencoders with multiple hidden layers are called deep autoencoders. Implementing a deep autoencoder is possible but requires a lot of effort. A result from the Universal Approximation Theorem (sometimes called the Cybenko Theorem) states, loosely speaking, that a neural network with a single hidden layer and enough hidden nodes can approximate any function that can be approximated by a deep autoencoder.


About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

comments powered by Disqus

Featured

Subscribe on YouTube