The Data Science Lab
The t-SNE Data Visualization Technique from Scratch Using C#
The MatShow() call is an overload and the parameters means display using 4 decimals, width 12, and show row indices. At this point, I manually copied the reduced data values from the running shell and placed them into an Excel spreadsheet to make the graph shown in Figure 2. In a non-demo scenario with a large dataset, I could have saved the reduced data as a comma-delimited text file using the MatSave() function defined in the demo program. Instead of using a manual approach, I could have used one of many ways to programmatically transfer data to an Excel spreadsheet, or to programmatically generate a graph on a WinForms application.
Initialization
A non-trivial task when implementing t-SNE using raw C# is generating Gaussian distributed random values with mean = 0 and standard deviation = 1. Gaussian values are used to initialize the n-by-2 result matrix. The demo program implements a private-scope Gaussian class that is nested inside the main TSNE class. The code is shown in Listing 3.
Listing 3: The Gaussian Class
class Gaussian
{
private Random rnd;
private double mean;
private double sd;
public Gaussian(double mean, double sd, int seed)
{
this.rnd = new Random(seed);
this.mean = mean;
this.sd = sd;
}
public double NextGaussian()
{
double u1 = this.rnd.NextDouble();
double u2 = this.rnd.NextDouble();
double left = Math.Cos(2.0 * Math.PI * u1);
double right = Math.Sqrt(-2.0 * Math.Log(u2));
double z = left * right;
return this.mean + (z * this.sd);
}
} // Gaussian
The Gaussian constructor accepts a mean and standard deviation. The NextGaussian() method returns a single Gaussian distributed value. The Gaussian class implements a simplified version of the clever Box-Muller algorithm.
Wrapping Up
Using t-SNE is just one of several ways to visualize data that has three or more columns. Such systems are often called dimensionality reduction techniques. Principal component analysis (PCA) is similar in some respects to t-SNE. PCA doesn't require any hyperparameters, but PCA visualization results tend to be very good or quite poor.
Another dimensionality reduction technique to visualize data is to use a neural network autoencoder with a central hidden layer that has two nodes. Neural autoencoders require many hyperparameters (number of hidden layers, hidden and output activation functions, learning rate, batch size, maximum epochs) and so a good visualization is more difficult to achieve. But a neural autoencoder can theoretically produce the best visualization.
Although t-SNE is designed to generate reduced two-dimensional data for use in a graph, the technique can generate higher-dimensional data. But this use of t-SNE is relatively rare because PCA and autoencoders are usually a better approach.
The t-SNE technique is computationally expensive and can be very slow when applied to source data that has many (more than 100) columns. For such source data, a common trick is to apply PCA to reduce the data to roughly 50 columns, and then apply t-SNE to reduce to two columns.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].