The t-SNE Data Visualization Technique from Scratch Using C# -- Visual Studio Magazine

The t-SNE Data Visualization Technique from Scratch Using C#

03/15/2024

The MatShow() call is an overload and the parameters means display using 4 decimals, width 12, and show row indices. At this point, I manually copied the reduced data values from the running shell and placed them into an Excel spreadsheet to make the graph shown in Figure 2. In a non-demo scenario with a large dataset, I could have saved the reduced data as a comma-delimited text file using the MatSave() function defined in the demo program. Instead of using a manual approach, I could have used one of many ways to programmatically transfer data to an Excel spreadsheet, or to programmatically generate a graph on a WinForms application.

Initialization
A non-trivial task when implementing t-SNE using raw C# is generating Gaussian distributed random values with mean = 0 and standard deviation = 1. Gaussian values are used to initialize the n-by-2 result matrix. The demo program implements a private-scope Gaussian class that is nested inside the main TSNE class. The code is shown in Listing 3.

Listing 3: The Gaussian Class

class Gaussian
{
  private Random rnd;
  private double mean;
  private double sd;

  public Gaussian(double mean, double sd, int seed)
  {
    this.rnd = new Random(seed);
    this.mean = mean;
    this.sd = sd;
  }

  public double NextGaussian()
  {
    double u1 = this.rnd.NextDouble();
    double u2 = this.rnd.NextDouble();
    double left = Math.Cos(2.0 * Math.PI * u1);
    double right = Math.Sqrt(-2.0 * Math.Log(u2));
    double z = left * right;
    return this.mean + (z * this.sd);
  }
} // Gaussian

The Gaussian constructor accepts a mean and standard deviation. The NextGaussian() method returns a single Gaussian distributed value. The Gaussian class implements a simplified version of the clever Box-Muller algorithm.

Wrapping Up
Using t-SNE is just one of several ways to visualize data that has three or more columns. Such systems are often called dimensionality reduction techniques. Principal component analysis (PCA) is similar in some respects to t-SNE. PCA doesn't require any hyperparameters, but PCA visualization results tend to be very good or quite poor.

Another dimensionality reduction technique to visualize data is to use a neural network autoencoder with a central hidden layer that has two nodes. Neural autoencoders require many hyperparameters (number of hidden layers, hidden and output activation functions, learning rate, batch size, maximum epochs) and so a good visualization is more difficult to achieve. But a neural autoencoder can theoretically produce the best visualization.

Although t-SNE is designed to generate reduced two-dimensional data for use in a graph, the technique can generate higher-dimensional data. But this use of t-SNE is relatively rare because PCA and autoencoders are usually a better approach.

The t-SNE technique is computationally expensive and can be very slow when applied to source data that has many (more than 100) columns. For such source data, a common trick is to apply PCA to reduce the data to roughly 50 columns, and then apply t-SNE to reduce to two columns.

Get Code Download

About the Author

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].

Printable Format

comments powered by Disqus

Featured

VS Code Insiders Get Copilot Chat AI-Enhanced Extensions in Latest Update

Developers using the "Insiders" build of the latest Visual Studio Code update, version 1.90, can now enjoy enhanced chat AI functionality in extensions.
Regression Using LightGBM

Dr. James McCaffrey of Microsoft Research presents a full-code, step-by-step tutorial on this powerful machine learning technique used to predict a single numeric value.
Microsoft Teases Discounts for August Dev Conference at Redmond HQ

Microsoft is offering a special discount for Visual Studio Professional and Enterprise subscribers wishing to attend a developer conference being held in August at the company's Redmond, Wash., headquarters.
Building Planet-Scale .NET Apps with Azure Cosmos DB

Azure Cosmos DB is a fully managed distributed database that can be transparently replicated across regions while remaining highly performant and seamlessly scaling according to needs, making it great for applications of any scale.
Java Devs in VS Code Can Now Ask Copilot for Syntax Rewrites

Count among the many emerging abilities of GitHub Copilot new functionality for rewriting your Java syntax in Visual Studio Code.