The Data Science Lab
Data Clustering Using a Self-Organizing Map (SOM) with C#
The demo program begins by loading the 12-item Penguin subset into memory using these statements:
string fn = "..\\..\\..\\Data\\penguin_12.txt";
double[][] X = ClusterSOM.MatLoad(fn,
new int[] { 1, 2, 3, 4 }, ',', "#");
Console.WriteLine("\nX: ");
ClusterSOM.MatShow(X, 1, 8, true);
The arguments to MatLoad() mean to load columns [1] to [4] of the data file, where items are comma-separated, and a "#" at the beginning of a line indicates a comment. The species values in column [0] are not loaded into memory.
SOM clustering works with labeled or unlabeled data. In some scenarios with labeled data, you may want to load the labels too. For example:
int[] y = ClusterSOM.VecLoad(fn, 0, "#"); // col [0]
Console.WriteLine("y labels: ");
ClusterSOM.VecShow(y, 3);
The data is z-score standardized and displayed using these statements:
double[][] stdX = ClusterSOM.MatStandard(X);
Console.WriteLine("Standardized data: ");
ClusterSOM.MatShow(stdX, 4, 9, true);
The arguments passed to the program-defined MatShow() function mean display using 4 decimals, using a width of 9 for each element, and show row index values.
Clustering the Data
The SOM object is created, and the data is clustered, like so:
int k = 3;
double lrnRateMax = 0.50;
int stepsMax = 1000;
ClusterSOM som = new ClusterSOM(stdX, k, seed: 0);
som.Cluster(lrnRateMax, stepsMax);
The parameter values for k, lrnRateMax and stepsMax are explicit hyperparameters that must be determined by trial and error. The seed value is an implicit hyperparameter. Different values for seed will give different results, but you shouldn't try to fine-tune SOM clustering using the seed value.
The SOM map data structure is displayed using these statements:
Console.WriteLine("SOM map nodes: ");
for (int kk = 0; kk < k; ++kk)
{
Console.Write("k = " + kk + ": ");
ClusterSOM.VecShow(som.map[0][kk], 4, 9);
}
The SOM mapping data structure is displayed using these statements:
Console.WriteLine("SOM mapping: ");
for (int kk = 0; kk < k; ++kk)
{
Console.Write("k = " + kk + ": ");
ClusterSOM.ListShow(som.mapping[0][kk]);
}
The SOM map is a 1-by-k matrix of vectors, so the first index value is always [0]. Similarly, the SOM mapping is a 1-by-k matrix of List objects, which are displayed using a program-defined ListShow() method. An alternative design is to implement dedicated public class methods with names such as ShowMap() and ShowMapping().
Inspecting the SOM Clustering Results
One of the advantages of using SOM clustering is that data items assigned to adjacent clusters are similar. The demo verifies this characteristic using these statements:
double[][] betweenDists = som.GetBetweenNodeDists();
Console.WriteLine("Between map node distances: ");
ClusterSOM.MatShow(betweenDists, 2, 6, true);
The result is:
Between map node distances
[0] 0.00 3.29 3.99
[1] 3.29 0.00 1.84
[2] 3.99 1.84 0.00
This means the smallest Euclidean distance between map cell/node representative vectors is 1.84, between clusters 1 and 2. The largest distance between cells is 3.99 between clusters 0 and 2.
The demo program concludes by displaying the clustering results in the usual format:
Console.WriteLine("clustering: ");
int[] clustering = som.GetClustering();
ClusterSOM.VecShow(clustering, wid: 3);
The result is [2, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0]. The clustering array contains the same information as the mapping object, just stored in an integer array rather than an array of List objects.
Wrapping Up
In practice, self-organizing map clustering is often used to validate k-means clustering. A relatively unexplored idea is to perform SOM clustering using an r-by-c map instead of a 1-by-k map. When using a two-dimensional map, SOM clustering produces cluster IDs that have a (row, column) format instead of a simple integer cluster ID format.
SOM clustering can be easily extended to create a data anomaly detection system. After clustering, each data item is compared to its associated cluster map vector. Data items that are farthest from their associated map cell vector are anomalous.
A major weakness of most clustering techniques, including SOM clustering, is that only strictly numeric data can be clustered. Suppose you have data that has a Color column with possible values red, blue, green. If you use ordinal encoding where red = 1, blue = 2, green = 3, the distance between red and green is greater than the distance between red and blue. If you use one-hot encoding where red = (1 0 0), blue = (0 1 0), green = (0 0 1), the distance between any two colors is always 1, and this is true regardless of how many possible color values there are.
An unpublished technique that appears promising is to use what I call "one-over-n-hot" encoding. For the three-color example, the encoding is red = (0.33, 0, 0), blue = (0, 0.33, 0), green = (0, 0, 0.33). If there are four colors, the encoding values are (0,25, 0, 0, 0), (0, 0.25, 0, 0), (0, 0, 0.25, 0), (0, 0, 0, 0.25). In other words, for a categorical variable with n possible values, you use modified one-hot encoding where the 1 values are replaced by 1/n.
For categorical data that is inherently ordered, you can use equal-interval encoding. For example, if you have a Height column with possible values short, medium and tall, then you can encode short = 0.25, medium = 0.50 and tall = 0.75.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].