The Data Science Lab
Clustering Mixed Categorical and Numeric Data Using k-Means with C#
The KMeans class uses Euclidean distance between data items. An alternative is to use Manhattan distance. Manhattan distance uses the sum of absolute values of the differences between data item elements instead of the square root of the sum of squared differences. Therefore, it is sometimes said that, in theory, Manhattan distance may be preferable for data that has a very high dimensionality. I'm mildly skeptical of this claim and Euclidean distance seems to work fine in practice.
The KMeans class clustering algorithm uses two loops. In pseudo-code:
loop exactly trials times
wcss, clustering = cluster using at most maxIter iterations
if wcss < best_wcss then
best_wcss = current wcss
best_clustering = current clustering
best_means = current means
end_if
end_loop
return best_clustering
The KMeans constructor sets the outer loop number of trials value to N * 5 where N is the number of data items being clustered. For the demo data this is 240 * 5 = 1200 iterations. This N * 5 value has worked well in practice, but I can imagine there might be scenarios where you need a larger value or a smaller value, or possibly a fixed value for doing performance evaluations. For example, the scikit-learn Python KMeans library module uses a default fixed value of 300 for the number of trials. Because the trials variable is a public class member, you can set it directly, like km.trials = 250, before calling the Cluster() method.
The value of the max iterations used in one clustering operation is set to N * 2. The k-means algorithm typically converges quickly to a result where there is no change in clustering -- typically about 15 iterations for the demo data -- so it's unlikely that you'll need to modify the maxIter value.
Wrapping Up
As is often the case with machine learning, when using the k-means clustering technique presented in this article, the most time-consuming part of the process is preparing the raw data by normalizing and encoding it. For relatively small datasets (typically less than N = 1000 and fewer than 30 columns), I usually just drop the raw data into an Excel spreadsheet and process the data there. For larger datasets it's usually worth the effort to implement a hard-coded specialized function to do the work, like the NormAndEncode() helper function in the demo program.
The technique presented in this article is all about encoding and normalizing any dataset so that it can be clustered using standard k-means. For example, I passed the encoded and normalized data to the scikit-learn Python language KMeans module and got identical results as the C# KMeans implementation presented in this article.
The technique presented in this article to cluster mixed categorical and numeric data can also be used to cluster strictly categorical data using k-means.
The demo uses k = 3 clusters, which is more or less arbitrary. To find a good value for k, you can cluster using values from 1 to 10, and track the resulting WCSS values. The WCSS values will typically get smaller quickly but start to level out at some point. The k-value at that point is often a good choice. This is called the "elbow" technique.
An alternative for clustering mixed categorical and numeric data is to use an old technique called k-prototypes clustering. One disadvantage of using k-prototypes clustering is that it's not widely implemented, as is the case with k-means clustering routines. Another disadvantage of k-prototypes clustering is that it's relatively crude because it uses a dissimilarity function that simply counts the number of mismatched ordinal-encoded categorical values between two items. In a set of limited experiments, the clustering technique presented in this article beat or matched k-prototypes clustering results on all datasets.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].