The Data Science Lab
Principal Component Analysis from Scratch Using Singular Value Decomposition with C#
The MatExtractCols() function pulls out each column in the eigenvectors, ordered by the idxs array, which holds the order in which to sort from largest to smallest. The now-sorted eigenvectors are converted from columns to rows using the MatTranspose() function only because that makes them a bit easier to read. The eigenvectors will have to be converted back to columns later. The point is that when you're looking at PCA documentation or working with PCA code, it's important to keep track of whether the eigenvectors are stored as columns (usually) or rows (sometimes).
Computing the Variance Explained
The variance explained by each of the four eigenvalue-eigenvector pairs is computed from the sorted eigenvalues:
Console.WriteLine("Computing variance explained: ");
double[] varExplained = VarExplained(eigenVals);
VecShow(varExplained, 4, 9);
For the demo data, the sum of the sorted eigenvalues is 3.0020 + 1.0158 + 0.3904 + 0.0918= 4.5000. The variance explained values are:
3.0020 / 4.5000 = 0.6671
1.0158 / 4.5000 = 0.2257
0.3904 / 4.5000 = 0.0868
0.0918 / 4.5000 = 0.0204
------
1.0000
Loosely speaking, this means that the first eigenvector captures 66.71 percent of the information in the source data. The first two eigenvectors capture 0.6671 + 0.2257 = 89.28 percent of the information. The first three eigenvectors capture 0.6671 + 0.2257 + 0.0868 = 97.96 percent of the information. All four eigenvectors capture 100 percent of the information.
Computing the Transformed Data
After computing and sorting the eigenvalues and eigenvectors from the normalized data via SVD, the next step in PCA is to use the sorted eigenvectors to compute the transformed data:
Console.WriteLine("Computing transformed data (all components): ");
double[][] transformed = MatProduct(stdX, MatTranspose(eigenVecs));
Console.WriteLine("Transformed data: ");
MatShow(transformed, 4, 9, true);
The transformed data is the matrix product of the normalized data and the eigenvectors. Because the eigenvectors were converted to rows for easier viewing, they must be passed as columns using the MatTranspose() function.
The transformed data is:
[0] 1.8846 0.6205 0.6592 -0.3440
[1] 1.3345 0.9543 0.3440 -0.2363
[2] 1.4610 0.5893 0.0603 0.7158
[3] 0.8447 -0.3894 -0.4378 0.0696
[4] 0.3924 -1.4465 0.0569 -0.0469
[5] 0.5486 -1.5813 -0.4221 -0.0953
[6] -1.6896 1.1743 -0.6539 -0.0640
[7] -3.1366 -0.3829 1.1156 0.1179
[8] -1.6394 0.4616 -0.7222 -0.1168
The demo program concludes by reconstructing the original source data from the transformed data. First the normalized data is reconstructed like so:
Console.WriteLine("Reconstructing original data: ");
double[][] reconstructed = MatProduct(transformed, eigenVecs);
Notice that reconstruction uses eigenvectors as rows rather than columns. Once again, the eigenvectors columns-vs-rows issue can be tricky when working with PCA. Next, the original source data is recovered by multiplying by the standard deviations that were used to normalize the data, and then adding the means that were used:
for (int i = 0; i < reconstructed.Length; ++i)
for (int j = 0; j < reconstructed[0].Length; ++j)
reconstructed[i][j] = (reconstructed[i][j] * stds[j]) + means[j];
MatShow(reconstructed, 4, 9, 3);
Reconstructing the original source data from the transformed data is a good way to check that nothing has gone wrong with the PCA.
Wrapping Up
Principal component analysis can be used in several ways. One of the most common is to create a graph of data that has more than two columns. For the Penguin data with four columns, there's no easy way to create a graph. But by applying PCA and then using the first of two transformed columns as the x-value and the second column as the y-value, a graph reveals that species Gentoo items are quite different from the somewhat similar Adelie and Chinstrap items.
Another way to use PCA is for anomaly detection. The idea is to use the first n columns of the transformed data, where the n columns capture roughly 60 to 80 percent of the variability, to approximately reconstruct the original source data. Data items that don't reconstruct accurately are anomalous in some way and should be investigated.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].