The Data Science Lab
Kernel Ridge Regression Using C#
The model is trained like so:
Console.WriteLine("Training model ");
krr.Train(trainX, trainY);
Console.WriteLine("Done ");
// Console.WriteLine("Model weights: ");
// Utils.VecShow(krr.wts, 4, 9, true);
Behind the scenes, the Train() method uses the algorithm explained in the previous section. Recall that training produces one weight for each training item, so the demo data would generate 40 weights. After the model has been trained, it is evaluated:
Console.WriteLine("Computing model accuracy" +
" (within 0.10) ");
double trainAcc = Accuracy(krr, trainX, trainY, 0.10);
double testAcc = Accuracy(krr, testX, testY, 0.10);
Console.WriteLine("Train acc = " +
trainAcc.ToString("F4"));
Console.WriteLine("Test acc = " +
testAcc.ToString("F4"));
Because there's no inherent definition of regression accuracy, it's necessary to implement a program-defined function. How close a predicted value must be to the true value in order to be considered a correct prediction, will vary from problem to problem.
The demo computes the root mean squared error of the model on the training and test data like so:
double trainRMSE = RootMSE(krr, trainX, trainY);
double testRMSE = RootMSE(krr, testX, testY);
Console.WriteLine("Train RMSE = " +
trainRMSE.ToString("F4"));
Console.WriteLine("Test RMSE = " +
testRMSE.ToString("F4"));
The demo concludes by using the trained model to predict for X = (0.5, -0.5, 0.5, -0.5, 0.5) using these statements:
Console.WriteLine("Predicting for x = " +
"[0.5, -0.5, 0.5, -0.5, 0.5] ");
double[] x =
new double[] { 0.5, -0.5, 0.5, -0.5, 0.5 };
double y = krr.Predict(x);
Console.WriteLine("Predicted y = " + y.ToString("F4"));
The Predict() method expects a single vector of predictor values. A common design alternative is to implement Predict() to accept multiple inputs as an array of vectors, which is equivalent to a matrix.
The Predict Method
The Predict() method is simple. It computes and returns the weighted sum of the input vector x times the RBF applied to x and each training item, as described previously.
public double Predict(double[] x)
{
int N = this.trainX.Length;
double sum = 0.0;
for (int i = 0; i < N; ++i) {
double[] xx = this.trainX[i];
double k = this.Rbf(x, xx);
sum += this.wts[i] * k;
}
return sum;
}
Instead of using an explicit for-loop, an alternative design is to use matrix operations:
double[][] K = Utils.MatCreate(N, 1); // K(x, X')
for (int i = 0; i < N; ++i)
K[i][0] = this.Rbf(x, this.trainX[i]);
double[][] y = Utils.MatProduct(
Utils.VecToMat(this.wts, 1, N), K); // w * K
return y[0][0];
The matrix operations approach matches the underlying math equations but is a bit messier in my opinion.
Finding Good KRR Parameters
The key to a good KRR model is finding good values for the RBF gamma parameter and the alpha regularization parameter. The demo used a simple grid search. Expressed in pseudo-code:
set gammas = [0.1, 0.2, 0.3, 0.5, 1.0, 1.5, 2.0]
set alphas = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5]
for-each g in gammas
for-each a in alphas
create KRR(g, a) and train model
compute accuracy on train and test data
compute RMSE on train and test data
display g, a, accuracies, RMSEs
end-alphas
end-gammas
The idea is to find values of gamma and alpha that balance training and test accuracy (a very coarse metric of quality) and RMSE (a very fine-grained metric of quality). Instead of using a grid search, it is possible to programmatically compute values of gamma and alpha that minimize some program-defined loss function. This approach is a lot of work -- roughly equivalent to the entire KRR system itself -- and doesn't always give good results.
Kernel ridge regression computes the inverse of a matrix that has size n x n where n is the number of training data items. Therefore, KRR doesn't scale well to very large datasets. In such situations, it's still possible to use KRR by estimating the model weights using stochastic gradient descent.
The demo program uses a standard general-purpose algorithm called Crout's decomposition to invert the K(X', X') matrix. Because K(X', X') is a square, symmetric, positive definite matrix, it's possible to use a specialized, slightly more efficient algorithm called Cholesky decomposition. However, the difference of a few milliseconds isn't usually important in most machine learning scenarios.
Wrapping Up
The kernel ridge regression system presented in this article uses a hard-coded radial basis function as the kernel function. There are about a dozen other common kernel functions such as polynomial kernel, Laplacian and sigmoid. You can experiment with these kernel functions but there are no good rules of thumb for picking a function based on the nature of your data.
The demo program uses predictor variables that are strictly numeric. KRR can handle categorical data by using one-hot encoding. For example, if you had a predictor variable color with possible values red, blue, green, you could encode red = (1, 0, 0), blue = (0, 1, 0), green = (0, 0, 1).
Kernel ridge regression is a relatively common machine learning technique so it's sometimes called just kernel regression. However, there is an old, relatively rare, simpler, related technique called just kernel regression, and so the two techniques are sometimes confused.
A technique that is very closely related to kernel ridge regression is Gaussian process regression (GPR). GPR starts with what appears to be different math assumptions than KRR, but GPR ends up with the same prediction equation as KRR. However, GPR has more restrictive assumptions which allow GPR to produce a predicted y value and also a standard deviation measure of confidence of the prediction.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].