The Data Science Lab
Gaussian Process Regression from Scratch Using C#
Although the demo program is long, the majority of the code is in the helper Utils class. The demo program begins by loading the synthetic training data into memory:
Console.WriteLine("Loading data from file ");
string trainFile =
"..\\..\\..\\Data\\synthetic_train_200.txt";
double[][] trainX = Utils.MatLoad(trainFile,
new int[] { 0, 1, 2, 3, 4, 5 }, ',', "#");
double[] trainY = Utils.MatToVec(Utils.MatLoad(trainFile,
new int[] { 6 }, ',', "#"));
The code assumes the data is located in a subdirectory named Data. The call to the MatLoad() utility function specifies that the data is comma delimited and uses the '#' character to identify lines that are comments. The test data is loaded into memory in the same way:
string testFile =
"..\\..\\..\\Data\\synthetic_test_40.txt";
double[][] testX = Utils.MatLoad(testFile,
new int[] { 0, 1, 2, 3, 4, 5 }, ',', "#");
double[] testY = Utils.MatToVec(Utils.MatLoad(testFile,
new int[] { 6 }, ',', "#"));
Console.WriteLine("Done ");
Next, the demo displays the first four training predictor values and target values:
Console.WriteLine("First four X predictors: ");
for (int i = 0; i < 4; ++i)
Utils.VecShow(trainX[i], 4, 8);
Console.WriteLine("\nFirst four y targets: ");
for (int i = 0; i < 4; ++i)
Console.WriteLine(trainY[i].ToString("F4").PadLeft(8));
In a non-demo scenario you'd probably want to display all training and test data to make sure it was loaded correctly.
Creating and Training the GPR Model
After loading the training data, the demo uses a grid search to find good values for the RBF theta and length scale parameters, and the alpha/noise parameter. In pseudo-code:
set candidate alpha, theta, scale values
for-each alpha value
for-each theta value
for-each length scale value
create/train GPR model using alpha, theta, scale
evaluate model accuracy and RMSE
end-for
end-for
end-for
Finding good hyperparameter values is somewhat subjective. The idea is to balance accuracy, which is a crude metric but ultimately the one you're usually interested in, with root mean squared error (RMSE), which is a granular metric but one that can be misleading.
After determining good model parameter values, the GPR model is simultaneously created and trained:
Console.WriteLine("Creating and training GPR model ");
double theta = 0.50; // "constant kernel"
double lenScale = 2.0; // RBF parameter
double alpha = 0.01; // noise
Console.WriteLine("Setting theta = " +
theta.ToString("F2") + ", lenScale = " +
lenScale.ToString("F2") + ", alpha = " +
alpha.ToString("F2"));
GPR model = new GPR(theta, lenScale, alpha,
trainX, trainY); // create and train
The GPR constructor uses the training data to create the inverse of the K(X', X') covariance matrix for use by the Predict() method.
GPR models are different from many other regression techniques because there is no explicit training. For example, a neural network regression model uses training data to create what is essentially a very complicated prediction equation, after which the training data is no longer needed. GPR needs all training data to make a prediction, therefore, the training data is passed to the GPR constructor. An alternative API design is to use an explicit Train() method that stores the training data and computes the inverse of the covariance matrix for later use by the Predict() method:
// alternative API design
GPR model = new GPR(theta: 0.5, lenScale: 2.0, noise: 0.01);
model.Train(trainX, trainY); // stores data, computes inverse
This approach is used by several machine learning libraries, notably scikit-learn, to provide an API that is consistent across all regression techniques in the library.
Evaluating and Using the GPR Model
After the GPR model is created and implicitly trained, the demo evaluates the model using these statements:
Console.WriteLine("Evaluate model acc" +
" (within 0.10 of true)");
double trainAcc = model.Accuracy(trainX, trainY, 0.10);
double testAcc = model.Accuracy(testX, testY, 0.10);
Console.WriteLine("Train acc = " +
trainAcc.ToString("F4"));
Console.WriteLine("Test acc = " +
testAcc.ToString("F4"));
Because there's no inherent definition of regression accuracy, it's necessary to implement a program-defined method. How close a predicted value must be to the true value in order to be considered a correct prediction, will vary from problem to problem.
The demo concludes by using the trained model to predict for x = (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) using these statements:
Console.WriteLine("Predicting for (0.1, 0.2, 0.3," +
" 0.4, 0.5, 0.6) ");
double[][] X = Utils.MatCreate(1, 6);
X[0] = new double[] { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 };
double[][] results = model.Predict(X);
Console.WriteLine("Predicted y value: ");
double[] means = results[0];
Utils.VecShow(means, 4, 8);
Console.WriteLine("Predicted std: ");
double[] stds = results[1];
Utils.VecShow(stds, 4, 8);
Recall that the Predict() method is expecting a matrix of input values, not just a single vector. Therefore, the demo creates an input matrix with one row and six columns, but just uses the first row at [0].
The return value from Predict() is an array with two cells. The first cell of the return at index [0] holds another array of predicted y values, one for each row of the input matrix, which in this case is just one row. The second cell of the return at index [1] holds an array of standard deviations, one for each prediction.
You should interpret the standard deviations of the Predict() method conservatively. Gaussian process regression makes several assumptions about the Gaussian distribution of the source data, which may not be completely correct. For example, if you have a predicted y with value 0.50 and a standard deviation of 0.06, it's usually better to just report these values rather than use the standard deviation to construct a classical statistics confidence interval along the lines of, "predicted y = 0.50 with a 95% confidence interval of plus or minus 1.96 * 0.06."
Wrapping Up
Implementing Gaussian process regression from scratch requires significant effort. But doing so allows you to customize the system for unusual scenarios, makes it easy to integrate the prediction code with other systems, and removes a significant external dependency.
The GPR system presented in this article uses a hard-coded composite radial basis function as the kernel function. There are about a dozen other common kernel functions such as polynomial, Laplacian and sigmoid. You can experiment with these kernel functions but there are no good rules of thumb for picking a function based on the nature of your data.
The demo program uses predictor variables that are strictly numeric. GPR can handle categorical data by using one-hot encoding. For example, if you had a predictor variable color with possible values red, blue, green, you could encode red = (1, 0, 0), blue = (0, 1, 0), green = (0, 0, 1).
A technique closely related to Gaussian process regression is kernel ridge regression (KRR), which I explain in last month's article, "Kernel Ridge Regression Using C#." KRR starts with what appear to be different math assumptions than GPR, but KRR ends up with the same prediction equation as GPR. However, KRR has less restrictive assumptions which do not allow KRR to produce a prediction standard deviation -- KRR produces just a point estimate without a standard deviation.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].