The Data Science Lab
Gaussian Process Regression Using the scikit Library
The demo program prints the first four predictor items and the first four target values in the training data:
print("First four X data: ")
print(train_X[0:4][:])
print(". . .")
print("First four targets: ")
print(train_y[0:4])
print(". . .")
In a non-demo scenario you might want to display all the training data and all the test data to verify the data has been read properly.
Creating and Training the Gaussian Process Regression Model
Creating the Gaussian process regression model is simultaneously simple and complicated. The code is relatively short:
# 2. create and train GPR model
print("Creating GPR model with RBF(1.0) kernel ")
krnl = RBF(length_scale=1.0, length_scale_bounds="fixed")
gpr_model = GaussianProcessRegressor(kernel=krnl,
normalize_y=False, random_state=1, alpha=0.001)
print("Done ")
An RBF kernel with length scale = 1.0 is specified. The length_scale="fixed" instructs the module to not try to programmatically optimize the length scale parameter that has just been set to 1.0. In my experience, programmatically optimizing kernel parameter(s) usually leads to a worse model than fixing the parameter value(s).
The demo sets the GaussianProcessRegressor constructor random_state parameter to 1, but this has no effect because kernel parameter optimization is not being used. So, the random_state parameter could be dropped.
The alpha parameter controls regularization, which is an attempt to limit model overfitting when a model predicts training data well but predicts test data poorly. GPR is highly susceptible to model overfitting, so a good value for alpha is critical. Larger values of alpha reduce overfitting, but a too-large value can create a model with very poor accuracy.
After the GPR model has been created, training is almost too simple:
print("Training model ")
gpr_model.fit(train_X, train_y)
print("Done ")
All of the decisions are made with the kernel and constructor parameters, so there's not much to think about. I usually explore GPR hyperparameters using a grid search. First, I define a root mean squared error function like so:
def root_mse(model, data_X, data_y):
preds = model.predict(data_X)
sum = np.sum((preds - data_y)**2)
return np.sqrt(sum / len(data_y))
Then I set up candidate hyperparameter values, for example:
scales = [0.1, 0.5, 1.0, 1.5, 2.0]
alphas = [0.0, 0.0001, 0.001, 0.01, 0.1]
And then I examine the different combinations of hyperparameter values:
for s in scales:
for a in alphas:
krnl = RBF(length_scale=s, length_scale_bounds="fixed")
gpr_model = GaussianProcessRegressor(kernel=krnl,
normalize_y=False, random_state=1, alpha=a)
gpr_model.fit(train_X, train_y)
train_acc = accuracy(gpr_model, train_X, train_y, 0.10)
train_rmse = root_mse(gpr_model, train_X, train_y)
print("scale = %0.1f alpha = %0.4f | acc = %0.4f \
rmse = %0.4f " % (s, a, train_acc, train_rmse))
The idea is to find a set of hyperparameter values that balance root mean squared error -- which is a fine-grained measure of model goodness -- with accuracy, which is coarse but ultimately what you're interested in.
Evaluating the Trained Model
The demo computes the accuracy of the trained model like so:
# 3. compute model accuracy
print("Computing accuracy (within 0.10 of true) ")
acc_train = accuracy(gpr_model, train_X, train_y, 0.10)
print("Accuracy on train data = %0.4f " % acc_train)
acc_test = accuracy(gpr_model, test_X, test_y, 0.10)
print("Accuracy on test data = %0.4f " % acc_test)
For most scikit classifiers, the built-in score() function computes a simple accuracy, which is just the number of correct predictions divided by the total number of predictions. But for a GaussianProcessRegressor regression model, you must define what a correct prediction is and write a program-defined custom accuracy function. The accuracy() function in Listing 1 defines a correct prediction as one that is within a specified percentage of the true target value. The key statement is:
if np.abs(pred - y) < np.abs(pct_close * y):
n_correct += 1
else:
n_wrong += 1
The pred variable is the predicted value and the y is the known correct target value. In words, "if the difference between predicted and actual values is less than x percent of the actual target, then the prediction is correct."
Using and Saving the Trained Model
The demo uses the trained model like so:
# 4. use model to predict
x = np.array([[-0.5, 0.5, -0.5, 0.5, -0.5]],
dtype=np.float64)
print("Predicting for x = ")
print(x)
(y_pred, std) = gpr_model.predict(x, return_std=True)
print("Predicted y = %0.4f " % y_pred)
print("std = %0.4f " % std)
Because the GPR model was trained using normalized data, the X-input must be normalized in the same way; in other words, all predictors must be between -1 and +1. Notice the double square brackets on the X-input. The predict() method expects a matrix rather than a vector.
A significant characteristic of a GPR model is that it predicts a value but can also return a standard deviation, which is a measure of uncertainty. For the demo data, the predicted value for the dummy inputs is 0.0993 and the standard deviation is 0.0517. Therefore, the prediction could be stated as, "0.0993 plus or minus one standard deviation of 0.0517."
In most situations you will want to save the trained regression model so that it can be used by other programs. There are several ways to save a scikit model but using the pickle library ("pickle" means to preserve in ordinary English) is the simplest and most common. The demo code is:
# 5. save model
print("Saving trained GPR model ")
fn = ".\\Models\\gpr_model.pkl"
with open(fn,'wb') as f:
pickle.dump(gpr_model, f)
This code assumes there is a subdirectory named Models. The saved model can be loaded into memory and used like this:
# load and use model
path = ".\\Models\\gpr_model.pkl"
with open(path, 'rb') as f:
loaded_model = pickle.load(f)
X = (set some values for X)
(y_pred, std) = loaded_model.predict(X)
A pickle file uses a proprietary binary format. An alternative is to write a custom save() function that saves model information as plain text. This is useful if a trained model is going to be consumed by a non-Python program.
Wrapping Up
Many of my colleagues shy away from using Gaussian process regression, mostly because the underlying mathematics are very complex. But when GPR works, it often works very well.
GPR is well-suited for relatively small datasets where the data is strictly numeric. Behind the scenes, GPR computes the inverse of a matrix with a size based on the number of training items. As mentioned earlier, GPR can handle categorical predictor variables by using one-hot encoding.
A regression technique that is closely related to Gaussian process regression is kernel ridge regression (KRR). KRR and GPR have different mathematical assumptions, but surprisingly end up with the same prediction equation. The GPR assumptions are more restrictive and so GPR can generate a metric (standard deviation / variance) of prediction uncertainty, but KRR can only produce a point prediction.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].