The Data Science Lab
Gaussian Naive Bayes Classification Using the scikit Library
Dr. James McCaffrey of Microsoft Research says the main advantage of using Gaussian naive Bayes classification compared to other techniques like decision trees or neural networks is that you don't have to fine-tune model parameters.
Gaussian naive Bayes classification is a classical machine learning technique that can be used to predict a discrete value when the predictor variables are all numeric. For example, you might want to predict a person's political leaning (conservative, moderate, liberal) from their age, annual income and bank account balance.
There are several other variations of naive Bayes (NB) classification including Categorical NB, Bernoulli NB, and Multinomial NB. The different NB variations are used for different types of predictor data. In Categorical NB the predictor values are strings like "red," "blue" and "green." In Bernoulli NB the predictor values are Boolean/binary like "male" and "female." In Multinomial NB the predictor values are integer counts.
This article explains Gaussian NB classification by presenting an end-to-end demo. The goal of the demo is to predict the species of a wheat seed (Kama = 0, Rosa = 1, Canadian = 2) from seven numeric predictor variables: seed length, width, perimeter and so on.
There are several tools and code libraries that you can use to perform Gaussian NB classification. The scikit-learn library (also called scikit or sklearn) is based on the Python language and is one of the most popular machine learning libraries.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading a 180-item set of training data and a 30-item set of test data into memory. The data is normalized so that all predictor values are between 0.0 and 1.0. The demo creates a Gaussian NB prediction model. The model scores 93.33 percent accuracy on the training data (168 out of 180 correct) and 66.67 percent accuracy on the test data (20 out of 30 correct).
The demo computes and displays a confusion matrix for the test data that shows how the prediction model performs for each of the three class labels / species. The model predicts wheat species Canadian quite well (9 out of 10 correct), but predicts species Rosa poorly (only 5 out of 10 correct).
The demo concludes by predicting the species of a dummy wheat seed that has all predictor values set to 0.2. The predicted pseudo-probabilities for each species are [0.0033, 0.0000, 0.9667]. Because the pseudo-probability at index [2] is the largest, the predicted species is 2 = Canadian.
This article assumes you have intermediate or better skill with a C-family programming language such as Python or C#, but doesn't assume you know much about Gaussian naive Bayes classification or the scikit library. The complete source code for the demo program is presented in this article. The source code is also available in the accompanying file download and is also online.
Installing the scikit Library
There are several ways to install the scikit library. I recommend installing the Anaconda Python distribution, which contains scikit and core Python engine, plus more than 500 libraries that are (mostly) compatible with one another. I used Anaconda3-2022.10, which contains Python 3.9.13 and the scikit 1.0.2 version. The code presented here works on earlier versions of Python and scikit too. The demo code runs on Windows 10 or 11.
Briefly, Anaconda is installed using a Windows self-extracting executable file. The setup process is mostly straightforward and takes about 15 minutes following step-by-step instructions. The instructions can be easily adapted for Anaconda3-2022-10 with Python 3.9.13.
There are more up-to-date versions of Anaconda / Python / scikit library available. However, because the Python ecosystem has hundreds of libraries, if you install the most recent versions of these libraries, you run a greater risk of library incompatibilities -- a standard headache when working with Python.
The Data
The Wheat Seeds dataset has 210 items. The raw data looks like:
15.26 14.84 0.871 5.763 3.312 2.221 5.22 1
14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1
. . .
17.63 15.98 0.8673 6.191 3.561 4.076 6.06 2
16.84 15.67 0.8623 5.998 3.484 4.675 5.877 2
. . .
11.84 13.21 0.8521 5.175 2.836 3.598 5.044 3
12.3 13.34 0.8684 5.243 2.974 5.637 5.063 3
Each line of data represents a wheat seed. The first seven values on each line are the predictor values: seed area, perimeter, compactness, length, width, asymmetry, groove. The last value on each line is the species: Kama = 1, Rosa = 2, Canadian = 3.
It's not necessary to normalize predictor values when using Gaussian NB classification, but normalization doesn't hurt and normalized data can be used by other techniques, such as a neural network, that need normalization. So I normalized the data.
I divided each predictor column by a constant: (25, 20, 1, 10, 10, 10, 10) respectively. The resulting predictors are all between 0.0 and 1.0. I also re-coded the target class labels from 1-based to 0-based. The resulting 210-item normalized and encoded data looks like:
0.6104 0.7420 0.8710 0.5763 0.3312 0.2221 0.5220 0
0.5952 0.7285 0.8811 0.5554 0.3333 0.1018 0.4956 0
. . .
0.7052 0.7990 0.8673 0.6191 0.3561 0.4076 0.6060 1
0.6736 0.7835 0.8623 0.5998 0.3484 0.4675 0.5877 1
. . .
0.5048 0.6835 0.8481 0.5410 0.2911 0.3306 0.5231 2
0.5104 0.6690 0.8964 0.5073 0.3155 0.2828 0.4830 2
I split the 210-item normalized data into a 180-item training set and a 30-item test set. I used the first 60 of each target class for training and the last 10 of each target class for testing using training and test data.
Understanding How Gaussian Naive Bayes Classification Works
Understanding how Gaussian naive Bayes classification works is best explained by example. Suppose, as in the demo program, the goal is to predict wheat seed species from seven predictor variables, and the item-to-predict has normalized values [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7].
Gaussian NB examines each predictor variable separately. Suppose in the training data that the average value in column 1 (seed area) for class 0 is 0.6, the average value for class 1 is 0.8 and the average value for class 2 is 0.3. Based on this limited information looking at just the first predictor, you'd guess that the unknown item is class 2 because the unknown predictor value of 0.1 is closest to the average value for class 2 of 0.3.
Now you could examine each of the other six predictors. Suppose they suggest that the class of the unknown item is 1, 2, 2, 0, 2, 2 respectively. At this point, five of the seven predictor values suggest that the class of the unknown item is class 2, one predictor suggests class 0, and one predictor suggests class 1. You could use a majority-rule approach and conclude that the unknown item is class 2. However, a majority-rule approach doesn't take into account the strength of each of the seven guesses. Gaussian naive Bayes uses Bayesian mathematics to combine the evidence values to produce a final prediction in the form of pseudo-probabilities.
Gaussian NB is called "naive" (meaning unsophisticated) because each predictor variable is analyzed independently, not taking into account interactions between variables. The technique is called Gaussian because it assumes that each predictor variable is Gaussian (bell-shaped) distributed. The name "Bayes" refers to Thomas Bayes (1701-1761), a founder of probability theory.
The Demo Program
The complete demo program is presented in Listing 1. I am a proud user of Notepad as my preferred code editor, but most of my colleagues use something more sophisticated. I indent my Python program using two spaces rather than the more common four spaces.
The program imports the NumPy library, which contains numeric array functionality. The GaussianNB module has the key code for performing Gaussian naive Bayes classification. Notice the name of the root scikit module is sklearn rather than scikit.
Listing 1:
Complete Gaussian Naive Bayes Demo Program
# wheat_gnb.py
# Gaussian NB on the Wheat Seeds dataset
# Anaconda3-2022.10 Python 3.9.13
# scikit 1.0.2 Windows 10/11
import numpy as np
from sklearn.naive_bayes import GaussianNB
# ---------------------------------------------------------
def show_confusion(cm):
dim = len(cm)
mx = np.max(cm) # largest count in cm
wid = len(str(mx)) + 1 # width to print
fmt = "%" + str(wid) + "d" # like "%3d"
for i in range(dim):
print("actual ", end="")
print("%3d:" % i, end="")
for j in range(dim):
print(fmt % cm[i][j], end="")
print("")
print("------------")
print("predicted ", end="")
for j in range(dim):
print(fmt % j, end="")
print("")
# ---------------------------------------------------------
def main():
# 0. prepare
print("\nBegin scikit Gaussian naive Bayes demo ")
print("Predict wheat species (0,1,2) from seven numerics ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)
# 1. load data
print("\nLoading train and test data ")
train_file = ".\\Data\\wheat_train_k.txt"
x_train = np.loadtxt(train_file, usecols=[0,1,2,3,4,5,6],
delimiter="\t", comments="#", dtype=np.float32)
y_train = np.loadtxt(train_file, usecols=7,
delimiter="\t", comments="#", dtype=np.int64)
test_file = ".\\Data\\wheat_test_k.txt"
x_test = np.loadtxt(test_file, usecols=[0,1,2,3,4,5,6],
delimiter="\t", comments="#", dtype=np.float32)
y_test = np.loadtxt(test_file, usecols=7,
delimiter="\t", comments="#", dtype=np.int64)
print("Done ")
print("\nData: ")
print(x_train[0:4][:])
print(". . .")
print("\nActual species: ")
print(y_train[0:4])
print(". . .")
# 2. create and train model
# GaussianNB(*, priors=None, var_smoothing=1e-09)
print("\nCreating Gaussian naive Bayes classifier ")
model = GaussianNB()
model.fit(x_train, y_train)
print("Done ")
# 3. evaluate model
acc_train = model.score(x_train, y_train)
print("\nAccuracy on train data = %0.4f " % acc_train)
acc_test = model.score(x_test, y_test)
print("Accuracy on test data = %0.4f " % acc_test)
# 3b. confusion matrix
from sklearn.metrics import confusion_matrix
y_predicteds = model.predict(x_test)
cm = confusion_matrix(y_test, y_predicteds)
print("\nConfusion matrix for test data: ")
show_confusion(cm)
# 4. use model
print("\nPredicting species all 0.2 predictors: ")
X = np.array([[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]],
dtype=np.float32)
print(X)
probs = model.predict_proba(X)
print("\nPrediction probs: ")
print(probs)
predicted = model.predict(X)
print("\nPredicted class: ")
print(predicted)
# 5. TODO: save model using pickle
print("\nEnd demo ")
if __name__ == "__main__":
main()
The demo begins by setting the NumPy random seed:
def main():
# 0. prepare
print("Begin scikit Gaussian naive Bayes demo ")
print("Predict wheat species (0,1,2) from seven numerics ")
np.random.seed(1)
np.set_printoptions(precision=4, suppress=True)
. . .
Technically, setting the random seed value isn't necessary, but doing so allows you to get reproducible results in most situations.
Loading the Training and Test Data
The demo program loads the training data into memory using these statements:
# 1. load data
print("Loading train and test data ")
train_file = ".\\Data\\wheat_train_k.txt"
x_train = np.loadtxt(train_file, usecols=[0,1,2,3,4,5,6],
delimiter="\t", comments="#", dtype=np.float32)
y_train = np.loadtxt(train_file, usecols=7,
delimiter="\t", comments="#", dtype=np.int64)
This code assumes the data files are stored in a directory named Data. There are many ways to load data into memory. I prefer using the NumPy library loadtxt() function, but common alternatives are the NumPy genfromtxt() function and the Pandas library read_csv() function.
The demo reads the 180-item training predictors and the target class labels using two calls to the loadtxt() function. An alternative technique is to read all the data into a single matrix and then split the data into a matrix of predictor values and a vector of target labels.
The demo loads the 30-item test data similarly:
test_file = ".\\Data\\wheat_test_k.txt"
x_test = np.loadtxt(test_file, usecols=[0,1,2,3,4,5,6],
delimiter="\t", comments="#", dtype=np.float32)
y_test = np.loadtxt(test_file, usecols=7,
delimiter="\t", comments="#", dtype=np.int64)
print("Done ")
The demo program shows the first four training items and their associated target labels:
print("Data: ")
print(x_train[0:4][:])
print(". . .")
print("Actual species: ")
print(y_train[0:4])
print(". . .")
In a non-demo scenario, you might want to display all of the training data.
Creating and Training the Model
Creating and training the Gaussian naive Bayes classification model is simple:
# 2. create and train model
print("Creating Gaussian naive Bayes classifier ")
model = GaussianNB()
model.fit(x_train, y_train)
print("Done ")
Unlike many scikit models, the GaussianlNB class has relatively few (just two) parameters. The constructor signature is: