The Data Science Lab
Naive Bayes Classification Using the scikit Library
Dr. James McCaffrey of Microsoft Research shows how to predict a person's sex based on their job type, eye color and country of residence.
Naive Bayes classification is a classical machine learning technique to predict a discrete value. For example, you might want to predict the sex of a person (female or male) based on their job type, eye color and country of residence. In addition to binary classification, naive Bayes can also be used for multi-class classification, for example, predicting job type (actuary, barista, chemist, dentist) from eye color, country and sex.
Naive Bayes classification is especially well suited to problems where the predictor variables are all categorical (strings). And, compared to neural network classifiers, naive Bayes classifications can work well with small training datasets.
There are several tools and code libraries that you can use to perform naive Bayes classification. The scikit-learn library (also called scikit or sklearn) is based on the Python language and is one of the most popular machine learning libraries.
A good way to see where this article is headed is to take a look at the screenshot in Figure 1. The demo program begins by loading a synthetic 20-item set training data into memory. The goal is to predict the sex of a person (female = 0, male = 1) from job type, eye color and country. The demo echoes the predictor values and the target class labels. A naive Bayes classifier is created and then used to make predictions for the 20 data items.
The accuracy of the trained model is 80 percent (16 out of 20 correct). The demo displays a confusion matrix for the model predictions:
actual = 0 [11 1]
actual = 1 [ 3 5]
predicted: 0 1
The model correctly predicted 11 class 0 (female) data items and incorrectly predicted one class 0 item. The model correctly predicted five of the eight class 1 (male) data items.
The demo concludes by predicting the sex/class/label for a new, previously unseen data item of (dentist, hazel, Italy). The model displays the prediction in the form of a vector of pseudo-probabilities: [0.33, 0.67]. Because the larger pseudo-probability is at index [1], the prediction is class 1 = male.
This article assumes you have intermediate or better skill with a C-family programming language such as Python or C#, but doesn't assume you know much about naive Bayes classification or the scikit library. The complete source code for the demo program is presented in this article. The source code is also available in the accompanying file download and is also available online.
Installing the scikit Library
There are several ways to install the scikit library. I recommend installing the Anaconda Python distribution. Anaconda contains a core Python engine plus more than 500 libraries that are (mostly) compatible with each other. I used Anaconda3-2020.02, which contains Python 3.7.6 and the scikit 0.22.1 version. The demo code runs on Windows 10 or 11.
Briefly, Anaconda is installed using a Windows self-extracting executable file. The setup process is mostly straightforward and takes about 15 minutes. You can consult step-by-step instructions.
There are more up-to-date versions of Anaconda/Python/scikit library available. But because the Python ecosystem has hundreds of libraries, if you install the most recent versions of these libraries, you run a greater risk of library incompatibilities -- a major headache when working with Python.
The Data
The 20-item raw source data is shown in Listing 1. Notice that all values are strings. If source data contains a predictor column where the values are numeric, then those values should be converted to strings by bucketing them. For example, if the raw source data had a person's age column with values like 24 and 37, then you could bucket the age values along the lines of "young" = ages 18 through 29, "middle" = ages 30 through 59 and "old" = ages 60 through 99.
Listing 1: The Raw Source Data
actuary green korea F
barista green italy M
dentist hazel japan M
dentist green italy F
chemist hazel japan M
actuary green japan F
actuary hazel japan M
chemist green italy F
chemist green italy F
dentist green japan F
barista hazel japan M
dentist green japan F
dentist green japan F
chemist green italy F
dentist green japan M
dentist hazel japan M
chemist green korea F
barista green japan F
actuary hazel italy F
actuary green italy M
When working with naive Bayes, the data should be integer-encoded as shown in Listing 2. Integer-encoding is sometimes called ordinal encoding or label encoding.
Listing 2: The Integer/Ordinal Encoded Data
# job_eye_country_sex.txt
# actuary=0, barista=1, chemist=2, dentist=3
# green=0, hazel=1
# italy = 0, japan=1, korea=2
# female=0, male=1
#
0 0 2 0
1 0 0 1
3 1 1 1
3 0 0 0
2 1 1 1
0 0 1 0
0 1 1 1
2 0 0 0
2 0 0 0
3 0 1 0
1 1 1 1
3 0 1 0
3 0 1 0
2 0 0 0
3 0 1 1
3 1 1 1
2 0 2 0
1 0 1 0
0 1 0 0
0 0 0 1
The "#" character indicates a comment line. The demo data is tab-separated and saved as job_eye_country_sex.txt, so if you copy-paste from this article you'll need to replace the spaces with tab characters or modify the demo code that loads the data into memory. Notice that the values in each column are encoded based on alphabetical order. This is standard procedure when working with naive Bayes but is not required.
Encoding the data from strings to integers is simple but time-consuming. The data can be encoded manually, for example by dropping the string data into an Excel spreadsheet and then applying find-replace operations.
It is also possible to programmatically encode string data using the scikit OrdinalEncoder class or by using a program-defined function. These two approaches will be explained shortly.
Understanding How Naive Bayes Classification Works
Understanding how naive Bayes classification works is best explained by example. Suppose, as in the demo program, the goal is to predict the sex of a person who is a dentist, has hazel colored eyes and who lives in Italy.
If you look just at the dentists in the job column, three of the seven dentists are male, and four of the seven are female. So you'd (weakly) guess the person is female. Next, if you look just at the hazel values in eye color column, five of six people are male and just one of six are female. So based just on eye color you'd strongly guess male. And then, if you look just at the Italy values in the country column, two of seven people are male and five of seven are female. So you'd guess the person is female.
If the frequencies are loosely interpreted as pseudo-probabilities, then:
Job: P(female) = 0.57 P(male) = 0.43
Eye: P(female) = 0.17 P(male) = 0.83
Cty: P(female) = 0.71 P(male) = 0.29
Therefore the job type and country predictors suggest the (dentist, hazel, Italy) person is female, but the eye color predictor strongly suggests the person is male. A simple way to produce a single prediction is to use a majority-rule vote. However, this approach isn't very good because different predictor distributions should be weighted differently. For example, suppose there is a height column with values "short," "medium" and "tall." If most of the data items are "short" or "medium," then a data item with height value of "tall" contains more information and should receive more weight.
The naive Bayes technique combines the frequencies in each predictor column in a way that takes relative frequencies into account. The technique is called "naive" (meaning unsophisticated) because each predictor column is analyzed independently, not taking into account interactions between columns. The name "Bayes" refers to Thomas Bayes (1701-1761), a founder of probability theory.
The Demo Program
The complete demo program is presented in Listing 3. I am a proud user of Notepad as my preferred code editor, but most of my colleagues use a more sophisticated programming environment. I indent my Python program using two spaces rather than the more common four spaces.
The program imports the NumPy library, which contains numeric array functionality. The CategoricalNB module has the key code for performing naive Bayes classification. Notice the name of the root scikit module is sklearn rather than scikit.
Listing 3: Complete Naive Bayes Demo Program
# naive_bayes.py
# Anaconda3-2020.02 Python 3.7.6
# scikit 0.22.1
# Windows 10/11
import numpy as np
from sklearn.naive_bayes import CategoricalNB
# ---------------------------------------------------------
def main():
# 0. prepare
print("\nBegin scikit naive Bayes demo ")
print("Predict sex (F = 0, M = 1) from job, eye, country ")
np.random.seed(1)
# actuary green korea F
# barista green italy M
# dentist hazel japan M
# . . .
# actuary = 0, barista = 1, chemist = 2, dentist = 3
# green = 0, hazel = 1
# italy = 0, japan = 1, korea = 2
# 1. load data
print("\nLoading train data ")
train_file = ".\\Data\\job_eye_country_sex.txt"
X = np.loadtxt(train_file, usecols=range(0,3),
delimiter="\t", comments="#", dtype=np.int64)
y = np.loadtxt(train_file, usecols=3,
delimiter="\t", comments="#", dtype=np.int64)
print("Done ")
print("\nDiscretized features: ")
print(X)
print("\nActual classes: ")
print(y)
# 2. create and train model
print("\nCreating naive Bayes classifier ")
model = CategoricalNB(alpha=1)
model.fit(X, y)
print("Done ")
pred_classes = model.predict(X)
# 3. evaluate model
print("\nPredicted classes: ")
print(pred_classes)
acc_train = model.score(X, y)
print("\nAccuracy on train data = %0.4f " % acc_train)
# 3b. confusion matrix
from sklearn.metrics import confusion_matrix
y_predicteds = model.predict(X)
cm = confusion_matrix(y, y_predicteds) # actual, pred
print("\nConfusion matrix raw: ")
print(cm)
# 4. use model
# dentist, hazel, Italy = [3,1,0]
print("\nPredicting class for dentist, hazel, Italy ")
probs = model.predict_proba([[3,1,0]])
print("\nPrediction probs: ")
print(probs)
predicted = model.predict([[3,1,0]])
print("\nPredicted class: ")
print(predicted)
# 5. TODO: save model using pickle
print("\nEnd demo ")
if __name__ == "__main__":
main()
The demo begins by setting the NumPy random seed:
def main():
# 0. prepare
print("Begin scikit naive Bayes demo ")
print("Predict sex (F=0, M=1) from job, eye, country ")
np.random.seed(1)
. . .
Technically, setting the random seed value isn't necessary, but doing so allows you to get reproducible results in many situations.
Loading the Training and Test Data
The demo program loads the training data into memory using these statements:
# 1. load data
print("Loading train data ")
train_file = ".\\Data\\job_eye_country_sex.txt"
X = np.loadtxt(train_file, usecols=range(0,3),
delimiter="\t", comments="#", dtype=np.int64)
y = np.loadtxt(train_file, usecols=3,
delimiter="\t", comments="#", dtype=np.int64)
print("Done ")
This code assumes the data files are stored in a directory named Data. There are many ways to load data into memory. I prefer using the NumPy library loadtxt() function, but common alternatives are the NumPy genfromtxt() function and the Pandas library read_csv() function.
The demo reads the predictors and the target class labels using two calls to the loadtxt() function. Because the demo data has predictors and labels in the same file, an alternative is to read both using one call to loadtxt() and then extract like so:
XY = np.loadtxt(train_file, usecols=range(0,4),
delimiter="\t", comments="#", dtype=np.int64)
X = XY[:,0:3]
y = XY[:,3]
The colon syntax means "all rows." The demo program does not have any test data, but test data would be read into memory in the same way as the training data.
The demo program prints the 20 encoded predictor items and the 20 target gender values:
print("Discretized features: ")
print(X)
print("Actual classes: ")
print(y)
In a non-demo scenario with a lot of training data, you might want to display just part of the data.
Programmatically Converting Raw String Data to Integers
The demo program assumes the existence of manually encoded integer/ordinal data. One way to programmatically encode raw string data for use by a scikit naive Bayes classifier is to use the OrdinalEncoder class. Suppose the raw data is stored in a text file named job_eye_country_sex_raw.txt and looks like:
actuary green korea F
barista green italy M
dentist hazel japan M
. . .
To programmatically encode the strings to integer values you could write code like:
from sklearn.preprocessing import OrdinalEncoder
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
raw = np.genfromtxt(train_file, usecols=range(0,4),
delimiter="\t", dtype=str)
enc.fit(raw) # scan data
encoded = enc.transform(raw) # encode the data
X = encoded[:,0:3]
y = encoded[:,3]