Synonyms

Bayes classifier

Definition

In classification, the objective is to build a classifier that takes an unlabeled example and assigns it to a class. Bayesian classification does this by modeling the probabilistic relationships between the attribute set and the class variable. Based on the modeled relationships, it estimates the class membership probability of the unseen example.

Historical Background

The foundation of Bayesian classification goes back to Reverend Bayes himself [2]. The origin of Bayesian belief nets can be traced back to [15]. In 1965, Good [4] combined the independence assumption with the Bayes formula to define the Naïve Bayes Classifier. Duda and Hart [14] introduced the basic notion of Bayesian classification and the naïve Bayes representation of joint distribution. The modern treatment and development of Bayesian belief networks is attributed to Pearl [8]. Heckerman [13] later reformulated the Bayes results and defined the probabilistic similarity networks that demonstrated the practicality of Bayesian classification in complex diagnostic problems.

Foundations

Bayesian classification is based on Bayes Theorem. It provides the basis for probabilistic learning that accommodates prior knowledge and takes into account the observed data.

Let X be a data sample whose class label is unknown. Suppose H is a hypothesis that X belongs to class Y. The goal is to estimate the probability that hypothesis H is true given the observed data sample X, that is, P(Y|X).

Consider the example of a dataset with the following attributes: Home Owner, Marital Status, and Annual Income as shown in Fig. 1. Credit Risks are Low for those who have never defaulted on their payments and credit risks are High for those who have previously defaulted on their payments.

Bayesian Classification, Fig. 1
figure 211figure 211

Dataset example

Assume that a new data arrives with the following attribute set: X = (Home Owner = Yes, Marital Status = Married, Annual Income = High). To determine the credit risk of this record, it is noted that the Bayes classifier combines the predictions of all alternative hypotheses to determine the most probable classification of a new instance. In the example, this involves computing P(High|X) and P(Low|X) and to determine whether P(High|X) > P(Low|X)?

However, estimating these probabilities is difficult, since it requires a very large training set that covers every possible combination of the class label and attribute values. Instead, Bayes theorem is applied and it resulted in the following equations:

$$ \begin{aligned}\mathrm{P}\left(\mathrm{High}|\mathrm{X}\right)&=\mathrm{P}\left(\mathrm{X}|\mathrm{High}\right)\mathrm{P}\left(\mathrm{High}\right)/\mathrm{P}\left(\mathrm{X}\right)\ \mathrm{and}\\ {}\mathrm{P}\left(\mathrm{Low}|\mathrm{X}\right)&=\mathrm{P}\left(\mathrm{X}|\mathrm{Low}\right)\mathrm{P}\left(\mathrm{Low}\right)/\mathrm{P}\left(\mathrm{X}\right)\end{aligned} $$

P(High), P(Low), and P(X) can be estimated from the given dataset and prior knowledge. To estimate the class-conditional probabilities P(X|High), P(X|Low), there are two implementations: the Naïve Bayesian classifier and the Bayesian Belief Networks.

In the Naïve Bayesian classifier [13], the attributes are assumed to be conditionally independent given the class label y. In other words, for an n-attribute set X = (X1, X2,…,Xn), the class-conditional probability can be estimated as follows:

$$ P\left(Y|X\right)=\alpha P(Y)\prod_iP\left({X}_i|Y\right) $$

In the example,

$$ \begin{aligned} \left(\mathrm{X}|\mathrm{Low}\right)=&\,\mathrm{P}\left(\mathrm{Home}\mathrm{Owner}=\mathrm{Yes}|\mathrm{Credit}\mathrm{Risk}\right.\\ =&\,\left.\mathrm{Low}\right)\times \mathrm{P}\left(\mathrm{Marital}\mathrm{Status}\right.\\ =&\,\left.\mathrm{Married}|\mathrm{Credit}\mathrm{Risk}=\mathrm{Low}\right)\\ &\times \mathrm{P}\left(\mathrm{Annual}\ \mathrm{Income}\right.\\ =&\,\left.\mathrm{High}\right|\mathrm{Credit}\ \mathrm{Risk}=\mathrm{Low}\\ =&\,3/4\times 2/4\times 4/4=3/8\ \mathrm{P}\left(\mathrm{X}|\mathrm{High}\right)\\ =&\,\mathrm{P}\left(\mathrm{Home}\ \mathrm{Owner}=\mathrm{Yes}|\mathrm{Credit}\ \mathrm{Risk}\right.\\ =&\,\left.\mathrm{High}\right)\times \mathrm{P}\left(\mathrm{Marital}\ \mathrm{Status}=\mathrm{High}\right)\\ &\times \mathrm{P}\left(\mathrm{Annual}\ \mathrm{Income}=\mathrm{High}|\mathrm{Credit}\ \right.\\ &\left.\mathrm{Risk}=\mathrm{High}\right)=0 \end{aligned} $$

Putting them together, P(High|X) = P(X|High)P(High)/P(X) = 0

$$ \mathrm{P}\,\left( Low|\mathrm{X}\right)=\mathrm{P}\,\left(\mathrm{X}| Low\right)\mathrm{P}(Low)/\mathrm{P}\,\left(\mathrm{X}\right)>0 $$

Since P(Low|X) > P(High|X), X is classified as having Credit Risk = Low.

  In other words,

$$ \begin{aligned} Classifiy(X)= &\,\underset{y}{ \arg \max }P\left(Y=y\right)\\ &\prod_iP\left({X}_i|Y=y\right) \end{aligned} $$

In general, Naïve Bayes classifiers are robust to isolated noise points and irrelevant attributes. However, the presence of correlated attributes can degrade the performance of naïve Bayes classifiers as they violate the conditional independence assumption. Fortunately, Domingos and Pazzani [3] showed that even when the independence assumption is violated in some situations, the naïve Bayesian classifier can still be optimal. This has led to a wide spread use of naïve Bayesian classifiers in many applications. Jaeger [9] also further clarifies and distinguishes the concepts that can be recognized by naïve Bayes classifiers and the theoretical limits on learning the concepts from data.

There are many extensions to the naïve Bayes classifier that impose limited dependencies among the feature/attribute nodes, such as tree-augmented naïve Bayes [6], and forest-augmented naïve Bayes [11].

Bayesian belief network [10] overcomes the rigidity imposed by this assumption by allowing the dependence relationships among a set of attributes to be modeled as a directed acyclic graph. Associated with each node in the directed acyclic graph is a probability table. Note that a node in the Bayesian network is conditionally independent of its non-descendants, if its parents are known.

Refer to the running example. Suppose the probabilistic relationships among Home Owner, Marital Status, Annual Income, and Credit Risks are shown in Fig. 2. Associated with each node is the corresponding conditional probability table relating the node to its parent node(s).

Bayesian Classification, Fig. 2
figure 212figure 212

Bayesian belief network of the credit risks dataset

In the Bayesian belief network, the probabilities are estimated as follows:

$$ \begin{aligned} P\left( Risk|X\right)=&\,\alpha \sum_IP\left( Risk| Income\right)\\ &\sum_OP(Owner)\sum_SP(Status)\\ &P\left( Income| Owner, Status\right)\end{aligned} $$

In the above example,

$$ \begin{aligned} \mathrm{P}\left(\mathrm{Low}|\mathrm{X}\right)&=\mathrm{P}\left(\mathrm{Low}|\mathrm{Income}=\mathrm{High}\right)\\ &\quad \ast \mathrm{P}\left(\mathrm{Income}=\mathrm{High}|\mathrm{Owner}\right.\\ &=\left.\mathrm{Yes},\mathrm{Status}=\mathrm{Married}\right)\\ &=1\ast 1=1\end{aligned} $$

Alternatively, by recognizing that given Annual Income, Credit Risks is conditionally independent of Home Owner and Marital status, then

$$ \mathrm{P}\left( Low|X\right)=P\left( Low| Income= High\right)=1 $$

This example illustrates that the classification problem in Bayesian networks is a special case of belief updating for any node (target class) Y in the network, given evidence X.

While Bayesian belief network provides an approach to capture dependencies among variables using a graphical model, constructing the network is time consuming and costly. Substantial research has been and is still continuing to address the inference as well as the automated construction of Bayesian networks by learning from data. Much progress has been made recently. So applying Bayesian networks for classification is no longer as time consuming and costly as before, and the approach is gaining headway into the mainstream applications.

Key Applications

Bayesian classification techniques have been applied in many applications. Here, a few more common applications of Bayesian classifiers are mentioned.

Text Document Classification

Text classification refers to the grouping of texts into several clusters so as to improve the efficiency and effectiveness of text retrieval. Typically, the text documents are pre-processed and the key words chosen. Based on the selected keywords of the documents, probabilistic classifiers are built. Dumais et al. [5] show naïve Bayes classifier yields surprisingly good classifications for text documents.

Image Pattern Recognition

In image pattern recognition, a set of elementary or low level image features are selected which describe some characteristics of the objects. Data extracted based on this feature set are used to train Bayesian classifiers for subsequent object recognition. Aggarwal et al. [1] did a comparative study of three paradigms for object recognition - Bayesian Statistics, Neural Networks and Expert Systems.

Medical Diagnostic and Decision Support Systems

Large amounts of medical data are available for analysis. Knowledge derived from analyzing these data can be used to assist the physician in subsequent diagnosis. In this area, naïve Bayesian classifiers performed exceptionally well. Kononenko et al. [12] showed that the naïve Bayesian classifier outperformed other classification algorithms on five out of the eight medical diagnostic problems.

Email Spam Filtering

With the growing problem of junk email, it is desirable to have an automatic email spam filters to eliminate unwanted messages from a user’s mail stream. Bayesian classifiers that take into consideration domain-specific features for classifying emails are now accurate enough for real world usage.

Data Sets

http://archive.ics.uci.edu/beta/datasets.html

http://spamassassin.apache.org/publiccorpus/

URL to Code

More recent lists of Bayesian Networks software can be found at:

Kevin Murphy’s website:

http://www.cs.ubc.ca/~murphyk/Bayes/bnsoft.html

Google directory:

http://directory.google.com/Top/Computers/Artificial_Intelligence/Belief_Networks/Software/

Specialized naïve Bayes classification software:

jBNC – a java toolkit for variants of naïve Bayesian classifiers, with WEKA interface

Cross-References