Bayesian Classification

Hsu, Wynne

doi:10.1007/978-1-4614-8265-9_556

Wynne Hsu³

300 Accesses
1 Citations

Download reference work entry PDF

Synonyms

Bayes classifier

Definition

In classification, the objective is to build a classifier that takes an unlabeled example and assigns it to a class. Bayesian classification does this by modeling the probabilistic relationships between the attribute set and the class variable. Based on the modeled relationships, it estimates the class membership probability of the unseen example.

Historical Background

The foundation of Bayesian classification goes back to Reverend Bayes himself [2]. The origin of Bayesian belief nets can be traced back to [15]. In 1965, Good [4] combined the independence assumption with the Bayes formula to define the Naïve Bayes Classifier. Duda and Hart [14] introduced the basic notion of Bayesian classification and the naïve Bayes representation of joint distribution. The modern treatment and development of Bayesian belief networks is attributed to Pearl [8]. Heckerman [13] later reformulated the Bayes results and defined the probabilistic similarity networks that demonstrated the practicality of Bayesian classification in complex diagnostic problems.

Foundations

Bayesian classification is based on Bayes Theorem. It provides the basis for probabilistic learning that accommodates prior knowledge and takes into account the observed data.

Let X be a data sample whose class label is unknown. Suppose H is a hypothesis that X belongs to class Y. The goal is to estimate the probability that hypothesis H is true given the observed data sample X, that is, P(Y|X).

Consider the example of a dataset with the following attributes: Home Owner, Marital Status, and Annual Income as shown in Fig. 1. Credit Risks are Low for those who have never defaulted on their payments and credit risks are High for those who have previously defaulted on their payments.

Assume that a new data arrives with the following attribute set: X = (Home Owner = Yes, Marital Status = Married, Annual Income = High). To determine the credit risk of this record, it is noted that the Bayes classifier combines the predictions of all alternative hypotheses to determine the most probable classification of a new instance. In the example, this involves computing P(High|X) and P(Low|X) and to determine whether P(High|X) > P(Low|X)?

However, estimating these probabilities is difficult, since it requires a very large training set that covers every possible combination of the class label and attribute values. Instead, Bayes theorem is applied and it resulted in the following equations:

$$ \begin{aligned}\mathrm{P}\left(\mathrm{High}|\mathrm{X}\right)&=\mathrm{P}\left(\mathrm{X}|\mathrm{High}\right)\mathrm{P}\left(\mathrm{High}\right)/\mathrm{P}\left(\mathrm{X}\right)\ \mathrm{and}\\ {}\mathrm{P}\left(\mathrm{Low}|\mathrm{X}\right)&=\mathrm{P}\left(\mathrm{X}|\mathrm{Low}\right)\mathrm{P}\left(\mathrm{Low}\right)/\mathrm{P}\left(\mathrm{X}\right)\end{aligned} $$

P(High), P(Low), and P(X) can be estimated from the given dataset and prior knowledge. To estimate the class-conditional probabilities P(X|High), P(X|Low), there are two implementations: the Naïve Bayesian classifier and the Bayesian Belief Networks.

In the Naïve Bayesian classifier [13], the attributes are assumed to be conditionally independent given the class label y. In other words, for an n-attribute set X = (X₁, X₂,…,X_n), the class-conditional probability can be estimated as follows:

$$ P\left(Y|X\right)=\alpha P(Y)\prod_iP\left({X}_i|Y\right) $$

In the example,

$$ \begin{aligned} \left(\mathrm{X}|\mathrm{Low}\right)=&\,\mathrm{P}\left(\mathrm{Home}\mathrm{Owner}=\mathrm{Yes}|\mathrm{Credit}\mathrm{Risk}\right.\\ =&\,\left.\mathrm{Low}\right)\times \mathrm{P}\left(\mathrm{Marital}\mathrm{Status}\right.\\ =&\,\left.\mathrm{Married}|\mathrm{Credit}\mathrm{Risk}=\mathrm{Low}\right)\\ &\times \mathrm{P}\left(\mathrm{Annual}\ \mathrm{Income}\right.\\ =&\,\left.\mathrm{High}\right|\mathrm{Credit}\ \mathrm{Risk}=\mathrm{Low}\\ =&\,3/4\times 2/4\times 4/4=3/8\ \mathrm{P}\left(\mathrm{X}|\mathrm{High}\right)\\ =&\,\mathrm{P}\left(\mathrm{Home}\ \mathrm{Owner}=\mathrm{Yes}|\mathrm{Credit}\ \mathrm{Risk}\right.\\ =&\,\left.\mathrm{High}\right)\times \mathrm{P}\left(\mathrm{Marital}\ \mathrm{Status}=\mathrm{High}\right)\\ &\times \mathrm{P}\left(\mathrm{Annual}\ \mathrm{Income}=\mathrm{High}|\mathrm{Credit}\ \right.\\ &\left.\mathrm{Risk}=\mathrm{High}\right)=0 \end{aligned} $$

Putting them together, P(High|X) = P(X|High)P(High)/P(X) = 0

$$ \mathrm{P}\,\left( Low|\mathrm{X}\right)=\mathrm{P}\,\left(\mathrm{X}| Low\right)\mathrm{P}(Low)/\mathrm{P}\,\left(\mathrm{X}\right)>0 $$

Since P(Low|X) > P(High|X), X is classified as having Credit Risk = Low.

In other words,

$$ \begin{aligned} Classifiy(X)= &\,\underset{y}{ \arg \max }P\left(Y=y\right)\\ &\prod_iP\left({X}_i|Y=y\right) \end{aligned} $$

In general, Naïve Bayes classifiers are robust to isolated noise points and irrelevant attributes. However, the presence of correlated attributes can degrade the performance of naïve Bayes classifiers as they violate the conditional independence assumption. Fortunately, Domingos and Pazzani [3] showed that even when the independence assumption is violated in some situations, the naïve Bayesian classifier can still be optimal. This has led to a wide spread use of naïve Bayesian classifiers in many applications. Jaeger [9] also further clarifies and distinguishes the concepts that can be recognized by naïve Bayes classifiers and the theoretical limits on learning the concepts from data.

There are many extensions to the naïve Bayes classifier that impose limited dependencies among the feature/attribute nodes, such as tree-augmented naïve Bayes [6], and forest-augmented naïve Bayes [11].

Bayesian belief network [10] overcomes the rigidity imposed by this assumption by allowing the dependence relationships among a set of attributes to be modeled as a directed acyclic graph. Associated with each node in the directed acyclic graph is a probability table. Note that a node in the Bayesian network is conditionally independent of its non-descendants, if its parents are known.

Refer to the running example. Suppose the probabilistic relationships among Home Owner, Marital Status, Annual Income, and Credit Risks are shown in Fig. 2. Associated with each node is the corresponding conditional probability table relating the node to its parent node(s).

In the Bayesian belief network, the probabilities are estimated as follows:

$$ \begin{aligned} P\left( Risk|X\right)=&\,\alpha \sum_IP\left( Risk| Income\right)\\ &\sum_OP(Owner)\sum_SP(Status)\\ &P\left( Income| Owner, Status\right)\end{aligned} $$

In the above example,

$$ \begin{aligned} \mathrm{P}\left(\mathrm{Low}|\mathrm{X}\right)&=\mathrm{P}\left(\mathrm{Low}|\mathrm{Income}=\mathrm{High}\right)\\ &\quad \ast \mathrm{P}\left(\mathrm{Income}=\mathrm{High}|\mathrm{Owner}\right.\\ &=\left.\mathrm{Yes},\mathrm{Status}=\mathrm{Married}\right)\\ &=1\ast 1=1\end{aligned} $$

Alternatively, by recognizing that given Annual Income, Credit Risks is conditionally independent of Home Owner and Marital status, then

$$ \mathrm{P}\left( Low|X\right)=P\left( Low| Income= High\right)=1 $$

This example illustrates that the classification problem in Bayesian networks is a special case of belief updating for any node (target class) Y in the network, given evidence X.

While Bayesian belief network provides an approach to capture dependencies among variables using a graphical model, constructing the network is time consuming and costly. Substantial research has been and is still continuing to address the inference as well as the automated construction of Bayesian networks by learning from data. Much progress has been made recently. So applying Bayesian networks for classification is no longer as time consuming and costly as before, and the approach is gaining headway into the mainstream applications.

Key Applications

Bayesian classification techniques have been applied in many applications. Here, a few more common applications of Bayesian classifiers are mentioned.

Text Document Classification

Text classification refers to the grouping of texts into several clusters so as to improve the efficiency and effectiveness of text retrieval. Typically, the text documents are pre-processed and the key words chosen. Based on the selected keywords of the documents, probabilistic classifiers are built. Dumais et al. [5] show naïve Bayes classifier yields surprisingly good classifications for text documents.

Image Pattern Recognition

In image pattern recognition, a set of elementary or low level image features are selected which describe some characteristics of the objects. Data extracted based on this feature set are used to train Bayesian classifiers for subsequent object recognition. Aggarwal et al. [1] did a comparative study of three paradigms for object recognition - Bayesian Statistics, Neural Networks and Expert Systems.

Medical Diagnostic and Decision Support Systems

Large amounts of medical data are available for analysis. Knowledge derived from analyzing these data can be used to assist the physician in subsequent diagnosis. In this area, naïve Bayesian classifiers performed exceptionally well. Kononenko et al. [12] showed that the naïve Bayesian classifier outperformed other classification algorithms on five out of the eight medical diagnostic problems.

Email Spam Filtering

With the growing problem of junk email, it is desirable to have an automatic email spam filters to eliminate unwanted messages from a user’s mail stream. Bayesian classifiers that take into consideration domain-specific features for classifying emails are now accurate enough for real world usage.

Data Sets

http://archive.ics.uci.edu/beta/datasets.html

http://spamassassin.apache.org/publiccorpus/

URL to Code

More recent lists of Bayesian Networks software can be found at:

Kevin Murphy’s website:

http://www.cs.ubc.ca/~murphyk/Bayes/bnsoft.html

Google directory:

http://directory.google.com/Top/Computers/Artificial_Intelligence/Belief_Networks/Software/

Specialized naïve Bayes classification software:

jBNC – a java toolkit for variants of naïve Bayesian classifiers, with WEKA interface

Cross-References

Probabilistic Retrieval Models and Binary Independence Retrieval (BIR) Model

References

Aggarwal JK, Ghosh J, Nair D, Taha I. A comparative study of three paradigms for object recognition – Bayesian statistics, neural networks, and expert systems. In: Advances in image understanding: a festschrift for Azriel Rosenfeld. Washington, DC: IEEE Computer Society Press; 1996. p. 241–62.
Google Scholar
Bayes T. An essay towards solving a problem in the doctrine of chances. Philos Trans R Soc. 1763;53:370–418.
Article MathSciNet MATH Google Scholar
Domingos P, Pazzani M. Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: Proceedings of the 13th International Conference on Machine Learning; 1996. p. 105–12.
Google Scholar
Duda RO, Hart PE. Pattern classification and scene analysis. New York: Wiley; 1973.
MATH Google Scholar
Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In: Proceedings of the International Conference on Information and Knowledge Management; 1998.
Google Scholar
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach Learn. 1997;29:131–63.
Article MATH Google Scholar
Good IJ. The estimation of probabilities: an essay on modern Bayesian methods. Cambridge, MA: MIT Press; 1965.
MATH Google Scholar
Heckerman D. Probabilistic similarity networks. ACM doctoral dissertation award series. Cambridge, MA: MIT Press; 1991.
Google Scholar
Jaeger M. Probabilistic classifiers and the concepts they recognize. In: Proceedings of the 20th International Conference on Machine Learning; 2003. p. 266–73.
Google Scholar
Jensen FV. An introduction to Bayesian networks. New York: Springer; 1996.
Google Scholar
Keogh E, Pazzani M. Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In: Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics; 1999.
Google Scholar
Kononenko I, Bratko I, Kukar M. Application of machine learning to medical diagnosis. In: Machine learning, data mining and knowledge discovery: methods and applications. New York: Wiley; 1998.
Google Scholar
Langley P, Iba W, Thompson K. An analysis of Bayesian classifiers. In: Proceedings of the 10th National Conference on Artificial Intelligence. 1992. p. 3–8.
Google Scholar
Pearl J. Probabilistic reasoning in intelligenet systems: networks of plausible inference. San Mateo: Morgan Kaufmann; 1988.
Google Scholar
Wright S. Correlation and causation. J Agric Res. 1921;20(7):557–85.
Google Scholar

Download references

Author information

Authors and Affiliations

National University of Singapore, Singapore, Singapore
Wynne Hsu

Authors

Wynne Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wynne Hsu .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

School of Elec. Eng. and Computer Science, Seoul National Univ., Seoul, Republic of Korea
Kyuseok Shim

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Hsu, W. (2018). Bayesian Classification. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_556

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_556
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics