Abstract
To solve the Maximum Mutual Information (MMI) and Maximum Likelihood (ML) for tests, estimations, and mixture models, it is found that we can obtain a new iterative algorithm by the Semantic Mutual Information (SMI) and R(G) function proposed by Chenguang Lu (1993) (where R(G) function is an extension of information rate distortion function R(D), G is the lower limit of the SMI, and R(G) represents the minimum R for given G). This paper focus on mixture models. The SMI is defined by the average log normalized likelihood. The likelihood function is produced from the truth function and the prior by the semantic Bayesian inference. A group of truth functions constitute a semantic channel. Letting the semantic channel and Shannon channel mutually match and iterate, we can obtain the Shannon channel that maximizes the MMI and the average log likelihood. Therefore, this iterative algorithm is called Channels’ Matching algorithm or the CM algorithm. It is proved that the relative entropy between the sampling distribution and predicted distribution may be equal to R − G. Hence, solving the maximum likelihood mixture model only needs minimizing R − G, without needing Jensen’s inequality. The convergence can be intuitively explained and proved by the R(G) function. Two iterative examples of mixture models (which are demonstrated in an excel file) show that the computation for the CM algorithm is simple. In most cases, the number of iterations for convergence (as the relative entropy <0.001 bit) is about 5. The CM algorithm is similar to the EM algorithm; however, the CM algorithm has better convergence and more potential applications.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
To obtain maximum likelihood mixture models, The EM algorithm [1] and the Newton method [2] are often used. There have been many papers on applying or improving the EM algorithm. Lu proposed the semantic information measure (SIM) and the R(G) function in 1993 [3,4,5]. The R(G) function is an extension of Shannon’s information rate distortion function R(D) [6, 7]. The R(G) means the minimum R for given SIM G. It is found that using SIM and R(G) function, we can obtain a new iterative algorithm, i.e, Channels’ Matching algorithm (or the CM algorithm). Compared with the EM algorithm, the CM algorithm proposed by this paper is seemly similar yet essentially differentFootnote 1.
In this study, we use the sampling distribution instead of the sampling sequence. Assume the sampling distribution is P(X) and the predicted distribution by the mixture model is Q(X). The goal is to minimize the relative entropy or Kullback-Leibler (KL) divergence H(Q||P) [8, 9]. With the semantic information method, we may prove H(Q||P) = R(G) − G. Then, maximizing G and modifying R alternatively, we can minimize H(Q||P).
We first introduce the semantic channel, semantic information measure, and R(G) function in a way that is as compatible with the likelihood method as possible. Then we discuss how the CM algorithm is applied to mixture models. Finally, we compare the CM algorithm with the EM algorithm to show the advantages of the CM algorithm.
2 Semantic Channel, Semantic Information Measure, and the R(G) Function
2.1 From the Shannon Channel to the Semantic Channel
First, we introduce the Shannon channel.
Let X be a discrete random variable representing a fact with alphabet A = {x 1, x 2, …, x m}, and let Y be a discrete random variable representing a message with alphabet B = {y 1, y 2, …, y n }. A Shannon channel is composed of a group of transition probability functions [6]: P(y j |X), j = 1, 2, …, n.
In terms of hypothesis-testing, X is a sample point and Y is a hypothesis or a model label. We need a sample sequence or sampling distribution P(X|.) to test a hypothesis to see how accurate it is.
Let ϴ be a random variable for a predictive model, and let θ j be a value taken by ϴ when Y = y j . The semantic meaning of a predicate y j (X) is defined by θ j or its (fuzzy) truth function T(θ j |X) ϵ [0,1]. Because T(θ j |X) is constructed with some parameters, we may also treat θ j as a set of model parameters. We can also state that T(θ j |X) is defined by a normalized likelihood, i.e., T(θ j |X) = k P(θ j |X)/P(θ j ) = k P(X|θ j )/P(X), where k is a coefficient that makes the maximum of T(θ j |X) be 1. The θ j can also be regarded as a fuzzy set, and T(θ j |X) can be considered as a membership function of a fuzzy set proposed by Zadeh [10].
In contrast to the popular likelihood method, the above method uses sub-models θ 1, θ 2, …, θ n instead of one model θ or ϴ. The P(X|θ j ) is equivalent to P(X|y j , θ) in the popular likelihood method. A sample used to test y j is also a sub-sample or a conditional sample. These changes will make the new method more flexible and more compatible with the Shannon information theory.
A semantic channel is composed of a group of truth value functions or membership functions: T(θ j |X), j = 1, 2, …, n.
Similar to P(y j |X), T(θ j |X) can also be used for Bayesian prediction to produce likelihood function [4]:
where T(θ j ) is called the logical probability of y j . The author now know that this formula was proposed by Thomas as early as 1981 [11]. We call this prediction the semantic Bayesian prediction. If T(θ j |X) ∝ P(y j |X), then the semantic Bayesian prediction is equivalent to the Bayesian prediction.
2.2 Semantic Information Measure and the Optimization of the Semantic Channel
The semantic information conveyed by y j about x i is defined by normalized likelihood as [3]:
where the semantic Bayesian inference is used; it is assumed that prior likelihood function P(X|ϴ) is equal to prior probability distribution P(X).
After averaging I(x i ;θ j ), we obtain semantic (or generalized) KL information:
The statistical probability P(x i|y j ), i = 1, 2, …, on the left of “log” above, represents a sampling distribution to test the hypothesis y j or model θ j . Assume we choose y j according to observed condition ZϵC. If y j = f(Z|ZϵC j ), where C j is a cub-set of C, then P(X|y j ) = P(X|C j ).
After averaging I(X;θ j ), we obtain semantic (or generalized) mutual information:
where H(X) is the Shannon entropy of X, H(X|Θ) is the generalized posterior entropy of X. Each of them has coding meaning [4, 5].
Optimizing a semantic Channel is equivalent to optimizing a predictive model ϴ. For given y j = f(Z|ZϵC j ), optimizing θ j is equivalent to optimizing T(θ j |X) by
It is easy to prove that when P(X|θ j ) = P(X|y j ), or
I(X; θ j ) reaches the maximum. Set the maximum of T(θ j |X) to 1. Then we can obtain
In this equation, \( x_{j}^{*} \) makes \( P(x_{j}^{*} |y_{j} )/P(x_{j}^{*} ) \) be the maximum of P(X|y j )/P(X).
2.3 Relationship Between Semantic Mutual Information and Likelihood
Assume that the size of the sample used to test y j is N j ; the sample points come from independent and identically distributed random variables. Among these points, the number of x i is N ij . Assume that N j is infinite, P(X|y j ) = N ij /N j . Hence, there is log normalized likelihood:
After averaging the above likelihood for different y j , j = 1, 2, …, n, we have the average log normalized likelihood:
where N = N 1 + N 2 + ···+N n . It shows that the ML criterion is equivalent to the minimum generalized posterior entropy criterion and the Maximum Semantic Information (MSI) criterion. When P(X|θ j ) = P(X|y j ) (for all j), the semantic mutual information I(X; ϴ) is equal to the Shannon mutual information I(X;Y), which is the special case of I(X; ϴ).
2.4 The Matching Function R(G) of R and G
The R(G) function is an extension of the rate distortion function R(D) [7]. In the R(D) function, R is the information rate, D is the upper limit of the distortion. The R(D) function means that for given D, R = R(D) is the minimum of the Shannon mutual information I(X;Y).
Let distortion function d ij be replaced with I ij = I(x i; y j ) = log[T(θ j |x i )/T(θ j )] = log[P(x i |θ j )/P(x i )], and let G be the lower limit of the semantic mutual information I(X; ϴ). The information rate for given G and P(X) is defined as
Following the derivation of R(D) [12], we can obtain [3]
where m ij = T(θ j |x i )/T(θ j ) = P(x i |θ j )/P(x i ) is the normalized likelihood; \( \lambda_{i} = \sum_{j} P(y_{j} )m_{ij}^{s} \). We may also use m ij = P(x i | θ j ), which results in the same \( m_{ij}^{s} /\lambda_{i} \). The shape of an R(G) function is a bowl-like curve as shown in Fig. 1.
The R(G) function is different from the R(D) function. For a given R, we have the maximum value G + and the minimum value G −, which is negative and means that to bring a certain information loss |G| to enemies, we also need certain objective information R.
In the rate distortion theory, dR/dD = s (s ≤ 0). It is easy to prove that there is also dR/dG = s, where s may be less or greater than 0. The increase of s will raise the model’s prediction precision. If s changes from positive s 1 to −s 1, then R(−s 1) = R(s 1) and G changes from G + to G − (see Fig. 1).
When s = 1, λ i = 1, and R = G, which means that the semantic channel matches the Shannon channel and the semantic mutual information is equal to the Shannon mutual information. When s = 0, R = 0 and G < 0. In Fig. 1, c = G(s = 0).
3 The CM Algorithm for Mixture Models
3.1 Explaining the Iterative Process by the R(G) Function
Assume a sampling distribution P(X) is produced by the conditional probability P*(X|Y) being some function such as Gaussian distribution. We only know that the number of the mixture components is n, without knowing P(Y). We need to solve P(Y) and model (or parameters) Θ, so that the predicted probability distribution of X, denoted by Q(X), is as close to the sampling distribution P(X) as possible, i.e. the relative entropy or Kullback-Leibler divergence H(Q||P) is as small as possible. The Fig. 2 shows the convergent processes of two examples.
We use P*(Y) and P*(X|Y) to denote the P(Y) and P(X|Y) that are used to produce the sampling distribution P(X), and use P*(Y|X) and R* = I*(X;Y) to denote the corresponding Shannon channel and Shannon mutual information. When Q(X) = P(X), there should be P(X|Θ) = P*(X|Y), and G* = R*.
For mixture models, when we let the Shannon channel match the semantic channel (in Left-steps), we do not maximize I(X;Θ), but seek a P(X|Θ) that accords with P*(X|Y) as possible (Left-step a in Fig. 2 is for this purpose), and a P(Y) that accords with P*(Y) as possible (Left-step b in Fig. 2 is for this purpose). That means we seek a R that is as close to R* as possible. Meanwhile, I(X;Θ) may decrease. However, in popular EM algorithms, the objective function, such as P(X N, Y|Θ), is required to keep increasing without decreasing in both steps.
With CM algorithm, only after the optimal model is obtained, if we need to choose Y according to X (for decision or classification), we may seek the Shannon channel P(Y|X) that conveys the MMI R max(G max) (see Left-step c in Fig. 2).
Assume that P(X) is produced by P*(X|Y) with the Gaussian distribution. Then the likelihood functions are
If n = 2, then parameters are c 1, c 2, d 1, d 2. In the beginning of the iteration, we may set P(Y) = 1/n. We begin iterating from Left-step a.
Left-step a:
Construct Shannon channel by
This formula has already been used in the EM algorithm [1]. It was also used in the derivation process of the R(D) function [12]. Hence the semantic mutual information is
Left-step b:
Use the following equation to obtain a new P(Y) repeatedly until the iteration converges.
The convergent P(Y) is denoted by P +1(Y). This is because P(Y|X) from Eq. (12) is an incompetent Shannon channel so that ∑ i P(x i )P(y j |x i ) ≠ P(y j ). The above iteration makes P +1(Y) match P(X) and P(X|Θ) better. This iteration has been used by some authors, such as in [13].
When n = 2, we should avoid choosing c 1 and c 2 so that both are larger or less than the mean of X; otherwise P(y 1) or P(y 2) will be 0, and cannot be larger than 0 later.
If H (Q||P) is less than a small number, such as 0.001 bit, then end the iteration; otherwise go to Right-step.
Right-step:
Optimize the parameters in the likelihood function P(X|Θ) on the right of the log in Eq. (13) to maximize I(X;Θ). Then go to Left-step a.
3.2 Using Two Examples to Show the Iterative Processes
3.2.1 Example 1 for R < R*
In Table 1, there are real parameters that produce the sample distribution P(X) and guessed parameters that are used to produce Q(X). The convergence process from the starting (G, R) to (G*, R*) is shown by the iterative locus as R < R* in Fig. 2. The convergence speed and changes of R and G are shown in Fig. 3. The iterative results are shown in Table 1.
Analyses:
In this iterative process, there are always R < R* and G < G*. After each step, R and G increase a little bit so that G approaches G* gradually. This process seams to tell us that each of Right-step, Left-step a, and Left-step b can increase G; and hence maximizing G can minimize H(Q||P), which is our goal. Yet, it is wrong. The Left a and Left b do not necessarily increase G. There are many counterexamples. Fortunately, iterations for these counterexamples can still converge. Let us see Example 2 as a counterexample.
3.2.2 Example 2 for R > R*
Table 2 shows the parameters and iterative results for R > R*. The iterative process is shown in Fig. 4.
Analyses:
G is not monotonically increasing nor monotonically decreasing. It increases in all Right steps and decreases in all Left steps. This example is a challenge to all authors who prove that the standard EM algorithm or a variant EM algorithm converges. If G is not monotonically increasing, it must be difficult or impossible to prove that logP(X N, Y|Θ) or other likelihood is monotonically increasing or no-decreasing in all steps. For example, in Example 2, Q* = −NH * (X, Y) = −6.031 N. After the first optimization of parameters, Q = −6.011 N > Q*. If we continuously maximize Q, Q cannot approach less Q*.
We also use some other true models P*(X|Y) and P*(Y) to test the CM algorithm. In most cases, the number of iterations is close to 5. In rare cases where R and G are much bigger than G*, such as R ≈ G > 2G*, the iterative convergence is slow. In these cases where logP(X N, Y|Θ) is also much bigger than logP*(X N,Y), the EM algorithm confronts similar problem. Because of these cases, the convergence proof of the EM algorithm is challenged.
3.3 The Convergence Proof of the CM Algorithm
Proof.
To prove the CM algorithm converges, we need to prove that H(Q||P) is decreasing or no-increasing in every step.
Consider the Right-step.
Assume that the Shannon mutual information conveyed by Y about Q(X) is R Q , and that about P(X) is R. Then we have
According to Eqs. (13) and (15), we have
Because of this equation, we do not need Jensen’s inequality that the EM algorithm needs.
In Right-steps, the Shannon channel and R Q does not change, G is maximized. Therefore H(Q||P) is decreasing and its decrement is equal to the increment of G.
Consider Left-step a.
After this step, Q(X) becomes Q +1(X) = ∑ j P(y j )P +1(X|θ j ). Since Q +1(X) is produced by a better likelihood function and the same P(Y), Q +1(X) should be closer to P(X) than Q(X), i.e. H(Q +1||P) < H(Q||P) (More strict mathematical proof for this conclusion is needed).
Consider Left-step b.
The iteration for P +1(Y) moves (G, R) to the R(G) function cure ascertained by P(X) and P(X|θ j ) (for all j) that form a semantic channel. This conclusion can be obtained from the derivation processes of R(D) function [12] and R(G) function [3]. A similar iteration is used for P(Y|X) and P(Y) in deriving the R(D) function. Because R(G) is the minimum R for given G, H(Q||P) = R Q − G = R − G becomes less.
Because H(Q||P) becomes less after every step, the iteration converges. Q.E.D.
3.4 The Decision Function with the ML Criterion
After we obtain optimized P(X|Θ), we need to select Y (to make decision or classification) according to X. The parameter s in R(G) function (see Eq. (11)) reminds us that we may use the following Shannon channel
which are fuzzy decision functions. When s → + ∞, the fuzzy decision will become crisp decision. Different from Maximum A prior (MAP) estimation, the above decision function still persists in the ML criterion or MSI criterion. The Left-step c in Fig. 2 shows that (G, R) moves to (G max, R max) with s increasing.
3.5 Comparing the CM Algorithm and the EM Algorithm
In the EM algorithm [1, 14], the likelihood of a mixture model is expressed as logP(X N |Θ) > L=Q − H. If we move P(Y) or P(Y|Θ) from Q into H, then Q will become −NH(X|Θ) and H becomes −NR Q . If we add NH(X) to both sides of the inequality, we will have H(Q||P) ≤ R Q − G, which is similar to Eq. (17). It is easy to prove
where H(Y) = −∑ j P +1(y j )logP(y j ) is a generalized entropy. We may think the M-step merges the Left-step b and the Right-step of the CM algorithm into one step. In brief,
In the EM algorithm, if we first optimize P(Y) (not for maximum Q) and then optimize P(X|Y, Θ), then the M-step will be equivalent to the CM algorithm.
There are also other improved EM algorithms [13, 15,16,17] with some advantages. However, no one of these algorithms facilitates that R converges to R*, and R − G converges to 0 as the CM algorithm.
The convergence reason of the CM algorithm is seemly clearer than the EM algorithm (see the analyses in Example 2 for R > R*). According to [7, 15,16,17], the CM algorithm is faster at least in most cases than the various EM algorithms.
The CM algorithm can also be used to achieve maximum mutual information and maximum likelihood of tests and estimations. There are more detailed discussions about the CM algorithmFootnote 2.
4 Conclusions
Lu’s semantic information measure can combine the Shannon information theory and likelihood method so that the semantic mutual information is the average log normalized likelihood. By letting the semantic channel and Shannon channel mutually match and iterate, we can achieve the mixture model with minimum relative entropy. The iterative convergence can be intuitively explained and proved by the R(G) function. Two iterative examples and mathematical analyses show that the CM algorithm has higher efficiency at least in most cases and clearer convergence reasons than the popular EM algorithm.
Notes
- 1.
Excel files demonstrating iterative process for tests, estimations, and mixture models can be download from http://survivor99.com/lcg/CM-iteration.zip.
- 2.
References
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39, 1–38 (1977)
Kok, M., Dahlin, J.B., Schon, T.B., Wills, A: Newton-based maximum likelihood estimation in nonlinear state space models. In: IFAC-PapersOnLine 48, pp. 398–403 (2015)
Lu, C.: A Generalized Information Theory. China Science and Technology University Press, Hefei (1993). (in Chinese)
Lu, C.: Coding meanings of generalized entropy and generalized mutual information. J. China Inst. Commun. 15, 37–44 (1994). (in Chinese)
Lu, C.: A generalization of Shannon’s information theory. Int. J. Gen. Syst. 28, 453–490 (1999)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–429, 623–656 (1948)
Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec. 4, 142–163 (1959)
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19, 716–723 (1974)
Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
Thomas, S.F.: Possibilistic uncertainty and statistical inference. ORSA/TIMS Meeting, Houston, Texas (1981)
Zhou, J.: Fundamentals of Information Theory. People’s Posts & Telecom Press, Beijing (1983). (in Chinese)
Byrne, C.L.: The EM algorithm theory applications and related methods. https://www.researchgate.net/profile/Charles_Byrne
Wu, C.F.J.: On the convergence properties of the EM algorithm. Ann. Stat. 11, 95–103 (1983)
Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. MIT Press, Cambridge (1999)
Huang, W.H., Chen, Y.G.: The multiset EM algorithm. Stat. Probab. Lett. 126, 41–48 (2017)
Springer, T., Urban, K.: Comparison of the EM algorithm and alternatives. Numer. Algorithms 67, 335–364 (2014)
Acknowledgment
The author thanks Professor Peizhuang Wang for his long term supports and encouragements.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 IFIP International Federation for Information Processing
About this paper
Cite this paper
Lu, C. (2017). Channels’ Matching Algorithm for Mixture Models. In: Shi, Z., Goertzel, B., Feng, J. (eds) Intelligence Science I. ICIS 2017. IFIP Advances in Information and Communication Technology, vol 510. Springer, Cham. https://doi.org/10.1007/978-3-319-68121-4_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-68121-4_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68120-7
Online ISBN: 978-3-319-68121-4
eBook Packages: Computer ScienceComputer Science (R0)