1 Introduction

To obtain maximum likelihood mixture models, The EM algorithm [1] and the Newton method [2] are often used. There have been many papers on applying or improving the EM algorithm. Lu proposed the semantic information measure (SIM) and the R(G) function in 1993 [3,4,5]. The R(G) function is an extension of Shannon’s information rate distortion function R(D) [6, 7]. The R(G) means the minimum R for given SIM G. It is found that using SIM and R(G) function, we can obtain a new iterative algorithm, i.e, Channels’ Matching algorithm (or the CM algorithm). Compared with the EM algorithm, the CM algorithm proposed by this paper is seemly similar yet essentially differentFootnote 1.

In this study, we use the sampling distribution instead of the sampling sequence. Assume the sampling distribution is P(X) and the predicted distribution by the mixture model is Q(X). The goal is to minimize the relative entropy or Kullback-Leibler (KL) divergence H(Q||P) [8, 9]. With the semantic information method, we may prove H(Q||P) = R(G) − G. Then, maximizing G and modifying R alternatively, we can minimize H(Q||P).

We first introduce the semantic channel, semantic information measure, and R(G) function in a way that is as compatible with the likelihood method as possible. Then we discuss how the CM algorithm is applied to mixture models. Finally, we compare the CM algorithm with the EM algorithm to show the advantages of the CM algorithm.

2 Semantic Channel, Semantic Information Measure, and the R(G) Function

2.1 From the Shannon Channel to the Semantic Channel

First, we introduce the Shannon channel.

Let X be a discrete random variable representing a fact with alphabet A = {x 1, x 2, …, x m}, and let Y be a discrete random variable representing a message with alphabet B = {y 1, y 2, …, y n }. A Shannon channel is composed of a group of transition probability functions [6]: P(y j |X), j = 1, 2, …, n.

In terms of hypothesis-testing, X is a sample point and Y is a hypothesis or a model label. We need a sample sequence or sampling distribution P(X|.) to test a hypothesis to see how accurate it is.

Let ϴ be a random variable for a predictive model, and let θ j be a value taken by ϴ when Y = y j . The semantic meaning of a predicate y j (X) is defined by θ j or its (fuzzy) truth function T(θ j |X) ϵ [0,1]. Because T(θ j |X) is constructed with some parameters, we may also treat θ j as a set of model parameters. We can also state that T(θ j |X) is defined by a normalized likelihood, i.e., T(θ j |X) = k P(θ j |X)/P(θ j ) = k P(X|θ j )/P(X), where k is a coefficient that makes the maximum of T(θ j |X) be 1. The θ j can also be regarded as a fuzzy set, and T(θ j |X) can be considered as a membership function of a fuzzy set proposed by Zadeh [10].

In contrast to the popular likelihood method, the above method uses sub-models θ 1, θ 2, …, θ n instead of one model θ or ϴ. The P(X|θ j ) is equivalent to P(X|y j , θ) in the popular likelihood method. A sample used to test y j is also a sub-sample or a conditional sample. These changes will make the new method more flexible and more compatible with the Shannon information theory.

A semantic channel is composed of a group of truth value functions or membership functions: T(θ j |X), j = 1, 2, …, n.

Similar to P(y j |X), T(θ j |X) can also be used for Bayesian prediction to produce likelihood function [4]:

$$ P(X|\theta_{j} ) = P(X)T(\theta_{j} |X)/T(\theta_{j} ),\quad T(\theta_{j} ) = \sum\limits_{i} {P(x_{i} )T(\theta_{j} |x_{i} )} $$
(1)

where T(θ j ) is called the logical probability of y j . The author now know that this formula was proposed by Thomas as early as 1981 [11]. We call this prediction the semantic Bayesian prediction. If T(θ j |X) ∝ P(y j |X), then the semantic Bayesian prediction is equivalent to the Bayesian prediction.

2.2 Semantic Information Measure and the Optimization of the Semantic Channel

The semantic information conveyed by y j about x i is defined by normalized likelihood as [3]:

$$ I(x_{i} ;\theta_{j} ) = \log \frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}}{ = }\log \frac{{T(\theta_{j} |x_{i} )}}{{T(\theta_{j} )}} $$
(2)

where the semantic Bayesian inference is used; it is assumed that prior likelihood function P(X|ϴ) is equal to prior probability distribution P(X).

After averaging I(x i ;θ j ), we obtain semantic (or generalized) KL information:

$$ I(X;\theta_{j} ) = \sum\limits_{i} {P(x_{i} |y_{j} )} \log \frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}} = \sum\limits_{i} {P(x_{i} |y_{j} )} \log \frac{{T(\theta_{j} |x_{i} )}}{{T(\theta_{j} )}} $$
(3)

The statistical probability P(x i|y j ), i = 1, 2, …, on the left of “log” above, represents a sampling distribution to test the hypothesis y j or model θ j . Assume we choose y j according to observed condition ZϵC. If y j  = f(Z|ZϵC j ), where C j is a cub-set of C, then P(X|y j ) = P(X|C j ).

After averaging I(X;θ j ), we obtain semantic (or generalized) mutual information:

$$ \begin{aligned} I(X;\varTheta ) & = \sum\limits_{j} {P(y_{j} )} \sum\limits_{i} {P(x_{i} |y_{j} )} \log \frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}} \\ & = \sum\limits_{j} {\sum\limits_{i} {P(x_{i} )P(y_{j} |x_{i} )} } \log \frac{{T(\theta_{j} |x_{i} )}}{{T(\theta_{j} )}} = H(X) - H(X|\varTheta ) \\ H(X|\varTheta ) & = - \sum\limits_{j} {\sum\limits_{i} {P(x_{i} ,y_{j} )} } \log P(x_{i} |\theta_{j} ) \\ \end{aligned} $$
(4)

where H(X) is the Shannon entropy of X, H(X|Θ) is the generalized posterior entropy of X. Each of them has coding meaning [4, 5].

Optimizing a semantic Channel is equivalent to optimizing a predictive model ϴ. For given y j  = f(Z|ZϵC j ), optimizing θ j is equivalent to optimizing T(θ j |X) by

$$ T^{*}(\theta_{j} |X) = \mathop {\arg }\limits_{{T(\theta_{j} |X)}} \hbox{max} \,I(X;\theta_{j} ) $$
(5)

It is easy to prove that when P(X|θ j ) = P(X|y j ), or

$$ \frac{{T(\theta_{j} |X)}}{{T(\theta_{j} )}} = \frac{{P(y_{j} |X)}}{{P(y_{j} )}},\quad {\text{or}}\,T(h_{j} |E)\,{ \propto }\,P(h_{j} |E) $$
(6)

I(X; θ j ) reaches the maximum. Set the maximum of T(θ j |X) to 1. Then we can obtain

$$ T^{*}(\theta_{j} |X) = P(y_{j} |X)/P(y_{j} |x_{j}^{*}) = [P\left( {X|y_{j} } \right)/P(X)]/[P\left( {x_{j}^{*} |y_{j} } \right)/P\left( {x_{j}^{*} } \right)] $$
(7)

In this equation, \( x_{j}^{*} \) makes \( P(x_{j}^{*} |y_{j} )/P(x_{j}^{*} ) \) be the maximum of P(X|y j )/P(X).

2.3 Relationship Between Semantic Mutual Information and Likelihood

Assume that the size of the sample used to test y j is N j ; the sample points come from independent and identically distributed random variables. Among these points, the number of x i is N ij . Assume that N j is infinite, P(X|y j ) = N ij /N j . Hence, there is log normalized likelihood:

$$ \log \prod\limits_{i} {\left[ {\frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}}} \right]^{{N_{ji} }} \,} { = }\,N_{j} \sum\limits_{i} {P(x_{i} |y_{j} )} \log \frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}}\,{ = }\,N_{j} I(X;\theta_{j} ) $$
(8)

After averaging the above likelihood for different y j , j = 1, 2, …, n, we have the average log normalized likelihood:

$$ \begin{aligned} \sum\limits_{j} {\frac{{N_{j} }}{N}} \log \prod\limits_{i} {\left[ {\frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}}} \right]^{{N_{ji} }} } & { = }\,\sum\limits_{j} {P(y_{j} )} \sum\limits_{i} {P(x_{i} |y_{j} )} \log \frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}} \\ & = \,I(X;\varTheta ){ = }H(X )- H(X|\varTheta ) \\ \end{aligned} $$
(9)

where N = N 1 + N 2 + ···+N n . It shows that the ML criterion is equivalent to the minimum generalized posterior entropy criterion and the Maximum Semantic Information (MSI) criterion. When P(X|θ j ) = P(X|y j ) (for all j), the semantic mutual information I(X; ϴ) is equal to the Shannon mutual information I(X;Y), which is the special case of I(X; ϴ).

2.4 The Matching Function R(G) of R and G

The R(G) function is an extension of the rate distortion function R(D) [7]. In the R(D) function, R is the information rate, D is the upper limit of the distortion. The R(D) function means that for given D, R = R(D) is the minimum of the Shannon mutual information I(X;Y).

Let distortion function d ij be replaced with I ij  = I(x i; y j ) = log[T(θ j |x i )/T(θ j )] = log[P(x i j )/P(x i )], and let G be the lower limit of the semantic mutual information I(X; ϴ). The information rate for given G and P(X) is defined as

$$ R(G) = \mathop {\hbox{min} }\limits_{{P(Y|X):I(E;\varTheta ) \ge G}} I(X;Y) $$
(10)

Following the derivation of R(D) [12], we can obtain [3]

$$ \begin{aligned} G(s) & = \sum\limits_{i} {\sum\limits_{j} {I_{ij} P(x_{i} )P(y_{j} )2^{{sI_{ij} }} )/\lambda_{i} \,} } { = }\,\sum\limits_{i} {\sum\limits_{j} {I_{ij} P(x_{i} )P(y_{j} )m_{ij}^{s} /\lambda_{i} } } \\ R(s) & = sG(s) - \sum\limits_{i} {P(x_{i} )\log \lambda_{i} } \\ \end{aligned} $$
(11)

where m ij  = T(θ j |x i )/T(θ j ) = P(x i j )/P(x i ) is the normalized likelihood; \( \lambda_{i} = \sum_{j} P(y_{j} )m_{ij}^{s} \). We may also use m ij  = P(x i | θ j ), which results in the same \( m_{ij}^{s} /\lambda_{i} \). The shape of an R(G) function is a bowl-like curve as shown in Fig. 1.

Fig. 1.
figure 1

The R(G) function of a binary source. When the slope s = 1, G = R, and information efficiency G/R reaches its maximum 1.

The R(G) function is different from the R(D) function. For a given R, we have the maximum value G + and the minimum value G , which is negative and means that to bring a certain information loss |G| to enemies, we also need certain objective information R.

In the rate distortion theory, dR/dD = s (s ≤ 0). It is easy to prove that there is also dR/dG = s, where s may be less or greater than 0. The increase of s will raise the model’s prediction precision. If s changes from positive s 1 to −s 1, then R(−s 1) = R(s 1) and G changes from G + to G (see Fig. 1).

When s = 1, λ i  = 1, and R = G, which means that the semantic channel matches the Shannon channel and the semantic mutual information is equal to the Shannon mutual information. When s = 0, R = 0 and G < 0. In Fig. 1, c = G(s = 0).

3 The CM Algorithm for Mixture Models

3.1 Explaining the Iterative Process by the R(G) Function

Assume a sampling distribution P(X) is produced by the conditional probability P*(X|Y) being some function such as Gaussian distribution. We only know that the number of the mixture components is n, without knowing P(Y). We need to solve P(Y) and model (or parameters) Θ, so that the predicted probability distribution of X, denoted by Q(X), is as close to the sampling distribution P(X) as possible, i.e. the relative entropy or Kullback-Leibler divergence H(Q||P) is as small as possible. The Fig. 2 shows the convergent processes of two examples.

Fig. 2.
figure 2

Illustrating the CM algorithm for mixture models. There are two iterative examples. One is for R > R* and another is for R < R*. The Left-step a and Left-step b make R close to R*; whereas the Right-step increases G so that (G, R) approaches line R = G.

We use P*(Y) and P*(X|Y) to denote the P(Y) and P(X|Y) that are used to produce the sampling distribution P(X), and use P*(Y|X) and R* = I*(X;Y) to denote the corresponding Shannon channel and Shannon mutual information. When Q(X) = P(X), there should be P(X|Θ) = P*(X|Y), and G* = R*.

For mixture models, when we let the Shannon channel match the semantic channel (in Left-steps), we do not maximize I(X;Θ), but seek a P(X|Θ) that accords with P*(X|Y) as possible (Left-step a in Fig. 2 is for this purpose), and a P(Y) that accords with P*(Y) as possible (Left-step b in Fig. 2 is for this purpose). That means we seek a R that is as close to R* as possible. Meanwhile, I(X;Θ) may decrease. However, in popular EM algorithms, the objective function, such as P(X N, Y|Θ), is required to keep increasing without decreasing in both steps.

With CM algorithm, only after the optimal model is obtained, if we need to choose Y according to X (for decision or classification), we may seek the Shannon channel P(Y|X) that conveys the MMI R max(G max) (see Left-step c in Fig. 2).

Assume that P(X) is produced by P*(X|Y) with the Gaussian distribution. Then the likelihood functions are

$$ P\left( {X|\theta_{j} } \right) = k_{j} \exp \left[ { - \left( {X - c_{j} } \right)^{2} /\left( {2d_{j} } \right)^{2} } \right],\,j = 1,2, \ldots ,n $$

If n = 2, then parameters are c 1, c 2, d 1, d 2. In the beginning of the iteration, we may set P(Y) = 1/n. We begin iterating from Left-step a.

Left-step a:

Construct Shannon channel by

$$ \begin{array}{*{20}l} {P(y_{j} |X) = P(y_{j} )P(X|\theta_{j} )/Q(X)} \hfill \\ {Q(X) = \sum\limits_{j} {P(y_{j} ) |(X|\theta_{j} )} } \hfill \\ \end{array} \begin{array}{*{20}c} {,j = 1,2, \ldots ,n} \\ \end{array} $$
(12)

This formula has already been used in the EM algorithm [1]. It was also used in the derivation process of the R(D) function [12]. Hence the semantic mutual information is

$$ G = I(X;\varTheta ) = \sum\limits_{i} {\sum\limits_{j} {P(x_{i} )} \frac{{P(x_{i} |\theta_{j} )}}{{Q(x_{i} )}}P(y_{j} )\log \frac{{P(x_{i} |\theta_{j} )}}{{P(x_{i} )}}} $$
(13)

Left-step b:

Use the following equation to obtain a new P(Y) repeatedly until the iteration converges.

$$ P(y_{j} ) \Leftarrow \sum\limits_{i} {P(x_{i} )} P(y_{j} |x_{i} ) = \sum\limits_{i} {P(x_{i} )} \frac{{P(x_{i} |\theta_{j} )}}{{\sum\limits_{k} {P(y_{k} )P(x_{i} |\theta_{k} )} }}P(y_{j} ),\,j = 1,2, \ldots ,n $$
(14)

The convergent P(Y) is denoted by P +1(Y). This is because P(Y|X) from Eq. (12) is an incompetent Shannon channel so that ∑ i P(x i )P(y j |x i ) ≠ P(y j ). The above iteration makes P +1(Y) match P(X) and P(X|Θ) better. This iteration has been used by some authors, such as in [13].

When n = 2, we should avoid choosing c 1 and c 2 so that both are larger or less than the mean of X; otherwise P(y 1) or P(y 2) will be 0, and cannot be larger than 0 later.

If H (Q||P) is less than a small number, such as 0.001 bit, then end the iteration; otherwise go to Right-step.

Right-step:

Optimize the parameters in the likelihood function P(X|Θ) on the right of the log in Eq. (13) to maximize I(X;Θ). Then go to Left-step a.

3.2 Using Two Examples to Show the Iterative Processes

3.2.1 Example 1 for R < R*

In Table 1, there are real parameters that produce the sample distribution P(X) and guessed parameters that are used to produce Q(X). The convergence process from the starting (G, R) to (G*, R*) is shown by the iterative locus as R < R* in Fig. 2. The convergence speed and changes of R and G are shown in Fig. 3. The iterative results are shown in Table 1.

Table 1. Real and guessed model parameters and iterative results of Example 1 (R < R*)
Fig. 3.
figure 3

The iterative process as R < R*. Rq is R Q in Eq. (15). H(Q||P) = R Q  − G decreases in all steps. G is monotonically increasing. R is also monotonically increasing except in the first Left-step b. G and R gradually approach G* = R* so that H(Q||P) = R Q  − G is close to 0.

Analyses:

In this iterative process, there are always R < R* and G < G*. After each step, R and G increase a little bit so that G approaches G* gradually. This process seams to tell us that each of Right-step, Left-step a, and Left-step b can increase G; and hence maximizing G can minimize H(Q||P), which is our goal. Yet, it is wrong. The Left a and Left b do not necessarily increase G. There are many counterexamples. Fortunately, iterations for these counterexamples can still converge. Let us see Example 2 as a counterexample.

3.2.2 Example 2 for R > R*

Table 2 shows the parameters and iterative results for R > R*. The iterative process is shown in Fig. 4.

Table 2. Real and guessed model parameters and iterative results for Example 2 (R > R*)
Fig. 4.
figure 4

The iterative process as R > R*. Rq is R Q in Eq. (15). H(Q||P) = R Q  − G decreases in all steps. R is monotonically decreasing. G increases more or less in all Right-steps and decreases in all Left-steps. G and R gradually approach G* = R* so that H(Q||P) = R Q  − G is close to 0.

Analyses:

G is not monotonically increasing nor monotonically decreasing. It increases in all Right steps and decreases in all Left steps. This example is a challenge to all authors who prove that the standard EM algorithm or a variant EM algorithm converges. If G is not monotonically increasing, it must be difficult or impossible to prove that logP(X N, Y|Θ) or other likelihood is monotonically increasing or no-decreasing in all steps. For example, in Example 2, Q* = −NH * (X, Y) = −6.031 N. After the first optimization of parameters, Q = −6.011 N > Q*. If we continuously maximize Q, Q cannot approach less Q*.

We also use some other true models P*(X|Y) and P*(Y) to test the CM algorithm. In most cases, the number of iterations is close to 5. In rare cases where R and G are much bigger than G*, such as R ≈ G > 2G*, the iterative convergence is slow. In these cases where logP(X N, Y|Θ) is also much bigger than logP*(X N,Y), the EM algorithm confronts similar problem. Because of these cases, the convergence proof of the EM algorithm is challenged.

3.3 The Convergence Proof of the CM Algorithm

Proof.

To prove the CM algorithm converges, we need to prove that H(Q||P) is decreasing or no-increasing in every step.

Consider the Right-step.

Assume that the Shannon mutual information conveyed by Y about Q(X) is R Q , and that about P(X) is R. Then we have

$$ R_{Q} = I_{Q} (X;Y) = \sum\limits_{i} {\sum\limits_{j} {P(x_{i} )} \frac{{P(x_{i} |\theta_{j} )}}{{Q(x_{i} )}}P(y_{j} )\log \frac{{P(x_{i} |\theta_{j} )}}{{Q(x_{i} )}}} $$
(15)
$$ \begin{aligned} & R = I(X;Y) = \sum\limits_{i} {\sum\limits_{j} {P(x_{i} )} \frac{{P(x_{i} |\theta_{j} )}}{{Q(x_{i} )}}P(y_{j} )\log \frac{{P(y_{j} |x_{i} )}}{{P^{ + 1} (y_{j} )}}} \\ & \quad = R_{Q} - H(Y||Y^{ + 1} ) \\ & H(Y||Y^{ + 1} ) = \sum\limits_{j} {P^{{{ + }1}} (y_{j} )\log [P^{{{ + }1}} (y_{j} )/P(y_{j} )} ] \\ \end{aligned} $$
(16)

According to Eqs. (13) and (15), we have

$$ H(Q||P) = R_{Q} - G = R + H(Y||Y^{ + 1} ) - G $$
(17)

Because of this equation, we do not need Jensen’s inequality that the EM algorithm needs.

In Right-steps, the Shannon channel and R Q does not change, G is maximized. Therefore H(Q||P) is decreasing and its decrement is equal to the increment of G.

Consider Left-step a.

After this step, Q(X) becomes Q +1(X) = ∑ j P(y j )P +1(X|θ j ). Since Q +1(X) is produced by a better likelihood function and the same P(Y), Q +1(X) should be closer to P(X) than Q(X), i.e. H(Q +1||P) < H(Q||P) (More strict mathematical proof for this conclusion is needed).

Consider Left-step b.

The iteration for P +1(Y) moves (G, R) to the R(G) function cure ascertained by P(X) and P(X|θ j ) (for all j) that form a semantic channel. This conclusion can be obtained from the derivation processes of R(D) function [12] and R(G) function [3]. A similar iteration is used for P(Y|X) and P(Y) in deriving the R(D) function. Because R(G) is the minimum R for given G, H(Q||P) = R Q  − G = R − G becomes less.

Because H(Q||P) becomes less after every step, the iteration converges. Q.E.D.

3.4 The Decision Function with the ML Criterion

After we obtain optimized P(X|Θ), we need to select Y (to make decision or classification) according to X. The parameter s in R(G) function (see Eq. (11)) reminds us that we may use the following Shannon channel

$$ \begin{array}{*{20}l} {P(y_{j} |X) = P(y_{j} )[P(X|\theta_{j} )]^{s} /Q(X)} \hfill \\ {Q(X) = \sum\limits_{j} {P(y_{j} )[P} (X|\theta_{j} )]^{s} } \hfill \\ \end{array} \begin{array}{*{20}c} {,j = 1,2, \ldots ,n} \\ \end{array} $$
(18)

which are fuzzy decision functions. When s → + ∞, the fuzzy decision will become crisp decision. Different from Maximum A prior (MAP) estimation, the above decision function still persists in the ML criterion or MSI criterion. The Left-step c in Fig. 2 shows that (G, R) moves to (G max, R max) with s increasing.

3.5 Comparing the CM Algorithm and the EM Algorithm

In the EM algorithm [1, 14], the likelihood of a mixture model is expressed as logP(X N ) > L=Q − H. If we move P(Y) or P(Y|Θ) from Q into H, then Q will become −NH(X|Θ) and H becomes −NR Q . If we add NH(X) to both sides of the inequality, we will have H(Q||P) ≤ R Q  − G, which is similar to Eq. (17). It is easy to prove

$$ Q = NG - NP(X) - NH(Y) $$
(19)

where H(Y) = −∑ j P +1(y j )logP(y j ) is a generalized entropy. We may think the M-step merges the Left-step b and the Right-step of the CM algorithm into one step. In brief,

$$ \begin{array}{*{20}c} {{\text{The E-step of EM}} = {\text{the Left-step a of CM}}} \\ {{\text{The M-step of EM}} \approx {\text{the Left-step b}} + {\text{the Right-step of CM}}} \\ \end{array} $$

In the EM algorithm, if we first optimize P(Y) (not for maximum Q) and then optimize P(X|Y, Θ), then the M-step will be equivalent to the CM algorithm.

There are also other improved EM algorithms [13, 15,16,17] with some advantages. However, no one of these algorithms facilitates that R converges to R*, and R  G converges to 0 as the CM algorithm.

The convergence reason of the CM algorithm is seemly clearer than the EM algorithm (see the analyses in Example 2 for R > R*). According to [7, 15,16,17], the CM algorithm is faster at least in most cases than the various EM algorithms.

The CM algorithm can also be used to achieve maximum mutual information and maximum likelihood of tests and estimations. There are more detailed discussions about the CM algorithmFootnote 2.

4 Conclusions

Lu’s semantic information measure can combine the Shannon information theory and likelihood method so that the semantic mutual information is the average log normalized likelihood. By letting the semantic channel and Shannon channel mutually match and iterate, we can achieve the mixture model with minimum relative entropy. The iterative convergence can be intuitively explained and proved by the R(G) function. Two iterative examples and mathematical analyses show that the CM algorithm has higher efficiency at least in most cases and clearer convergence reasons than the popular EM algorithm.