A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction

Kim, Minyoung

doi:10.1007/s10489-017-0985-1

A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction

Published: 13 July 2017

Volume 48, pages 381–389, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Minyoung Kim¹

638 Accesses
2 Citations
Explore all metrics

Abstract

We deal with the parameter estimation problem for probability density models with latent variables. For this problem traditionally the expectation maximization (EM) algorithm has been broadly used. However, it suffers from bad local maxima, and the quality of the estimator is sensitive to the initial model choice. Recently, an alternative density estimator has been proposed that is based on matching the moments between sample averaged and model averaged. This moment matching estimator is typically used as the initial iterate for the EM algorithm for further refinement. However, there is actually no guarantee that the EM-refined estimator still yields the moments close enough to the sample-averaged one. Motivated by this issue, in this paper we propose a novel estimator that takes merits of both worlds: we do likelihood maximization, but the moment discrepancy score is used as a regularizer that prevents the model-averaged moments from straying away from those estimated from data. On some crowd-sourcing label prediction problems, we demonstrate that the proposed approach yields more accurate density estimates than the existing estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A unified statistical framework for crowd labeling

Article 05 October 2014

Crowd Learning with Candidate Labeling: An EM-Based Solution

Crowd labeling latent Dirichlet allocation

Article 19 April 2017

Notes

Here h is marginalized out, namely $P_{\theta }(x) = {\sum }_{h} P_{\theta }(x,h)$.
As a feature vector ϕ(z), we take the one used by the MM estimator, namely ϕ(z) comprised of x ₁, x ₁ ⊗ x ₂, x ₁ ⊗ x ₃, and x ₁ ⊗ x ₂ ⊗ x _j for $j=3,\dots ,m$ where x _j is the K-dimension one-hot vector for z _j. This feature representation is shown to be able to identify the model parameters via inverse mapping [3].
We report results up to m = 25 since having m larger than 25 resulted in almost perfect prediction for most estimators (e.g., less than 1% errors).
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

References

Anandkumar A, Foster DP, Hsu D, Kakade SM, Liu YK (2015) A spectral algorithm for latent Dirichlet allocation. Algorithmica 72(1):193–214
Article MathSciNet MATH Google Scholar
Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M (2014) Tensor decompositions for learning latent variable models. J Mach Learn Res 15:2773–2832
MathSciNet MATH Google Scholar
Anandkumar A, Hsu D, Kakade SM (2012) A method of moments for mixture models and hidden Markov models. In: 25th annual conference on learning theory
Belkin M, Sinha K (2015) Polynomial learning of distribution families. SIAM J Comput 44(4):889–911
Article MathSciNet MATH Google Scholar
Bishop C (2007) Pattern recognition and machine learning. Springer, Berlin
MATH Google Scholar
Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of world wide web conference
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Stat Soc Ser C 20–28
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38
MathSciNet MATH Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE international conference on computer vision and pattern recognition
Deng ZH, Tang SW, Yang DQ, Li MZLY, Xie KQ (2004) A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications. Lect Notes Comput Sci 3007:588–597
Article Google Scholar
Diamond S, Boyd S (2016) Cvxpy: a python-embedded modeling language for convex optimization. J Mach Learn Res 17(83):1–5
MathSciNet MATH Google Scholar
Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators? Crowdsourcing abuse detection in user-generated content. In Proceedings of the ACM conference on electronic commerce
Hsu D, Kakade SM (2013) Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In: Proceedings of the 4th conference on innovations in theoretical computer science
Liu Q, Peng J, Ihler AT (2012) Variational inference for crowdsourcing. In: Advances in neural information processing systems
Lofberg J (2004) YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proceedings of the IEEE international symposium on computed aided control systems design
Moitra A, Valiant G (2010) Settling the polynomial learnability of mixtures of Gaussians. In: 51st annual IEEE symposium on foundations of computer science
Raghunathan A, Frostig R, Duchi J, Liang P (2016) Estimation from indirect supervision with linear moments. In: International conference on machine learning
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11:1297–1322
MathSciNet Google Scholar
Sarkar P, Siddiqi SM, Gordon GJ (2007) A latent space approach to dynamic embedding of co-occurrence data. In: Proceedings of the 11th international conference on artificial intelligence and statistics
Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast? But is it good?: Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing
Sorensen DC (1982) Newton’s method with a model trust region modification. SIAM J Numer Anal 19 (2):409–426
Article MathSciNet MATH Google Scholar
Wang Y, Xie B, Song L (2016) Isotonic Hawkes processes. In: International conference on machine learning
Xiang Yuan Y (2015) Recent advances in trust region algorithms. Math Program 151(1):249–281
Article MathSciNet MATH Google Scholar
Zhang Y, Chen X, Zhou D, Jordan MI (2014) Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: Advances in neural information processing systems
Zhou D, Liu Q, Platt JC, Meek C (2014) Aggregating ordinal labels from crowds by minimax conditional entropy. In: International conference on machine learning
Zhou D, Platt JC, Basu S, Mao Y (2012) Learning from the wisdom of crowds by minimax entropy. In: Advances in neural information processing systems

Download references

Acknowledgements

This work is supported by National Research Foundation of Korea (NRF-2016R1A1A1A05921948)

Author information

Authors and Affiliations

Department of Electronics & IT Media Engineering, Seoul National University of Science & Technology, Seoul, 139-743, Korea
Minyoung Kim

Authors

Minyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minyoung Kim.

Ethics declarations

Conflict of interests

The authors have no conflict of interest.

Consent for Publication

This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.

Appendix: Moment matching estimation for Dawid-Skene models:

This appendix provides the detailed derivation for the moment matching estimation based on the theorems in [3, 25].

The assumption here is that the true model parameters satisfy: i) w _y > 0 and ii) rank(μ _y) = K (i.e., full-rank) for all $y=1,\dots ,K$. As we described in Section 2, the moments for the three types of features are: one-hot vectors, their pairwise products, and triple (tensor) products. For convenience in the exposition, we fix three-index set {1, 2, 3} while one can replace it with any subset of cardinality three from $\{1,\dots ,K\}$. That is, the moments are defined and considered as: $M_{1}:=\mathbb {E}[x_{1}]$, $M_{12}:= \mathbb {E}[x_{1} \otimes x_{2}]$, and $M_{123}:= \mathbb {E}[x_{1} \otimes x_{2} \otimes x_{3}]$, where all expectations are with respect to P _θ(⋅) (we have dropped it for notational simplicity). They are (K × 1) vector, (K × K) matrix, and (K × K × K) tensor, respectively. First we have the following analytic formulas for the moments:

$$\begin{array}{@{}rcl@{}} M_{1} &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1}|y]] = \mathbb{E}_{P(y)}[(\mu_{1})_{y}]\\ &=& \sum\limits_{y=1}^{K} w_{y} \cdot (\mu_{1})_{y} = \mu_{1}^{\top} \cdot w. \end{array} $$

(21)

$$\begin{array}{@{}rcl@{}} M_{12} &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1} \otimes x_{2}|y]]\\ &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1}|y] \otimes \mathbb{E}[x_{2}|y]] \end{array} $$

(22)

$$\begin{array}{@{}rcl@{}} &=& \mathbb{E}_{P(y)}[(\mu_{1})_{y} \otimes (\mu_{2})_{y}]\\ &=& \sum\limits_{y=1}^{K} w_{y} \cdot (\mu_{1})_{y} \otimes (\mu_{2})_{y} \end{array} $$

(23)

$$\begin{array}{@{}rcl@{}} &=& \mu_{1} \cdot \text{diag}(w) \cdot \mu_{2}^{\top}, \end{array} $$

(24)

where diag(w) is the (K × K) diagonal matrix with entries of w in the diagonal.

$$\begin{array}{@{}rcl@{}} M_{123} &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1} \otimes x_{2} \otimes x_{3}|y]]\\ &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1}|y] \otimes \mathbb{E}[x_{2}|y] \otimes \mathbb{E}[x_{3}|y]] \end{array} $$

(25)

$$\begin{array}{@{}rcl@{}} &=& \mathbb{E}_{P(y)}[(\mu_{1})_{y} \otimes (\mu_{2})_{y} \otimes (\mu_{3})_{y}] \end{array} $$

(26)

$$\begin{array}{@{}rcl@{}} &=& {\sum}_{y=1}^{K} w_{y} \cdot (\mu_{1})_{y} \otimes (\mu_{2})_{y} \otimes (\mu_{3})_{y}. \end{array} $$

(27)

In the second equalities in (22) and (25), we use the conditional independency assumed in the model (i.e., $P(z|y) = {\prod }_{j=1}^{m} P(z_{j}|y)$).

For the (K × K × K) tensor M ₁₂₃, we often use the following (K × K) projected matrix notation on an arbitrary vector $\eta \in \mathbb {R}^{K}$:

$$\begin{array}{@{}rcl@{}} M_{123}(\eta) &:=& \mathbb{E}[(x_{1} \otimes x_{2}) \cdot (x_{3}^{\top} \eta)] \end{array} $$

(28)

$$\begin{array}{@{}rcl@{}} &=& \mu_{1} \cdot (\text{diag}(w) \cdot \text{diag}(\mu_{3}^{\top} \cdot \eta)) \cdot \mu_{2}^{\top}. \end{array} $$

(29)

To find the inverse mapping (i.e., determine μ and w from the observed sample moments M’s), we first pick three arbitrary (K × K) matrices U _k (for k = 1, 2, 3) such that $U_{k}^{\top } \mu _{k}$ is invertible. This can be done by letting U _k be the matrix of the left singular vectors of μ _k. Although we do not know the true μ _k at this moment, one can use the left singular vectors of M ₁₂ instead, using the fact that the column spaces of μ ₁ and M ₁₂ coincide from (24) and non-singularity of $\text {diag}(w) \cdot \mu _{2}^{\top }$.

Now we have the following three lemmas.

Lemma 1

$U_{1}^{\top } M_{12} U_{2}$ is invertible.

Proof

$U_{1}^{\top } M_{12} U_{2} = (U_{1}^{\top } \mu _{1}) \cdot \text {diag}(w) \cdot (U_{2}^{\top } \mu _{2})^{\top }$,which is the product of all non-singular terms. □

Lemma 2

Let $B_{123}(\eta ) := (U_{1}^{\top } M_{123}(\eta ) U_{2}) \cdot (U_{1}^{\top }M_{12} U_{2})^{-1}$.Then $B_{123}(\eta ) = (U_{1}^{\top } \mu _{1}) \cdot \text {diag}(\mu _{3}^{\top } \eta ) \cdot (U_{1}^{\top } \mu _{1})^{-1}$.

Proof

Using (24) and (29),

$$\begin{array}{@{}rcl@{}} &&(U_{1}^{\top} M_{123}(\eta) U_{2}) \cdot (U_{1}^{\top} M_{12} U_{2})^{-1}\\ &=& (U_{1}^{\top} \mu_{1}) \cdot (\text{diag}(w) \cdot \text{diag}(\mu_{3}^{\top} \eta)) \cdot (U_{2}^{\top} \mu_{2})^{\top} \cdot \\ && ((U_{1}^{\top} \mu_{1}) \cdot \text{diag}(w) \cdot (U_{2}^{\top} \mu_{2})^{\top} )^{-1} \end{array} $$

(30)

$$\begin{array}{@{}rcl@{}} &=& (U_{1}^{\top} \mu_{1}) \cdot \text{diag}(\mu_{3}^{\top} \eta) \cdot (U_{1}^{\top} \mu_{1})^{-1}. \end{array} $$

(31)

□

Lemma 3

η ^⊤⋅ (μ ₃)_j for $j=1,\dots ,K$ are the eigenvalues of B ₁₂₃(η).

Proof

It immediately follows from (31) where it has the well-known diagonalization form. □

Note that one can easily compute the sample estimate of B ₁₂₃(η) for any vector η using the empirical moments M ₁₂ and M ₁₂₃. The Lemma 3 implies that the eigenvalues of B ₁₂₃(η) give us partial information about the true model parameters μ ₃. One simple recipe to retrieve μ ₃ is as follows. We choose η to be the i-th row vector (U ₃)_i of U ₃ (for $i=1,\dots ,K$), and let $[\lambda _{i,1}, \dots , \lambda _{i,K}]^{\top }$ be the eigenvalues of B ₁₂₃((U ₃)_i). Let L be the (K × K) matrix whose (i, j) entry is L _{i, j} = λ _{i, j}. From Lemma 3, it is straightforward that $L = U_{3}^{\top } \mu _{3}$, and we get μ ₃ = (U ₃)^−⊤ L. The other μ _j’s can be recovered in a similar manner.

During this process, to be more rigorous, one has to deal with the remaining issue of eigenvalue ordering. However, this can be handled easily by ordering/matching the eigenvectors that are shared among different recovery indices (details can be found in [3, 25]). Finally, the prior label multinomial parameter vector w can be identified using (21). That is, with the empirical estimate M ₁, we have:

$$ w = \mu_{1}^{\dagger} M_{1}, $$

(32)

where A ^† is the Moore-Penrose pseudo-inverse of A.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, M. A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction. Appl Intell 48, 381–389 (2018). https://doi.org/10.1007/s10489-017-0985-1

Download citation

Published: 13 July 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10489-017-0985-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction

Abstract

Access this article

Similar content being viewed by others

A unified statistical framework for crowd labeling

Crowd Learning with Candidate Labeling: An EM-Based Solution

Crowd labeling latent Dirichlet allocation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Consent for Publication

Appendix: Moment matching estimation for Dawid-Skene models:

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction

Abstract

Access this article

Similar content being viewed by others

A unified statistical framework for crowd labeling

Crowd Learning with Candidate Labeling: An EM-Based Solution

Crowd labeling latent Dirichlet allocation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Consent for Publication

Appendix: Moment matching estimation for Dawid-Skene models:

Appendix: Moment matching estimation for Dawid-Skene models:

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation