Correlated topographic analysis: estimating an ordering of correlated components
 551 Downloads
 4 Citations
Abstract
This paper describes a novel method, which we call correlated topographic analysis (CTA), to estimate nonGaussian components and their ordering (topography). The method is inspired by a central motivation of recent variants of independent component analysis (ICA), namely, to make use of the residual statistical dependency which ICA cannot remove. We assume that components nearby on the topographic arrangement have both linear and energy correlations, while faraway components are statistically independent. We use these dependencies to fix the ordering of the components. We start by proposing the generative model for the components. Then, we derive an approximation of the likelihood based on the model. Furthermore, since gradient methods tend to get stuck in local optima, we propose a threestep optimization method which dramatically improves topographic estimation. Using simulated data, we show that CTA estimates an ordering of the components and generalizes a previous method in terms of topography estimation. Finally, to demonstrate that CTA is widely applicable, we learn topographic representations for three kinds of real data: natural images, outputs of simulated complex cells and text data.
Keywords
Independent component analysis Topographic representation Natural image statistics Higher order features Natural language processing1 Introduction
However, real data do often not follow the assumptions made in ICA. For instance, the components in s may not be statistically independent. When such components are estimated with ICA, statistical dependencies between the estimates can be observed, in violation of the independence assumption made. For natural images, for example, the conditional variance of an estimated component s _{ i } may depend on the value of another component s _{ j }: As s _{ j } increases, the conditional variance of s _{ i } grows. This means that the conditional distribution of s _{ i } becomes wider as s _{ j } increases, which gives the conditional histogram a characteristic bowtie like shape (Simoncelli 1999; Karklin and Lewicki 2005). An alternative formulation of this dependency is energy correlation, \(\text {cov}(s_{i}^{2},s_{j}^{2})>0\): both \(s_{i}^{2}\) and \(s_{j}^{2}\) tend to be coactive (Hyvärinen et al. 2009).
Therefore, it seems important to relax the independence assumption. Topographic ICA (TICA) is based on this idea (Hyvärinen et al. 2001). The key point of TICA is to arrange the components on an one or twodimensional grid or lattice, and allow nearby components to have energy correlations, while faraway components are assumed statistically independent. Thus, energy correlations define the proximity of the components and can be used to fix their ordering. Osindero et al. (2006) proposed another related method and their results for natural image data were similar to those obtained with TICA, although their estimations were overcomplete in contrast to the ones in TICA. Karklin and Lewicki (2005) proposed a hierarchical model where the second layer learns variance components. Further related work includes treelike modeling of the dependencies of the components (Bach and Jordan 2003; Zoran and Weiss 2009).
The components in TICA are constrained to be linearly uncorrelated. However, uncorrelated components are not always optimal. In fact, both linear and energy correlations can be observed in many practical situations. Consider the outputs of two collinearly aligned Gaborlike filters. As natural images often contain long edges, their outputs have both linear and energy correlations (CoenCagli et al. 2012). Such linear correlations make the conditional histogram of the outputs have a tilted bowtie like shape. Coherent sources in MEG or EEG data can be linearly correlated too, due to neural interactions (GómezHerrero et al. 2008). As we will see later, another example occurs in the analysis of text data.
In this paper, we propose a new statistical method which we call correlated topographic analysis (CTA). In CTA, topographically nearby components have linear and energy correlations, and those dependencies are used to fix the ordering as in TICA. Since CTA is sensitive to both kinds of correlations, only one kind (linear or energy) needs to exist in the data. CTA thus generalizes TICA for topography estimation.
In addition to proposing the statistical model of CTA, we propose an optimization method that performs better than standard procedures in terms of local optima. This method dramatically improves topography estimation, and we verify its performance on simulated as well as real data.
This paper is organized as follows. Section 2 motivates the estimation of topographic representations, and presents the new statistical method CTA. CTA is introduced as a special case of a more general framework which also includes ICA and TICA. In Sect. 3, we use simulated data to verify identifiability of the linear mixing model in (1) for sources with various dependency structures, and compare the performances of ICA, TICA and CTA. In Sect. 4, CTA is applied on three kinds of real data: natural images, outputs of simulated complex cells and text data. The applicability on such a wide range of data sets suggests that CTA may be widely applicable. Connections to previous work are discussed in Sect. 5. Section 6 concludes this paper.
2 Correlated topographic analysis
We start by motivating the estimation of topographic representations. Then, we introduce a generative model for the sources s in order to model ICA, TICA and CTA in a unified way, and describe the basic properties of the components in CTA. We then derive an approximation of the likelihood for CTA and propose a method for its optimization.
2.1 Motivation for estimating topographic representations
The foremost motivation for estimating topographic representations is visualization. Plotting the components with the topographic arrangement enables us to easily see the interrelationships between components. This is particularly true if the topographic grid is two dimensional and can thus be plotted on the plane.
A second motivation is that the topography learned from natural inputs such as natural images, natural sound, or text, might model cortical representations in the brain. This is based on the hypothesis that in order to minimize wiring length, neurons which interact with each other should be close to each other, see e.g. Hyvärinen et al. (2009). Minimizing wiring seems to be important to keep the volume of the brain manageable, and possibly to speed up computation as well.
An example is computation of complex cell outputs based on simple cell outputs in primary visual cortex (V1). Simple cells are sensitive to an oriented bar or an edge at a certain location in visual space, while complex cells are otherwise similar, but invariant to local sinusoidal phases of visual stimuli. Computationally, such a conversion can be achieved by pooling the squares of the outputs of the simple cells which have similar orientation and spatial location, but different phases. A topographic representation where simple cells are arranged as observed in V1 could minimize the wiring needed in such a pooling because the pooling is done over nearby cells. Such a minimumwiring topography was found to emerge from natural images using TICA (Hyvärinen et al. 2001).
Related to minimum wiring, the topography may also enable simple definition of new, higherorder features. Summation of the features in a topographic neighborhood (possibly after a nonlinearity such as squaring) may even in general lead to interesting new features, just as in the case of simple cell pooling explained above.
2.2 The generative model
 1.
If z is multivariate Gaussian with mean 0 and the elements in σ are positive random variables, which is what we assume in the following, the components in s are superGaussian, i.e., sparse (Hyvärinen et al. 2001).
 2.
By introducing linear correlations in z and/or energy correlations in σ, the components in s will have linear and/or energy correlations. This point will be made more precise in the following.
 Case 1

If all the elements in z and σ are statistically independent, then s is a vector with independent sparse sources, and (2) gives the source model of ICA.
 Case 2

If all the elements in z are uncorrelated, but the squares of nearby elements in σ are correlated, then s is a vector formed by sparse sources with energy correlations (and no linear correlations) within a certain neighborhood, and thus (2) gives the source model of TICA.
 Case 3

If nearby elements in z are correlated, but all the elements in σ are statistically independent, then s is a sparse source vector whose elements have linear correlations (and zero or weak energy correlations) within a certain neighborhood.
 Case 4

If nearby elements in z and the squares of nearby elements in σ are correlated, then s is a sparse source vector whose elements have linear and energy correlations within a certain neighborhood, and (2) gives the source model of CTA.
Dependencies of pairs of nearby elements in σ and z on four cases of sources and the corresponding source model
Case 1  Case 2  Case 3  Case 4  

σ  \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})=0\)  \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})\neq0\)  \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})=0\)  \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})\neq0\) 
z  cov(z _{ i },z _{ j })=0  cov(z _{ i },z _{ j })=0  cov(z _{ i },z _{ j })≠0  cov(z _{ i },z _{ j })≠0 
Model  ICA  TICA  not explicitly considered  CTA 
In the following, we concentrate on Case 4 (both energy and linear correlations). We do not explicitly consider Case 3 (linear correlations only), but we will show below with simulations that CTA identifies its sources and estimates the ordering of the components as well. This is natural since the model in Case 4 uses both linear and energy correlations to model topography, while Case 3 uses linear ones only.
2.3 Basic properties of the model
 The mean values of all the components are zero.
 Nearby components, s _{ i } and s _{ j }, are correlated if and only if z _{ i } and z _{ j } are linearly correlated. From the property (3), this is proven by Thus, cov(s _{ i },s _{ j }) is the same as cov(z _{ i },z _{ j }) up to the positive multiplication factor E{σ _{ i } σ _{ j }}. The linear correlation coefficient of the components has an upper bound (Appendix A).
 The energy correlation for s _{ i } and s _{ j } can be computed as where we used the formula valid for Gaussian variables with zero means, \(E\{z_{i}^{2}z_{j}^{2}\}=E\{z_{i}^{2}\}E\{z_{j}^{2}\}+2E\{z_{i}z_{j}\}^{2}\) which is proven by Isserlis’ theorem (Isserlis 1918; Michalowicz et al. 2009). From (5), the energy correlation is caused by the energy correlation for σ and the squared linear correlation for z. Thus, to prove that \(\text {cov}(s_{i}^{2},s_{j}^{2})>0\), it is enough to prove that \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})>0\). In the literature of TICA (Hyvärinen et al. 2001), \(\text {cov}(\sigma_{i}^{2},\sigma_{j}^{2})\) is conjectured to be positive when each σ _{ i } takes the following form, where N(i) is an index set to determine a certain neighborhood, ϕ _{ i }(⋅) denotes a monotonic nonlinear function and u _{ i } is a positive random variable. We follow this conjecture. The energy correlation coefficient of the components has also an upper bound (Appendix A).
The same analysis has been done for the TICA generative model (Case 2) in Hyvärinen et al. (2001). In the model, the sources are linearly uncorrelated, and, regarding energy correlation, only the first term in (5) is nonzero because the elements in z are statistically independent. Thus, compared to TICA, in CTA, there exist linear correlations and the energy correlations are stronger as well.
2.4 Probability distribution and its approximation
2.5 Accuracy of the approximation
The two approximations (12) and (13) were used to derive (16). To analyze the implications of these approximations, we compared (16) with the generative model in (2) in terms of correlation and sparsity of the sources.
For the comparison, we sampled from (2) using d=2 sources and the fixed values a _{ i }=b _{ i }=1 for different values of λ _{ i }=λ. We sampled from (16), with d=2, using slice sampling.^{1} For both models, we drew 10^{6} samples to compute the correlation coefficient between the two sources, the correlation coefficient between their squared values, and their kurtosis.
For the generative model, we found that the (excess) kurtosis of the sources was independent of λ, with a value around 3.4. For the approximation, we obtained a value around 2.1. This means that both the original model and the approximation yield sparse sources.
To conclude, we confirmed that the approximation has qualitatively similar properties as the generative model for a λ close to −1. The limitations of the approximation are that the sources are more strongly energy correlated but less sparse than in the original generative model for λ close to −1.
2.6 Objective function and its optimization
The final output of the algorithm is W ^{(3)}. Step 1 corresponds to performing ICA, and Step 2 gives the optimal order and the optimal signs of the ICA components in the sense of the objective function J _{2}. In Step 3, W ^{(2)} is used as initial value of W. Therefore, Step 1 and Step 2 can be interpreted as a way to find a good initial value.
We now briefly describe the runtime cost of the optimization. When data is highdimensional, most of the time is spent on the dynamic programming part (Algorithm 2). The computation of (23) is T times additions, and the additions are repeated 4(d−i+1)(d−i) times to make the ith table (24). This means that the computational cost for addition is approximately \(O(4T\sum_{i=2}^{d1}(di+1)(di))=O(Td^{3})\). Thus, as the dimension of the data increases, more computational time is needed. But, as we will see below, this algorithm dramatically improves results in terms of topography estimation.
3 Identifying simulated sources
In this section, we investigate how the objective function in (17) can be used to estimate the model (1) with sources generated according to the four cases outlined in the previous section, and compare the performances of ICA, TICA and CTA.
3.1 Methods
We generated sources s according to the four cases of (2). We sampled z from a Gaussian distribution with mean 0 and covariance matrix C: In Case 3 and Case 4, all the diagonal elements are 1, the (i,i+1)th element c _{ i,i+1}(=c _{ i+1,i }) is 0.4 with a ringlike boundary, and the other elements are 0. In Case 1 and Case 2, C is the identity matrix. For σ, each element in Case 2 and Case 4 is generated as σ _{ i }=r _{ i−1}+r _{ i }+r _{ i+1} where r _{ i } is sampled from the exponential distribution with mean 1. In Case 1 and Case 3, σ _{ i }=r _{ i }. After generating s, the mean and variance of all the components s _{ i } are standardized to zero and one, respectively. The dimension of s and the number of samples are d=20 and T=30,000, respectively.
For the generated sources of Case 3, we verified that the energy correlation was very weak: the mean of the energy correlation coefficient in \(s_{1}^{2}\) and \(s_{2}^{2}\) and its standard deviation in 100 source sets were 0.0192 and 0.0102, respectively.
Then, the data x was generated from the model (1) where the elements of A were randomly sampled from the standard normal distribution. The preprocessing consisted of whitening based on PCA.
We visualize the estimation results by showing the performance matrix P=WA. If the estimation of the ordering is correct, P should be close to a diagonal matrix, or a circularly shifted diagonal matrix because of the ringlike boundary.
3.2 Results
We first show the effectiveness of our threestep optimization method in optimizing J. Then, we show the results of the comparison between ICA, TICA and CTA.
3.2.1 Escaping from local maxima
Figure 3(c) shows the result when the threestep optimization method is applied to our example. The performance matrix is close to a shifted identity matrix, and the value of the objective function equals the one in Fig. 3(b). This means that our estimation is performed correctly. Furthermore, note that the signs of the diagonal elements of P in Fig. 3(c) all agree. This means that CTA solves also the sign indeterminacy problem in ICA. This is impossible for TICA because the objective function (27) is insensitive to the signs of the components.
3.2.2 Comparison to ICA and TICA
Next, we perform 100 trials for each of the four cases of sources, and compare the performance of the three methods.
Performance matrices for one of the 100 trials are presented in Fig. 6(a). CTA shows the best performance for sources from Case 2 to Case 4 for topography estimation. TICA cannot estimate the topography for Case 3. Regarding AI (Fig. 6(b)), CTA is not as good as ICA and TICA in Case 1 and Case 2. This is presumably because CTA forces the estimated components to be correlated even if they are not. For Case 3 and Case 4, CTA shows almost the same or a better performance than ICA and TICA. Regarding TI (Fig. 6(c)), only CTA can estimate the ordering of the components in all three topographic cases (Case 2, Case 3 and Case 4). TICA cannot estimate the topography for Case 3. We conclude that CTA shows the best performance among the three methods and generalizes TICA for topography estimation. The performance of CTA is weaker in the case of sources with no linear correlations in terms of identifiability, but it is at least as good as ICA or TICA in the case of sources with linear ones.
4 Application to real data
In this section, CTA is applied to three kinds of real data: natural images, outputs of simulated complex cells in V1, and text data.
4.1 Natural images
Here, we apply CTA to natural image patches.
4.1.1 Methods
The data x(t) are 20 by 20 image patches which are extracted from natural images.^{3} The total number of patches is 100,000. As preprocessing, the DC component of each patch is removed, and whitening and dimensionality reduction are performed by PCA. We retain 252 dimensions.
4.1.2 Results
4.2 Simulated complex cells
Next, CTA is applied to the outputs of simulated complex cells in V1 when stimulated with natural images. ICA and its related methods have been applied to this kind of data before (Hoyer and Hyvärinen 2002; Hyvärinen et al. 2005). Our purpose here is to investigate what kind of topography emerges for the learned higherorder basis vectors.
4.2.1 Methods
4.2.2 Results
4.3 Text data
Our final application of CTA is for text data. Previously, ICA has been applied to similar data. Kolenda et al. (2000) analyzed a set of documents and showed that ICA found more easily interpretable structures than the more traditional latent semantic analysis (LSA). Honkela et al. (2010) analyzed word contexts in text corpora. ICA gave more distinct features reflecting linguistic categories than LSA. We apply here CTA to this kind of contextword data. The purpose is to see what kind of interrelationships CTA identifies between the latent categories.
4.3.1 Methods
We constructed the contextword data as in Honkela et al. (2010). First, the most frequent T=200,000 words were collected from 51,126 Wikipedia articles written in English; these are called “collected words” in what follows. Next, we listed the context words occurring among the two words before or two words after each collected word and then took the most frequent 1,000 words. For each pair of collected and context word, we computed the joint frequency, and organized the values into a matrix Y of size 1,000 by 200,000. Finally, we obtained the contextword matrix X=(x(1),x(2),…,x(T)) by transforming each element of Y as x _{ i }(t)=log(y _{ i,t }+1).
As preprocessing, we made the mean of each row of X zero, and standardized its variance to one. Then, the data was whitened by PCA, and the dimension of the data was reduced from 1,000 to 60. Unlike in the experiments of natural images and outputs of complex cells, we assume here an onedimensional topography and estimate W as described in Sect. 2.6. After the estimation, the contextword data can be represented as X=AS where S is a 60 by 200,000 categoryword matrix. Note that in the context of the text data, we call the rows in S “categories”.
To quantify if the words in each category are similar to those in the same and adjacent category, we compute a similarity metric between two words using WordNet (Miller 1995; Fellbaum 1998) and the natural language toolkit (NLTK) (Bird et al. 2009). WordNet is a large lexical database where words are assigned to sets of synonyms (synsets), each expressing a lexicalized concept (Miller 1995). Since WordNet contains a network of synsets, one can compute the similarity between two words based on simple measures, e.g., the distance between synsets. For the computation of the similarity, first, we picked the top 40 words in each category, that are the words with the largest s _{ i }(t). Then, we computed similarities between all possible combinations of words within categories and between adjacent ones. The words which are not assigned to synsets were omitted from this analysis. In addition, categories in which all the top 40 words had no synsets were omitted.^{6} To compute the similarity, we used the algorithm path_similarity in NLTK which is based on the shortest path. When words had more than two synsets, we computed similarities with all possible combinations of synsets and selected the maximum value. As a baseline, we computed similarities to 1,600 pairs of words which were randomly selected from 200,000 “collected words”.
4.3.2 Results
Two examples of a topographic ordering between three categories. Denoting the kth row of the matrix S by S ^{ k }, the words with the top ten absolute values of a S ^{ k } are shown
Example 1: Number  Example 2: Media  

S ^{7}  S ^{8}  S ^{9}  S ^{27}  S ^{28}  S ^{29} 
hours  few  6  published  marvel  band’s 
week  several  3  reports  comic  album 
month  already  4  report  comics  pop 
weeks  various  13  review  fantasy  albums 
days  have  16  articles  batman  solo 
months  numerous  8  detailed  xmen  band 
day  frequently  21  newspaper  manga  rock 
hour  two  23  journal  fiction  songs 
year  eight  11  fiction  spiderman  blues 
summer  many  32  books  superman  punk 
Another two examples of a topographic ordering between three categories. “undergrad.” in S ^{38} is an abbreviation of “undergraduate”
Example 3: Job titles and Names  Example 4: Politics and Education  

S ^{31}  S ^{32}  S ^{33}  S ^{36}  S ^{37}  S ^{38} 
actress  actor  minister  minister  constitution  graduate 
singer  scott  deputy  politician  constitutional  education 
lord  smith  prime  government  parliament  sciences 
actor  jr  ali  parliament  courts  engineering 
songwriter  haward  appointed  poet  federal  undergrad. 
governor  allen  elected  troops  court  medical 
musician  lee  ahmed  election  senate  faculty 
chairman  johnson  pierre  citizens  legislative  institute 
secretary  anthony  mohammad  actress  law  school 
naval  wilson  singh  party  supreme  science 
We performed the same experiment and analysis for TICA. The best run results are shown in Fig. 13, too. Figure 13(a) shows that CTA and TICA have almost the same curves for similarity values within categories. However, for CTA, the curve for the similarities between adjacent categories is typically higher than for TICA (Fig. 13(b)). We performed one sided ttests to each data in the two curves of Fig. 13(b). The null hypothesis of the test is that μ ^{CTA} is less than μ ^{TICA} where μ ^{CTA} denotes the mean of the points forming the CTA curve in Fig. 13(b), and μ ^{TICA} denotes the mean for the TICA curve. Note that we did not test if the CTA curve itself is higher than the TICA one because the points in Fig. 13 are sorted only for visualization and thus, there is no particular orderrelationship between the points in the two curves. For Fig. 13(b1) and (b2), the pvalues are 0.045 and 0.162, respectively. Thus, in the best result for CTA, the difference is statistically significant at 0.05 level (Fig. 13(b1)). Even in the worst case, the performance of CTA seems intuitively better although the difference is not statistically significant (Fig. 13(b2)). Therefore, we conclude that CTA identifies a better topography for text data as well.
5 Discussion
First, we summarize the connections between CTA and TICA. Then, we discuss the connection to other related work.
5.1 Connection to topographic independent component analysis
Section 2 showed that TICA and CTA are closely connected. We see their source models as special instances of the generative model (2), or of the distribution in (15). The distribution (16) which we used to define the CTA objective function is obtained from (15) by fixing the parameters a _{ i }=1, b _{ i }=1 and ϱ _{ i }=−1 for all i. Ideally, we would estimate all these parameters. This is however difficult because we do not know the analytical expression of the partition function in (15). Therefore, we had to leave this challenge to future work. A possible approach is to use score matching (Hyvärinen 2006) or noisecontrastive estimation (Gutmann and Hyvärinen 2012).
The foremost difference between CTA and TICA is the additional assumption of linear correlations in the source vector s. The sensitivity to linear correlation improved the topography estimations on artificial data as well as on real data as discussed in Sect. 3.2.2 and Sect. 4. A drawback of this sensitivity is that the identifiability of CTA becomes worse than ICA or TICA when the sources have no linear correlations (Fig. 6). To fix this drawback, we should estimate the amount of linear correlations. This could be achieved by estimating the ϱ _{ i }, which is, as mentioned above, a topic that we had to leave to future work.
5.2 Connection to other related work and possible application
Structured sparsity is a concept related to topographic analysis. Mairal et al. (2011) applied dictionary learning on natural images using structured sparsity and the results were similar to TICA. The main difference is that they did not learn linearly correlated components like CTA. As discussed above, incorporating linear correlation can have advantages in topography estimation.
For natural images, Osindero et al. (2006) proposed another energybased model which has an objective very similar to TICA, and produces similar results on natural images. Again, the difference to our method is that linear correlations between components are not explicitly modeled. Their model allows for overcomplete bases, which by necessity introduces some linear correlations. But it seems that their model still tries to minimize linear correlations instead of explicitly allowing them.
For the outputs of complex cells Hoyer and Hyvärinen (2002) first discovered long contours by applying a nonnegative sparse coding method to the data. Hyvärinen et al. (2005) applied ICA to the outputs with multiple frequency bands and found long broadband contours. Comparing with our results, the main difference is the topography of the estimated features: in Fig. 9(a), similar features are close to each other, while they are randomly organized in the work cited above. The reason is that the previously used methods assume that the components are statistically independent. In addition, the endstopping behavior that emerges for CTA was not seen in previous work.
For the results of text data, Honkela et al. (2010) applied ICA to the same kind of word data. Categories similar to ours were learned. Since Honkela and colleagues used ICA, there were no relationships between the categories. In contrast to their results, our method estimates a topographic representation where nearby categories include semantically similar words.
We have focused on learning data representations in this paper. CTA might also be useful for engineering applications. Recently, Kavukcuoglu et al. (2009) proposed an architecture for image recognition by creating a new feature through a topographic map which is learned by a method similar to TICA. Hence, we would expect that CTA is equally applicable in such tasks, with its additional sensitivity to linear correlations possibly being an advantage. However, such a study is out of scope of this paper, and we leave it to future work.
6 Conclusion
We proposed correlated topographic analysis (CTA) which is an extension of ICA to estimate the ordering (topography) of correlated components. In the proposed method, nearby components s _{ i } are allowed to have linear and energy correlations; faraway components are as statistically independent as possible. In previous work, only higher order correlations were introduced. Our method generalizes those methods: if either linear or energy correlations in the components are present, CTA can estimate the topography. In addition, since optimization by gradient methods tends to get stuck in local maxima, we proposed a threestep optimization method which dramatically improved topography estimation.
Besides validating the properties of CTA using artificial data, we applied CTA to three kinds of real data sets: natural images, outputs of simulated complex cells, and text data. For natural images, similar basis vectors were close to each other, and we found that most basis vectors represented edges, not bars. In the experiment using the outputs of simulated complex cells, new kinds of higherorder features emerged and, moreover, similar features were systematically organized on the lattice. Finally, we showed for text data that CTA identifies semantic categories and orders them so that adjacent categories are connected by the semantics of the words which they represent.
Footnotes
 1.
We used MATLAB’s slicesample.m with a burnin period of 50,000 samples.
 2.
J _{1} is insensitive to any change of the order and signs of the components. Therefore, we computed only J _{2} instead of J.
 3.
The natural images here were taken from the software package associated with the book (Hyvärinen et al. 2009), available at http://www.naturalimagestatistics.net.
 4.
The contournet MATLAB package is used to compute the outputs of complex cells and available at http://www.cs.helsinki.fi/u/phoyer/software.html.
 5.
For TICA extended to a twodimensional lattice, we simply maximized the objective function only by the conjugate gradient method (Rasmussen 2006), and did not optimize the order of the components because the functional form of the objective function in TICA is different from the one in CTA. Therefore, the optimization method described in Appendix D could not be applied.
 6.
For example, there were no synsets in the categories consisting of numbers, such as “the late 1900’s” and “the late 1800’s” in Fig. 12(c) and (d).
 7.
Some categories, which have less than 30 similarity values, were also omitted because the algorithm could not define the similarity for some pairs of synsets.
Notes
Acknowledgements
The authors are very grateful to Timo Honkela and Jaakko J. Väyrynen for providing us the text data, and wish to thank Shunji Satoh and Junichiro Hirayama for their helpful discussion. H. Sasaki was supported by JSPS KAKENHI Grant Number 23⋅7556 (GrantinAid for JSPS Fellows). H. Shouno was partially supported by GrandinAid for Scientific Research (C) 21500214 and on Innovative Areas, 21103008, MEXT, Japan. A. Hyvärinen and M.U. Gutmann were supported by the Academy of Finland (CoE in Algorithmic Data Analysis, CoE in Inverse Problems Research, CoE in Computational Inference Research, and Computational Science Program).
References
 Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In Advances in neural information processing systems (Vol. 8, pp. 757–763). Google Scholar
 Andrews, D. F., & Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological), 36(1), 99–102. MathSciNetzbMATHGoogle Scholar
 Bach, F. R., & Jordan, M. I. (2003). Beyond independent components: trees and clusters. Journal of Machine Learning Research, 4, 1205–1233. MathSciNetGoogle Scholar
 Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. CrossRefGoogle Scholar
 Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press. zbMATHGoogle Scholar
 Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton: Princeton University Press. zbMATHGoogle Scholar
 Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Sebastopol: O’Reilly Media. zbMATHGoogle Scholar
 CoenCagli, R., Dayan, P., & Schwartz, O. (2012). Cortical surround interactions and perceptual salience via natural scene statistics. PLoS Computational Biology, 8(3), e1002405. CrossRefGoogle Scholar
 Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. zbMATHCrossRefGoogle Scholar
 Fellbaum, C. (1998). WordNet: an electronic lexical database. Cambridge: MIT. zbMATHGoogle Scholar
 GómezHerrero, G., Atienza, M., Egiazarian, K., & Cantero, J. L. (2008). Measuring directional coupling between EEG sources. NeuroImage, 43(3), 497–508. CrossRefGoogle Scholar
 Gutmann, M. U., & Hyvärinen, A. (2012). Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13, 307–361. Google Scholar
 Held, M., & Karp, R. M. (1962). A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied Mathematics, 10(1), 196–210. MathSciNetzbMATHCrossRefGoogle Scholar
 Honkela, T., Hyvärinen, A., & Väyrynen, J. J. (2010). WordICA—emergence of linguistic representations for words by independent component analysis. Natural Language Engineering, 16(03), 277–308. CrossRefGoogle Scholar
 Hoyer, P. O., & Hyvärinen, A. (2002). A multilayer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. CrossRefGoogle Scholar
 Hyvärinen, A. (2006). Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6, 695–708. Google Scholar
 Hyvärinen, A., & Hoyer, P. O. (2001). A twolayer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18), 2413–2423. CrossRefGoogle Scholar
 Hyvärinen, A., & Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks, 13(4–5), 411–430. CrossRefGoogle Scholar
 Hyvärinen, A., Hoyer, P. O., & Inki, M. (2001). Topographic independent component analysis. Neural Computation, 13(7), 1527–1558. zbMATHCrossRefGoogle Scholar
 Hyvärinen, A., Gutmann, M., & Hoyer, P. O. (2005). Statistical model of natural stimuli predicts edgelike pooling of spatial frequency channels in V2. BMC Neuroscience, 6, 12. CrossRefGoogle Scholar
 Hyvärinen, A., Hurri, J., & Hoyer, P. O. (2009). Natural image statistics: a probabilistic approach to early computational vision. Berlin: Springer. Google Scholar
 Isserlis, L. (1918). On a formula for the productmoment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika, 12(1/2), 134–139. CrossRefGoogle Scholar
 Karklin, Y., & Lewicki, M. S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17(2), 397–423. zbMATHCrossRefGoogle Scholar
 Kavukcuoglu, K., Ranzato, M. A., Fergus, R., & LeCun, Y. (2009). Learning invariant features through topographic filter maps. In IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009 (pp. 1605–1612). New York: IEEE. CrossRefGoogle Scholar
 Kolenda, T., Hansen, L. K., & Sigurdsson, S. (2000). Independent components in text. In Advances in independent component analysis (pp. 229–250). Berlin: Springer. Google Scholar
 Mairal, J., Jenatton, R., Obozinski, G., & Bach, F. (2011). Convex and network flow optimization for structured sparsity. Journal of Machine Learning Research, 12, 2681–2720. MathSciNetGoogle Scholar
 Michalowicz, J. V., Nichols, J. M., Bucholtz, F., & Olson, C. C. (2009). An Isserlis’ theorem for mixed Gaussian variables: application to the autobispectral density. Journal of Statistical Physics, 136(1), 89–102. MathSciNetzbMATHCrossRefGoogle Scholar
 Miller, G. A. (1995). Wordnet: a lexical database for English. Communications of the ACM, 38(11), 39–41. CrossRefGoogle Scholar
 Olshausen, B. A., & Field, D. J. (1996). Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. CrossRefGoogle Scholar
 Osindero, S., Welling, M., & Hinton, G. E. (2006). Topographic product models applied to natural scene statistics. Neural Computation, 18(2), 381–414. MathSciNetzbMATHCrossRefGoogle Scholar
 Rasmussen, C. E. (2006). Conjugate gradient algorithm, version 20060908. Google Scholar
 Simoncelli, E. P. (1999). Modeling the joint statistics of images in the wavelet domain. In Proc. SPIE, 44th annual meeting (Vol. 3813, pp. 188–195). Google Scholar
 Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., & Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 67(1), 91–108. MathSciNetzbMATHCrossRefGoogle Scholar
 Vigário, R., Särelä, J., Jousmäki, V., Hämäläinen, M., & Oja, E. (2000). Independent component approach to the analysis of EEG and MEG recordings. IEEE Transactions on Biomedical Engineering, 47(5), 589–593. CrossRefGoogle Scholar
 Zoran, D., & Weiss, Y. (2009). The “treedependent components” of natural images are edge filters. In Advances in neural information processing systems (Vol. 22, pp. 2340–2348). Google Scholar