SICA: subjectively interesting component analysis
Abstract
The information in highdimensional datasets is often too complex for human users to perceive directly. Hence, it may be helpful to use dimensionality reduction methods to construct lower dimensional representations that can be visualized. The natural question that arises is how do we construct a most informative low dimensional representation? We study this question from an informationtheoretic perspective and introduce a new method for linear dimensionality reduction. The obtained model that quantifies the informativeness also allows us to flexibly account for prior knowledge a user may have about the data. This enables us to provide representations that are subjectively interesting. We title the method Subjectively Interesting Component Analysis (SICA) and expect it is mainly useful for iterative data mining. SICA is based on a model of a user’s belief state about the data. This belief state is used to search for surprising views. The initial state is chosen by the user (it may be empty up to the data format) and is updated automatically as the analysis progresses. We study several types of prior beliefs: if a user only knows the scale of the data, SICA yields the same cost function as Principal Component Analysis (PCA), while if a user expects the data to have outliers, we obtain a variant that we term tPCA. Finally, scientifically more interesting variants are obtained when a user has more complicated beliefs, such as knowledge about similarities between data points. The experiments suggest that SICA enables users to find subjectively more interesting representations.
Keywords
Exploratory data mining Dimensionality reduction Information theory Subjective interestingness FORSIED1 Introduction
The amount of information in high dimensional data makes it impossible to interpret such data directly. However, the data can be analyzed in a controlled manner, by revealing particular perspectives of data (lower dimensional data representations), one at a time. This is often done by means of projecting the data from the original feature space into a lowerdimensional subspace. Hence, such lower dimensional representations of a dataset are also called data projections, which are computed by a dimensionality reduction (DR) method.
DR methods are widely used for a number of purposes. The most prominent are data compression, feature construction, regularization in prediction problems, and exploratory data analysis. The most widely known DR technique, Principal Component Analysis (PCA) (Peason 1901) is used for each of these purposes (Bishop 2006), since it is computationally efficient, and more importantly, because large variance is often associated with structure, while noise often has smaller variance.
Other DR methods include linear methods such as Multidimensional Scaling (Kruskal and Wish 1978), Independent Component Analysis (Hyvärinen et al. 2004) and Canonical Correlations Analysis (Hotelling 1936), and nonlinear techniques such as ISOMAP (Tenenbaum et al. 2000), Locality Preserving Projections (He and Niyogi 2004), and Laplacianregularized models (Weinberger et al. 2006). The aforementioned methods all have objective score functions whose optimization yields the lowerdimensional representation, and they do not involve human users directly. Hence, we argue that these methods may well be suitable for, e.g., compression or regularization, but not optimal for providing most insight.
In exploratory data analysis, data is often visualized along the dimensions given by a DR method. Humans are unmatched in spotting visual patterns but inefficient at crunching numbers. Hence, visualizing high dimensional data in human perceivable yet computergenerated 2D/3D space can efficiently help users to understand different perspectives of the data (Puolamaki et al. 2010). However, since different human operators have different prior knowledge and interests, they are unlikely to have equal interest in the same aspect of data. For instance, PCA might be applied to obtain an impression about the spread of data. But for many users, the structure in the data with largest variance may not be relevant at all.
To address this issue, Projection Pursuit (PP) (Friedman and Tukey 1974) was proposed, which finds data projections according to a certain interestingness measure (IM), designed with specific goals. With the ability to choose between different IMs, PP balances the computational efficiency and its applicability. However, because there are many analysis tasks and users, very many IMs are required, and this has led to an explosion in the number of IMs. Hence, unlike DR used for a specific analysis task or a predictive model, it seems to be conceptually challenging to define a generic quality metric for DR in the tasks of exploratory data analysis. This is precisely the focus of this paper.
In this paper we present Subjectively Interesting Component Analaysis (SICA), a dimensionality reduction method that finds subjectively interesting data projections. That is, projections that are aimed to be interesting to a particular user. In order to do so, SICA relies on quantifying how interesting a data projection is to the user. This quantification is based on information theory and follows the principles of FORSIED (De Bie 2011). Here we discuss the central idea of FORSIED and more detail will follow in Sect. 2.
FORSIED is a data mining framework for quantifying subjective interestingness of patterns. The central idea is that a user’s belief state about the dataset is modelled as a Maximum Entropy (MaxEnt) probability distribution over the space of possible datasets. This probability distribution is called the background distribution and is updated as the analysis progresses, based on user interaction and the patterns in the data provided to the user. One can quantify the probability that a given pattern is present in data that is randomly drawn from the background distribution. Clearly, the smaller this probability, the more surprising the pattern is, and the more information it conveys to the user. More specifically, in FORSIED, the selfinformation of the pattern, defined as minus the logarithm of that probability, is then proposed as a suitable measure of how informative it is given the belief state.
In this paper, we define a pattern syntax called projection patterns for data projections that is compatible with FORSIED. By following FORSIED’s principles, we can quantify the probability of a projection given the user’s belief state. The lower the probability, the more surprising and interesting the pattern is, since surprising information about the data is typically what is truly interesting (Hand et al. 2001). Because this surprisal is evaluated with respect to the belief state, SICA can evaluate the subjective interestingness of projection patterns with respect to a particular user.

we define projection patterns, a pattern syntax for data projections (Sect. 2);

we derive a measure that quantifies the subjective interestingness of projection patterns (Sect. 2);

we propose a method that finds the most subjectively interesting projections in terms of an optimization problem (Sect. 2);

we define three types of prior beliefs a user may have knowledge about (Sect. 3);

we demonstrate that with different prior belief types, SICA is able to (approximately/exactly) find the subjectively most interesting patterns. In particular, for some prior belief types, the subjective interestingness can be efficiently optimized by solving an eigenvalue problem (Sect. 3);

we present three case studies and investigate the practical advantages and drawbacks of our method, which show that it can be meaningful to account for available prior knowledge about the data (Sect. 4).
2 Subjectively interesting component analysis
SICA allows one to find data projections that reveal unexpected variation in the data. In this section, we introduce the ingredients needed to achieve this. Namely, we (a) define an interestingness measure (IM) that quantifies the amount of information a projection conveys to a particular user, (b) following to the IM, find interesting data projections for the user. In Sect. 3, we then develop SICA further for various types of prior beliefs.
2.1 Notation
We use upper case bold face letters to denote matrices, lower case bold face letters for vectors, and normal lower case letters for scalars. We denote a ddimensional realvalued dataset as \({\hat{{\mathbf {X}}}}\triangleq ({\hat{{\mathbf {x}}}}_1', {\hat{{\mathbf {x}}}}_2', \ldots , {\hat{{\mathbf {x}}}}_n')' \in {\mathbb {R}}^{n\times d}\), and the corresponding random variable as \({\mathbf {X}}\). We will refer to \({\mathbb {R}}^{n\times d}\), the space the data is known to belong to, as the data space. Dimensionality reduction methods search weight vectors \({\mathbf {w}}\in {\mathbb {R}}^d\) of unit norm (i.e. \({\mathbf {w}}'{\mathbf {w}}=1\)) onto which the data is projected by computing \({\hat{{\mathbf {X}}}}{\mathbf {w}}\). If k vectors are sought, they will be stored as columns of a matrix \({\mathbf {W}}\in {\mathbb {R}}^{d\times k}\). We will denote the projections of a data set \({\hat{{\mathbf {X}}}}\) onto the column vectors of \({\mathbf {W}}\) as \({\hat{\varvec{\Pi }}}_{\mathbf {W}}\in {\mathbb {R}}^{n\times k}\), or formally: \({\hat{\varvec{\Pi }}}_{\mathbf {W}}\triangleq {\hat{{\mathbf {X}}}}{\mathbf {W}}\), and analogously for the random variable counterpart \(\varvec{\Pi }_{\mathbf {W}}\triangleq {\mathbf {X}}{\mathbf {W}}\). We will write \({\mathbf {I}}\) to denote the identity matrix of appropriate dimensions, and \({\mathbf {1}}_{n\times d}\) (or \({\mathbf {1}}\) for short if the dimensions are clear from the context) to denote a nbyd matrix with all elements \({\mathbf {1}}_{ij} = 1\). We define the matrix interval with lower bound \({\mathbf {B}}\) and upper bound \({\mathbf {C}}\) denoted by \({\mathbf {A}}_{n\times m} \in [{\mathbf {B}}_{n\times m}, {\mathbf {C}}_{n\times m}]\), which indicates \(a_{i,j}\in [b_{i,j}, c_{i,j}]\) for every \(i, j = \{1,2,\ldots ,n\}\times \{1,2,\ldots ,m\}\).
2.2 Subjective interestingness measure for projections
We now derive an IM for SICA following the framework for subjective interestingness measures (FORSIED) (De Bie 2011, 2013). FORSIED is a data mining framework that specifies on an abstract level how to model a user’s belief state about a given dataset, and how to quantify the informativeness of patterns with respect to a particular user. It works as follows: in order to measure the subjective interestingness of projections, SICA needs to maintain a model of the user’s belief state. In addition, SICA should be able to describe data projections in a pattern syntax compatible with FORSIED. We discuss both these issues in turn below.
2.2.1 Modeling the user’s belief state
We formalize a user’s belief state as a probability distribution over the data space (De Bie 2011):
Definition 1
(Background distribution) The background distribution is a distribution over the data space \({\mathbb {R}}^{n\times d}\) that represents the user’s belief state: the probability it assigns to any measurable subset of \({\mathbb {R}}^{n\times d}\) corresponds to the probability that the user would ascribe to the data \({\hat{{\mathbf {X}}}}\) belonging to that subset. The background distribution can be represented by a probability density function \(p_{\mathbf {X}}:{\mathbb {R}}^{n\times d}\rightarrow {\mathbb {R}}^+\).
For brevity, and slightly abusively, we will often refer to the density function \(p_{\mathbf {X}}\) as the background distribution.
Of course, the background distribution is typically not known to the data mining system. Thus, it has to be inferred from limited information provided by the user. De Bie (2013) proposed an intuitive while mathematically rigorous language a user can employ to express certain beliefs about the data. The language assumes that important characteristics of the data can be quantified by means of statistics \(f: {\mathbb {R}}^{n\times d} \rightarrow {\mathbb {R}}\). Using such statistics, the user can express their beliefs by declaring which value they expect f to have when evaluated on the data. Mathematically, this then becomes a constraint on the background distribution \(p_{\mathbf {X}}\).
Definition 2
Except in degenerate cases, such constraints will not uniquely determine \(p_{\mathbf {X}}\), such that an additional criterion is required to decide which one to use. Amongst those satisfying the prior belief constraints, the distribution with the maximum entropy (MaxEnt) is an attractive choice, given its unbiasedness and robustness. Further, as the resulting distribution belongs to the exponential family, its inference is well understood and often computationally tractable.
2.2.2 Projection patterns: a pattern syntax for data projections
In FORSIED,^{1} a pattern is defined as any information that restricts the set of possible values the data may have. For example, if the user is shown a scatter plot of the projections in \({\hat{\varvec{\Pi }}}_{\mathbf {W}}\), the user will from then on know that \({\hat{{\mathbf {X}}}}{\mathbf {W}}\) is equal to \({\hat{\varvec{\Pi }}}_{\mathbf {W}}\) (up to the resolution of the plot), which clearly constrains the set of possible values of the data to a subset of \({\mathbb {R}}^{n\times d}\).
One could thus be tempted to define a projection pattern as a statement of the kind \({\hat{{\mathbf {X}}}}{\mathbf {W}}= {\hat{\varvec{\Pi }}}_{\mathbf {W}}\). This would tell the user that the projections of the data \({\hat{{\mathbf {X}}}}\) onto the columns of \({\mathbf {W}}\) are found to be equal to the columns of \({\hat{\varvec{\Pi }}}_{\mathbf {W}}\).
However, realvalued data projections are often conveyed visually to a user, and in any case with finite accuracy, e.g. by means of a scatter plot. Because human eyes as well as the visualization devices (e.g., monitor, projector, and paper) have finite resolution, the precise value of the projected data can only be determined up to a certain resolutiondependent uncertainty \(2\varDelta {\mathbf {1}}\in {\mathbb {R}}^{n\times k}\). With these considerations^{2}, we formally define the syntax of a projection pattern as follows:
Definition 3
Thus, the projection pattern specifies, up to an accuracy of \(2{\varDelta }\), the value of the projections of the data onto the columns of the projection matrix \({\mathbf {W}}\).
2.2.3 Subjective interestingness of projections
Relying on the background distribution, we can now quantify the subjective interestingness of a projection pattern:
Definition 4
2.3 Searching subjectively interesting projection patterns
Searching for subjectively interesting projection patterns amounts to finding a set of weight vectors \({\mathbf {W}}\in {\mathbb {R}}^{d\times k}\) that yield projections with the largest SI value. The resulting weight vectors \({\mathbf {W}}\) linearly transform the original d features of the data \({\hat{{\mathbf {X}}}}\) into k features. Similar to the definition of the (principal) components in PCA, we refer to those k transformed features as the subjectively interesting components (SICs) of the data \({\hat{{\mathbf {X}}}}\).
It is this problem that we will be solving in Sect. 3 for a number of different types of background distributions.
3 SICA with different types of prior beliefs
In this section, we develop SICA further for three different types of prior beliefs. Each is discussed in a separate subsection. In Sect. 3.4, we discuss how SICA can in principle be used for other prior belief types as well, while also highlighting the difficulties in tackling other prior belief types that may limit the applicability of SICA in practice.
3.1 Scale of the data as prior belief
When the user only has a prior belief about the average variance of a dataset, SICA will aim to find projections with large variances. As we will show here, SICA with such prior is equivalent to PCA.
Theorem 1
Proof
Therefore, the MaxEnt background distribution is an independent multivariate normal distribution, where each independent random variable has zero mean, and covariance matrix \(\sigma ^2{\mathbf {I}}\), i.e., \({\mathcal {N}}\left( {\mathbf {0}}, \sigma ^2{\mathbf {I}}\right) \). \(\square \)
Subjectively interesting patterns Now we can search for subjectively interesting patterns by solving problem (9). This requires to first compute distribution \(p_{\varvec{\Pi }_{\mathbf {W}}}\) as the marginal of the background distribution (16).
3.2 tPCA: magnitude of spread as prior belief
In contrast to believing the data has a certain scale, a user might expect that the data has certain magnitude of spread. In this subsection, we show that with such prior expectation, SICA yields an alternative result, that turns out to be more robust against outliers.
Theorem 2
Proof
We restate the Theorem 2.1 and the derivation of Eq. 2.12 from the paper by Zografos (1999). From this the proof immediately follows.
Remark 1
Note that for \(\rho \text{, } \nu \rightarrow \infty \), \(\frac{\rho }{\nu } \rightarrow \sigma ^2\) this density function tends to the multivariate Normal density function with mean \({\mathbf {0}}\) and covariance \(\sigma ^2 {\mathbf {I}}\). For \(\rho = \nu = 1\) it is a multivariate standard Cauchy distribution, which is so heavytailed that its mean is undefined and its second moment is infinitely large. Thus, this type of prior belief can model the expectation of outliers to varying degrees.
Given the reliance on a multivariate tdistribution as the background distribution, we will refer to this model as tPCA.
Remark 2
We solve the problem (26) by approximation. Observe that the orthonormality constraint posed on \({\mathbf {W}}\) leads to problem (26) being a Stiefel manifold (Onishchik 2011) optimization problem. This can be addressed fairly efficiently with a standard tool box. We use the Manopt toolbox (Boumal et al. 2014) to obtain an approximate solution.
Remark 3
For the parameter \(\rho \) in constraint (20), a user can set it freely according to her prior belief. Namely, if the user feels confident about the average squared norm of the data points, a large \(\rho \) should be used, but if the user feels confident only about the order of magnitude of the norms of the data points, a small \(\rho \) should be used. The next example illustrates the effect of different choices for \(\rho \).
Example
As an illustrative example, we compare PCA and SICA on synthetic data. We generated a dataset consisting of two populations with different covariance structures: 1000 data points sampled from \({\mathcal {N}}\left( {\mathbf {0}},\left( \begin{array}{cc} 4 &{} 0 \\ 0 &{} 1 \end{array}\right) \right) \), and 10 ‘outliers’ from \({\mathcal {N}}\left( {\mathbf {0}},\left( \begin{array}{cc} 16 &{} 12 \\ 12 &{} 13 \end{array}\right) \right) \), i.e., \({\hat{{\mathbf {X}}}}\in {\mathbb {R}}^{1010\times 2}\). After sampling, the data is centered. Figure 1a shows the first components resulting from PCA, SICA, and PCA had there been no outliers. The PCA result is determined primarily by the outliers. The right plot (Fig. 1b) shows the components on top of a scatter plot without the 10 outliers, illustrating that SICA is hardly affected by outliers. That is, the lower \(\rho \) the more the user’s belief allows for the existence of outliers, hence SICA shows the projection with fewer outliers as additional information. By varying the \(\rho \) parameter (\(\rho = 10, 100, 1000\)), the resulted projection interpolates between PCA and PCA on data with outliers removed.
3.3 Pairwise data point similarities as prior beliefs
In SICA, users may specify not only global characteristics of the data, such as the expected magnitude of spread, but they can also express expectations about local characteristics, such as similarities between data points.
Prior belief Assume the user believes that a data point is similar to another point or group of points. She may then want to discover other structure within the data, in addition to the known similarities. Generally speaking, the most interesting/surprising information would be a pattern that contrasts with the known similarities. For example, consider a user interested in social network analysis, and more specifically, interested in finding social groups that share certain properties. Suppose the user has already studied the network structure to some degree, and now it would be more interesting for her to learn about other properties shared by different social groups; other as in properties not aligned with the network structure.
Theorem 3
The proof, provided below, makes clear that the values of \(\lambda _1\) and \(\lambda _2\) depend on the values of b and c in the constraints, and can be found by solving a very simple convex optimization problem:
Proof
Remark 4
Remark 5
In order to determine suitable values for b and c in the prior belief constraints, SICA may assume that the user already has a good understanding of the pointwise similarity (Eq. 28) and scale (Eq. 29) of the data points (or, that the user is not interested in these). Given this assumption, b and c can simply be set equal to the empirical value of these statistics as measured in the data. If the user wishes, she could of course specify values herself that differ from these. More realistically though, she may be able to specify a range of values for the pointwise similarity and scale. The background distribution should then be found as the MaxEnt distribution subject to two box constraints, i.e., four inequality constraints: a lower and a upper bound for pairwise similarity as well as for the scale measure. Theorem 3 still applies unaltered though: while the four inequality constraints lead to four Lagrange multipliers, only two may be nonzero at the optimum (one for each box constraint), as for each box constraint only either the upper or the lower bound constraint can be tight.
The computational complexity of finding an optimal projection \({\mathbf {W}}\) consists of two parts: (1) solving a convex optimization problem to obtain the background distribution. This can be achieved by applying, e.g., a steepest descent method, which uses at most \({\mathcal {O}}(\varepsilon ^{2})\) steps (until the norm of the gradient is \(\le \varepsilon \)) (Nesterov 2013). For each step, the complexity is \({\mathcal {O}}(n)\) with n being the size of data. (2) Given the background distribution, we find an optimal projection, the complexity of which is dominated by eigenvalue decomposition (\({\mathcal {O}}(n^3)\)). Hence, the overall complexity of SICA with graph prior is \({\mathcal {O}}(\frac{n}{\varepsilon ^2} + n^3)\).
Example
We synthesized a dataset with 100 users, where each user is described by 10 attributes, i.e., \({\hat{{\mathbf {X}}}}\in {\mathbb {R}}^{100\times 10}\). The first attribute is generated from a bimodal Gaussian distribution such that the first attribute clearly separates the users into two groups. We assume that people within each community are fully connected. To have a more interesting simulation, we also insert a few connections between the communities. The second attribute value is uniformly drawn from \(\{1, +1\}\) which could resemble, e.g., people’s sentiment towards a certain topic. The remaining eight attributes are standard Gaussian noise. After sampling, we centered the data.
We assume the user has studied the observed connection between all data points. Hence, the graphencoded prior expectation is chosen as the actual network structure; i.e., the resulting prior graph consists of the two cliques and a few edges inbetween, see Fig. 2a.
Communities data (Sect. 3.3), weights of first component for PCA and SICA
Feature 1  Feature 2  \(\cdots \)  

PCA 1st component  \(\) 0.998  0.015  \(\cdots \) 
SICA 1st component  0.186  0.957  \(\cdots \) 
Table 1 lists the weight vectors of the projections. As expected, PCA gives a large weight to the first feature, which has higher variance. However, SICA’s first component is dominated by the second feature. Hence, by incorporating the community structure as prior expectation, SICA finds an alternative structure corresponding to the second feature.
3.4 Discussion: potential and limitations of SICA
Potential of SICA The three instantiations of SICA discussed in this section are illustrative of SICA’s potential to take into account prior beliefs of the data analyst, and to find projections that are interesting with respect to it. The three steps that need to be followed to instantiate SICA are always the same: (1) Express the prior belief in the form of constraints on the expected value of certain specified statistics—i.e. in form of Eq. (1)—and solve the MaxEnt problem (9) to obtain the background distribution. (2) Compute the marginal density function of the background distribution for the data projection onto a projection matrix \({\mathbf {W}}\). And (3), come up a good optimization strategy. In principle, any data analyst able to express their prior beliefs in the required form can thus benefit from this approach.
Limitations of SICA Yet, each of these steps also implies some important limitations of SICA that should be the subjects of further work. The result of the first step will always be an exponential family distribution, and hence have an analytical form. However, expressing prior belief types as required will often be beyond the capabilities of a data analyst. Also the second step may require considerable mathematical expertise. Indeed, it may not be possible to express the marginal distribution in an analytical form such that it may need to be approximated. And even when it can be expressed analytically, deriving it mathematically may be nontrivial. Finally, thanks to the orthonormality assumption of the projection matrix, general purpose (Stiefel) manifold optimization solvers are in principle applicable, but doing this does not provide any optimality guarantees.
SICA in practice For these reasons, SICA as a framework is not directly suitable for use by practitioners. Instead, it can be used by researchers to develop specific instantiations of sufficiently broad applicability, which can then be made available to practitioners. Probably the most powerful example of this is the third instantiation (Sect. 3.3). Indeed, it is a very generic prior belief type for which an efficient algorithm exists, and which is relatively easy to be used.
4 Experiments
In this section, we present several case studies which demonstrate how SICA may help users to explore various types of real world data. For every case, we specify some background knowledge a user might have, and encode that knowledge using previously defined expressions. The encoded beliefs are then provided to SICA in the form of the background distribution. Third, we analyze the projections computed by SICA and evaluate whether they are indeed interesting with respect to the assumed user’s prior. Finally, we summarize the runtime of all experiments presented in this section.
Note that the purpose of our experiments is not to investigate superiority of SICA over existing methods for dimensionality reduction. Instead, we aim to investigate whether and to which extent SICA’s results usefully depend on the various prior beliefs, in highlighting information that is complementary to them. Where the answer to this question is positive, SICA is the method of choice—of course, assuming the prior beliefs are wellspecified.
4.1 tPCA on realworld data
Setup We evaluate the use of SICA with a spread prior (tPCA) on two datasets. The Shuttle^{6} data describes radiator positions (seven position classes: Rad Flow, Fpv Close, Fpv Open, High, Bypass, Bpv Close) in a NASA space shuttle and consists of 58000 data points and 9 integer attributes, i.e., \({\hat{{\mathbf {X}}}}\in {\mathbb {R}}^{58000\times 9}\). The 20 Newsgroups^{7} data describes four newsgroups (four classes) and has 16242 points and 100 integer attributes, i.e., \({\hat{{\mathbf {X}}}}\in {\mathbb {R}}^{16242\times 100}\). Both datasets are centered such that each attribute has zero mean.
Both of the datasets contain complex structures. Particularly, the shuttle dataset contains highly imbalanced cluster structure: one of the classes forms \(80\%\) of the population. For both datasets, we assume the user has a prior belief only about the order of magnitude of the data, i.e., the user would not be surprised by the presence of outliers. This can be encoded using the spread prior with a small \(\rho \), e.g., \(\rho = 10^{5}\cdot (\frac{1}{{\mathbf {X}}}\sum _{i=1}^{{\mathbf {X}}}x_i_2)^{\frac{1}{2}}\).
For the 20 Newsgroup dataset, SICA’s result is qualitatively similar to PCA’s result, although the variance of the SICA projection is slightly lower, arguably in favor of making the more finegrained variation in the data more apparent. FastICA’s result, however, is qualitatively different. It puts all weight on a single binary attribute, such that its top components project all data points onto just three points.
4.2 Images and lighting, with a graph prior
Setup We now apply SICA to explore image data. The Extended Yale Face Database B^{9} contains frontal images of 38 human subjects under 64 illumination conditions, for example, see Fig. 4. We ignored the images of seven subjects whose illumination conditions are not fully specified. The input dataset then contains 1684 data points, each of which is described by 1024 real valued features , i.e., \({\hat{{\mathbf {X}}}}\in {\mathbb {R}}^{1684\times 1024}\). The data is then centered to have a zero mean. The task of decomposing images in order to account for a number of prespecified factors has been addressed in the past (e.g., using a Nmode SVD; Vasilescu and Terzopoulos 2002). Here we want to explore how SICA weight vectors change according to the prior belief of a specific user.
Let us assume that the user already knows there are lighting conditions and is not interested in them. We can encode such knowledge by declaring that images (data points) with the same lighting condition are similar to each other. This can be expressed in a pointwise similarity prior. We construct a graph where each image is a node and two nodes are connected by an edge if the corresponding images have the same lighting conditions. The resulting prior graph consists of 64 cliques, one for each lighting condition.
Conversely, as SICA with the stated prior beliefs should result in a projection that highlights information complementary to lighting conditions, one can expect that the top SICs perform worse in separating the different lighting conditions than the top PCs. To evaluate this, instead of classifying subjects, we used kNN to classify different illumination conditions, using the same PCs and SICs as before. That is, where we told SICA explicitly we were not interested in light variation. Figure 7b shows that PCA indeed gives better 3NN classification accuracy than SICA. The result (Fig. 8b) obtained by SVM confirms this with another classifier.
4.3 Spatial socioeconomy, with a graph prior
Now we use SICA to analyze a socioeconomic dataset. The German socioeconomic data (Boley et al. 2013) was compiled from the database of the German Federal Statistical Office. The dataset consists of socioeconomic records of 412 administrative districts in Germany. The data features used in this case study fall into two groups: election vote counts and age demographics. We additionally coded for each district the geographic coordinates of the district center and which districts share a border with each other.
4.3.1 Vote attribute group
Setup Let us assume a user is interested in exploring the voting behavior of different districts in Germany. The (realvalued) data attributes about the 2009 German elections cover the percentage of votes on the five largest political parties^{10}: CDU/CSU, SPD, FDP, GREEN, and LEFT. Thus, we have a dataset \({\hat{{\mathbf {X}}}}\in {\mathbb {R}}^{412\times 5}\). We centered the data attributewise by subtracting the mean from each data point.
Results The projection onto the first PC (Fig. 9a) shows smooth variation across the map. Districts in western Germany and Bavaria (south) receive high scores (red circles) and districts in East Germany (Brandenburg and Thuringa) have low scores (dark blue circles). Table 2 additionally shows the weight vectors of the top PC and SIC. The PC is dominated by the difference between CDU/CSU and Left. This is expected, because this indeed is the primary division in the elections; East Germany votes more Left, while in Bavaria, CSU is very popular.
However, SICA highlights a different pattern; the competition between CDU/CSU and SPD is more local. Although there is still considerable global variation (in this case between the south and the north), we also observe that the Ruhr area (Dortmund and around) is similar to East Germany in that the socialdemocrats are preferred over the Christian parties. Arguably, the districts where this happens are those with a large fraction of working class, like the Ruhr area. Perhaps they vote more on parties that put more emphasis on interests of the lesswealthy part of the population.
German socioeconomics data vote attributes (Sect. 4.3.1), weights given by top PCA and SICA component
CDU/CSU  SPD  FDP  GREEN  Left  

PCA 1st  0.53  \(\) 0.13  0.22  0.13  \(\) 0.80 
SICA 1st  0.72  \(\) 0.65  0.10  \(\) 0.09  \(\) 0.19 
4.3.2 Demographic attribute group
German socioeconomics data age demographics (Sect. 4.3.1), weights given by first PCA and SICA component
Elderly  Old  Midage  Young  Child  

PCA  \(\) 0.61  \(\) 0.42  0.43  0.09  0.51 
SICA  \(\) 0.62  \(\) 0.32  0.69  0.19  0.06 
Results Projection on the top PC (Fig. 11a) confirms the user’s prior expectations. There is a substantial difference between East and West Germany. In the visualization, high projection values (red color) appear mostly in East Germany, while low values (blue color) appear mostly in the rest of Germany. If we look at the weights of the top PC (Table 3), we find that the projection is based on large negative weights to people above 44 (Old and Elder), and large positive weights to the younger population (age < 45). This confirms that indeed the demographic status of East Germany deviates.
SICA results in a different projection (Fig. 11b), even though the difference is more subtle than in the analysis of the voting behavior. Although SICA also assigns large negative scores to East Germany, presumably because there are relatively many elderly there, SICA also highlights the large cities, e.g., Munich, Cologne, Frankfurt, Hamburg, Kiel, Trier. In addition to showing a smooth geographic East–West trend, SICA also seems to highlight districts whose demographic status deviates from its surrounding districts. Indeed, from the weight vector (Table 3) we see that these districts are found by considering the number of middle aged people against the number of elderly. We know that many middleaged (24–44) working people live in large cities, and, according to the report from BerlinInstitute for Population and Development, “large cities generally have fewer children, since they offer families too little room for development”. Indeed, we find that families live in the neighboring districts, highlighting a perhaps lessexpected local contrast.
Also, to further investigate this more quantitatively, we applied an SVM to classify the eastern versus noneastern districts using the projected demographic attributes. Figure 12a shows that the top two PCA components result in a slightly smaller loss than SICA. This indicates that the top PCs and SICs both correspond to the eastern and noneastern division. The similarity matrix (Fig. 12b) of PCA and SICA components also shows the first and second components of the two methods are very similar. However, according to the visualization, the best (positively) scored districts by SICA (Fig. 11b) highlight large cities more than the PCA result (Fig. 11a). Also the highlighted cities stand out more from their surrounding area.
4.4 Runtime
Table 4 summarizes the runtime of PCA and SICA in all experiments presented in this paper. In all these cases, SICA takes more time to compute the projections. For the first three columns (tPCA cases), we used the solver offered by Manopt to perform gradient descent over the Stiefel manifold. We tried ten random starts in all three cases and picked the projection that gives the best objective. The ten random starts already give stable local optima in all three cases. Note that tPCA scales gracefully when the data size increases from Synthetic dataset (\(1010\times 2\)) to Shuttle (\(58000\times 9\)) and then 20NewsGroup (\(16242\times 100\)).
Runtime (in seconds) of SICA and PCA for all experiments (Sect. 4.4)
Synthetic outlier  Shuttle  20NewsGroup  Synthetic community  Socioeco. (age)  Socioeco. (vote)  Face image  

SICA  0.12  1.75  8.07  0.03  0.06  0.04  2.26 
PCA  < 0.01  0.08  0.25  0.01  < 0.01  < 0.01  0.56 
5 Related work
SICA is linear, unsupervised, and subjective. Dimensionality reduction (DR) methods, as indicated by the name, aim to find lower dimensional representation of high dimensional data. Here “dimension” refers to the number of features that are used to describe the data. Finding a lower dimensional representation further boils down to either select a subset of the original features or transform the feature space to another (lowdimensional) space. Here we mainly discuss the line of work for feature transformation (extraction), since they are more closely related to our work.
Supervised vs. unsupervised DR methods are often designed with a certain goal: to have lower dimensional representations with some specific properties. For example Principal Component Analysis (PCA) (Peason 1901; Jolliffe 2002) is often used for computing a presentation of dataset where the data variance is preserved, whereas Canonical Component Analysis (CCA) (Hotelling 1936) aims to find pairs of directions in two feature spaces where the corresponding two datasets are highly correlated. While PCA and CCA achieve their goals in an unsupervised manner, Linear Discriminant Analysis (LDA) (Fisher 1936), on the other hand, extracts discriminative features according to the given class labels with a supervised flavor. The new features provided by DR methods can not only be used for later classification or prediction, but also to explore the structures in the data, e.g., Self Organizing Map (SOM) (Kohonen 1998) for exploratory data analysis. In order to meet different analysis goals under a unified framework, Projection Pursuit (PP) (Friedman and Tukey 1974) was proposed to locate different projections according to some predefined “interestingness index”. Different from the previous works, we seek for data projections that are interesting particularly to the user. Therefore, SICA aims to propose a generic interestingness measure that does not explicitly depend on the context of the data or on the specific analytic tasks.
Linear vs. nonlinear Orthogonally, when approaching these goals, DR methods further assume the relationship between the original data and its lower dimensional representation to be either linear or nonlinear. The aforementioned methods (PCA, CCA and LDA) compute new data representations via linear transformation. Additionally, classical Multidimensional Scaling (Kruskal and Wish 1978) also finds a linear transformation that preservers the distances between the data points. We refer the reader to the survey by Cunningham and Ghahramani (2015) and the references therein for a comprehensive review of linear DR techniques. However, in reality, high dimensional data often obeys certain constraints; data then lies on a lowdimensional (nonlinear) manifold embedded in the original feature space. Nonlinear dimensionality reduction methods like SOM approximate such a manifold by a set of linked nodes. Building upon Multidimensional Scaling, ISOMAP (Tenenbaum et al. 2000) seeks to preserve the intrinsic geometry of the data by first encoding neighborhood relations as a weighted graph. This inspired later spectral methods (Von Luxburg 2007; Ng et al. 2002) as well as different manifold learning approaches (Belkin and Niyogi 2003; He and Niyogi 2004; Weinberger et al. 2006) that try to solve a eigenproblem in order to discover the intrinsic manifold structure of the data, using an eigendecomposition to preserve the local properties of the data. Note that with a graph prior, SICA computes linear projections in a spectralmethodlike manner (Sect. 3.3). However, the previously mentioned nonlinear DR methods are interested in the eigenvectors corresponding to the smallest k eigenvalues of the Laplacian, as they provide insights into the local structure of the underlying graph, while SICA identifies mappings that target nonsmoothness with respect to the user’s beliefs about the data, while maximizing the variance of the data in the resulting subspace. Interestingly, the resulting optimization problem is not simply the opposite of existing approaches.
Objective vs. subjective The aforementioned methods are mainly “objective” in the sense there that user is not explicitly considered. A notable exception is the work on User Intent Modeling for Information Discovery (Ruotsalo et al. 2015), where indeed an explicit relevance model is built to help a user find information relevant to her search. Their tool also computes a 2D embedding of the search results, accounting for their user and session specific relevance. However, they do not introduce a new theoretically wellmotived method to find a lowdimensional subspace that accounts for background knowledge or intent. That is also not the focus of their work, which is rather the identification of relevant results. Some other techniques have been proposed in exploratory data analysis that take into account the user knowledge to determine interesting projections. For instance, Brown et al. (2012) suggests an interactive process in which the user provides feedback by moving incorrectlypositioned data points to locations that reflect their understanding. In a similar manner, Paurat and Gärtner (2013) make use semisupervised least squares projections but allowing the user to select and rearrange some of the embedded data points. In the work by Iwata et al. (2013), the authors use active learning to select candidate data points for the user to relocate so that they can achieve their desired visualization. All of these methods, guided by the user, interactively present different aspects of the data. Finally the work by Weinberger and Saul (2009) require the practitioner to provide auxiliary information, e.g. a similarity graph, that identify target neighbours for each data point, that is then used to constraint their optimization problem. This prior knowledge is the structure that one wants to preserve, as opposed to SICA. To our best knowledge, SICA is the first subjective DR method which finds lowerdimensional data representations that are as interesting as possible for a particular user. Hence, SICA adds another layer to the family of dimensionality reduction methods.
6 Conclusion
In exploratory data analysis, structures in the data often have different value for different tasks and data analysts. To address this, the Projection Pursuit literature has introduced numerous projection indices that quantify the interestingness of a projection in various ways. However, it still seems to be conceptually challenging to define a generic quality metric for the tasks of exploratory data analysis. As an attempt in this direction, we present SICA, a new linear dimensionality reduction approach that explicitly embraces the subjective nature of interestingness. In this paper, we show how the modeling of a user’s belief state can be used to drive a subjective interestingness measure for DR. Such interestingness measure is then used to search for subjectively interesting projections of data. Results from several case study show that it can be meaningful to account for available prior knowledge about the data.
Avenues for further work include incorporating multiple prior expectations simultaneously (e.g., define multiple (disjoint) groups of similar nodes using graph prior), to enable more flexible iterative analysis. This involves solving a MaxEnt optimization problem subject to multiple constraints. We also plan to study how to improve the interpretability of the projections, e.g., finding projections with sparse weight vectors. In terms of visualization, an interesting future direction is to investigate how the SICA result will be affected by removing the assumption of the resolution being the same through all dimensions. Although that is already possible, one question is how a user could conveniently input these expectations into the system. Another open question is to what extend SICA can be applied to nonlinear dimensionality reduction. Finally, alternative types of prior expectations are also worth examining.
Footnotes
 1.
As well as in the only other framework for interactive data mining, CORAND (Lijffijt et al. 2014). By a framework for interactive data mining we mean a generic method that can be used to design specific data mining methods that take into account results previously shown to the user or other prior knowledge about the data. Such a framework would specify certain aspects of the method while other aspects are left open and only a guideline is provided on how to fill in that part. E.g., FORSIED specifies to define the background model as a MaxEnt distribution and the objective to maximize is the Subjective Interestingness. CORAND mandates another objective score (to maximize the p value of the data) and the form of the background distribution is left open; it may be anything. As far as we know, there are no other works published with a similar spirit.
 2.
To simplify our notation, we assume the resolution parameter being the same through all dimensions. It is indeed an interesting direction to further develop SICA for the resolution varying in different dimensions.
 3.
In FORSIED, the subjective interestingness of a pattern is generally defined by a trade off between the information content (i.e., negative probability) of the pattern and the descriptional complexity (i.e., the amount of effort needed to assimilate the pattern). Here we assume all projections of the same dataset have the same descriptional complexity. As a result, the descriptional complexity can be ignored from the definition of SI.
 4.
 5.
In this paper, by performing PCA, we mean the data \({\mathbf {X}}\) is first centered (\({\mathbf {X}}_c = {\mathbf {X}} \frac{1}{n}{\mathbf {1}}_{n\times 1}{\mathbf {1}}_{n\times 1}'{\mathbf {X}}\)), then the eigenvectors of matrix \({\mathbf {X}}'{\mathbf {X}}\) are computed and sorted in descending order according to the absolute value of the eigenvalues. After sorting, the eigenvectors of \({\mathbf {X}}'{\mathbf {X}}\) with largest absolute eigenvalues correspond to the top principal components.
 6.
https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle), retrieved November 18, 2016.
 7.
http://cs.nyu.edu/~roweis/data.html, retrieved November 18, 2016.
 8.
In the experiment we used the FastICA package for MATLAB. The package can be downloaded from https://research.ics.aalto.fi/ica/fastica/
 9.
This data is available as a preprocessed Matlab file at http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html. The original dataset is described in Georghiades et al. (2001), Lee et al. (2005).
 10.
https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany, retrieved November 18, 2016.
 11.
https://en.wikipedia.org/wiki/New_states_of_Germany#Demographic_development, retrieved November 18, 2016.
 12.
Notes
Acknowledgements
We thank the anonymous reviewers for their constructive and insightful comments. We are grateful to Petteri Kaski for discussions about the complexity of tPCA.
References
 Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15(6):1373–1396CrossRefMATHGoogle Scholar
 Bishop CM (2006) Pattern Recogn Mach Learn. Springer, BerlinGoogle Scholar
 Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining: interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, ACM, New York, NY, USA, pp 27–35Google Scholar
 Boumal N, Mishra B, Absil PA, Sepulchre R (2014) Manopt, a matlab toolbox for optimization on manifolds. J Mach Learn Res 15(1):1455–1459. http://www.manopt.org
 Brown ET, Liu J, Brodley CE, Chang R (2012) Disfunction: learning distance functions interactively. In: IEEE VAST, IEEE, Seattle, WA, USA, pp 83–92Google Scholar
 Cunningham JP, Ghahramani Z (2015) Linear dimensionality reduction: survey, insights, and generalizations. J Mach Learn Res 16:2859–2900MathSciNetMATHGoogle Scholar
 De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, pp 564–572Google Scholar
 De Bie T (2013) Subjective interestingness in exploratory data mining. In: International symposium on intelligent data analysis, Springer, Berlin, Heidelberg, pp 19–31Google Scholar
 De Bie T, Lijffijt J, SantosRodriguez R, Kang B (2016) Informative data projections: a framework and two examples. In: European symposium on artificial neural networks, computational intelligence and machine learning. www.i6doc.com
 Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188CrossRefGoogle Scholar
 Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput 100(9):881–890CrossRefMATHGoogle Scholar
 Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: illumination cone models for face recognition under variable lighting and pose. IEEE Trans Pattern Anal Mach Intell 23(6):643–660CrossRefGoogle Scholar
 Gupta AK, Nagar DK (1999) Matrix variate distributions. CRC Press, Boca RatonMATHGoogle Scholar
 Hand DJ, Mannila H, Smyth P (2001) Principles of data mining. MIT Press, CambridgeGoogle Scholar
 He X, Niyogi P (2004) Locality preserving projections. In: Advances in neural information processing systems, pp 153–160Google Scholar
 Hotelling H (1936) Relations between two sets of variates. Biometrika 28(3/4):321–377CrossRefMATHGoogle Scholar
 Hyvärinen A et al (1999) Fast and robust fixedpoint algorithms for independent component analysis. IEEE Trans Neural Netw 10(3):626–634CrossRefGoogle Scholar
 Hyvärinen A, Karhunen J, Oja E (2004) Independent component analysis. Wiley, New YorkGoogle Scholar
 Iwata T, Houlsby N, Ghahramani Z (2013) Active learning for interactive visualization. In: Proceedings of the sixteenth international conference on artificial intelligence and statistics, proceedings of machine learning research, vol. 31, pp 342–350Google Scholar
 Jolliffe I (2002) Principal component analysis. Wiley Online LibraryGoogle Scholar
 Kang B, Lijffijt J, SantosRodríguez R, De Bie T (2016) Subjectively interesting component analysis: data projections that contrast with prior expectations. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, pp 1615–1624Google Scholar
 Kohonen T (1998) The selforganizing map. Neurocomputing 21(1):1–6MathSciNetCrossRefMATHGoogle Scholar
 Kokiopoulou E, Chen J, Saad Y (2011) Trace optimization and eigenproblems in dimension reduction methods. Numer Linear Algebra Appl 18(3):565–602MathSciNetCrossRefMATHGoogle Scholar
 Kotz S, Nadarajah S (2004) Multivariate tdistributions and their applications. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
 Kruskal JB, Wish M (1978) Multidimensional scaling. Sage, Thousand OaksCrossRefGoogle Scholar
 Lee KC, Ho J, Kriegman DJ (2005) Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans Pattern Anal Mach Intell 27(5):684–698CrossRefGoogle Scholar
 Lijffijt J, Papapetrou P, Puolamäki K (2014) A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Discov 28(1):238–263MathSciNetCrossRefMATHGoogle Scholar
 Nesterov Y (2013) Introductory lectures on convex optimization: a basic course. Springer, BerlinMATHGoogle Scholar
 Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, pp 849–856Google Scholar
 Onishchik (2011) Stiefel manifold. Encyclopedia of mathematics. http://www.encyclopediaofmath.org/index.php?title=Stiefel_manifold&oldid=12028. Accessed 21st June 2017
 Paurat D, Gärtner T (2013) Invis: a tool for interactive visual data analysis. In: Machine learning and knowledge discovery in databases: European conference, ECML PKDD, Springer, Berlin, Heidelberg, pp 672–676Google Scholar
 Peason K (1901) On lines and planes of closest fit to systems of point in space. Philos Mag 2(11):559–572CrossRefGoogle Scholar
 Puolamaki K, Papapetrou P, Lijffijt J (2010) Visually controllable data mining methods. In: IEEE international conference on data mining workshops, IEEE, pp 409–417Google Scholar
 Ruotsalo T, Jacucci G, Myllymäki P, Kaski S (2015) Interactive intent modeling: information discovery beyond search. Commun ACM 58(1):86–92CrossRefGoogle Scholar
 Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323CrossRefGoogle Scholar
 Vasilescu MAO, Terzopoulos D (2002) Multilinear analysis of image ensembles: tensorfaces. In: Proceedings of the 7th european conference on computer vision, Springer, Berlin, Heidelberg, pp 447–460Google Scholar
 Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416MathSciNetCrossRefGoogle Scholar
 Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATHGoogle Scholar
 Weinberger KQ, Sha F, Zhu Q, Saul LK (2006) Graph laplacian regularization for largescale semidefinite programming. In: Advances in neural information processing systems, pp 1489–1496Google Scholar
 Zografos K (1999) On maximum entropy characterization of Pearson’s type II and VII multivariate distributions. J Multivar Anal 71(1):67–75MathSciNetCrossRefMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.