Clustering with missing features: a penalized dissimilarity measure based approach
- 799 Downloads
- 1 Citations
Abstract
Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we overcome this drawback by utilizing a penalized dissimilarity measure which we refer to as the feature weighted penalty based dissimilarity (FWPD). Using the FWPD measure, we modify the traditional k-means clustering algorithm and the standard hierarchical agglomerative clustering algorithms so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also undertake a detailed theoretical analysis showing that the new FWPD based k-means algorithm converges to a local optimum within a finite number of iterations. We also present a detailed method for simulating random as well as feature dependent missingness. We report extensive experiments on various benchmark datasets for different types of missingness showing that the proposed clustering techniques have generally better results compared to some of the most well-known imputation methods which are commonly used to handle such incomplete data. We append a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be undefined).
Keywords
Missing features Penalized dissimilarity measure k-means Hierarchical agglomerative clustering Absent features1 Introduction
In data analytics, clustering is a fundamental technique concerned with partitioning a given dataset into useful groups (called clusters) according to the relative similarity among the data instances. Clustering algorithms attempt to partition a set of data instances (characterized by some features), into different clusters such that the member instances of any given cluster are akin to each other and are different from the members of the other clusters. Greater the similarity within a group and the dissimilarity between groups, better is the clustering obtained by a suitable algorithm.
Clustering techniques are of extensive use and are hence being constantly investigated in statistics, machine learning, and pattern recognition. Clustering algorithms find applications in various fields such as economics, marketing, electronic design, space research, etc. For example, clustering has been used to group related documents for web browsing (Broder et al. 1997; Haveliwala et al. 2000), by banks to cluster the previous transactions of clients to identify suspicious (possibly fraudulent) behaviour (Sabau 2012), for formulating effective marketing strategies by clustering customers with similar behaviour (Chaturvedi et al. 1997), in earthquake studies for identifying dangerous zones based on previous epicentre locations (Weatherill and Burton 2009; Shelly et al. 2009; Lei 2010), and so on. However, when we analyze such real-world data, we may encounter incomplete data where some features of some of the data instances are missing. For example, web documents may have some expired hyper-links. Such missingness may be due to a variety of reasons such as data input errors, inaccurate measurement, equipment malfunction or limitations, and measurement noise or data corruption, etc. This is known as unstructured missingness (Chan and Dunn 1972; Rubin 1976). Alternatively, not all the features may be defined for all the data instances in the dataset. This is termed as structural missingness or absence of features (Chechik et al. 2008). For example, credit-card details may not be defined for non-credit card clients of a bank.
Missing features have always been a challenge for researchers because traditional learning methods (which assume all data instances to be fully observed, i.e. all the features are observed) cannot be directly applied to such incomplete data, without suitable preprocessing. When the rate of missingness is low, the data instances with missing values may be ignored. This approach is known as marginalization. Marginalization cannot be applied to data having a sizable number of missing values, as it may lead to the loss of a sizable amount of information. Therefore, sophisticated methods are required to fill in the vacancies in the data, so that traditional learning methods can be applied subsequently. This approach of filling in the missing values is called imputation. However, inferences drawn from data having a large fraction of missing values may be severely warped, despite the use of such sophisticated imputation methods (Acuña and Rodriguez 2004).
1.1 Literature
The initial models for feature missingness are due to Rubin (1976); Little and Rubin (1987). They proposed a three-fold classification of missing data mechanisms, viz. Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). MCAR refers to the case where missingness is entirely haphazard, i.e. the likelihood of a feature being unobserved for a certain data instance depends neither on the observed nor on the unobserved characteristics of the instance. For example, in an annual income survey, a citizen is unable to participate, due to unrelated reasons such as traffic or schedule problems. MAR eludes to the cases where the missingness is conditional to the observed features of an instance, but is independent of the unobserved features. Suppose, college-goers are less likely to report their income than office-goers. But, whether a college-goer will report his or her income is independent of the actual income. MNAR is characterized by the dependence of the missingness on the unobserved features. For example, people who earn less are less likely to report their incomes in the annual income survey. Datta et al. (2016b) further classified MNAR into two sub-types, namely MNAR-I when the missingness only depends on the unobserved features and MNAR-II when the missingness is governed by both observed as well as unobserved features. Schafer and Graham (2002) and Zhang et al. (2012) have observed that MCAR is a special case of MAR and that MNAR can also be converted to MAR by appending a sufficient number of additional features. Therefore, most learning techniques are based on the validity of the MAR assumption.
A lot of research on the problem of learning with missing or absent features has been conducted over the past few decades, mostly focussing on imputation methods. Several works such as Little and Rubin (1987) and Schafer (1997) provide elaborate theories and analyses of missing data. Common imputation methods (Donders et al. 2006) involve filling the missing features of data instances with zeros [Zero Imputation (ZI)], or the means of the corresponding features over the entire dataset [Mean Imputation (MI)]. Class Mean Imputation or Concept Mean Imputation (CMI) is a slight modification of MI that involves filling the missing features with the average of all observations having the same label as the instance being filled. Yet another common imputation method is k-Nearest Neighbor Imputation (kNNI) (Dixon 1979), where the missing features of a data instance are filled in by the means of corresponding features over its k-Nearest Neighbors (kNN), on the observed subspace. Grzymala-Busse and Hu (2001) suggested various novel imputation schemes such as treating missing attribute values as special values. Rubin (1987) proposed a technique called Multiple Imputation (MtI) to model the uncertainty inherent in imputation. In MtI, the missing values are imputed by a typically small (e.g. 5–10) number of simulated versions, depending on the percentage of missing data (Chen 2013; Horton and Lipsitz 2001). Some more sophisticated imputation techniques have been developed, especially by the bioinformatics community, to impute the missing values by exploiting the correlations between data. A prominent example is the Singular Value Decomposition based Imputation (SVDI) technique (Troyanskaya et al. 2001) which performs regression based estimation of the missing values using the k most significant eigenvectors of the dataset. Other examples inlcude Least Squares Imputation (LSI) (Bo et al. 2004), Non-Negative LSI (NNLSI) and Collateral Missing Value Estimation (CMVE) (Sehgal et al. 2005). Model-based methods are related to yet distinct from imputation techniques. These methods attempt to model the distributions for the missing values instead of filling them in Dempster and Rubin (1983); Ahmad and Tresp (1993); Wang and Rao (2002a, b).
However, most of these techniques assume the pattern of missingness to be MCAR or MAR because this allows the use of simpler models of missingness (Heitjan and Basu 1996). Such simple models are not likely to perform well in case of MNAR as the pattern of missingness also holds information. Hence, other methods have to be developed to tackle incomplete data due to MNAR (Marlin 2008). Moreover, imputation may often lead to the introduction of noise and uncertainty in the data (Dempster and Rubin 1983; Little and Rubin 1987; Barceló 2008; Myrtveit et al. 2001).
In light of the observations made in the preceding paragraph, some learning methods avoid the inexact methods of imputation (as well as marginalization) altogether, while dealing with missingness. A common paradigm is random subspace learning where an ensemble of learners is trained on projections of the data in random subspaces and an inference is drawn based on the concensus among the ensemble (Krause and Polikar 2003; Juszczak and Duin 2004; Nanni et al. 2012). Chechik et al. (2008) used the geometrical insight of max-margin classification to formulate an objective function which was optimized to directly classify the incomplete data. This was extended to the max-margin regression case for software effort prediction with absent features in Zhang et al. (2012). Wagstaff (2004); Wagstaff and Laidler (2005) suggested a k-means algorithm with Soft Constraints (KSC) where soft constraints determined by fully observed objects are introduced to facilitate the grouping of instances with missing features. Himmelspach and Conrad (2010) provided a good review of partitional clustering techniques for incomplete datasets, which mentions some other techniques that do not make use of imputation.
The idea to modify the distance between the data instances to directly tackle missingness (without having to resort to imputation) was first put forth by Dixon (1979). The Partial Distance Strategy (PDS) proposed in Dixon (1979) scales up the observed distance, i.e. the distance between two data instances in their common observed subspace (the subspace consisting of the observed features common to both data instances) by the ratio of the total number of features (observed as well as unobserved) and the number of common observed features between them to obtain an estimate of their distance in the fully observed space. Hathaway and Bezdek (2001) used the PDS to extend the Fuzzy C-Means (FCM) clustering algorithm to cases with missing features. Furthermore, Millán-Giraldo et al. (2010) and Porro-Muñoz et al. (2013) generalized the idea of the PDS by proposing to scale the observed distance by factors other than the fraction of observed features. However, neither the PDS nor its extensions can always provide a good estimate of the actual distance as the observed distance between two instances may be unrelated to the distance between them in the unobserved subspace.
1.2 Motivation
- 1.
ZI works well only for missing values in the vicinity of the origin and is also origin dependent;
- 2.
MI works well only when the missing value is near the observed mean of the missing feature;
- 3.
kNNI is reliant on the assumption that neighbors have similar features, but suffers from the drawbacks that missingness may give rise to erroneous neighbor selection and that the estimates are restricted to the range of observed values of the feature in question;
- 4.
PDS suffers from the assumption that the common observed distances reflect the unobserved distances; and
- 5.
None of these methods differentiate between identical incomplete points, i.e. \(\widetilde{{\mathbf {x}}}_1\) and \(\widetilde{{\mathbf {x}}}'_1\) are not differentiated between.
1.3 Contribution
The FWPD measure is a PDM used in Datta et al. (2016b) for kNN classification of datasets with missing features.^{1} The FWPD between two data instances is a weighted sum of two terms; the first term being the observed distance between the instances and the second being a penalty term. The penalty term is a sum of the penalties corresponding to each of the features which are missing from at least one of the data instances; each penalty being directly proportional to the probability of its corresponding feature being observed. Such a weighting scheme imposes greater penalty if a feature which is observed for a large fraction of the data is missing for a particular instance. On the other hand, if the missing feature is unobserved for a large fraction of the data, then a smaller penalty is imposed.
- 1.
In the current article, we formulate the k-means clustering problem for datasets with missing features based on the proposed FWPD and develop an algorithm to solve the new formulation.
- 2.
We prove that the proposed algorithm is guaranteed to converge to a locally optimal solution of the modified k-means optimization problem formulated with the FWPD measure.
- 3.
We also propose Single Linkage, Average Linkage, and Complete Linkage based HAC methods for datasets plagued by missingness, based on the proposed FWPD.
- 4.
We provide an extensive discussion on the properties of the FWPD measure. The said discussion is more thorough compared to that of Datta et al. (2016b).
- 5.
We further provide a detailed algorithm for simulating the four types of missingness enumerated in Datta et al. (2016b), namely MCAR, MAR, MNAR-I (missingness only depends on the unobserved features) and MNAR-II (missingness depends on both observed as well as unobserved features).
- 6.
Moreover, since this work presents an alternative to imputation and can be useful in scenarios where imputation is not practical (such as structural missingness), we append an extension of the proposed FWPD to the case of absent features (where the absent features are known to be undefined or non-existent). We also show that the FWPD becomes a semi-metric in the case of structural missingness.
1.4 Organization
The rest of this paper is organized in the following way. In Sect. 2, we elaborate on the properties of the FWPD measure. The next section (Sect. 3) presents a formulation of the k-means clustering problem which is directly applicable to datasets with missing features, based on the FWPD discussed in Sect. 2. This section also puts forth an algorithm to solve the optimization problem posed by this new formulation. The subsequent section (Sect. 4) covers the HAC algorithm formulated using FWPD to be directly applicable to incomplete datasets. Experimental results (based on the missingness simulating mechanism discussed in the same section) are presented in Sect. 5. Relevant conclusions are drawn in Sect. 6. Subsequently, “Appendix A” deals with the extension of the proposed FWPD to the case of absent features (structural missingness).
2 Feature weighted penalty based dissimilarity measure for datasets with missing features
Some important notations used in Sect. 2 and beyond
Notation | Meaning |
---|---|
X | Dataset with incomplete data points |
n | Number of data points in X |
\({\mathbf {x}}_i\) | A data point in X |
\(x_{i,l}\) | l-th feature of \({\mathbf {x}}_i\) |
S | Set of all features in X |
m | Number of features in S, i.e. |S| |
\(\gamma \) | General notation for a set of features in S |
\(\gamma _{{\mathbf {x}}_i}\) | Set of features observed for point \({\mathbf {x}}_i\) |
\(\gamma _{obs}\) | Set of features observed for all instances in X |
\(\gamma _{miss}\) | Set of features which are unobserved for some point in X |
\(d_{\gamma }({\mathbf {x}}_i,{\mathbf {x}}_j)\) | Distance between poinst \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) in the subspace defined by the features in \(\gamma \) |
\(d({\mathbf {x}}_i,{\mathbf {x}}_j)\) | Observed distance between points \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) |
\(d_E({\mathbf {x}}_i,{\mathbf {x}}_j)\) | Euclidean distance between fully observed points \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) |
\(w_l\) | Number of instances in X having observed values for the l-th feature |
\(p({\mathbf {x}}_i,{\mathbf {x}}_j)\) | Feature Weighted Penalty (FWP) between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) |
\(p_{\gamma }\) | FWP corresponding to the subspace defined by \(\gamma \) |
\(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) | Feature Weighted Penalty based Dissimilarity (FWPD) between \({\mathbf {x}}_i\) and \({\mathbf {x}}_j\) |
\(d_{max}\) | Maximum observed distance between any two data points in X |
\(\alpha \) | Coefficient of relative importance between observed distance and FWP for FWPD |
\(\rho _{i,j,k}\) | \(p({\mathbf {x}}_i,{\mathbf {x}}_j) + p({\mathbf {x}}_j,{\mathbf {x}}_k) - p({\mathbf {x}}_k,{\mathbf {x}}_i)\) for some \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\) |
\(\phi \) | An empty set |
Definition 1
Definition 2
Definition 3
Then, the definition of the proposed FWPD follows.
Definition 4
2.1 Properties of the proposed FWPD
In this subsection, we discuss some of the important properties of the proposed FWPD measure. The following theorem discusses some of the important properties of the proposed FWPD measure and the subsequent discussion is concerned with the triangle inequality in the context of FWPD.
Theorem 1
- 1.
\(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \le \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\),
- 2.
\(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \ge 0\) \(\forall \) \({\mathbf {x}}_i \in X\),
- 3.
\(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = 0\) iff \(\gamma _{{\mathbf {x}}_i}=S\), and
- 4.
\(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=\delta ({\mathbf {x}}_j,{\mathbf {x}}_i)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\).
Proof
- 1.From Eqs. (1) and (3), it follows thatIt also follows from Eq. (2) that \(p({\mathbf {x}}_i,{\mathbf {x}}_i) \le p({\mathbf {x}}_i,{\mathbf {x}}_j)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\). Therefore, \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \le \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j)\). Since \(\alpha \le 1\), we have \(\alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j) \le p({\mathbf {x}}_i,{\mathbf {x}}_j)\). Now, it follows from Eq. (3) that \(p({\mathbf {x}}_i,{\mathbf {x}}_j) \le \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\). Hence, we get \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \le \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\).$$\begin{aligned} \delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_i). \end{aligned}$$(4)
- 2.
It can be seen from Eq. (3) that \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_i)\). Moreover, it follows from Eq. (2) that \(p({\mathbf {x}}_i,{\mathbf {x}}_i) \ge 0\). Hence, \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) \ge 0\) \(\forall \) \({\mathbf {x}}_i \in X\).
- 3.
It is easy to see from Eq. (2) that \(p({\mathbf {x}}_i,{\mathbf {x}}_i)=0\) iff \(\gamma _{{\mathbf {x}}_i}=S\). Hence, it directly follows from Eq. (4) that \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_i) = 0\) iff \(\gamma _{{\mathbf {x}}_i}=S\).
- 4.From Eq. (3) we haveHowever, \(d({\mathbf {x}}_i,{\mathbf {x}}_j)=d({\mathbf {x}}_j,{\mathbf {x}}_i)\) and \(p({\mathbf {x}}_i,{\mathbf {x}}_j)=p({\mathbf {x}}_j,{\mathbf {x}}_i)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\) (by definition). Therefore, it can be easily seen that \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=\delta ({\mathbf {x}}_j,{\mathbf {x}}_i)\) \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\).$$\begin{aligned} \begin{aligned}&\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)=(1-\alpha )\times \frac{d({\mathbf {x}}_i,{\mathbf {x}}_j)}{d_{max}} + \alpha \times p({\mathbf {x}}_i,{\mathbf {x}}_j),\\ \text {and }&\delta ({\mathbf {x}}_j,{\mathbf {x}}_i)=(1-\alpha )\times \frac{d({\mathbf {x}}_j,{\mathbf {x}}_i)}{d_{max}} + \alpha \times p({\mathbf {x}}_j,{\mathbf {x}}_i). \end{aligned} \end{aligned}$$
The triangle inequality is an important criterion which lends some useful properties to the space induced by a dissimilarity measure. Therefore, the conditions under which FWPD satisfies the said criterion are investigated below. However, it should be stressed that the satisfaction of the said criterion is not essential for the functioning of the clustering techniques proposed in the subsequent text.
Definition 5
The three following lemmas deal with the conditions under which Inequality (5) will hold.
Lemma 1
For any three data instances \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\) let \(\rho _{i,j,k} = p({\mathbf {x}}_i,{\mathbf {x}}_j) + p({\mathbf {x}}_j,{\mathbf {x}}_k) - p({\mathbf {x}}_k,{\mathbf {x}}_i)\). Then \(\rho _{i,j,k} \ge 0\) \(\forall \) \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\).
Proof
Lemma 2
For any three data points \({\mathbf {x}}_i, {\mathbf {x}}_j, {\mathbf {x}}_k \in X\), Inequality (5) is satisfied when \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j})=(\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k})=(\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i})\).
Proof
Lemma 3
If \(|\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j}| \rightarrow 0\), \(|\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}| \rightarrow 0\) and \(|\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i}| \rightarrow 0\), then Inequality (8) tends to be satisfied.
Proof
When \(|\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j}| \rightarrow 0\), \(|\gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}| \rightarrow 0\) and \(|\gamma _{{\mathbf {x}}_k} \bigcap \gamma _{{\mathbf {x}}_i}| \rightarrow 0\), then LHS \(\rightarrow \alpha ^{+}\) and RHS \(\rightarrow 0\) for the Inequality (8). As \(\alpha \in (0,1]\), Inequality (8) tends to be satisfied. \(\square \)
The following lemma deals with the value of the parameter \(\alpha \in (0,1]\) for which a relaxed form of the triangle inequality is satisfied for any three data instances in a dataset X.
Lemma 4
Proof
- 1.
If \({\mathbf {x}}_i\), \({\mathbf {x}}_j\), and \({\mathbf {x}}_k\) are all fully observed, then Inequality (5) holds. Now, since \(\epsilon \ge 0\), therefore \(\delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) - {\epsilon }^2\). This implies \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) - {\epsilon }^2\). Hence, Inequality (9) must hold.
- 2.
If \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}) \ne S\) i.e. at least one of the data instances is not fully observed, and \(\rho _{i,j,k} = 0\), then \((\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j} = \phi \), \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j} = \phi \), \(S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k}) = \phi \), and \(\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k}) = \phi \). This implies that \(\gamma _{{\mathbf {x}}_j} = S\), and \(\gamma _{{\mathbf {x}}_k} \bigcup \gamma _{{\mathbf {x}}_i} = \gamma _{{\mathbf {x}}_j}\). Moreover, since \(\rho _{i,j,k} = 0\), we have \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) - \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) = d({\mathbf {x}}_i,{\mathbf {x}}_j) + d({\mathbf {x}}_j,{\mathbf {x}}_k) - d({\mathbf {x}}_k,{\mathbf {x}}_i)\). Now, \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k} \subseteq \gamma _{{\mathbf {x}}_i}\), \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k} \subseteq \gamma _{{\mathbf {x}}_k}\) and \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k} \subseteq \gamma _{{\mathbf {x}}_j}\) as \(\gamma _{{\mathbf {x}}_k} \bigcup \gamma _{{\mathbf {x}}_i} = \gamma _{{\mathbf {x}}_j} = S\). Therefore, \(d({\mathbf {x}}_i,{\mathbf {x}}_j) + d({\mathbf {x}}_j,{\mathbf {x}}_k) - d({\mathbf {x}}_k,{\mathbf {x}}_i) \ge d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_i,{\mathbf {x}}_j) + d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_j,{\mathbf {x}}_k) - d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_k,{\mathbf {x}}_i)\). Now, by the triangle inequality in subspace \(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}\), \(d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_i,{\mathbf {x}}_j) + d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_j,{\mathbf {x}}_k) - d_{\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k}}({\mathbf {x}}_k,{\mathbf {x}}_i) \ge 0\). Hence, \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) - \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge 0\), i.e. Inequalities (5) and (9) are satisfied.
- 3.
If \((\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_j} \bigcap \gamma _{{\mathbf {x}}_k}) \ne S\) and \(\rho _{i,j,k} \ne 0\), as \(\alpha \ge (1-\epsilon )\), LHS of Inequality (8) \(\ge (1-\epsilon ) \times (p_{(\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j}} + p_{(\gamma _{{\mathbf {x}}_i} \bigcap \gamma _{{\mathbf {x}}_k})\backslash \gamma _{{\mathbf {x}}_j}} + p_{\gamma _{{\mathbf {x}}_j} \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_k})} + p_{S \backslash (\gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {x}}_j} \bigcup \gamma _{{\mathbf {x}}_k})})\). Since \(\epsilon \le {\mathscr {P}}\), we further get that LHS \(\ge (1-\epsilon )\epsilon \). Moreover, as \(\frac{1}{d_{max}}(d({\mathbf {x}}_k,{\mathbf {x}}_i) - (d({\mathbf {x}}_i,{\mathbf {x}}_j) + d({\mathbf {x}}_j,{\mathbf {x}}_k))) \le 1\), we get RHS of Inequality (8) \(\le \epsilon \). Therefore, LHS - RHS \(\ge (1-\epsilon )\epsilon - \epsilon = -{\epsilon }^2\). Now, as Inequality (8) is obtained from Inequality (5) after some algebraic manipulation, it must hold that (LHS - RHS) of Inequality (8) = (LHS - RHS) of Inequality (5). Hence, we get \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j) + \delta ({\mathbf {x}}_j,{\mathbf {x}}_k) - \delta ({\mathbf {x}}_k,{\mathbf {x}}_i) \ge -{\epsilon }^2\) which can be simplified to obtain Inequality (9). This completes the proof.
Let us now elucidate the proposed FWP (and consequently the proposed FWPD measure) by using the following example.
Example 1
Let \(X \subset \mathbb {R}^3\) be a dataset consisting of \(n=5\) data points, each having three features (\(S=\{1,2,3\}\)), some of which (marked by ’*’) are unobserved. The dataset is presented below (along with the feature observation counts and the observed feature sets for each of the instances).
Data Point | \(x_{i,1}\) | \(x_{i,2}\) | \(x_{i,3}\) | \(\gamma _{{\mathbf {x}}_i}\) |
---|---|---|---|---|
\({\mathbf {x}}_1\) | * | 3 | 2 | \(\{2,3\}\) |
\({\mathbf {x}}_2\) | 1.2 | * | 4 | \(\{1,3\}\) |
\({\mathbf {x}}_3\) | * | 0 | 0.5 | \(\{2,3\}\) |
\({\mathbf {x}}_4\) | 2.1 | 3 | 1 | \(\{1,2,3\}\) |
\({\mathbf {x}}_5\) | − 2 | * | * | \(\{1\}\) |
Obs. Count | \(w_1=3\) | \(w_2=3\) | \(w_3=4\) | – |
Some important notations used in Sect. 3 and beyond
Notation | Counter-part in k-means-FWPD iteration t | Meaning |
---|---|---|
k | – | Number of clusters for k-means |
\(C_j\) | \(C^t_j\) | j-th cluster for k-means |
\(u_{i,j}\) | \(u^t_{i,j}\) | Membership of the data point \({\mathbf {x}}_i\) in the cluster \(C_j\) |
U | \(U^t\) | \(n \times k\) matrix of cluster memberships |
\({\mathscr {U}}\) | – | Set of all possible U values |
\({\mathbf {z}}_j\) | \({\mathbf {z}}^t_j\) | Centroid of cluster \(C_j\) |
\(z_{j,l}\) | \(z^t_{j,l}\) | l-th feature of the cluster centroid \({\mathbf {z}}_j\) |
Z | \(Z^t\) | Set of cluster centroids |
\({\mathscr {Z}}\) | – | Set of all possible Z values |
f(U, Z) | \(f(U^t,Z^t)\) | k-means objective function defined on \({\mathscr {U}} \times {\mathscr {Z}}\) |
\(X_l\) | – | Set of all \({\mathbf {x}}_i \in X\) having observed values for feature l |
\(U^*\) | – | Final cluster memberships found by k-means-FWPD |
\(Z^*\) | – | Final cluster centroids found by k-means-FWPD |
T | – | The convergent iteration of k-means-FWPD |
– | \(\tau \) | Any iteration preceding the current iteration t |
\({\mathscr {F}}(Z)\) | – | Set of feasible membership matrices for Z |
\({\mathscr {F}}(U)\) | – | Set of feasible centroid sets for U |
\({\mathscr {S}}(U)\) | – | Set of super-feasible centroids sets for U |
\(({\tilde{U}},{\tilde{Z}})\) | – | A partial optimal solution of the k-means-FWPD problem |
D | – | A feasible direction of movement for \(U^*\) |
\({\mathscr {O}}\) | – | Big O notation |
3 k-means clustering for datasets with missing features using the proposed FWPD
This section presents a reformulation of the k-means clustering problem for datasets with missing features, using the FWPD measure proposed in Sect. 2. The important notations used in this section (and beyond) are summarized in Table 2. The k-means problem (a term coined by MacQueen (1967)) deals with the partitioning of a set of n data instances into \(k(< n)\) clusters so as to minimize the sum of within-cluster dissimilarities. The standard heuristic algorithm to solve the k-means problem, referred to as the k-means algorithm, was first proposed by Lloyd in 1957 (Lloyd 1982), and rediscovered by Forgy (1965). Starting with random assignments of each of the data instances to one of the k clusters, the k-means algorithm functions by iteratively recalculating the k cluster centroids and reassigning the data instances to the nearest cluster (the cluster corresponding to the nearest cluster centroid), in an alternating manner. Selim and Ismail (1984) showed that the k-means algorithm converges to a local optimum of the non-convex optimization problem posed by the k-means problem, when the dissimilarity used is the Euclidean distance between data points.
where \(U=[u_{i,j}]\) is the \(n \times k\) real matrix of memberships, \(d_{max}\) denotes the maximum observed distance between any two data points \({\mathbf {x}}_i,{\mathbf {x}}_i \in X\), \(\gamma _{{\mathbf {z}}_j}\) denotes the set of observed features for \({\mathbf {z}}_j\) \((j \in \{1, 2, \cdots , k\})\), \(C_{j}\) denotes the j-th cluster (corresponding to the centroid \({\mathbf {z}}_j\)), \(Z=\{{\mathbf {z}}_1, \cdots , {\mathbf {z}}_k\}\), and it is said that \({\mathbf {x}}_i \in C_{j}\) when \(u_{i,j}=1\).
3.1 The k-means-FWPD algorithm
- 1.
Start with a random initial set of cluster assignments U such that \(\sum _{j=1}^{k} u_{i,j}=1\). Set \(t=1\) and specify the maximum number of iterations MaxIter.
- 2.For each cluster \(C_{j}^{t}\) \((j = 1, 2, \cdots , k)\), calculate the observed features of the cluster centroid \({\mathbf {z}}_{j}^{t}\). The value for the l-th feature of a centroid \({\mathbf {z}}_{j}^{t}\) should be the average of the corresponding feature values for all the data instances in the cluster \(C_{j}^{t}\) having observed values for the l-th feature. If none of the data instances in \(C_{j}^{t}\) have observed values for the feature in question, then the value \(z_{j,l}^{t-1}\) of the feature from the previous iteration should be retained. Therefore, the feature values are calculated as follows:where \(X_l\) denotes the set of all \({\mathbf {x}}_i \in X\) having observed values for the feature l.$$\begin{aligned} z_{j,l}^{t}=\left\{ \begin{array}{ll} \left( \underset{{\mathbf {x}}_i \in X_l}{\sum }\; u_{i,j}^{t} \times x_{i,l}\right) \bigg / \left( \underset{{\mathbf {x}}_i \in X_l}{\sum }\; u_{i,j}^{t}\right) \text { }, &{} \forall \text { } l \in \mathop \bigcup \nolimits _{{\mathbf {x}}_i \in C_{j}^{t}} \gamma _{{\mathbf {x}}_i},\\ z_{j,l}^{t-1}, &{} \forall \text { } l \in \gamma _{{\mathbf {z}}_j^{t-1}} \backslash \mathop \bigcup \nolimits _{{\mathbf {x}}_i \in C_{j}^{t}} \gamma _{{\mathbf {x}}_i},\\ \end{array} \right. \end{aligned}$$(11)
- 3.Assign each data point \({\mathbf {x}}_i\) \((i=1, 2, \cdots , n)\) to the cluster corresponding to its nearest (in terms of FWPD) centroid, i.e.Set \(t=t+1\). If \(U^{t}=U^{t-1}\) or \(t = MaxIter\), then go to Step 4; otherwise go to Step 2.$$\begin{aligned} u_{i,j}^{t+1} = \left\{ \begin{array}{ll} 1, &{} \text{ if } {\mathbf {z}}_{j}^{t}=\mathop {\mathrm{arg\, min}}\limits _{{\mathbf {z}} \in Z^t} \; \delta ({\mathbf {x}}_i,{\mathbf {z}}),\\ 0, &{} \text{ otherwise }.\\ \end{array} \right. \end{aligned}$$
- 4.Calculate the final cluster centroid set \(Z^*\) as:Set \(U^* = U^{t+1}\).$$\begin{aligned} z_{j,l}^{*}=\frac{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l}\; u_{i,j}^{t+1} \times x_{i,l}}{{\mathop \sum \limits }_{{\mathbf {x}}_i \in X_l}\; u_{i,j}^{t+1}} \text { } \forall \text { } l \in \bigcup _{{\mathbf {x}}_i \in C_{j}^{t+1}} \gamma _{{\mathbf {x}}_i}. \end{aligned}$$(12)
Remark 1
The iterations of the traditional k-means algorithm are known to each result in a decrease in the value of the objective function f (Selim and Ismail 1984) (Fig. 2a). However, for the k-means-FWPD algorithm, the \(Z^t\) calculations for some of the iterations may result in a finite increase in f, as shown in Fig. 2b. We show in Theorem 3 that only a finite number of such increments may occur during a given run of the algorithm, thus ushering in ultimate convergence. Moreover, the final feasible, locally-optimal solution is obtained using Step 4 (denoted by dotted line) which does not result in any further change to the objective function value.
3.2 Notions of feasibility in problem P
Let \({\mathscr {U}}\) and \({\mathscr {Z}}\) respectively denote the sets of all possible U and Z. Unlike the traditional k-means problem, the entire \({\mathscr {U}} \times {\mathscr {Z}}\) space is not feasible for the Problem P. There exists a set of feasible U for a given Z. Similarly, there exist sets of feasible and super-feasible Z (a super-set of the set of feasible Z) for a given U. In this subsection, we formally define these notions.
Definition 6
Definition 7
Definition 8
Remark 2
The k-means-FWPD problem differs from traditional k-means in that not all \(U \in {\mathscr {U}}\) are feasible for a given Z. Additionally, for a given U, there exists a set \({\mathscr {S}}(U)\) of super-feasible Z; \({\mathscr {F}}(U)\) a subset of \({\mathscr {S}}(U)\) being the set of feasible Z. The traversal of the k-means-FWPD algorithm is illustrated in Fig. 3 where the grey solid straight lines denote the set of feasible Z for the current \(U^t\) while the rest of the super-feasible region is marked by the corresponding grey dotted straight line. Furthermore, the grey jagged lines denote the feasible set of U for the current \(Z^t\). Starting with a random \(U^1 \in {\mathscr {U}}\) (Step 1), the algorithm finds \(Z^1 \in {\mathscr {S}}(U^1)\) (Step 2), \(U^2 \in {\mathscr {F}}(Z^1)\) (Step 3), and \(Z^2 \in {\mathscr {F}}(U^2)\) (Step 2). However, it subsequently finds \(U^3 \not \in {\mathscr {F}}(Z^2)\) (Step 3), necessitating a feasibility adjustment (see Sect. 3.4) while calculating \(Z^3\) (Step 2). Subsequently, the algorithm converges to \((U^5,Z^4)\). For the convergent \((U^{T+1},Z^T)\), \(U^{T+1} \in {\mathscr {F}}(Z^T)\) but it is possible that \(Z^T \in {\mathscr {S}}(U^{T+1})\backslash {\mathscr {F}}(U^T)\) (as seen in the case of Fig. 3). However, the final \((U^*,Z^*)\) (obtained by the dotted black line transition denoting Step 4) is seen to be feasible in both respects and is shown (in Theorem 5) to be locally-optimal in the corresponding feasible region.
3.3 Partial optimal solutions
This subsection deals with the concept of partial optimal solutions of the problem P, to one of which the k-means-FWPD algorithm is shown to converge (prior to Step 4). The following definition formally presents the concept of a partial optimal solution.
Definition 9
Lemma 5
Given a \(U^t\), the centroid matrix \(Z^t\) calculated using Eq. (11) is an optimal solution of the Problem P1.
Proof
Lemma 6
For a given \(Z^t\), problem P2 is solved if \(u^{t+1}_{i,j}=1\) and \(u^{t+1}_{i,j^{'}}=0\) \(\forall \) \(i \in \{1, \cdots , n\}\) when \(\delta ({\mathbf {x}}_i,{\mathbf {z}}^t_j) \le \delta ({\mathbf {x}}_i,{\mathbf {z}}^t_{j^{'}})\), for all \(j^{'} \ne j\).
Proof
It is clear that the contribution of \({\mathbf {x}}_i\) to the total objective function is \(\delta ({\mathbf {x}}_i,{\mathbf {z}}^t_j)\) when \(u^{t+1}_{i,j}=1\) and \(u^{t+1}_{i,j^{'}}=0\) \(\forall \) \(j^{'} \ne j\). Since any alternative solution is an extreme point of \({\mathscr {U}}\) (Selim and Ismail 1984), it must satisfy (10c). Therefore, the contribution of \({\mathbf {x}}_i\) to the objective function for an alternative solution will be some \(\delta ({\mathbf {x}}_i,{\mathbf {z}}^t_{j^{'}}) \ge \delta ({\mathbf {x}}_i,{\mathbf {z}}^t_j)\). Hence, the contribution of \({\mathbf {x}}_i\) is minimized by assigning \(u^{t+1}_{i,j}=1\) and \(u^{t+1}_{i,j^{'}}=0\) \(\forall \) \(j^{'} \ne j\). This argument holds true for all \({\mathbf {x}}_i \in X\), i.e. \(\forall \) \(i \in \{1, \cdots , n\}\). This completes the proof. \(\square \)
Theorem 2
The k-means-FWPD algorithm finds a partial optimal solution of P.
Proof
Let T denote the terminal iteration. Since Step 2 and Step 3 of the k-means-FWPD algorithm respectively solve P1 and P2, the algorithm terminates only when the obtained iterate \((U^{T+1},Z^{T})\) solves both P1 and P2. Therefore, \(f(U^{T+1},Z^{T}) \le f(U^{T+1},Z) \text { } \forall Z \in {\mathscr {S}}(U^{T+1})\). Since Step 2 ensures that \(Z^{T} \in {\mathscr {S}}(U^{T})\) and \(U^{T+1} = U^{T}\), we must have \(Z^{T} \in {\mathscr {S}}(U^{T+1})\). Moreover, \(f(U^{T+1},Z^{T}) \le f(U,Z^T)\) \(\forall U \in {\mathscr {U}}\) which implies \(f(U^{T+1},Z^{T}) \le f(U,Z^T) \text { } \forall U \in {\mathscr {F}}(Z^T)\). Now, Step 2 ensures that \(\gamma _{{\mathbf {z}}^{T}_j} \supseteq \bigcup _{{\mathbf {x}}_{i} \in C^T_{j}} \gamma _{{\mathbf {x}}_i} \text { } \forall \text { } j \in \{1, 2, \cdots , k\}\). Since we must have \(U^{T+1} = U^{T}\) for convergence to occur, it follows that \(\gamma _{{\mathbf {z}}^{T}_j} \supseteq \bigcup _{{\mathbf {x}}_{i} \in C^{T+1}_{j}} \gamma _{{\mathbf {x}}_i} \text { } \forall \text { } j \in \{1, 2, \cdots , k\}\), hence \(u^{T+1}_{i,j} = 1\) implies \(\gamma _{{\mathbf {z}}^T_j} \supseteq \gamma _{{\mathbf {x}}_i}\). Therefore, \(U^{T+1} \in {\mathscr {F}}(Z^T)\). Consequently, the terminal iterate of Step 3 of the k-means-FWPD algorithm must be a partial optimal solution of P. \(\square \)
3.4 Feasibility adjustments
Since it is possible for the number of observed features of the cluster centroids to increase over the iterations to maintain feasibility w.r.t. constraint (10d), we now introduce the concept of feasibility adjustment, the consequences of which are discussed in this subsection.
Definition 10
A feasibility adjustment for cluster j (\(j \in \{1,2,\cdots ,k\}\)) is said to occur in iteration t if \(\gamma _{{\mathbf {z}}^t_j} \supset \gamma _{{\mathbf {z}}^{t-1}_j}\) or \(\gamma _{{\mathbf {z}}^t_j} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \), i.e. if the centroid \({\mathbf {z}}^t_j\) acquires an observed value for at least one feature which was unobserved for its counter-part \({\mathbf {z}}^{t-1}_j\) in the previous iteration.
The following lemma shows that feasibility adjustment can only occur for a cluster as a result of the addition of a new data point previously unassigned to it.
Lemma 7
Feasibility adjustment occurs for a cluster \(C_j\) in iteration t iff at least one data point \({\mathbf {x}}_i\), such that \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{\tau }_j} \ne \phi \) \(\forall \tau < t\), which was previously unassigned to \(C_j\) (i.e. \(u^{\tau }_{i,j} = 0\) \(\forall \tau < t\)) is assigned to it in iteration t.
Proof
Due to Eq. (11), all features defined for \({\mathbf {z}}^{t-1}_j\) are also retained for \({\mathbf {z}}^t_j\). Therefore, for \(\gamma _{{\mathbf {z}}^t_j} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \) there must exist some \({\mathbf {x}}_i\) such that \(u^t_{i,j} = 1\), \(u^{t-1}_{i,j} = 0\), and \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \). Since the set of defined features for any cluster centroid is a monotonically growing set, we have \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{\tau }_j} \ne \phi \) \(\forall \tau < t\). It then follows from constraint (10d) that \(u^{\tau }_{i,j} = 0\) \(\forall \tau < t\). Now, to prove the converse, let us assume the existence of some \({\mathbf {x}}_i\) such that \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{\tau }_j} \ne \phi \) \(\forall \tau < t\) and \(u^{\tau }_{i,j} = 0\) \(\forall \tau < t\). Since \(\gamma _{{\mathbf {x}}_i} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \) and \(\gamma _{{\mathbf {z}}^t_j} \supseteq \gamma _{{\mathbf {x}}_i} \bigcup \gamma _{{\mathbf {z}}^{t-1}_j}\), it follows that \(\gamma _{{\mathbf {z}}^t_j} \backslash \gamma _{{\mathbf {z}}^{t-1}_j} \ne \phi \). \(\square \)
The following theorem deals with the consequences of the feasibility adjustment phenomenon.
Theorem 3
For a finite number of iterations during a single run of the k-means-FWPD algorithm, there may be a finite increment in the objective function f, due to the occurrence of feasibility adjustments.
Proof
It follows from Lemma 5 that \(f(U^t,Z^t) \le f(U^t,Z)\) \(\forall Z \in {\mathscr {S}}(U^t)\). If there is no feasibility adjustment in iteration t, \({\mathscr {S}}(U^{t-1}) = {\mathscr {S}}(U^t)\). Hence, \(f(U^t,Z^t) \le f(U^t,Z^{t-1})\). However, if a feasibility adjustment occurs in iteration t, then \(\gamma _{{\mathbf {z}}^t_j} \subset \gamma _{{\mathbf {z}}^{t-1}_j}\) for at least one \(j \in \{1,2,\cdots ,k\}\). Hence, \(Z^{t-1} \in {\mathscr {Z}} \backslash {\mathscr {S}}(U^t)\) and we may have \(f(U^t,Z^t) > f(U^t,Z^{t-1})\). Since both \(f(U^t,Z^t)\) and \(f(U^t,Z^{t-1})\) are finite, \((f(U^t,Z^t) - f(U^t,Z^{t-1}))\) must also be finite. Now, the maximum number of feasibility adjustments occur in the worst case scenario where each data point, having an unique set of observed features (which are unobserved for all other data points), traverses all the clusters before convergence. Therefore, the maximum number of possible feasibility adjustments during a single run of the k-means-FWPD algorithm is \(n(k-1)\), which is finite. \(\square \)
3.5 Convergence of the k-means-FWPD algorithm
We now show that the k-means-FWPD algorithm converges to the partial optimal solution, within a finite number of iterations. The following lemma and the subsequent theorem are concerned with this.
Lemma 8
Starting with a given iterate \((U^t,Z^t)\), the k-means-FWPD algorithm either reaches convergence or encounters a feasibility adjustment, within a finite number of iterations.
Proof
Let us first note that there are a finite number of extreme points of \({\mathscr {U}}\). Then, we observe that an extreme point of \({\mathscr {U}}\) is visited at most once by the algorithm before either convergence or the next feasibility adjustment. Suppose, this is not true, and let \(U^{t_1}=U^{t_2}\) for distinct iterations \(t_1\) and \(t_2\) \((t_1 \ge t, t_1 < t_2)\) of the algorithm. Applying Step 2 of the algorithm, we get \(Z^{t_1}\) and \(Z^{t_2}\) as optimal centroid sets for \(U^{t_1}\) and \(U^{t_2}\), respectively. Then, \(f(U^{t_1},Z^{t_1}) = f(U^{t_2},Z^{t_2})\) since \(U^{t_1}=U^{t_2}\). However, it is clear from Lemmas 5, 6 and Theorem 3 that f strictly decreases subsequent to the iterate \((U^t,Z^t)\) and prior to either the next feasibility adjustment (in which case the value of f may increase) or convergence (in which case f remains unchanged). Hence, \(U^{t_1} \ne U^{t_2}\). Therefore, it is clear from the above argument that the k-means-FWPD algorithm either converges or encounters a feasibility adjustment within a finite number of iterations. \(\square \)
Theorem 4
The k-means-FWPD algorithm converges to a partial optimal solution within a finite number of iterations.
Proof
It follows from Lemma 8 that the first feasibility adjustment is encountered within a finite number of iterations since initialization and that each subsequent feasibility adjustment occurs within a finite number of iterations of the previous. Moreover, we know from Theorem 3 that there can only be a finite number of feasibility adjustments during a single run of the algorithm. Therefore, the final feasibility adjustment must occur within a finite number of iterations. Moreover, it follows from Lemma 8 that the algorithm converges within a finite number of subsequent iterations. Hence, the k-means-FWPD algorithm must converge within a finite number of iterations. \(\square \)
3.6 Local optimality of the final solution
In this subsection, we establish the local optimality of the final solution obtained in Step 4 of the k-means-FWPD algorithm, subsequent to convergence in Step 3.
Lemma 9
\(Z^*\) is the unique optimal feasible cluster centroid set for \(U^*\), i.e. \(Z^* \in {\mathscr {F}}(U^*)\) and \(f(U^*,Z^*) \le f(U^*,Z) \text { } \forall Z \in {\mathscr {F}}(U^*)\).
Proof
Lemma 10
If \(Z^*\) is the unique optimal feasible cluster centroid set for \(U^*\), then \(f(U^*,Z^*) \le f(U,Z^*) \text { } \forall U \in {\mathscr {F}}(Z^*)\).
Proof
We know from Theorem 2 that \(f(U^{*},Z^{T}) \le f(U,Z^T) \text { } \forall U \in {\mathscr {F}}(Z^T)\). Now, \(\gamma _{{\mathbf {z}}^*_j} \subseteq \gamma _{{\mathbf {z}}^T_j} \text { } \forall j \in \{1,2,\cdots ,k\}\). Therefore, \({\mathscr {F}}(Z^*) \subseteq {\mathscr {F}}(Z^T)\) must hold. It therefore follows that \(f(U^*,Z^*) \le f(U,Z^*) \text { } \forall U \in {\mathscr {F}}(Z^*)\). \(\square \)
Now, the following theorem shows that the final solution obtained by Step 4 of the k-means-FWPD algorithm is locally optimal.
Theorem 5
The final solution \((U^*,Z^*)\) obtained by Step 4 of the k-means-FWPD algorithm is a local optimal solution of P.
Proof
3.7 Time complexity of the k-means-FWPD algorithm
- 1.
Centroid Calculation: As a maximum of m features of each centroid must be calculated, the complexity of centroid calculation is at most \({\mathscr {O}}(kmn)\).
- 2.
Distance Calculation: As each distance calculation involves at most m features, the observed distance calculation between n data instances and k cluster centroids is at most \({\mathscr {O}}(kmn)\).
- 3.
Penalty Calculation: The penalty calculation between a data point and a cluster centroid involves at most m summations. Hence, penalty calculation over all possible pairings is at most \({\mathscr {O}}(kmn)\).
- 4.
Cluster Assignment: The assignment of n data points to k clusters consists of the comparisons of the dissimilarities of each point with k clusters, which is \({\mathscr {O}}(nk)\).
4 Hierarchical agglomerative clustering for datasets with missing features using the proposed FWPD
Some important notations used in Sect. 4 and beyond
Notation | Meaning |
---|---|
\(B^t\) | Set of hierarchical clusters obtained in iteration t of HAC-FWPD |
\(\beta ^t_i\) | i-th hierarchical cluster in \(B^t\) |
\(Q^t\) | Matrix of dissimilarities between the hierarchical clusters in \(B^t\) |
\(q^(i,j)\) | (i, j)-th element of \(Q^t\) |
\(q^t_{min}\) | Smallest non-zero value in \(Q^t\) |
M | List of location in \(Q^t\) having value \(q^t_{min}\) |
G | New hierarchical cluster formed by merging two of the closest hierarchical clusters in \(B^t\) |
\(i_G\) | Location of G in the set \(B^{t+1}\) |
\(L(G,\beta )\) | Linkage between two hierarchical clusters G and \(\beta \) |
- 1.
Single Linkage with FWPD (SL-FWPD): The SL between two clusters \(C_i\) and \(C_j\) is \(\min \{\delta ({\mathbf {x}}_i,{\mathbf {x}}_j):{\mathbf {x}}_i \in C_i, {\mathbf {x}}_j \in C_j\}\).
- 2.
Complete Linkage with FWPD (CL-FWPD): The CL between two clusters \(C_i\) and \(C_j\) is \(\max \{\delta ({\mathbf {x}}_i,{\mathbf {x}}_j):{\mathbf {x}}_i \in C_i, {\mathbf {x}}_j \in C_j\}\).
- 3.
Average Linkage with FWPD (AL-FWPD): \(\frac{1}{|C_i| \times |C_j|} \underset{{\mathbf {x}}_i \in C_i}{\sum } \underset{{\mathbf {x}}_j \in C_j}{\sum } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\) is the AL between two clusters \(C_i\) and \(C_j\), where \(|C_i|\) and \(|C_j|\) are respectively the number of instances in the clusters \(C_i\) and \(C_j\).
4.1 The HAC-FWPD algorithm
- 1.
Set \(B^0 = X\). Compute pairwise dissimilarities \(\delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\), \(\forall \) \({\mathbf {x}}_i,{\mathbf {x}}_j \in X\) and construct the dissimilarity matrix \(Q^0\) so that \(q^0(i,j) = \delta ({\mathbf {x}}_i,{\mathbf {x}}_j)\). Set \(t=0\).
- 2.
Search \(Q^t\) to identify the set \(M = \{(i_1,j_1), (i_2,j_2), \cdots , (i_k,j_k)\}\) containing all the pairs of indexes such that \(q^t(i_r,j_r) = q^t_{min}\) \(\forall \) \(r \in \{1, 2, \cdots , k\}\), \(q^t_{min}\) being the smallest non-zero element in \(Q^t\).
- 3.
Merge the elements corresponding to any one pair in M, say \(\beta ^t_{ir}\) and \(\beta ^t_{jr}\) corresponding to the pair \((i_r,j_r)\), into a single group \(G = \{\beta ^t_{ir}, \beta ^t_{jr}\}\). Construct \(B^{t+1}\) by removing \(\beta ^t_{ir}\) and \(\beta ^t_{jr}\) from \(B^t\) and inserting G.
- 4.Define \(Q^{t+1}\) on \(B^{t+1} \times B^{t+1}\) as \(q^{t+1}(i,j) = q^t(i,j)\) \(\forall \) \(i, j \text { such that } \beta ^t_i, \beta ^t_j \ne G\) and \(q^{t+1}(i,i_G) = q^{t+1}(i_G,i) = L(G,\beta ^t_i)\), where \(i_G\) denotes the location of G in \(B^{t+1}\) andSet \(t=t+1\).$$\begin{aligned} L(G,\beta ) = \left\{ \begin{array}{ll} \underset{{\mathbf {x}}_i \in G,{\mathbf {x}}_j \in \beta }{\min } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) &{}\text{ for } \text{ SL-FWPD },\\ \underset{{\mathbf {x}}_i \in G,{\mathbf {x}}_j \in \beta }{\max } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) &{}\text{ for } \text{ CL-FWPD },\\ \frac{1}{|G| \times |\beta |} \underset{{\mathbf {x}}_i \in G}{\sum } \underset{{\mathbf {x}}_j \in \beta }{\sum } \delta ({\mathbf {x}}_i,{\mathbf {x}}_j) &{}\text{ for } \text{ AL-FWPD }.\\ \end{array}\right. \end{aligned}$$
- 5.
Repeat Steps 2-4 until \(B^t\) contains a single element.
4.2 Time complexity of the HAC-FWPD algorithm
- 1.
Distance Calculation: As each distance calculation involves at most m features, the calculation of all pairwise observed distances among n data instances is at most \({\mathscr {O}}(n^{2}m)\).
- 2.
Penalty Calculation: The penalty calculation between a data point and a cluster centroid involves at most m summations. Hence, penalty calculation over all possible pairings is at most \({\mathscr {O}}(n^{2}m)\).
- 3.
Cluster Merging: The merging of two clusters takes place in each of the \(n-1\) steps of the algorithm, and each merge at most has a time complexity of \({\mathscr {O}}(n^2)\).
5 Experimental results
In this section, we report the results of several experiments carried out to validate the merit of the proposed k-means-FWPD and HAC-FWPD clustering algorithms.^{2} In the following subsections, we describe the experimental setup used to validate the proposed techniques. The results of the experiments for the k-means-FWPD algorithm and the HAC-FWPD algorithm, are respectively presented thereafter.
5.1 Experiment setup
Adjusted Rand Index (ARI) (Hubert and Arabie 1985) is a popular validity index used to judge the merit of the clustering algorithms. When the true class labels are known, ARI provides a measure of the similarity between the cluster partition obtained by a clustering technique and the true class labels. Therefore, a high value of ARI is thought to indicate better clusterings. But, the class labels may not always be in keeping with the natural cluster structure of the dataset. In such cases, good clusterings are likely to achieve lower values of these indexes compared to possibly erroneous partitions (which are more akin to the class labels). However, the purpose of our experiments is to find out how close the clusterings obtained by the proposed methods (and the contending techniques) are to the clusterings obtained by the standard algorithms (k-means algorithm and HAC algorithm); the proposed methods (and its contenders) being run on the datasets with missingness, while the standard methods are run on corresponding fully observed datasets. Hence, the clusterings obtained by the standard algorithms are used as the ground-truths using which the ARI values are calculated for the proposed methods (and their contenders). The performances of ZI, MI, kNNI (with \(k \in \{3,5,10,20\}\)) and SVDI (using the most significant 10% of the eigenvectors) are used for comparison with the proposed methods. The variant of MI that we impute with for these experiments differs from the traditional technique in that we use the average of the averages for individual classes, instead of the overall average. This is done to minimize the effects of severe class imbalances that may exist in the datasets. We also conduct the Wilcoxon’s signed rank test (Wilcoxon 1945) to evaluate the statistical significance of the observed results.
The performance of k-means depends on the initial cluster assignment. Therefore, to ensure fairness, we use the same set of random initial cluster assignments for both the standard k-means algorithm on the fully observed dataset as well as the proposed k-means-FWPD method (and its contenders). The maximum number of iterations of the k-means variants is set as \(MaxIter = 500\). Results are recorded in terms of average ARI values over 50 different runs on each dataset. The number of clusters is assumed to be same as the number of classes.
Details of the 20 real-world datasets
Dataset | #Instances | #Features | #Classes | Repository |
---|---|---|---|---|
Chronic kidney | 800 | 24 | 2 | UCI |
Colon | 62 | 2000 | 2 | JGD |
\(\hbox {GSAD}^{*}\) \(1^{\dagger }\) | 445 | 128 | 6 | UCI |
Glass | 214 | 9 | 6 | UCI |
Iris | 150 | 4 | 3 | UCI |
Isolet \(5^{\dagger }\) | 1559 | 617 | 26 | UCI |
Landsat | 6435 | 36 | 6 | UCI |
Leaf | 340 | 15 | 36 | UCI |
Libras | 360 | 90 | 15 | UCI |
Lung | 181 | 12,533 | 2 | JGD |
Lung Cancer | 27 | 56 | 3 | UCI |
Lymphoma | 62 | 4026 | 3 | JGD |
Pendigits | 10,992 | 16 | 10 | UCI |
Prostate | 102 | 6033 | 2 | JGD |
Seeds | 210 | 7 | 3 | UCI |
\(\hbox {Sensorless}^{\dagger }\) | 6000 | 48 | 11 | UCI |
Sonar | 208 | 60 | 2 | UCI |
\(\hbox {Theorem proving}^{\dagger }\) | 3059 | 51 | 6 | UCI |
Vehicle | 94 | 18 | 4 | UCI |
Vowel context | 990 | 14 | 11 | UCI |
5.1.1 Datasets
We take 20 real-world datasets from the University of California at Irvine (UCI) repository (Dheeru and Karra Taniskidou 2017) and the Jin Genomics Dataset (JGD) repository (Jin 2017). Each feature of each dataset is normalized so as to have zero mean and unit standard deviation. The details of these 20 datasets are listed in Table 4.
5.1.2 Simulating missingness mechanisms
- 1.
Specify the number of entries MissCount to be removed from the dataset. Select the missingness mechanism as one out of MCAR, MAR, MNAR-I or MNAR-II.
- 2.
If the mechanism is MAR or MNAR-II, select a random subset \(\gamma _{miss} \subset S\) containing half of the features in S (i.e. \(|\gamma _{miss}| = \frac{m}{2}\) if |S| is even or \(\frac{m+1}{2}\) if |S| is odd). If the mechanism is MNAR-I, set \(\gamma _{miss} = S\). Identify \(\gamma _{obs} = S \backslash \gamma _{miss}\). Otherwise, go to Step 5.
- 3.
If the mechanism is MAR or MNAR-II, for each feature \(l \in \gamma _{miss}\), randomly select a feature \(l_c \in \gamma _{obs}\) on which the missingness of feature l may depend.
- 4.
For each feature \(l \in \gamma _{miss}\) randomly choose a type of missingness \(MissType_l\) as one out of CENTRAL, INTERMEDIATE or EXTREMAL.
- 5.
Randomly select a non-missing entry \(x_{i,l}\) from the data matrix. If the mechanism is MCAR, mark the entry as missing and decrement \(MissCount = MissCount - 1\) and go to Step 11.
- 6.
If the mechanism is MAR, set \(\lambda = x_{i,l_c}\), \(\mu = \mu _{l_c}\) and \(\sigma = \sigma _{l_c}\), where \(\mu _{l_c}\) and \(\sigma _{l_c}\) are the mean and standard deviation of the \(l_c\)-th feature over the dataset. If the mechanism is MNAR-I, set \(\lambda = x_{i,l}\), \(\mu = \mu _{l}\) and \(\sigma = \sigma _{l}\). If the mechanism is MNAR-II, randomly set either \(\lambda = x_{i,l}\), \(\mu = \mu _{l}\) and \(\sigma = \sigma _{l}\) or \(\lambda = x_{i,l_c}\), \(\mu = \mu _{l_c}\) and \(\sigma = \sigma _{l_c}\).
- 7.
Calculate \(z = \frac{1}{\sigma }\sqrt{(\lambda - \mu )^2}\).
- 8.
If \(MissType_l = \text {CENTRAL}\), set \(\mu _z = 0\). If \(MissType_l = \text {INTERMEDIATE}\), set \(\mu _z = 1\). If \(MissType_l = \text {EXTREMAL}\), set \(\mu _z = 2\). Set \(\sigma _z = 0.35\).
- 9.
Calculate \(pval = \frac{1}{\sqrt{2 \pi \sigma _z}} exp(- \frac{(z - \mu _z)^2}{2 \sigma _z^2})\).
- 10.
Randomly generate a value qval in the interval [0, 1]. If \(pval \ge qval\), then mark the entry \(x_{i,l}\) as missing and decrement \(MissCount = MissCount - 1\).
- 11.
If \(MissCount > 0\), then go to Step 5.
For our experiments, we set \(MissCount = \frac{nm}{4}\) to remove 25% of the features from each dataset. Thus, an average of \(\frac{m}{4}\) of the features are missing from each data instance.
5.1.3 Selecting the parameter \(\alpha \)
In order to conduct experiments using the FWPD measure, we need to select a value of the parameter \(\alpha \). Proper selection of \(\alpha \) may help to boost the performance of the proposed k-means-FWPD and HAC-FWPD measures. Therefore, in this section, we undertake a study on the effects of \(\alpha \) on the performance of FWPD. Experiments are conducted using \(\alpha \in \{0.1, 0.25, 0.5, 0.75, 0.9\}\) on the datasets listed in Table 4 using the experimental setup detailed above. The summary of the results of this study is shown in Table 5 in terms of average ARI values.
Summary of results for different choices of \(\alpha \) in terms of average ARI values
Clustering | Type of | \(\alpha \) | ||||
---|---|---|---|---|---|---|
Algorithm | Missingness | 0.1 | 0.25 | 0.5 | 0.75 | 0.9 |
k-means-FWPD | MCAR | 0.682 | 0.712 | 0.664 | 0.691 | 0.683 |
MAR | 0.738 | 0.730 | 0.723 | 0.729 | 0.711 | |
MNAR-I | 0.649 | 0.676 | 0.675 | 0.613 | 0.666 | |
MNAR-II | 0.711 | 0.718 | 0.689 | 0.665 | 0.678 | |
Overall | 0.695 | 0.709 | 0.688 | 0.675 | 0.685 | |
HAC-FWPD | MCAR | 0.665 | 0.709 | 0.389 | 0.073 | 0.017 |
MAR | 0.740 | 0.724 | 0.441 | 0.210 | 0.094 | |
MNAR-I | 0.720 | 0.721 | 0.458 | 0.158 | 0.036 | |
MNAR-II | 0.708 | 0.716 | 0.443 | 0.140 | 0.025 | |
Overall | 0.709 | 0.718 | 0.433 | 0.145 | 0.043 |
Means and standard deviations of ARI values for the k-means-FWPD algorithm against MCAR
Dataset | \(\begin{array}{ll} \text {k-means-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic Kidney | \(0.807\pm 0.002\) | \(0.813\pm 0.005\) | \(0.229\pm 0.013\) | \(0.763\pm 0.005\) | \(\mathbf{0.815 }\pm 0.003\) | 5 |
Colon | \(\mathbf{0.681 }\pm 0.341\) | \(0.656\pm 0.323\) | \(0.659\pm 0.322\) | \(0.662\pm 0.314\) | \(0.656\pm 0.323\) | 3 |
GSAD 1 | \(\mathbf{0.798 }\pm 0.236\) | \(0.625\pm 0.199\) | \(0.722\pm 0.172\) | \(0.552\pm 0.203\) | \(0.711\pm 0.195\) | 3 |
Glass | \(0.488\pm 0.097\) | \(0.466\pm 0.119\) | \(0.131\pm 0.066\) | \(0.417\pm 0.114\) | \(\mathbf{0.505 }\pm 0.134\) | 5 |
Iris | \(\mathbf{0.799 }\pm 0.119\) | \(0.672\pm 0.113\) | \(0.116\pm 0.083\) | \(0.732\pm 0.159\) | \(0.758\pm 0.157\) | 3 |
Isolet 5 | \(\mathbf{0.679 }\pm 0.117\) | \(0.623\pm 0.093\) | \(0.625\pm 0.097\) | \(0.626\pm 0.105\) | \(0.614\pm 0.072\) | 3 |
Landsat | \(\mathbf{0.937 }\pm 0.001\) | \(0.807\pm 0.001\) | \(0.798\pm 0.001\) | \(0.838\pm 0.104\) | \(0.937\pm 0.010\) | 5 |
Leaf | \(0.455\pm 0.014\) | \(0.328\pm 0.010\) | \(0.339\pm 0.019\) | \(0.354\pm 0.037\) | \(\mathbf{0.465 }\pm 0.029\) | 5 |
Libras | \(\mathbf{0.656 }\pm 0.070\) | \(0.642\pm 0.069\) | \(0.103\pm 0.019\) | \(0.619\pm 0.067\) | \(0.625\pm 0.077\) | 20 |
Lung | \(\mathbf{0.731 }\pm 0.341\) | \(0.718\pm 0.261\) | \(0.659\pm 0.341\) | \(0.694\pm 0.318\) | \(0.718\pm 0.261\) | 3 |
Lung Cancer | \(\mathbf{0.542 }\pm 0.249\) | \(0.541\pm 0.202\) | \(0.537\pm 0.214\) | \(0.529\pm 0.192\) | \(0.525\pm 0.189\) | 5 |
Lymphoma | \(\mathbf{0.755 }\pm 0.167\) | \(0.743\pm 0.175\) | \(0.700\pm 0.165\) | \(0.733\pm 0.175\) | \(0.743\pm 0.175\) | 3 |
Pendigits | \(0.729\pm 0.089\) | \(0.659\pm 0.082\) | \(0.083\pm 0.013\) | \(0.604\pm 0.063\) | \(\mathbf{0.832 }\pm 0.105\) | 3 |
Prostate | \(\mathbf{0.961 }\pm 0.025\) | \(0.944\pm 0.043\) | \(0.944\pm 0.043\) | \(0.946\pm 0.041\) | \(0.944\pm 0.043\) | 3 |
Seeds | \(\mathbf{0.866 }\pm 0.030\) | \(0.735\pm 0.021\) | \(0.242\pm 0.039\) | \(0.745\pm 0.041\) | \(0.865\pm 0.025\) | 5 |
Sensorless | \(\mathbf{0.765 }\pm 0.031\) | \(0.684\pm 0.028\) | \(0.687\pm 0.021\) | \(0.719\pm 0.060\) | \(0.726\pm 0.051\) | 20 |
Sonar | \(\mathbf{0.697 }\pm 0.195\) | \(0.681\pm 0.187\) | \(0.672\pm 0.188\) | \(0.434\pm 0.234\) | \(0.656\pm 0.162\) | 5 |
Theorem proving | \(\mathbf{0.714 }\pm 0.197\) | \(0.671\pm 0.229\) | \(0.661\pm 0.188\) | \(0.565\pm 0.139\) | \(0.672\pm 0.197\) | 20 |
Vehicle | \(0.715\pm 0.143\) | \(0.674\pm 0.139\) | \(0.114\pm 0.060\) | \(0.646\pm 0.105\) | \(\mathbf{0.723 }\pm 0.134\) | 10 |
Vowel context | \(0.458\pm 0.031\) | \(0.366\pm 0.028\) | \(0.360\pm 0.029\) | \(0.352\pm 0.022\) | \(\mathbf{0.461 }\pm 0.060\) | 3 |
Average ranks | 1.38 | 3.33 | 4.20 | 3.65 | 2.45 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.03)\) |
Means and standard deviations of ARI values for the k-means-FWPD algorithm against MAR
Dataset | \(\begin{array}{ll} \text {k-means-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic kidney | \(0.793\pm 0.006\) | \(0.792\pm 0.004\) | \(0.803\pm 0.003\) | \(0.787\pm 0.008\) | \(\mathbf{0.838 }\pm 0.011\) | 20 |
Colon | \(0.812\pm 0.177\) | \(0.826\pm 0.149\) | \(\mathbf{0.827 }\pm 0.157\) | \(0.796\pm 0.202\) | \(0.826\pm 0.149\) | 3 |
GSAD 1 | \(\mathbf{0.801 }\pm 0.163\) | \(0.737\pm 0.187\) | \(0.673\pm 0.179\) | \(0.719\pm 0.165\) | \(0.760\pm 0.144\) | 3 |
Glass | \(\mathbf{0.617 }\pm 0.125\) | \(0.455\pm 0.168\) | \(0.565\pm 0.128\) | \(0.411\pm 0.171\) | \(0.479\pm 0.152\) | 5 |
Iris | \(0.776\pm 0.185\) | \(0.776\pm 0.185\) | \(\mathbf{0.851 }\pm 0.144\) | \(0.776\pm 0.163\) | \(0.764\pm 0.176\) | 5 |
Isolet 5 | \(\mathbf{0.729 }\pm 0.072\) | \(0.713\pm 0.051\) | \(0.691\pm 0.046\) | \(0.704\pm 0.027\) | \(0.713\pm 0.051\) | 3 |
Landsat | \(\mathbf{0.940 }\pm 0.002\) | \(0.828\pm 0.127\) | \(0.850\pm 0.123\) | \(0.773\pm 0.150\) | \(0.899\pm 0.058\) | 3 |
Leaf | \(0.510\pm 0.022\) | \(0.392\pm 0.042\) | \(0.440\pm 0.046\) | \(0.501\pm 0.021\) | \(\mathbf{0.532 }\pm 0.046\) | 3 |
Libras | \(0.731\pm 0.076\) | \(0.700\pm 0.077\) | \(\mathbf{0.778 }\pm 0.065\) | \(0.697\pm 0.082\) | \(0.675\pm 0.071\) | 3 |
Lung | \(\mathbf{0.754 }\pm 0.131\) | \(0.711\pm 0.230\) | \(0.717\pm 0.232\) | \(0.625\pm 0.312\) | \(0.711\pm 0.230\) | 3 |
Lung cancer | \(\mathbf{0.606 }\pm 0.223\) | \(0.526\pm 0.224\) | \(0.476\pm 0.219\) | \(0.509\pm 0.202\) | \(0.493\pm 0.231\) | 3 |
Lymphoma | \(0.790\pm 0.164\) | \(0.883\pm 0.109\) | \(\mathbf{0.885 }\pm 0.102\) | \(0.875\pm 0.120\) | \(0.883\pm 0.109\) | 3 |
Pendigits | \(0.717\pm 0.068\) | \(0.494\pm 0.072\) | \(0.852\pm 0.062\) | \(0.705\pm 0.067\) | \(\mathbf{0.903 }\pm 0.065\) | 5 |
Prostate | \(\mathbf{0.990 }\pm 0.021\) | \(0.984\pm 0.036\) | \(0.986\pm 0.036\) | \(0.918\pm 0.017\) | \(0.984\pm 0.036\) | 3 |
Seeds | \(0.785\pm 0.026\) | \(0.755\pm 0.025\) | \(0.774\pm 0.027\) | \(0.752\pm 0.025\) | \(\mathbf{0.834 }\pm 0.033\) | 3 |
Sensorless | \(\mathbf{0.759 }\pm 0.038\) | \(0.663\pm 0.089\) | \(0.629\pm 0.145\) | \(0.662\pm 0.132\) | \(0.685\pm 0.087\) | 3 |
Sonar | \(\mathbf{0.620 }\pm 0.289\) | \(0.599\pm 0.326\) | \(0.598\pm 0.325\) | \(0.524\pm 0.315\) | \(0.574\pm 0.359\) | 10 |
Theorem proving | \(\mathbf{0.672 }\pm 0.179\) | \(0.636\pm 0.165\) | \(0.617\pm 0.182\) | \(0.618\pm 0.189\) | \(0.649\pm 0.145\) | 10 |
Vehicle | \(\mathbf{0.699 }\pm 0.142\) | \(0.545\pm 0.155\) | \(0.644\pm 0.147\) | \(0.566\pm 0.154\) | \(0.537\pm 0.149\) | 3 |
Vowel context | \(0.497\pm 0.044\) | \(0.461\pm 0.072\) | \(0.436\pm 0.064\) | \(0.408\pm 0.054\) | \(\mathbf{0.577 }\pm 0.043\) | 3 |
Average ranks | 1.85 | 3.33 | 2.90 | 4.25 | 2.67 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_0 (0.13)\) | \(H_1 (0.00)\) | \(H_0 (0.37)\) |
Means and standard deviations of ARI values for the k-means-FWPD algorithm against MNAR-I
Dataset | \(\begin{array}{ll} \text {k-means-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic Kidney | \(\mathbf{0.729 }\pm 0.011\) | \(0.714\pm 0.010\) | \(0.399\pm 0.051\) | \(0.599\pm 0.005\) | \(0.616\pm 0.022\) | 3 |
Colon | \(0.789\pm 0.214\) | \(0.781\pm 0.202\) | \(0.770\pm 0.205\) | \(\mathbf{0.801 }\pm 0.147\) | \(0.781\pm 0.202\) | 3 |
GSAD 1 | \(0.791\pm 0.112\) | \(\mathbf{0.799 }\pm 0.097\) | \(0.790\pm 0.110\) | \(0.689\pm 0.175\) | \(\mathbf{0.799 }\pm 0.097\) | 3 |
Glass | \(\mathbf{0.439 }\pm 0.101\) | \(0.391\pm 0.117\) | \(0.152\pm 0.048\) | \(0.388\pm 0.097\) | \(0.438\pm 0.105\) | 10 |
Iris | \(0.662\pm 0.073\) | \(0.709\pm 0.144\) | \(0.137\pm 0.077\) | \(\mathbf{0.739 }\pm 0.132\) | \(0.658\pm 0.168\) | 5 |
Isolet 5 | \(\mathbf{0.708 }\pm 0.098\) | \(0.680\pm 0.103\) | \(0.680\pm 0.082\) | \(0.663\pm 0.067\) | \(0.680\pm 0.103\) | 3 |
Landsat | \(\mathbf{0.869 }\pm 0.048\) | \(0.701\pm 0.149\) | \(0.712\pm 0.159\) | \(0.858\pm 0.058\) | \(0.813\pm 0.001\) | 10 |
Leaf | \(0.493\pm 0.052\) | \(0.416\pm 0.029\) | \(0.403\pm 0.020\) | \(0.439\pm 0.023\) | \(\mathbf{0.522 }\pm 0.040\) | 3 |
Libras | \(\mathbf{0.717 }\pm 0.083\) | \(0.667\pm 0.076\) | \(0.378\pm 0.058\) | \(0.638\pm 0.070\) | \(0.656\pm 0.067\) | 3 |
Lung | \(\mathbf{0.636 }\pm 0.201\) | \(0.592\pm 0.199\) | \(0.606\pm 0.210\) | \(0.578\pm 0.192\) | \(0.592\pm 0.199\) | 3 |
Lung cancer | \(\mathbf{0.529 }\pm 0.235\) | \(0.497\pm 0.184\) | \(0.459\pm 0.129\) | \(0.457\pm 0.217\) | \(0.497\pm 0.184\) | 3 |
Lymphoma | \(0.796\pm 0.118\) | \(\mathbf{0.798 }\pm 0.133\) | \(0.764\pm 0.130\) | \(0.764\pm 0.129\) | \(\mathbf{0.798 }\pm 0.133\) | 3 |
Pendigits | \(0.666\pm 0.079\) | \(0.635\pm 0.067\) | \(0.135\pm 0.025\) | \(0.619\pm 0.054\) | \(\mathbf{0.756 }\pm 0.093\) | 5 |
Prostate | \(\mathbf{0.975 }\pm 0.032\) | \(0.958\pm 0.075\) | \(0.962\pm 0.075\) | \(0.910\pm 0.061\) | \(0.958\pm 0.075\) | 3 |
Seeds | \(0.776\pm 0.019\) | \(0.705\pm 0.044\) | \(0.298\pm 0.065\) | \(0.725\pm 0.042\) | \(\mathbf{0.819 }\pm 0.052\) | 20 |
Sensorless | \(\mathbf{0.693 }\pm 0.041\) | \(0.638\pm 0.036\) | \(0.636\pm 0.039\) | \(0.610\pm 0.069\) | \(0.593\pm 0.051\) | 10 |
Sonar | \(\mathbf{0.600 }\pm 0.297\) | \(0.537\pm 0.287\) | \(0.546\pm 0.292\) | \(0.326\pm 0.282\) | \(0.537\pm 0.287\) | 3 |
Theorem proving | \(\mathbf{0.540 }\pm 0.211\) | \(0.399\pm 0.196\) | \(0.388\pm 0.178\) | \(0.513\pm 0.224\) | \(0.465\pm 0.195\) | 3 |
Vehicle | \(\mathbf{0.639 }\pm 0.141\) | \(0.551\pm 0.123\) | \(0.298\pm 0.078\) | \(0.613\pm 0.103\) | \(0.536\pm 0.107\) | 3 |
Vowel context | \(0.473\pm 0.044\) | \(0.412\pm 0.049\) | \(0.435\pm 0.056\) | \(0.383\pm 0.043\) | \(\mathbf{0.512 }\pm 0.036\) | 3 |
Average ranks | 1.55 | 3.02 | 4.08 | 3.67 | 2.67 | |
Signed Rank Hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.05)\) |
Means and standard deviations of ARI values for the k-means-FWPD algorithm against MNAR-II
Dataset | \(\begin{array}{ll} \text {k-means-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic kidney | \(0.751\pm 0.014\) | \(0.659\pm 0.032\) | \(0.744\pm 0.015\) | \(0.636\pm 0.037\) | \(\mathbf{0.770 }\pm 0.017\) | 3 |
Colon | \(\mathbf{0.804 }\pm 0.210\) | \(0.797\pm 0.226\) | \(0.798\pm 0.220\) | \(0.781\pm 0.208\) | \(0.797\pm 0.226\) | 3 |
GSAD 1 | \(\mathbf{0.731 }\pm 0.198\) | \(0.665\pm 0.242\) | \(0.712\pm 0.212\) | \(0.689\pm 0.215\) | \(0.665\pm 0.242\) | 3 |
Glass | \(\mathbf{0.530 }\pm 0.089\) | \(0.423\pm 0.101\) | \(0.413\pm 0.099\) | \(0.395\pm 0.109\) | \(0.451\pm 0.101\) | 5 |
Iris | \(\mathbf{0.773 }\pm 0.165\) | \(0.718\pm 0.172\) | \(0.635\pm 0.189\) | \(0.702\pm 0.174\) | \(0.756\pm 0.175\) | 10 |
Isolet 5 | \(\mathbf{0.789 }\pm 0.061\) | \(0.765\pm 0.076\) | \(0.747\pm 0.056\) | \(0.728\pm 0.063\) | \(0.765\pm 0.076\) | 3 |
Landsat | \(\mathbf{0.892 }\pm 0.083\) | \(0.871\pm 0.084\) | \(0.868\pm 0.083\) | \(0.794\pm 0.130\) | \(0.836\pm 0.083\) | 3 |
Leaf | \(\mathbf{0.476 }\pm 0.021\) | \(0.385\pm 0.036\) | \(0.381\pm 0.031\) | \(0.389\pm 0.033\) | \(0.454\pm 0.028\) | 3 |
Libras | \(\mathbf{0.698 }\pm 0.079\) | \(0.675\pm 0.078\) | \(0.681\pm 0.077\) | \(0.648\pm 0.080\) | \(0.669\pm 0.081\) | 3 |
Lung | \(0.686\pm 0.220\) | \(0.649\pm 0.226\) | \(\mathbf{0.707 }\pm 0.089\) | \(0.674\pm 0.216\) | \(0.649\pm 0.226\) | 3 |
Lung cancer | \(\mathbf{0.641 }\pm 0.200\) | \(0.640\pm 0.283\) | \(0.567\pm 0.243\) | \(0.568\pm 0.190\) | \(0.640\pm 0.283\) | 3 |
Lymphoma | \(\mathbf{0.856 }\pm 0.115\) | \(0.824\pm 0.136\) | \(0.842\pm 0.123\) | \(0.818\pm 0.153\) | \(0.824\pm 0.136\) | 3 |
Pendigits | \(0.608\pm 0.082\) | \(0.589\pm 0.095\) | \(0.561\pm 0.097\) | \(0.557\pm 0.096\) | \(\mathbf{0.825 }\pm 0.083\) | 5 |
Prostate | \(\mathbf{0.984 }\pm 0.032\) | \(\mathbf{0.984 }\pm 0.032\) | \(\mathbf{0.984 }\pm 0.032\) | \(0.969\pm 0.053\) | \(\mathbf{0.984 }\pm 0.032\) | 3 |
Seeds | \(\mathbf{0.884 }\pm 0.028\) | \(0.738\pm 0.039\) | \(0.772\pm 0.038\) | \(0.771\pm 0.038\) | \(0.831\pm 0.029\) | 10 |
Sensorless | \(\mathbf{0.747 }\pm 0.041\) | \(0.667\pm 0.130\) | \(0.727\pm 0.030\) | \(0.704\pm 0.056\) | \(0.698\pm 0.068\) | 3 |
Sonar | \(\mathbf{0.704 }\pm 0.227\) | \(0.662\pm 0.235\) | \(0.658\pm 0.221\) | \(0.314\pm 0.152\) | \(0.662\pm 0.235\) | 3 |
Theorem proving | \(\mathbf{0.640 }\pm 0.141\) | \(0.600\pm 0.105\) | \(0.605\pm 0.077\) | \(0.610\pm 0.175\) | \(0.593\pm 0.106\) | 3 |
Vehicle | \(0.677\pm 0.145\) | \(\mathbf{0.734 }\pm 0.132\) | \(0.571\pm 0.167\) | \(0.635\pm 0.153\) | \(0.712\pm 0.141\) | 3 |
Vowel context | \(0.478\pm 0.048\) | \(0.404\pm 0.041\) | \(0.396\pm 0.064\) | \(0.345\pm 0.032\) | \(\mathbf{0.511 }\pm 0.056\) | 3 |
Average ranks | 1.38 | 3.30 | 3.27 | 4.25 | 2.80 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.03)\) |
5.2 Experiments with the k-means-FWPD algorithm
We compare the proposed k-means-FWPD algorithm to the standard k-means algorithm run on the datasets obtained after performing ZI, MI, SVDI and kNNI. All runs of k-means-FWPD were found to converge within the stipulated budget of \(MaxIter = 500\). The results of the experiments are listed in terms of the means and standard deviations of the obtained ARI values, in Tables 6, 7, 8 and 9. Only the best results for kNNI are reported, along with the best k values. The statistically significance of the listed results are summarized at the bottom of the table in terms of average ranks as well as signed rank test hypotheses and p values (\(H_0\) signifying that the ARI values achieved by the proposed method and the contending method originate from identical distributions having the same medians; \(H_1\) implies that the ARI values achieved by the proposed method and the contender originate from different distributions).
We know from Theorem 3 that the maximum number of feasibility adjustments that can occur during a single run of k-means-FWPD is \(n(k-1)\). This begs the question of whether one should choose \(MaxIter \ge n(k-1)\). However, k-means-FWPD was observed to converge within the stipulated \(MaxIter = 500\) iterations even for datasets like Isolet 5, Pendigits, Sensorless, etc. which have relatively large values of \(n(k-1)\). This indicates that the number of feasibility adjustments that occur during a run is much lower in practice. Therefore, we conclude that it is not required to set \(MaxIter \ge n(k-1)\) for practical problems.
It is seen from Tables 6, 7, 8 and 9 that the k-means-FWPD algorithm performs best, indicated by the consistently minimum average rankings on all types of missingness. The proposed method performs best on the majority of datasets for all kinds of missingness. kNNI is overall seen to be the second best performer (being statistically comparable to k-means-FWPD in case of MAR). It is also interesting to observe that the performance of MI is improved in case of MAR and MNAR-II, indicating that MI tends to be useful for partitional clustering when the missingness depends on the observed features. Moreover, SVDI is generally observed to perform poorly irrespective of the type of missingness, implying that the linear model assumed by SVDI is unable to conserve the convexity of the clusters (which is essential for good performance in case of partitional clustering).
5.3 Experiments with the HAC-FWPD algorithm
The experimental setup described in Sect. 5.1 is also used to compare the HAC-FWPD algorithm (with AL-FWPD as the proximity measure) to the standard HAC algorithm (with AL as the proximity measure) in conjunction with ZI, MI, SVDI and kNNI. Results are reported as means and standard deviations of obtained ARI values over the 20 independent runs. AL is preferred here over SL and CL as it is observed to generally achieve higher ARI values. The results of the experiments are listed in Tables 10, 11, 12 and 13. The statistically significance of the listed results are also summarized at the bottom of the respective tables in terms of average ranks as well as signed rank test hypotheses and p values (\(H_0\) signifying that the ARI values achieved by the proposed method and the contending method originate from identical distributions having the same medians; \(H_1\) implies that the ARI values achieved by the proposed method and the contender originate from different distributions).
Means and standard deviations of ARI values for the HAC-FWPD algorithm against MCAR
Dataset | \(\begin{array}{ll} \text {HAC-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic kidney | \(\mathbf{1.000 }\pm 0.000\) | \(0.967\pm 0.031\) | \(0.933\pm 0.033\) | \(0.933\pm 0.033\) | \(0.000\pm 0.000\) | 3 |
Colon | \(\mathbf{0.690 }\pm 0.240\) | \(0.469\pm 0.145\) | \(0.286\pm 0.174\) | \(0.380\pm 0.304\) | \(0.000\pm 0.000\) | 3 |
GSAD 1 | \(\mathbf{0.454 }\pm 0.309\) | \(0.367\pm 0.088\) | \(0.271\pm 0.112\) | \(0.311\pm 0.087\) | \(0.022\pm 0.018\) | 3 |
Glass | \(\mathbf{0.737 }\pm 0.081\) | \(0.671\pm 0.089\) | \(0.680\pm 0.089\) | \(0.638\pm 0.090\) | \(0.033\pm 0.032\) | 3 |
Iris | \(0.885\pm 0.072\) | \(\mathbf{0.922 }\pm 0.047\) | \(0.831\pm 0.053\) | \(0.917\pm 0.049\) | \(0.559\pm 0.129\) | 20 |
Isolet 5 | \(\mathbf{0.855 }\pm 0.111\) | \(0.044\pm 0.003\) | \(0.046\pm 0.003\) | \(0.081\pm 0.037\) | \(0.064\pm 0.003\) | 3 |
Landsat | \(\mathbf{0.712 }\pm 0.098\) | \(0.228\pm 0.034\) | \(0.254\pm 0.012\) | \(0.217\pm 0.033\) | \(0.300\pm 0.018\) | 10 |
Leaf | \(\mathbf{0.497 }\pm 0.046\) | \(0.200\pm 0.016\) | \(0.221\pm 0.017\) | \(0.290\pm 0.077\) | \(0.140\pm 0.011\) | 3 |
Libras | \(\mathbf{0.845 }\pm 0.054\) | \(0.276\pm 0.031\) | \(0.298\pm 0.033\) | \(0.381\pm 0.030\) | \(0.156\pm 0.050\) | 10 |
Lung | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.001\pm 0.002\) | 3 |
Lung cancer | \(\mathbf{0.458 }\pm 0.193\) | \(0.408\pm 0.223\) | \(0.356\pm 0.229\) | \(0.436\pm 0.335\) | \(0.034\pm 0.035\) | 3 |
Lymphoma | \(\mathbf{0.885 }\pm 0.058\) | \(0.718\pm 0.373\) | \(0.335\pm 0.498\) | \(0.713\pm 0.372\) | \(0.547\pm 0.297\) | 3 |
Pendigits | \(\mathbf{0.712 }\pm 0.082\) | \(0.242\pm 0.194\) | \(0.228\pm 0.224\) | \(0.252\pm 0.260\) | \(0.365\pm 0.147\) | 3 |
Prostate | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.001\pm 0.001\) | 3 |
Seeds | \(0.534\pm 0.173\) | \(0.332\pm 0.046\) | \(0.317\pm 0.055\) | \(0.436\pm 0.127\) | \(\mathbf{0.563 }\pm 0.110\) | 10 |
Sensorless | \(\mathbf{0.416 }\pm 0.303\) | \(0.196\pm 0.024\) | \(0.203\pm 0.017\) | \(0.249\pm 0.102\) | \(0.005\pm 0.008\) | 3 |
Sonar | \(\mathbf{0.440 }\pm 0.419\) | \(0.329\pm 0.473\) | \(0.128\pm 0.298\) | \(0.261\pm 0.365\) | \(0.001\pm 0.000\) | 3 |
Theorem proving | \(\mathbf{0.802 }\pm 0.085\) | \(0.691\pm 0.088\) | \(0.691\pm 0.088\) | \(0.654\pm 0.082\) | \(0.002\pm 0.001\) | 5 |
Vehicle | \(\mathbf{0.807 }\pm 0.108\) | \(0.315\pm 0.295\) | \(0.315\pm 0.295\) | \(0.645\pm 0.232\) | \(0.084\pm 0.008\) | 5 |
Vowel context | \(\mathbf{0.453 }\pm 0.081\) | \(0.248\pm 0.042\) | \(0.211\pm 0.066\) | \(0.194\pm 0.029\) | \(0.101\pm 0.019\) | 3 |
Average ranks | 1.30 | 2.95 | 3.52 | 2.88 | 4.35 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) |
Means and standard deviations of ARI values for the HAC-FWPD algorithm against MAR
Dataset | \(\begin{array}{ll} \text {HAC-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic kidney | \(\mathbf{0.799 }\pm 0.394\) | \(0.398\pm 0.494\) | \(0.398\pm 0.494\) | \(0.398\pm 0.494\) | \(0.003\pm 0.001\) | 3 |
Colon | \(\mathbf{1.000 }\pm 0.000\) | \(0.463\pm 0.516\) | \(0.463\pm 0.516\) | \(0.601\pm 0.423\) | \(0.016\pm 0.002\) | 3 |
GSAD 1 | \(\mathbf{0.619 }\pm 0.230\) | \(0.359\pm 0.115\) | \(0.419\pm 0.227\) | \(0.346\pm 0.150\) | \(0.007\pm 0.005\) | 3 |
Glass | \(\mathbf{0.650 }\pm 0.188\) | \(0.617\pm 0.124\) | \(0.603\pm 0.129\) | \(0.590\pm 0.184\) | \(0.057\pm 0.082\) | 20 |
Iris | \(\mathbf{0.949 }\pm 0.051\) | \(0.893\pm 0.083\) | \(0.893\pm 0.083\) | \(0.854\pm 0.146\) | \(0.587\pm 0.068\) | 10 |
Isolet 5 | \(\mathbf{0.725 }\pm 0.186\) | \(0.491\pm 0.011\) | \(0.491\pm 0.011\) | \(0.464\pm 0.076\) | \(0.076\pm 0.005\) | 3 |
Landsat | \(\mathbf{0.734 }\pm 0.044\) | \(0.638\pm 0.323\) | \(0.601\pm 0.301\) | \(0.721\pm 0.142\) | \(0.162\pm 0.113\) | 3 |
Leaf | \(0.463\pm 0.107\) | \(0.420\pm 0.099\) | \(0.415\pm 0.058\) | \(\mathbf{0.471 }\pm 0.053\) | \(0.154\pm 0.013\) | 3 |
Libras | \(\mathbf{0.864 }\pm 0.098\) | \(0.855\pm 0.048\) | \(0.815\pm 0.081\) | \(0.813\pm 0.064\) | \(0.469\pm 0.012\) | 3 |
Lung | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.008\pm 0.024\) | 3 |
Lung cancer | \(\mathbf{0.709 }\pm 0.270\) | \(0.522\pm 0.438\) | \(0.538\pm 0.422\) | \(0.443\pm 0.358\) | \(0.036\pm 0.019\) | 3 |
Lymphoma | \(\mathbf{1.000 }\pm 0.000\) | \(0.903\pm 0.089\) | \(0.772\pm 0.132\) | \(0.890\pm 0.104\) | \(0.788\pm 0.000\) | 3 |
Pendigits | \(\mathbf{0.493 }\pm 0.138\) | \(0.351\pm 0.124\) | \(0.292\pm 0.076\) | \(0.483\pm 0.103\) | \(0.407\pm 0.035\) | 3 |
Prostate | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.001\pm 0.000\) | 3 |
Seeds | \(\mathbf{0.564 }\pm 0.131\) | \(0.487\pm 0.016\) | \(0.444\pm 0.103\) | \(0.535\pm 0.224\) | \(0.556\pm 0.135\) | 3 |
Sensorless | \(\mathbf{0.439 }\pm 0.288\) | \(0.276\pm 0.163\) | \(0.174\pm 0.053\) | \(0.256\pm 0.136\) | \(0.000\pm 0.000\) | 10 |
Sonar | \(\mathbf{0.396 }\pm 0.551\) | \(0.005\pm 0.000\) | \(0.005\pm 0.000\) | \(0.094\pm 0.222\) | \(0.001\pm 0.000\) | 3 |
Theorem proving | \(\mathbf{0.725 }\pm 0.102\) | \(0.677\pm 0.031\) | \(0.685\pm 0.107\) | \(0.641\pm 0.121\) | \(0.001\pm 0.006\) | 3 |
Vehicle | \(\mathbf{0.827 }\pm 0.123\) | \(0.431\pm 0.279\) | \(0.431\pm 0.278\) | \(0.825\pm 0.105\) | \(0.075\pm 0.044\) | 3 |
Vowel context | \(\mathbf{0.517 }\pm 0.127\) | \(0.451\pm 0.095\) | \(0.445\pm 0.126\) | \(0.401\pm 0.207\) | \(0.104\pm 0.024\) | 3 |
Average ranks | 1.20 | 2.83 | 3.27 | 3.00 | 4.70 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) |
Means and standard deviations of ARI values for the HAC-FWPD algorithm against MNAR-I
Dataset | \(\begin{array}{ll} \text {HAC-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic kidney | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.002\pm 0.000\) | 3 |
Colon | \(\mathbf{0.926 }\pm 0.166\) | \(0.025\pm 0.000\) | \(0.025\pm 0.000\) | \(0.025\pm 0.000\) | \(0.014\pm 0.000\) | 3 |
GSAD 1 | \(\mathbf{0.473 }\pm 0.237\) | \(0.326\pm 0.085\) | \(0.343\pm 0.098\) | \(0.271\pm 0.112\) | \(0.005\pm 0.003\) | 3 |
Glass | \(0.736\pm 0.135\) | \(0.697\pm 0.119\) | \(\mathbf{0.738 }\pm 0.133\) | \(0.614\pm 0.145\) | \(0.018\pm 0.006\) | 10 |
Iris | \(0.852\pm 0.142\) | \(0.527\pm 0.437\) | \(0.543\pm 0.456\) | \(\mathbf{0.881 }\pm 0.180\) | \(0.540\pm 0.026\) | 20 |
Isolet 5 | \(\mathbf{0.586 }\pm 0.085\) | \(0.326\pm 0.207\) | \(0.401\pm 0.169\) | \(0.223\pm 0.153\) | \(0.048\pm 0.033\) | 3 |
Landsat | \(\mathbf{0.786 }\pm 0.085\) | \(0.420\pm 0.298\) | \(0.443\pm 0.375\) | \(0.765\pm 0.086\) | \(0.072\pm 0.004\) | 5 |
Leaf | \(\mathbf{0.514 }\pm 0.043\) | \(0.345\pm 0.065\) | \(0.267\pm 0.080\) | \(0.437\pm 0.054\) | \(0.110\pm 0.035\) | 5 |
Libras | \(\mathbf{0.843 }\pm 0.109\) | \(0.750\pm 0.101\) | \(0.750\pm 0.101\) | \(0.782\pm 0.087\) | \(0.419\pm 0.042\) | 3 |
Lung | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.008\pm 0.024\) | 3 |
Lung cancer | \(\mathbf{0.651 }\pm 0.316\) | \(0.516\pm 0.269\) | \(0.516\pm 0.269\) | \(0.561\pm 0.336\) | \(0.020\pm 0.010\) | 3 |
Lymphoma | \(\mathbf{0.942 }\pm 0.130\) | \(0.861\pm 0.000\) | \(0.861\pm 0.000\) | \(0.861\pm 0.000\) | \(0.651\pm 0.000\) | 3 |
Pendigits | \(\mathbf{0.641 }\pm 0.172\) | \(0.405\pm 0.255\) | \(0.341\pm 0.238\) | \(0.399\pm 0.139\) | \(0.416\pm 0.043\) | 5 |
Prostate | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.001\pm 0.000\) | 3 |
Seeds | \(\mathbf{0.584 }\pm 0.154\) | \(0.429\pm 0.141\) | \(0.479\pm 0.047\) | \(0.421\pm 0.084\) | \(0.555\pm 0.082\) | 3 |
Sensorless | \(\mathbf{0.300 }\pm 0.298\) | \(0.225\pm 0.025\) | \(0.217\pm 0.027\) | \(0.216\pm 0.009\) | \(0.000\pm 0.000\) | 3 |
Sonar | \(\mathbf{0.598 }\pm 0.550\) | \(0.196\pm 0.449\) | \(0.196\pm 0.449\) | \(0.329\pm 0.473\) | \(0.001\pm 0.000\) | 3 |
Theorem proving | \(\mathbf{0.775 }\pm 0.121\) | \(0.762\pm 0.068\) | \(0.731\pm 0.025\) | \(0.742\pm 0.031\) | \(0.002\pm 0.000\) | 3 |
Vehicle | \(\mathbf{0.797 }\pm 0.121\) | \(0.321\pm 0.194\) | \(0.522\pm 0.264\) | \(0.699\pm 0.290\) | \(0.068\pm 0.040\) | 3 |
Vowel context | \(\mathbf{0.447 }\pm 0.168\) | \(0.300\pm 0.073\) | \(0.282\pm 0.068\) | \(0.306\pm 0.114\) | \(0.095\pm 0.029\) | 3 |
Average ranks | 1.33 | 3.15 | 3.05 | 2.83 | 4.65 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) |
Means and standard deviations of ARI values for the HAC-FWPD algorithm against MNAR-II
Dataset | \(\begin{array}{ll} \text {HAC-}\\ \text {-FWPD}\\ \end{array}\) | ZI | MI | SVDI | kNNI | \(\begin{array}{cc} \text {Best } k\\ \text {(for kNNI)}\\ \end{array}\) |
---|---|---|---|---|---|---|
Chronic kidney | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.002\pm 0.000\) | 3 |
Colon | \(\mathbf{1.000 }\pm 0.000\) | \(0.147\pm 0.384\) | \(0.147\pm 0.384\) | \(0.147\pm 0.384\) | \(0.013\pm 0.004\) | 3 |
GSAD 1 | \(\mathbf{0.429 }\pm 0.190\) | \(0.353\pm 0.011\) | \(0.356\pm 0.021\) | \(0.353\pm 0.017\) | \(0.009\pm 0.001\) | 3 |
Glass | \(0.717\pm 0.125\) | \(\mathbf{0.733 }\pm 0.103\) | \(0.696\pm 0.113\) | \(0.690\pm 0.172\) | \(0.020\pm 0.008\) | 20 |
Iris | \(0.885\pm 0.079\) | \(0.718\pm 0.394\) | \(0.718\pm 0.394\) | \(]\pm 0.024\) | \(0.552\pm 0.012\) | 10 |
Isolet 5 | \(\mathbf{0.711 }\pm 0.047\) | \(0.296\pm 0.214\) | \(0.479\pm 0.007\) | \(0.184\pm 0.174\) | \(0.043\pm 0.034\) | 3 |
Landsat | \(\mathbf{0.763 }\pm 0.048\) | \(0.229\pm 0.004\) | \(0.229\pm 0.004\) | \(0.651\pm 0.278\) | \(0.082\pm 0.003\) | 3 |
Leaf | \(\mathbf{0.477 }\pm 0.036\) | \(0.255\pm 0.109\) | \(0.277\pm 0.081\) | \(0.342\pm 0.124\) | \(0.110\pm 0.044\) | 3 |
Libras | \(\mathbf{0.817 }\pm 0.053\) | \(0.742\pm 0.057\) | \(0.758\pm 0.065\) | \(0.780\pm 0.097\) | \(0.392\pm 0.040\) | 3 |
Lung | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.009\pm 0.019\) | 3 |
Lung cancer | \(\mathbf{0.527 }\pm 0.158\) | \(0.515\pm 0.194\) | \(0.515\pm 0.194\) | \(0.503\pm 0.216\) | \(0.035\pm 0.021\) | 3 |
Lymphoma | \(\mathbf{0.925 }\pm 0.122\) | \(0.916\pm 0.076\) | \(0.876\pm 0.034\) | \(0.876\pm 0.034\) | \(0.706\pm 0.075\) | 3 |
Pendigits | \(\mathbf{0.485 }\pm 0.132\) | \(0.219\pm 0.053\) | \(0.300\pm 0.117\) | \(0.427\pm 0.108\) | \(0.387\pm 0.030\) | 20 |
Prostate | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(\mathbf{1.000 }\pm 0.000\) | \(0.001\pm 0.000\) | 3 |
Seeds | \(\mathbf{0.581 }\pm 0.154\) | \(0.398\pm 0.079\) | \(0.319\pm 0.149\) | \(0.491\pm 0.149\) | \(0.580\pm 0.125\) | 3 |
Sensorless | \(\mathbf{0.475 }\pm 0.378\) | \(0.204\pm 0.061\) | \(0.210\pm 0.044\) | \(0.214\pm 0.045\) | \(0.000\pm 0.000\) | 3 |
Sonar | \(\mathbf{0.295 }\pm 0.449\) | \(0.196\pm 0.450\) | \(0.196\pm 0.450\) | \(0.261\pm 0.436\) | \(0.001\pm 0.000\) | 3 |
Theorem proving | \(\mathbf{0.885 }\pm 0.078\) | \(0.681\pm 0.057\) | \(0.681\pm 0.057\) | \(0.711\pm 0.059\) | \(0.001\pm 0.002\) | 3 |
Vehicle | \(\mathbf{0.821 }\pm 0.072\) | \(0.518\pm 0.289\) | \(0.664\pm 0.251\) | \(0.700\pm 0.278\) | \(0.041\pm 0.037\) | 3 |
Vowel context | \(\mathbf{0.533 }\pm 0.154\) | \(0.377\pm 0.232\) | \(0.430\pm 0.109\) | \(0.425\pm 0.109\) | \(0.103\pm 0.024\) | 3 |
Average ranks | 1.33 | 3.33 | 3.00 | 2.60 | 4.75 | |
Signed rank hypotheses (p values) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) | \(H_1 (0.00)\) |
6 Conclusions
In this paper, we propose to use the FWPD measure as a viable alternative to imputation and marginalization approaches to handle the problem of missing features in data clustering. The proposed measure attempts to estimate the original distances between the data points by adding a penalty term to those pair-wise distances which cannot be calculated on the entire feature space due to missing features. Therefore, unlike existing methods for handling missing features, FWPD is also able to distinguish between distinct data points which look identical due to missing features. Yet, FWPD also ensures that the dissimilarity for any data instance from itself is never greater than its dissimilarity from any other point in the dataset. Intuitively, these advantages of FWPD should help us better model the original data space which may help in achieving better clustering performance on the incomplete data.
Therefore, we use the proposed FWPD measure to put forth the k-means-FWPD and the HAC-FWPD clustering algorithms, which are directly applicable to datasets with missing features. We conduct extensive experimentation on the new techniques using various benchmark datasets and find the new approach to produce generally better results (for both partitional as well as hierarchical clustering) compared to some of the popular imputation methods which are generally used to handle the missing feature problem. In fact, it is observed from the experiments that the performance of the imputation schemes varies with the type of missingness and/or the clustering algorithm being used (for example, kNNI is useful for k-means clustering but not for HAC clustering; SVDI is useful for HAC clustering but not for k-means clustering; MI is effective when the missingness depends on the observed features). The proposed approach, on the other hand, exhibits good performance across all types of missingness as well as both partitional and hierarchical clustering paradigms. The experimental results attest to the ability of FWPD to better model the original data space, compared to existing methods.
However, it must be stressed, that the performance of all these methods, including the FWPD based ones, can vary depending on the structure of the dataset concerned, the choice of the proximity measure used (for HAC), and the pattern and extent of missingness plaguing the data. Fortunately, the \(\alpha \) parameter embedded in FWPD can be varied in accordance with the extent of missingness to achieve desired results. The results in Sect. 5.1.3 indicate that it may be useful to choose a high value of \(\alpha \) when a large fraction of the features are unobserved, and to choose a smaller value when only a few of the features are missing. However, in the presence of a sizable amount of missingness and the absence of ground-truths to validate the merit of the achieved clusterings, it is safest to choose a value of \(\alpha \) proportional to the percentage of missing features restricted within the range [0.1, 0.25]. We also present an appendix dealing with an extension of the FWPD measure to problems with absent features and show that this modified form of FWPD is a semi-metric.
An obvious follow-up to this work is the application of the proposed PDM variant to practical clustering problems which are characterized by large fractions of unobserved data that arise in various fields such as economics, psychiatry, web-mining, etc. Studies can be undertaken to better understand the effects that the choice of \(\alpha \) has on the clustering results. Another rewarding topic of research is the investigation of the abilities of the FWPD variant for absent features (see “Appendix A”) by conducting proper experiments using benchmark applications characterized by this rare form of missingness (structural missingness).
Footnotes
- 1.
- 2.
Source codes are available at https://github.com/Shounak-D/Clustering-Missing-Features.
Notes
Acknowledgements
We would like to thank Debaleena Misra and Sayak Nag, formerly of the Department of Instrumentation and Electronics Engineering, Jadavpur University, Kolkata, India, for their extensive help with the computer implementations of the different techniques used in our experiments.
References
- Acuña, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. In D. Banks, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications, studies in classification, data analysis, and knowledge organisation (pp. 639–647). Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
- Ahmad, S., & Tresp, V. (1993). Some solutions to the missing feature problem in vision. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in neural information processing systems 5 (pp. 393–400). Los Altos, CA: Morgan-Kaufmann.Google Scholar
- Barceló, C. (2008). The impact of alternative imputation methods on the measurement of income and wealth: Evidence from the spanish survey of household finances. In Working paper series. Banco de España.Google Scholar
- Bo, T. H., Dysvik, B., & Jonassen, I. (2004). Lsimpute: Accurate estimation of missing values in microarray data with least squares methods. Nucleic Acid Research, 32(3).MathSciNetCrossRefGoogle Scholar
- Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.CrossRefGoogle Scholar
- Chan, L. S., & Dunn, O. J. (1972). The treatment of missing values in discriminant analysis-1. The sampling experiment. Journal of the American Statistical Association, 67(338), 473–477.zbMATHGoogle Scholar
- Chaturvedi, A., Carroll, J. D., Green, P. E., & Rotondo, J. A. (1997). A feature-based approach to market segmentation via overlapping k-centroids clustering. Journal of Marketing Research, pp. 370–377.CrossRefGoogle Scholar
- Chechik, G., Heitz, G., Elidan, G., Abbeel, P., & Koller, D. (2008). Max-margin classification of data with absent features. Journal of Machine Learning Research, 9, 1–21.zbMATHGoogle Scholar
- Chen, F. (2013). Missing no more: Using the mcmc procedure to model missing data. In Proceedings of the SAS global forum 2013 conference, pp. 1–23. SAS Institute Inc.Google Scholar
- Datta, S., Bhattacharjee, S., & Das, S. (2016a). Clustering with missing features: A penalized dissimilarity measure based approach. CoRR, arXiv:1604.06602.
- Datta, S., Misra, D., & Das, S. (2016b). A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features. Pattern Recognition Letters, 80, 231–237.CrossRefGoogle Scholar
- Dempster, A. P., & Rubin, D. B. (1983). Incomplete data in sample surveys, vol. 2, chap. Part I: Introduction, pp. 3–10. New York: Academic Press.Google Scholar
- Dheeru, D., & Taniskidou, E. K. (2017). UCI machine learning repository. Online repository at http://archive.ics.uci.edu/ml.
- Dixon, J. K. (1979). Pattern recognition with partly missing data. IEEE Transactions on Systems, Man and Cybernetics, 9(10), 617–621.CrossRefGoogle Scholar
- Donders, A. R. T., van der Heijden, G. J. M. G., Stijnen, T., & Moons, K. G. M. (2006). Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology, 59(10), 1087–1091.CrossRefGoogle Scholar
- Forgy, E. W. (1965). Cluster analysis of multivariate data: Efficiency versus interpretability of classifications. Biometrics, 21, 768–769.Google Scholar
- Grzymala-Busse, J. W., & Hu, M. (2001). A comparison of several approaches to missing attribute values in data mining. In Rough sets and current trends in computing, pp. 378–385. Berlin: Springer.CrossRefGoogle Scholar
- Hathaway, R. J., & Bezdek, J. C. (2001). Fuzzy c-means clustering of incomplete data. IEEE Transactions on Systems, Man, and Cybernetics: Part B: Cybernetics, 31(5), 735–744.CrossRefGoogle Scholar
- Haveliwala, T., Gionis, A., & Indyk, P. (2000). Scalable techniques for clustering the web. Tech. rep.: Stanford University.Google Scholar
- Heitjan, D. F., & Basu, S. (1996). Distinguishing “missing at random” and “missing completely at random”. The American Statistician, 50(3), 207–213.MathSciNetGoogle Scholar
- Himmelspach, L., & Conrad, S. (2010). Clustering approaches for data with missing values: Comparison and evaluation. In Digital Information Management (ICDIM), 2010 fifth international conference on, pp. 19–28.Google Scholar
- Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputation in practice: Comparison of software packages for regression models with missing variables. The American Statistician, 55(3), 244–254.MathSciNetCrossRefGoogle Scholar
- Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.CrossRefGoogle Scholar
- Jin, J. (2017). Genomics dataset repository. Online Repository at http://www.stat.cmu.edu/~jiashun/Research/software/GenomicsData/.
- Juszczak, P., & Duin, R. P. W. (2004). Combining one-class classifiers to classify missing data. In Multiple classifier systems, pp. 92–101. Berlin: Springer.CrossRefGoogle Scholar
- Krause, S., & Polikar, R. (2003). An ensemble of classifiers approach for the missing feature problem. In Proceedings of the international joint conference on neural networks, vol. 1, pp. 553–558. IEEE.Google Scholar
- Lasdon, L. S. (2013). Optimization theory for large systems. Courier Corporation.Google Scholar
- Lei, L. (2010). Identify earthquake hot spots with 3-dimensional density-based clustering analysis. In Geoscience and remote sensing symposium (IGARSS), 2010 IEEE international, pp. 530–533. IEEE.Google Scholar
- Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.zbMATHGoogle Scholar
- Lloyd, S. P. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2), 129–137.MathSciNetCrossRefGoogle Scholar
- MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, pp. 281–297. University of California Press.Google Scholar
- Marlin, B. M. (2008). Missing data problems in machine learning. Ph.D. thesis, University of Toronto.Google Scholar
- Millán-Giraldo, M., Duin, R. P., & Sánchez, J. S. (2010). Dissimilarity-based classification of data with missing attributes. In Cognitive information processing (CIP), 2010 2nd international workshop on, pp. 293–298. IEEE.Google Scholar
- Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86–97.Google Scholar
- Myrtveit, I., Stensrud, E., & Olsson, U. H. (2001). Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Transactions on Software Engineering, 27(11), 999–1013.CrossRefGoogle Scholar
- Nanni, L., Lumini, A., & Brahnam, S. (2012). A classifier ensemble approach for the missing feature problem. Artificial Intelligence in Medicine, 55(1), 37–50.CrossRefGoogle Scholar
- Porro-Muñoz, D., Duin, R. P., & Talavera, I. (2013). Missing values in dissimilarity-based classification of multi-way data. In Iberoamerican congress on pattern recognition, pp. 214–221. Berlin: Springer.CrossRefGoogle Scholar
- Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.MathSciNetCrossRefGoogle Scholar
- Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. London: Wiley.CrossRefGoogle Scholar
- Sabau, A. S. (2012). Survey of clustering based financial fraud detection research. Informatica Economica, 16(1), 110.MathSciNetGoogle Scholar
- Schafer, J. L. (1997). Analysis of incomplete multivariate data. Boca Raton, FL: CRC Press.CrossRefGoogle Scholar
- Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.CrossRefGoogle Scholar
- Sehgal, M. S. B., Gondal, I., & Dooley, L. S. (2005). Collateral missing value imputation: a new robust missing value estimation algorithm fpr microarray data. Bioinformatics, 21(10), 2417–2423.CrossRefGoogle Scholar
- Selim, S. Z., & Ismail, M. A. (1984). K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(1), 81–87.CrossRefGoogle Scholar
- Shelly, D. R., Ellsworth, W. L., Ryberg, T., Haberland, C., Fuis, G. S., Murphy, J., et al. (2009). Precise location of san andreas fault tremors near cholame, california using seismometer clusters: Slip on the deep extension of the fault? Geophysical Research Letters, 36(1).Google Scholar
- Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., et al. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6), 520–525.CrossRefGoogle Scholar
- Wagstaff, K. L. (2004). Clustering with missing values: No imputation required. In Proceedings of the meeting of the international Federation of classification societies, pp. 649–658.CrossRefGoogle Scholar
- Wagstaff, K. L., & Laidler, V. G. (2005). Making the most of missing values: Object clustering with partial data in astronomy. In Astronomical data analysis software and systems XIV, ASP Conference Series, pp. 172–176. Astronomical Society of the Pacific.Google Scholar
- Wang, Q., & Rao, J. N. K. (2002a). Empirical likelihood-based inference in linear models with missing data. Scandinavian Journal of Statistics, 29(3), 563–576.MathSciNetCrossRefGoogle Scholar
- Wang, Q., & Rao, J. N. K. (2002b). Empirical likelihood-based inference under imputation for missing response data. The Annals of Statistics, 30(3), 896–924.MathSciNetCrossRefGoogle Scholar
- Weatherill, G., & Burton, P. W. (2009). Delineation of shallow seismic source zones using k-means cluster analysis, with application to the aegean region. Geophysical Journal International, 176(2), 565–588.CrossRefGoogle Scholar
- Wendel, R. E., & Hurter, A. P, Jr. (1976). Minimization of a non-separable objective function subject to disjoint constraints. Operations Research, 24, 643–657.MathSciNetCrossRefGoogle Scholar
- Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.CrossRefGoogle Scholar
- Zhang, W., Yang, Y., & Wang, Q. (2012). A comparative study of absent features and unobserved values in software effort data. International Journal of Software Engineering and Knowledge Engineering, 22(02), 185–202.CrossRefGoogle Scholar