A Study of Boolean Matrix Factorization Under Supervised Settings
Abstract
Boolean matrix factorization is a generally accepted approach used in data analysis to explain data or for data preprocessing in the supervised settings. In this paper we study factors in the supervised settings. We provide an experimental proof that factors are able to explain not only data as a whole but also classes in the data.
Keywords
Boolean matrix factorization Supervised settings Classification Quality of factors1 Introduction
Boolean matrix factorization (BMF) is a powerful tool that is widely used in data mining to describe data. It allows for data explanation by means of factors, i.e. hidden variables that rely on a solid algebraic foundation.
In general, BMF is used in the unsupervised settings, where the input data are not labeled, classified or categorized. However, evaluation of quality of generating factors did not received appropriate attention in the scientific literature on BMF. An exception is a pioneer work [4] that provides basic ideas of how the quality of BMF algorithms can be assessed in the unsupervised settings. In this paper we evaluate BMF algorithms in the supervised settings. To the best of our knowledge, the quality of factors in this settings has not been studied yet.
It was shown that BMF algorithms used as a preprocessing stage [2, 3, 18] or as neurons in a simple (one layer) artificial neural network [13] can improve classification quality. Other relevant works come from the Formal Concept Analysis [11] (FCA), since factors are often formal concepts [6]. In [1, 12, 14] closed sets of attributes, i.e. intents of formal concepts, were studied as basic classifiers (hypothesis) in different voting and inference schemes. In the mentioned studies the whole set of (frequent) factors was used to build classifiers. One may consider factors as a result of the selection of only relevant concepts (hypotheses) w.r.t. to coverage or MDL principle, e.g. in [17] MDL principle is used to select concepts that then were evaluated under supervised settings. From the FCA perspective, our study can be considered as evaluation of BMF-optimal concepts (intents or their generators) in the supervised settings. Under BMF-optimal concepts we mean those that are generated by a BMF algorithm.
Our contribution is twofold. First, we evaluate the ability of factors to explain classes of objects rather than the data as a whole. Second, we propose different models of factor-based classifiers and study their quality.
The paper is organized as follows. Section 2 introduces the used notation and the basic notions of BMF. In Sect. 3 we discuss how factors can be used and evaluated in supervised settings. Section 4 provides the results of a comparative study of factor sets generated by different BMF algorithms as well as evaluation of different models of factor-based ensembles of classifiers. In Sect. 5 we conclude and discuss direction of future work.
2 Preliminaries
In this section we recall the main notions used in this paper. Matrices are denoted by upper-case bold letters. \(\mathbf{I}_{ij}\) denotes the entry of matrix \(\mathbf{I}\) corresponding to the row i and column j. \(\mathbf{I}_{i\_}\) and \(\mathbf{I}_{\_j}\) denotes the ith row and jth column of matrix \(\mathbf{I}\), respectively. The set of all \(m \times n \) Boolean matrices is denoted by \(\{0,1\}^{m\times n}\). The number of 1s in Boolean matrix \(\mathbf{I}\) is denoted by \(\Vert \mathbf{I}\Vert \), i.e \(\Vert \mathbf{I}\Vert = \sum _{i,j} \mathbf{I}_{ij}\).
For matrices \(\mathbf{A} \in \{0,1\}^{m\times n}\) and \(\mathbf{B} \in \{0,1\}^{m\times n}\) we define the following element-wise operations: (i) Boolean sum \(\mathbf{A} \oplus \mathbf{B}\), i.e. the normal matrix sum where \(1+1 = 1\). (ii) Boolean subtraction \(\mathbf{A} \ominus \mathbf{B}\), i.e. the normal matrix subtraction where \(0 - 1 = 0\).
Under this model, the decomposition of \(\mathbf{I}\) into \(\mathbf{A}\circ \mathbf{B}\) may be interpreted as discovery of k factors that exactly or approximately explain the data, i.e. with \(\mathbf{I}_{ij}=1\) the object i has the attribute j, if and only if there exists factor l such that l applied to i and j is one of the particular manifestations of l.
3 Factors Under Supervised Settings
Quality of factors is most often understood as their ability to explain data [4]. However, a lot of problems is needed to be solved in the supervised settings, where class labels of objects are available.
In supervised settings, Boolean matrix \(\mathbf{I}\in \{0,1\}^{m\times n}\) corresponds to m objects described by n attributes. A special target attribute refers to an object class. More formally, we define a function class that maps row \(\mathbf{I}_{i\_}\) to its class label \(c = class(\mathbf{I}_{i\_}) \in \mathcal {Y}\), the size of set \( \mathcal {Y}\) is equal to the number of classes.
3.1 Key Components of Classifiers
Representation and Labeling. For the Boolean matrix factorization \(\mathbf{I}= \mathbf{A}\circ \mathbf{B}\) we consider factor-classifier as a tuple \((f_i, c, sim)\), where \(f_i\) is the ith Boolean factor (represented by the ith column and ith row of matrices \(\mathbf{A}\) and \(\mathbf{B}\), respectively), c is a class label given by class function, and sim is a classification strategy (see details below). In our study we assign to c a class label of the majority of objects from column \(\mathbf{A}_{\_i}\). If the majority is not unique, we do not consider the factor as a classifier.
Strategy of Classification. We focus on two common classification strategies, namely rule-based and similarity-based.
According to the first strategy, object \(g = \mathbf{I}_{j\_}\) (given by n-dimensional vector) is classified by factor-classifier \((f_i, c, sim)\) if \(\mathbf{B}_{i\_} \cdot g = \mathbf{B}_{i\_}\), i.e. the object g has all attributes of factor \(f_i\), “\(\cdot \)” denotes the element-wise multiplication.
With the second strategy, the object g is classified by factor-classifier \((f_i, c, sim)\) if \({similarity}(\mathbf{B}_{i\_}, g) > \varepsilon \), i.e. the attributes of factor \(f_i\) are quite similar to the attributes of object g. The similarity can be defined by means of either a distance measure or an asymmetrical operator.
It should be noted that the rule-based classification strategy is a particular case of the similarity-based one, where for \(g = \mathbf{I}_{j\_}\) \(similarity(\mathbf{B}_{i\_}, \mathbf{I}_{j\_}) \equiv \sum _{l=1}^n({\mathbf{B}_{il} \rightarrow \mathbf{I}_{jl}}) \equiv \sum _{l=1}^n(\overline{\mathbf{B}_{il}} \,|\, {\mathbf{I}_{jl}}) = n.\) Operations \(\rightarrow \) and | represent logical implication and logical OR, respectively.
For the sake of simplicity, we will use \((f_i, c)\) to denote a classifier, because in our experiments we use only the similarity function.
Responses of Classifiers. We say that object g is classified by \((f_i, c, sim)\) if \(sim(\mathbf{B}_{i\_}, g) >\varepsilon \). To assign a class label to g, the responses of classifiers \((f_i, c, sim)\) can be accounted with weights \(w^g_{(f_i, c, sim)}\), e.g. precision, accuracy of \(f_i\), or similarity between \(\mathbf{B}_{i\_}\) and g. We assume that \(f_i\) does not contribute to the final decision on a class of g (the response is 0) if g is not classified by \((f_i, c, sim)\). Again, for the sake of simplicity, we will use \(w^g_{(f_i, c)}\) instead of \(w^g_{(f_i, c, sim)}\).
To compute a class label of an object, the responses of classifiers (weights) are aggregated. We discuss aggregation strateges in Sect. 4.2.
4 Experimental Evaluation
Datasets and their characteristics.
Dataset | Size | Density \(\mathbf{I}\) | Class distribution |
---|---|---|---|
anneal | \(898\times 66\) | 0.20 | 0.76/0.04/0.11/0.07/0.01 |
breast | \(699\times 14\) | 0.64 | 0.34/0.66 |
hepatitis | \(155\times 50\) | 0.36 | 0.79/0.21 |
horse colic | \(368\times 81\) | 0.21 | 0.63/0.37 |
iris | \(150\times 16\) | 0.25 | 0.33/0.33/0.33 |
led7 | \(3200\times 14\) | 0.50 | 0.11/0.09/0.10 (\(\times 8\) classes) |
mushroom | \(8124\times 88\) | 0.25 | 0.52/0.48 |
nursery | \(1000\times 27\) | 0.30 | 0.32/0.34/0.34 |
page block | \(5473\times 39\) | 0.26 | 0.90/0.02/0.01/0.05/0.02 |
pima | \(768\times 36\) | 0.22 | 0.650/0.35 |
wine | \(178\times 65\) | 0.20 | 0.33/0.40/0.27 |
We compare most common BMF algorithms, namely 8M [9], GreConD [6], GreEss [5], Hyper [19], MDLGreConD [16], NaiveCol [10] and PaNDa\(^+\) [15].
4.1 Factor as Classification Rule
In this section we examine factors as single classifiers. We study (i) the connection between factor ranks given by unsupervised and supervised quality measures, and which factors are best ones w.r.t. the supervised quality measures, (ii) how well the factors summarize classes.
Connection Between Supervised and Unsupervised Quality Measures. The mentioned BMF algorithms are based on a greedy strategy. The generated factors are ordered w.r.t. their importance. The importance of factors is estimated by a particular objective of an algorithm. Put it differently, the factors generated first might best explain data. Since some factor sets are very small, we cannot use correlation analysis to examine the dependence between the importance of factors (unsupervised quality measure) and their precision (supervised quality measure). To assess the connection between these measures we count how many factors we need to compute to get the best k factors w.r.t. precision. The less the number of factors we need to compute, the stronger connection between unsupervised and supervised quality measures.
Figure 1 shows that the lowest values correspond to the MDLGreConD factors. It means that we need to compute only few factors to get the most precise classifiers. The most important factors w.r.t. the MDLGreConD objective have relatively higher precision than the most important factors generated by other BMF algorithms.
In the next section we discuss the ability of factors to explain classes rather than data as a whole, i.e. their ability to distinguish a single class from others.
The average values of precision on training/test sets. Best values are highlighted in bold.
8M | GreConD | GreEss | Hyper | MDLGreConD | NaiveCol | PaNDa \(^+\) | |
---|---|---|---|---|---|---|---|
anneal | 0.86/0.67 | 0.85/0.66 | 0.86/0.64 | 0.84/0.62 | 0.85/0.75 | 0.84/0.63 | 0.87/0.87 |
breast | 0.88/0.73 | 0.88/0.85 | 0.84/0.84 | 0.93/0.64 | 0.87/0.87 | 0.80/0.80 | 0.85/0.81 |
hepatitis | 0.80/0.64 | 0.81/0.61 | 0.81/0.60 | 0.81/0.68 | 0.83/0.75 | 0.79/0.59 | 0.83/0.55 |
horse colic | 0.70/0.48 | 0.69/0.60 | 0.69/0.60 | 0.72/0.61 | 0.70/0.63 | 0.69/0.59 | 0.80/0.56 |
iris | 0.80/0.75 | 0.80/0.61 | 0.80/0.61 | 0.79/0.67 | 0.92/0.86 | 0.79/0.67 | 0.96/0.53 |
led7 | 0.40/0.44 | 0.33/0.32 | 0.33/0.32 | 0.50/0.19 | 0.37/0.36 | 0.23/0.22 | 0.43/0.42 |
mushroom | 0.82/0.76 | 0.82/0.79 | 0.83/0.79 | 0.85/0.70 | 0.87/0.84 | 0.78/0.75 | 0.81/0.00 |
nursery | 0.45/0.44 | 0.45/0.44 | 0.45/0.44 | 0.45/0.44 | 0.42/0.41 | 0.45/0.44 | 0.58/0.53 |
page blocks | 0.82/0.35 | 0.82/0.46 | 0.84/0.43 | 0.78/0.33 | 0.80/0.51 | 0.83/0.51 | 0.80/0.74 |
pima | 0.70/0.43 | 0.68/0.49 | 0.68/0.48 | 0.69/0.44 | 0.68/0.61 | 0.67/0.45 | 0.77/0.73 |
wine | 0.66/0.40 | 0.69/0.57 | 0.68/0.56 | 0.67/0.49 | 0.84/0.77 | 0.64/0.50 | 0.88/0.66 |
Average | 0.72/0.53 | 0.71/0.58 | 0.71/0.57 | 0.73/0.53 | 0.74/0.67 | 0.68/0.56 | 0.78/0.65 |
The results of the experiments given in Table 2 show that the highest average precision is achieved for factors computed by PaNDa\(^+\) (0.78, on average), the MDLGreConD factors also have quite high values of precision (0.74, on average). The MDLGreConD factors have the most stable quality measures (precision on test sets is smaller by 0.07 than on training sets).
More than that, Table 2 provides precision of factor-classifiers on training and test data. Precision on training data for all algorithms is quite similar (the best algorithm is PaNDa), while MDLGreConD demonstrates the best precision on test sets. It should be noticed that MDLGreConD has the smallest difference in precision for training and test data. That might indicate its ability to generalize well (i.e. it is less likely to overfit). Almost the same quality of factors, but in the unsupervised settings, was described in [4].
4.2 Factors as Ensemble of Classifiers
The modern state-of-the-art classifiers, e.g. Random Forests, Multilayer Networks, Nearest Neighbour classifiers, are comprised of a set single classifiers, i.e. the single classifiers make ensembles. In this section we examine a set of factor-classifiers as an ensemble and evaluate its accuracy.
It should be noticed that some factor sets are incomplete, in other words, they do not contain factors for several classes. It is caused by unbalanced training sets, where some classes contain only few objects. Here we examine the datasets where there are enough factors for every class, namely iris, mushroom, pima and wine datasets. We study rule-based ensembles.
The average accuracy of classifier ensembles computed on iris, mushroom, pima and wine datasets. Best values are highlighted in bold.
All-votes | Best-vote | |||
---|---|---|---|---|
Precision | Accuracy | Precision | Accuracy | |
iris | 0.84/0.82 | 0.84/0.82 | 0.84/0.82 | 0.84/0.82 |
mushroom | 0.93/0.93 | 0.89/0.89 | 0.99/0.99 | 0.88/0.88 |
pima | 0.66/0.66 | 0.67/0.66 | 0.72/0.70 | 0.73/0.73 |
wine | 0.77/0.75 | 0.76/0.75 | 0.79/0.75 | 0.76/0.75 |
Average | 0.80/0.79 | 0.79/0.78 | 0.83/0.81 | 0.80/0.79 |
The results of the experiments, given in Table 3, show that the most accurate ensembles are those that are based on the precision-weighed votes. According to the examined datasets, the “best-vote” scheme (where the response of the best classifier is considered) provides best results.
5 Conclusion
In this paper we examine the factors computed on unlabeled data in the supervised settings. We provided an experimental justification that in case of factors the data explanation problem is closely related to the class explanation problem, i.e. a factor is able to explain specificity of a particular (sub)class. Based on the results of the supervised factor evaluation we propose several models of factor-based ensembles of classifiers. We show that factor-based classifiers can achieve accuracy comparable to the state-of-the-art ensembles of classifiers.
An important direction of further work is to study factors computed in supervised settings for each class separately rather than for the whole dataset. Incorporating precision or accuracy to a BMF objective might improve accuracy of the model as well as provide a deeper insight on a class structure.
Notes
Acknowledgment
The work of Tatiana Makhalova was supported by the Russian Science Foundation under grant 17- 11-01294 and performed at National Research University Higher School of Economics, Moscow, Russia.
References
- 1.Belohlavek, R., Baets, B.D., Outrata, J., Vychodil, V.: Inducing decision trees via concept lattices. Int. J. Gen. Syst. 38(4), 455–467 (2009)MathSciNetCrossRefGoogle Scholar
- 2.Belohlavek, R., Grissa, D., Guillaume, S., Nguifo, E.M., Outrata, J.: Boolean factors as a means of clustering of interestingness measures of association rules. Ann. Math. Artif. Intell. 70(1–2), 151–184 (2014)MathSciNetCrossRefGoogle Scholar
- 3.Belohlavek, R., Outrata, J., Trnecka, M.: Impact of Boolean factorization as preprocessing methods for classification of boolean data. Ann. Math. Artif. Intell. 72(1–2), 3–22 (2014)MathSciNetCrossRefGoogle Scholar
- 4.Belohlavek, R., Outrata, J., Trnecka, M.: Toward quality assessment of Boolean matrix factorizations. Inf. Sci. 459, 71–85 (2018)MathSciNetCrossRefGoogle Scholar
- 5.Belohlavek, R., Trnecka, M.: From-below approximations in Boolean matrix factorization: geometry and new algorithm. J. Comput. Syst. Sci. 81(8), 1678–1697 (2015)MathSciNetCrossRefGoogle Scholar
- 6.Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. 76(1), 3–20 (2010)MathSciNetCrossRefGoogle Scholar
- 7.Coenen, F.: The LUCS-KDD discretised/normalised ARM and CARM data library (2003). http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN
- 8.Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
- 9.Dixon, W.: BMDP statistical software manual to accompany the 7.0 software release, vols. 1–3 (1992)Google Scholar
- 10.Ene, A., Horne, W.G., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R.E.: Fast exact and heuristic methods for role minimization problems. In: Ray, I., Li, N. (eds.) 13th ACM Symposium on Access Control Models and Technologies, SACMAT 2008, Estes Park, CO, USA, 11–13 June 2008, Proceedings, pp. 1–10. ACM (2008)Google Scholar
- 11.Ganter, B., Wille, R.: Formal Concept Analysis Mathematical Foundations. Springer, Heidelberg (1999). https://doi.org/10.1007/978-3-642-59830-2CrossRefzbMATHGoogle Scholar
- 12.Ganter, B., Kuznetsov, S.O.: Hypotheses and Version Spaces. In: Ganter, B., de Moor, A., Lex, W. (eds.) ICCS-ConceptStruct 2003. LNCS (LNAI), vol. 2746, pp. 83–95. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45091-7_6CrossRefGoogle Scholar
- 13.Kueti, L.T., Tsopzé, N., Mbiethieu, C., Nguifo, E.M., Fotso, L.P.: Using Boolean factors for the construction of an artificial neural networks. Int. J. Gen. Syst. 47(8), 849–868 (2018)MathSciNetCrossRefGoogle Scholar
- 14.Kuznetsov, S.O.: Machine learning and formal concept analysis. In: Eklund, P. (ed.) ICFCA 2004. LNCS (LNAI), vol. 2961, pp. 287–312. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24651-0_25CrossRefzbMATHGoogle Scholar
- 15.Lucchese, C., Orlando, S., Perego, R.: A unifying framework for mining approximate top-k binary patterns. IEEE Trans. Knowl. Data Eng. 26(12), 2900–2913 (2014)CrossRefGoogle Scholar
- 16.Makhalova, T., Trnecka, M.: From-below boolean matrix factorization algorithm based on MDL. arXiv preprint arXiv:1901.09567 (2019)
- 17.Makhalova, T.P., Kuznetsov, S.O., Napoli, A.: A first study on what MDL can do for FCA. In: Ignatov, D.I., Nourine, L. (eds.) Proceedings of the Fourteenth International Conference on Concept Lattices and Their Applications. CEUR Workshop Proceedings, vol. 2123, pp. 25–36 (2018)Google Scholar
- 18.Outrata, J.: Preprocessing input data for machine learning by FCA. In: Kryszkiewicz, M., Obiedkov, S.A. (eds.) Proceedings of the 7th International Conference on Concept Lattices and Their Applications, Sevilla, Spain, 19–21 October 2010. CEUR Workshop Proceedings, vol. 672, pp. 187–198. CEUR-WS.org (2010)Google Scholar
- 19.Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Summarizing transactional databases with overlapped hyperrectangles. Data Min. Knowl. Discov. 23(2), 215–251 (2011)MathSciNetCrossRefGoogle Scholar