Abstract
Practical applications of association rule mining often suffer from overwhelming number of rules that are generated, many of which are not interesting or useful for the application in question. Removing irrelevant features and/or rules comprised of irrelevant features can significantly improve the overall performance. Many statistical and constraint based measures are used to discard unnecessary and irrelevant features and rules when vectorial or tabular data is in question. In contrast, the use of such measures is limited in the tree-structured data domain, due to the structural aspects that are not easily incorporated. In this chapter, we explore the use of a feature subset selection measure as well as a number of common statistical interestingness measures via a recently proposed structure-preserving flat representation for tree-structured data such as XML. A feature subset selection is used prior to association rule generation. Once the initial set of rules is obtained, irrelevant rules are determined as those that are comprised of attributes not determined to be statistically significant for the classification task. The experiments are performed using real world web access trees and property management dataset. The results indicate that where the dataset has more standard structure a large number of insignificant rules will be discarded and accuracy will increase. However, where the tree instances can vary greatly in terms of structure and label distribution among nodes, while many rules are removed and the accuracy increases, there is a significant reduction in coverage rate of the rule set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Agrawal, R., Imieliski, T., Swami, A.: Mining association rules between sets of items in large databases. ACM SIGMOD Rec. 22(2), 207–216 (1993)
Aumann, Y., Lindell, Y.: A statistical theory for quantitative association rules. Intell. Inf. Syst. 20(3), 253–283 (2003)
Bathoorn, R., Koopman, A., Siebes, A.: Reducing the frequent pattern set. In: Proceedings of the 6th IEEE International Conference on Data Mining—Workshops, pp. 55–59 (2006)
Bayardo, R., Agrawal, R., Gunopulos, D.: Constraint-based rule mining in large, dense databases. Data Min. Knowl. Discov. 4(2–3), 217–240 (2000)
Blanchard, J., Guillet, F., Gras, R., Briand, H.: Using information-theoretic measures to assess association rule interestingness. In: Proceedings of the 5th IEEE International Conference on Data Mining, pp. 215–238 (2005)
Bolon-Canedo, V., Sanchez-Marono, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)
Brijs, T., Vanhoof, K., Wets, G.: Defining interestingness for association rules. Int. J. Inf. Theor. Appl. 10(4), 370–376 (2003)
Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: generalizing association rules to correlations. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 265–276 (1997)
Cheng, H., Yan, X., Han, J., Hsu, C.W.: Discriminative frequent pattern analysis for effective classification. In: Proceedings of the 23rd International IEEE Conference on Data Engineering, pp. 716–725 (2007)
Cheng, H., Yan, X., Han, J., Yu, P.: Direct discriminative pattern mining for effective classification. In: Proceedings of the 24th International Conference on Data Engineering, pp. 167–178 (2008)
Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)
Geng, L., Hamilton, H.: Interestingness measures for data mining: a survey. ACM Comput. Surv. 338(3, Article No. 9) (2006)
Goodman, A., Kamath, C., Kumar, V.: Data analysis in the 21st century. Stat. Anal. Data Min. 1(1), 1–3 (2008)
Hadzic, F.: A structure preserving flat data format representation for tree-structured data. In: Proceedings of PAKDD Workshops, vol. 2011, pp. 221–233 (2012)
Hadzic, F., Dillon, T.: Using the symmetrical tau (\( \tau \)) criterion for feature selection in decision tree and neural network learning. In: Proceedings of the 2nd SIAM Workshop on Feature Selection for Data Mining: Interfacing Machine Learning and Statistics (2006)
Hadzic, F., Hecker, M.: Alternative approach to tree-structured web log representation and mining. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 235–242 (2011)
Hadzic, F., Tan, H., Dillon, T.S.: Mining of Data With Complex Structures, 1st edn, Studies in Computational Intelligence, vol. 333, . Springer (2011)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2001)
Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M., Mamitsuka, H.: Mining significant tree patterns in carbohydrate sugar chains. Bioinformatics 24(16), 167–173 (2008)
Knijf, J.D., Feelders, A.J.: Monotone constraints in frequent tree mining. In: Proceedings of the 14th Annual Machine Learning Conference of Belgium and the Netherlands, BENELEARN pp. 13–20 (2005)
Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classifiers. Pattern Recognit. 33(1), 25–41 (2000)
Lallich, S., Teytaud, O., Prudhomme, E.: Association rule interestingness: measure and statistical validation. In: Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 251–275. Springer (2007)
Lallich, S., Teytaud, O., Prudhomme, E.: Formal framework for the study of algorithmic properties of objective interestingness measures. In: Data Mining: Foundations and Intelligent Paradigms, vol. 24, pp. 77–98. ISRL (2012)
Le Bras, Y., Lenca, P., Lallich, S.: Mining classification rules without support: an anti-monotone property of Jaccard measure. In: Proceedings of the 14th International Conference on Discovery Science, pp. 179–193 (2011)
Lenca, P., Meyer, P., Vaillant, B., Lallich, S.: On selecting interestingness measures for association rules: user oriented description and multiple criteria decision aid. Eur. J. Oper. Res. 184(2), 610–626 (2008)
Li, J., Shen, H., Topor, R.: Mining the optimal class association rule set. Knowl.-Based Syst. 15(7), 399–405 (2002)
Little, R., Rubin, D.: Statistical Analysis with Missing Data, 2nd edn. Wiley, New York (2002)
Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80–86 (1998)
McGarry, K.: A survey of interestingness measures for knowledge discovery. Knowl. Eng. Rev. 20(1), 39–61 (2005)
Molina, L., Belanche, L., Nebot, A.: Feature selection algorithms: a survey and experimental evaluation. In: Proceedings of IEEE International Conference on Data Mining, pp. 306–313 (2002)
Nakamura, A., Kudo, M.: Mining frequent trees with node-inclusion constraints. In: Advances in Knowledge Discovery and Data Mining, vol. 3518, pp. 850–860. Springer (2005)
Ozaki, T., Ohkawa, T.: New frontiers in applied data mining, PAKDD 2008 International Workshops. Mining Mutually Dependent Ordered Subtrees in Tree Databases, pp. 75–86. Springer, Heidelberg (2009)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufman (1993)
Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann Publishers, San Francisco (2007)
Roiger, R., Geatz, M.: Data Mining: A Tutorial-Based Primer. Addison Wesley, Boston (2003)
Shaharanee, I., Hadzic, F.: Evaluation and optimization of frequent, closed and maximal association rule based classification. Stat. Comput. 23, 1–23 (2013)
Shaharanee, I., Hadzic, F., Dillon, T.: Interestingness measures for association rules based on statistical validity. Knowl.-Based Syst. 24(3), 386–392 (2011)
Siebes, A., Vreeken, J., Leeuwen, M.V.: Item sets that compress. In: Proceedings of the SIAM Conference on Data Mining, pp. 393–404 (2006)
Silverstein, C., Brin, S., Motwani, R.: Beyond market baskets: generalizing association rules to dependence rules. Data Min. Knowl. Disc. 2(1), 39–68 (1998)
Srikant, R., Vu, Q., Agrawal, R.: Mining association rules with item constraints. In: Proceedings of the 3rd Internationall Conference on Knowledge Discovery in Databases and Data Mining, pp. 67–73 (1997)
Tan, H., Dillon, T., Hadzic, F., Feng, L., Chang, E.: IMB3-Miner: Mining induced/embedded subtrees by constraining the level of embedding. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 450–461 (2006)
Tan, H., Hadzic, F., Dillon, T., Chang, E., Feng, L.: Tree model guided candidate generation for mining frequent subtrees from XML documents. ACM Trans. Knowl. Disc. Data Min. 2(2), 1–43 (2008)
Tan, P., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proceedings of the 8th ACM Knowledge Discovery and Data Mining Conference, pp. 32–41 (2002)
Veloso, A., Meira, W., Zaki, M.: Lazy Associative classification. In: Proceedings of the 6th IEEE International Conference on Data Mining, pp. 645–654 (2006)
Webb, G.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
Xiong, H., Tan, P.N., Kumar, V.: Hyperclique pattern discovery. Data Min. Knowl. Disc. 13(2), 219–242 (2006)
Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 433–444 (2008)
Yan, X., Han, J., Hsu, C.W.: Discrimantive frequent pattern analysis for effective classification. In: Proceedings of the 23rd IEEE International Conference on Data Engineering, pp. 716–725 (2007)
Yin, X., Han, J.: CPAR: Classification based on predictive association rules. In: Proceedings of the SIAM International Conference on Data Mining, pp. 396–376 (2003)
Zaki, M.: Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans. Knowl. Data Eng. 17(8), 1021–1035 (2005)
Zaki, M.J., Aggarwal, C.: XRules: an effective structural classifier for XML data. In: Proceedings of the 9th ACM Knowledge Discovery and Data Mining Conference, pp. 316–325 (2003)
Zhang, C., Zhang, S.: Collecting quality data for database mining. In: AI 2001: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 2256, pp. 593–604. Springer (2001)
Zhou, X., Dillon, T.: A statistical-heuristic feature selection criterion for decision tree induction. IEEE Trans. Pattern Anal. Mach. Intell. 13(8), 834–841 (1991)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Shaharanee, I.N.M., Hadzic, F. (2015). Irrelevant Feature and Rule Removal for Structural Associative Classification Using Structure-Preserving Flat Representation. In: Stańczyk, U., Jain, L. (eds) Feature Selection for Data and Pattern Recognition. Studies in Computational Intelligence, vol 584. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45620-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-45620-0_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45619-4
Online ISBN: 978-3-662-45620-0
eBook Packages: EngineeringEngineering (R0)