Abstract
Structured link vector model (SLVM) is a representation proposed for modeling XML documents, which was extended from the conventional vector space model (VSM) by incorporating document structures. In this paper, we describe the classification approach for XML documents based on SLVM in the Document Mining Challenge of INEX 2009, where the closed frequent subtrees as structural units are used for content extraction from the XML document and the Chi-square test is used for feature selection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Berry, M.: Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, Heidelberg (2003)
Yang, J., Chen, X.: A semi-structured document model for text mining. Journal of Computer Science and Technology 17(5), 603–610 (2002)
Salton, G., McGill, M.J.: Introduction to Modern information Retrieval. McGraw-Hill, New York (1983)
Yang, J., Zhang, F.: XML Document Classification using Extended VSM. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 234–244. Springer, Heidelberg (2008)
Yang, J., Cheung, W.K., Chen, X.O.: Learning Element Similarity Matrix for Semi-structured Document Analysis. Knowledge and Information Systems 19, 53–78 (2009)
Vapnic, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chi, Y., Nijssen, S., Muntz, R.R., Kok, J.N.: Frequent Subtree Mining -An Overview. Fundamenta Informaticae (2005)
Chi, Y., Yang, Y., Xia, Y., Muntz, R.R.: CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 63–73. Springer, Heidelberg (2004)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the 1998 International Conference on Machine Learning, pp. 412–420 (1997)
Collobert, R., Bengio, S.: SVMTorch: support vector machines for large-scale regression problems. Journal of Machine Learning Research 1, 143–160 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, J., Wang, S. (2010). Extended VSM for XML Document Classification Using Frequent Subtrees. In: Geva, S., Kamps, J., Trotman, A. (eds) Focused Retrieval and Evaluation. INEX 2009. Lecture Notes in Computer Science, vol 6203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14556-8_44
Download citation
DOI: https://doi.org/10.1007/978-3-642-14556-8_44
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14555-1
Online ISBN: 978-3-642-14556-8
eBook Packages: Computer ScienceComputer Science (R0)