Abstract
For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel Web document representation and classification approach based on rough set is proposed. Firstly, TF*IDF weighting scheme is used to assign weight values for Web document’s vector. The weights of those terms which do not occur in a Web document are considered missing information. Then rough set for incomplete information is introduced to supplement loss and expand Web document representation. Through generating tolerance classes in both term space and Web document space, the missing information of Web document can be complemented by incorporating the corresponding weights of terms in tolerance classes, which extends the essential information to Web document. Finally, Web document classification algorithm is implemented. Experimental results show that the performance of the classification is greatly improved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)
Shih, L.K., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: WWW2004, pp. 193–202 (2004)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic, Dordrecht (1991)
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996)
Kryszkiewicz, M.: Rough set approach to incomplete information system. Information Sciences 112, 39–49 (1998)
Ho, T.B., Nguyen, N.B.: Nonhierarchical Document Clustering based on A Tolerance Tough Set Model. International Journal of Intelligent Systems 17, 199–212 (2002)
Nguyen, H.S., Ngo, C.L.: A Tolerance Rough Set Approach to Clustering Web Search Results. In: Boulicaut, J.-F., et al. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 515–517. Springer, Heidelberg (2004)
Miao, D.Q., Hou, L.S.: A comparison of rough set methods and representative inductive learning algorithms. Fundamenta Informaticae 59(2-3), 203–219 (2004)
Yao, Y.Y., Liau, C.-J., Zhong, N.: Granular computing based on rough sets, quotient space theory, and belief functions. In: Zhong, N., et al. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 152–159. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Duan, Q., Miao, D., Chen, M. (2007). Web Document Classification Based on Rough Set. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. RSFDGrC 2007. Lecture Notes in Computer Science(), vol 4482. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72530-5_28
Download citation
DOI: https://doi.org/10.1007/978-3-540-72530-5_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72529-9
Online ISBN: 978-3-540-72530-5
eBook Packages: Computer ScienceComputer Science (R0)