Web Document Classification Based on Rough Set

Duan, Qiguo; Miao, Duoqian; Chen, Min

doi:10.1007/978-3-540-72530-5_28

Qiguo Duan²⁴,
Duoqian Miao²⁴ &
Min Chen²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4482))

Included in the following conference series:

International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing

1514 Accesses
3 Citations

Abstract

For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel Web document representation and classification approach based on rough set is proposed. Firstly, TF*IDF weighting scheme is used to assign weight values for Web document’s vector. The weights of those terms which do not occur in a Web document are considered missing information. Then rough set for incomplete information is introduced to supplement loss and expand Web document representation. Through generating tolerance classes in both term space and Web document space, the missing information of Web document can be complemented by incorporating the corresponding weights of terms in tolerance classes, which extends the essential information to Web document. Finally, Web document classification algorithm is implemented. Experimental results show that the performance of the classification is greatly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)
Chapter Google Scholar
Shih, L.K., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: WWW2004, pp. 193–202 (2004)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic, Dordrecht (1991)
Book MATH Google Scholar
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996)
MathSciNet MATH Google Scholar
Kryszkiewicz, M.: Rough set approach to incomplete information system. Information Sciences 112, 39–49 (1998)
Article MathSciNet MATH Google Scholar
Ho, T.B., Nguyen, N.B.: Nonhierarchical Document Clustering based on A Tolerance Tough Set Model. International Journal of Intelligent Systems 17, 199–212 (2002)
Article MATH Google Scholar
Nguyen, H.S., Ngo, C.L.: A Tolerance Rough Set Approach to Clustering Web Search Results. In: Boulicaut, J.-F., et al. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 515–517. Springer, Heidelberg (2004)
Google Scholar
Miao, D.Q., Hou, L.S.: A comparison of rough set methods and representative inductive learning algorithms. Fundamenta Informaticae 59(2-3), 203–219 (2004)
MathSciNet MATH Google Scholar
Yao, Y.Y., Liau, C.-J., Zhong, N.: Granular computing based on rough sets, quotient space theory, and belief functions. In: Zhong, N., et al. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 152–159. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, The Key Laboratory of ”Embedded System and Service Computing”, Ministry of Education, Shanghai 201804, China
Qiguo Duan, Duoqian Miao & Min Chen

Authors

Qiguo Duan
View author publications
You can also search for this author in PubMed Google Scholar
Duoqian Miao
View author publications
You can also search for this author in PubMed Google Scholar
Min Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Engineering, York University, M3J 1P3, Toronto, Ontario, Canada
Aijun An
Institute of Computing Sciences, Poznań University of Technology, ul. Piotrowo 2, 60–965, Poznań, Poland
Jerzy Stefanowski
Department of Applied Computer Science, University of Winnipeg, R3B 2E9, Winnipeg, Manitoba, Canada
Sheela Ramanna
Department of Computer Science, University of Regina, S4S 0A2, Regina, Saskatchewan, Canada
Cory J. Butz
Department of Electrical and Computer Engineering, University of Alberta, T6G 2V4, Edmonton, Alberta, Canada
Witold Pedrycz
Institute of Compuer Science and Technology, Chongqing University of Posts and Telecommunications, 40065, Chongqing, P.R. China
Guoyin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, Q., Miao, D., Chen, M. (2007). Web Document Classification Based on Rough Set. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. RSFDGrC 2007. Lecture Notes in Computer Science(), vol 4482. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72530-5_28

Download citation

DOI: https://doi.org/10.1007/978-3-540-72530-5_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72529-9
Online ISBN: 978-3-540-72530-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics