Skip to main content

Web Document Classification Based on Rough Set

  • Conference paper
Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4482))

Abstract

For traditional way of Web document representation in Vector Space Model, zero-valued similarity problem between vectors occurs frequently, which decreases classificatory quality when defining the relation between Web documents. In this paper, a novel Web document representation and classification approach based on rough set is proposed. Firstly, TF*IDF weighting scheme is used to assign weight values for Web document’s vector. The weights of those terms which do not occur in a Web document are considered missing information. Then rough set for incomplete information is introduced to supplement loss and expand Web document representation. Through generating tolerance classes in both term space and Web document space, the missing information of Web document can be complemented by incorporating the corresponding weights of terms in tolerance classes, which extends the essential information to Web document. Finally, Web document classification algorithm is implemented. Experimental results show that the performance of the classification is greatly improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ceci, M., Malerba, D.: Hierarchical Classification of HTML Documents with WebClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  2. Shih, L.K., Karger, D.R.: Using URLs and Table Layout for Web Classification Tasks. In: WWW2004, pp. 193–202 (2004)

    Google Scholar 

  3. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  4. Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer Academic, Dordrecht (1991)

    Book  MATH  Google Scholar 

  5. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996)

    MathSciNet  MATH  Google Scholar 

  6. Kryszkiewicz, M.: Rough set approach to incomplete information system. Information Sciences 112, 39–49 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  7. Ho, T.B., Nguyen, N.B.: Nonhierarchical Document Clustering based on A Tolerance Tough Set Model. International Journal of Intelligent Systems 17, 199–212 (2002)

    Article  MATH  Google Scholar 

  8. Nguyen, H.S., Ngo, C.L.: A Tolerance Rough Set Approach to Clustering Web Search Results. In: Boulicaut, J.-F., et al. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 515–517. Springer, Heidelberg (2004)

    Google Scholar 

  9. Miao, D.Q., Hou, L.S.: A comparison of rough set methods and representative inductive learning algorithms. Fundamenta Informaticae 59(2-3), 203–219 (2004)

    MathSciNet  MATH  Google Scholar 

  10. Yao, Y.Y., Liau, C.-J., Zhong, N.: Granular computing based on rough sets, quotient space theory, and belief functions. In: Zhong, N., et al. (eds.) ISMIS 2003. LNCS (LNAI), vol. 2871, pp. 152–159. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Duan, Q., Miao, D., Chen, M. (2007). Web Document Classification Based on Rough Set. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. RSFDGrC 2007. Lecture Notes in Computer Science(), vol 4482. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72530-5_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72530-5_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72529-9

  • Online ISBN: 978-3-540-72530-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics