Feature Selection with Rough Sets for Web Page Classification

An, Aijun; Huang, Yanhui; Huang, Xiangji; Cercone, Nick

doi:10.1007/978-3-540-27778-1_1

Aijun An²²,
Yanhui Huang²³,
Xiangji Huang²³ &
…
Nick Cercone²⁴

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 3135))

627 Accesses
15 Citations

Abstract

Web page classification is the problem of assigning predefined categories to web pages. A challenge in web page classification is how to deal with the high dimensionality of the feature space. We present a feature reduction method based on the rough set theory and investigate the effectiveness of the rough set feature selection method on web page classification. Our experiments indicate that rough set feature selection can improve the predictive performance when the original feature set for representing web pages is large.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

An, A., Cercone, N.: ELEM2: A Learning system for more accurate classifications. In: Proceedings of the 12th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, AI 1998, Vancouver, Canada (1998)
Google Scholar
An, A., Cercone, N.: Rule quality measures for rule induction systems: Description and evaluation. Computational Intelligence 17(3), 409–424 (2001)
Article Google Scholar
Huang, Y.: Web-based Classification Using Machine Learning Approaches. MSc Thesis, Department of Computer Science, University of Regina, Regina, Canada (2002)
Google Scholar
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning
Google Scholar
Lawrence, S., Giles, L.: Accessibility and distribution of information on the Web. Nature 400, 107–109 (1999), http://www.metrics.com
Article Google Scholar
Lewis, D.D., Ringuette, M.: Comparison of two learning algorithms for text categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval SDAIR 1994 (1994)
Google Scholar
Mladenic, D.: Feature subset selection in text learning. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, Springer, Heidelberg (1998)
Google Scholar
Notess, G.R. Search engine statistics: database total size estimates. http://www.searchengineshowdown.com/stats/sizeest.shtml
Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer, Dordrecht (1991)
MATH Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Raghavan, V.V., Sever, H.: The state of rough sets for database mining applications. In: Proceedings of 23rd Computer Science Conference Workshop on Rough Sets and Database Mining, pp. 1–11 (1995)
Google Scholar
Scott, S., Matwin, S.: Text classification using WordNet Hypernyms. In: Proceedings of the Conference on the Use of WordNet in Natural Language Processing Systems (1998)
Google Scholar
Slowinski, R. (ed.): Intelligent Decision Support: Handbook of Advances and Applications of the Rough Sets Theory. Kluwer, Dordrecht (1992)
MATH Google Scholar
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval SDAIR 1995 (1995)
Google Scholar
Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 1995 (1995)
Google Scholar
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning ICML 1997, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

York University, Toronto, Ontario, M3J 1P3, Canada
Aijun An
University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada
Yanhui Huang & Xiangji Huang
Dalhousie University, Halifax, Nova Scotia, B3H 1W5, Canada
Nick Cercone

Authors

Aijun An
View author publications
You can also search for this author in PubMed Google Scholar
Yanhui Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangji Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nick Cercone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, University of Manitoba, R3T 5V6, Winnipeg, Manitoba, Canada
James F. Peters
Institute of Mathematics, Warsaw University, Banacha 2, 02-097, Warsaw, Poland
Andrzej Skowron
Institut de Recherche en Informatique de Toulouse, France
Didier Dubois
Institute of Computer Science, Polish Academy of Sciences, 01–237, Warsaw, Poland
Jerzy W. Grzymała-Busse
Department of Systems Innovation, Graduate School of Engineering Science, Osaka University, 1-3, Machikaneyama, Toyonaka, 560-8531, Osaka, Japan
Masahiro Inuiguchi
Polish-Japanese Institute of Information Technology Warsaw, Poland Department of Mathematics and Computer Science, University of Warmia and Mazury, Olsztyn, Poland
Lech Polkowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

An, A., Huang, Y., Huang, X., Cercone, N. (2004). Feature Selection with Rough Sets for Web Page Classification. In: Peters, J.F., Skowron, A., Dubois, D., Grzymała-Busse, J.W., Inuiguchi, M., Polkowski, L. (eds) Transactions on Rough Sets II. Lecture Notes in Computer Science, vol 3135. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27778-1_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-27778-1_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23990-1
Online ISBN: 978-3-540-27778-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.