Skip to main content

Classifying Web Data in Directory Structures

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3841))

Abstract

Web Directories have emerged as an alternative to the Search Engines for locating information on the Web. Typically, Web Directories rely on humans putting in significant time and effort into finding important pages on the Web and categorizing them in the Directory. In this paper, we experimentally study the automatic population of a Web Directory via the use of a subject hierarchy. For our study, we have constructed a subject hierarchy for the top level topics offered in Dmoz, by leveraging ontological content from available lexical resources. We first describe how we built our subject hierarchy. Then, we analytically present how the hierarchy can help in the construction of a Directory. We also introduce a ranking formula for sorting the pages listed in every Directory topic, based on the pages’ quality, and we experimentally study the efficiency of our approach against other popular methods for creating Directories.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   189.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. MultiWordNet Domains, http://wndomains.itc.it/

  2. Open Directory Project, http://dmoz.com

  3. Sumo Ontology, http://ontology.teknowledge.com/

  4. WordNet 2.0, http://www.cogsci.princeton.edu/~wn/

  5. Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Inf. Systems 12(3), 233–251 (1994)

    Article  Google Scholar 

  6. Barzilay, R., Elhadad, M.: Lexical chains for text summarization. Master’s Thesis (1997)

    Google Scholar 

  7. Boyapati, V.: Improving text classification using unlabeled data. In: Proceedings of SIGIR Conference, pp. 11–15 (2002)

    Google Scholar 

  8. Broder, A.Z., Glassman, S.C., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the 6th WWW Conference, pp. 1157–1166 (1997)

    Google Scholar 

  9. Chakrabarti, S., Dom, B., Agraval, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal 7, 163–178 (1998a)

    Article  Google Scholar 

  10. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of ACM SIGMOD Conference (1998b)

    Google Scholar 

  11. Stamou, S., Krikos, V., Kokosis, P., Christodoulakis, D.: Web directory construction using lexical chains. In: Proceedings of the 10th NLDB Conference (2005)

    Google Scholar 

  12. Chen, H., Dumais, S.: Bringing order to the web: Automatically categorizing search results. In: Proceedings of the SIGCHI Conference, pp. 145–152 (2000)

    Google Scholar 

  13. Christianini, N., Shawe-Taylor, J.: An introduction to support vector machines. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  14. Duda, R.O., Hart, P.E.: Pattern Classification and scene analysis. Wiley & sons, Chichester (1973)

    MATH  Google Scholar 

  15. Furnkranz, J.: Exploring structural information for text classification on the WWW. In: Intelligent Data Analysis, pp. 487–498 (1999)

    Google Scholar 

  16. Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock, M., Flake, G.: Using web structure for classifying and describing Web pages. In: Proc. of the 11th WWW Conference (2002)

    Google Scholar 

  17. Huang, C.C., Chuang, S.L., Chien, L.K.: LiveClassifier: Creating hierarchical text classifiers through Web corpora. In: Proceedings of the 13th WWW Conference, pp. 184–192 (2004)

    Google Scholar 

  18. Kaufman, L., Rousseeuw, P.J.: Finding groups in data: An introduction to cluster analysis. Wiley & sons, New York (1990)

    Google Scholar 

  19. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML Conference, pp. 170–178 (1997)

    Google Scholar 

  20. Mladenic, D.: Turning Yahoo into an automatic web page classifier. In: the 13th European Conference on Artificial Intelligence, pp. 473–474 (1998)

    Google Scholar 

  21. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2-3), 103–134 (2000)

    Article  MATH  Google Scholar 

  22. Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web? The evolution of the Web from a search engine perspective. In: Proceedings of the 13th WWW Conference, pp. 1–12 (2004)

    Google Scholar 

  23. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web (1998), http://dbpubs.stanford.edu:8090/pub/1999-66

  24. Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of interesting Web sites. Machine Learning Journal 23, 313–331 (1997)

    Article  Google Scholar 

  25. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11, 95–130 (1999)

    MATH  Google Scholar 

  26. Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization. In: Proceedings of SIGIR Conference, pp. 281–282 (1999)

    Google Scholar 

  27. Song, Y.I., Han, K.S., Rim, H.C.: A term weighting method based on lexical chain for automatic summarization. In: Proceedings of the 5th CICLing Conference, pp. 636–639 (2004)

    Google Scholar 

  28. Krikos, V., Stamou, S., Ntoulas, A., Kokosis, P., Christodoulakis, D.: DirectoryRank: ordering pages in web directories. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (WIDM), Bremen, Germany (2005)

    Google Scholar 

  29. WordNets in the world, Available at, http://www.globalwordnet.org/gwa/wordnet_table.htm

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., Christodoulakis, D. (2006). Classifying Web Data in Directory Structures. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_22

Download citation

  • DOI: https://doi.org/10.1007/11610113_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31142-3

  • Online ISBN: 978-3-540-32437-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics