Abstract
Web Directories have emerged as an alternative to the Search Engines for locating information on the Web. Typically, Web Directories rely on humans putting in significant time and effort into finding important pages on the Web and categorizing them in the Directory. In this paper, we experimentally study the automatic population of a Web Directory via the use of a subject hierarchy. For our study, we have constructed a subject hierarchy for the top level topics offered in Dmoz, by leveraging ontological content from available lexical resources. We first describe how we built our subject hierarchy. Then, we analytically present how the hierarchy can help in the construction of a Directory. We also introduce a ranking formula for sorting the pages listed in every Directory topic, based on the pages’ quality, and we experimentally study the efficiency of our approach against other popular methods for creating Directories.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
MultiWordNet Domains, http://wndomains.itc.it/
Open Directory Project, http://dmoz.com
Sumo Ontology, http://ontology.teknowledge.com/
WordNet 2.0, http://www.cogsci.princeton.edu/~wn/
Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Inf. Systems 12(3), 233–251 (1994)
Barzilay, R., Elhadad, M.: Lexical chains for text summarization. Master’s Thesis (1997)
Boyapati, V.: Improving text classification using unlabeled data. In: Proceedings of SIGIR Conference, pp. 11–15 (2002)
Broder, A.Z., Glassman, S.C., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the 6th WWW Conference, pp. 1157–1166 (1997)
Chakrabarti, S., Dom, B., Agraval, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal 7, 163–178 (1998a)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: Proceedings of ACM SIGMOD Conference (1998b)
Stamou, S., Krikos, V., Kokosis, P., Christodoulakis, D.: Web directory construction using lexical chains. In: Proceedings of the 10th NLDB Conference (2005)
Chen, H., Dumais, S.: Bringing order to the web: Automatically categorizing search results. In: Proceedings of the SIGCHI Conference, pp. 145–152 (2000)
Christianini, N., Shawe-Taylor, J.: An introduction to support vector machines. Cambridge University Press, Cambridge (2000)
Duda, R.O., Hart, P.E.: Pattern Classification and scene analysis. Wiley & sons, Chichester (1973)
Furnkranz, J.: Exploring structural information for text classification on the WWW. In: Intelligent Data Analysis, pp. 487–498 (1999)
Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock, M., Flake, G.: Using web structure for classifying and describing Web pages. In: Proc. of the 11th WWW Conference (2002)
Huang, C.C., Chuang, S.L., Chien, L.K.: LiveClassifier: Creating hierarchical text classifiers through Web corpora. In: Proceedings of the 13th WWW Conference, pp. 184–192 (2004)
Kaufman, L., Rousseeuw, P.J.: Finding groups in data: An introduction to cluster analysis. Wiley & sons, New York (1990)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML Conference, pp. 170–178 (1997)
Mladenic, D.: Turning Yahoo into an automatic web page classifier. In: the 13th European Conference on Artificial Intelligence, pp. 473–474 (1998)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2-3), 103–134 (2000)
Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web? The evolution of the Web from a search engine perspective. In: Proceedings of the 13th WWW Conference, pp. 1–12 (2004)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web (1998), http://dbpubs.stanford.edu:8090/pub/1999-66
Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of interesting Web sites. Machine Learning Journal 23, 313–331 (1997)
Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11, 95–130 (1999)
Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization. In: Proceedings of SIGIR Conference, pp. 281–282 (1999)
Song, Y.I., Han, K.S., Rim, H.C.: A term weighting method based on lexical chain for automatic summarization. In: Proceedings of the 5th CICLing Conference, pp. 636–639 (2004)
Krikos, V., Stamou, S., Ntoulas, A., Kokosis, P., Christodoulakis, D.: DirectoryRank: ordering pages in web directories. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (WIDM), Bremen, Germany (2005)
WordNets in the world, Available at, http://www.globalwordnet.org/gwa/wordnet_table.htm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., Christodoulakis, D. (2006). Classifying Web Data in Directory Structures. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_22
Download citation
DOI: https://doi.org/10.1007/11610113_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)