Mapping Documents onto Web Page Ontology

  • Dunja Mladenić
  • Marko Grobelnik
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3209)


The paper describes an approach to automatically mapping Web pages onto ontology using document classification based on the Yahoo! ontology of Web pages. Techniques developed for learning on text data are used here on the hierarchical classification structure (ontology of Web documents). The high number of features is reduced by taking into account the hierarchical structure and using feature subset selection developed for the Naive Bayesian classifier. We focus on data sets with many features that also have a highly unbalanced class distribution. Documents are represented as word-vectors that include word sequences of up to five consecutive words. Based on the hierarchical structure the problem is divided into subproblems, each representing one on the categories included in the Yahoo! hierarchy. The resulting model is a set of independent classifiers, each used to predict the probability that a new document is a member of the corresponding category represented as a node in the hierarchy. Our example problem is automatic document categorization where we want to identify documents relevant for the selected category. Usually, only about 1%-10% of examples belong to the selected category. Experimental evaluation on real-world data shows that the proposed approach gives good results. Our experimental comparison of eleven feature scoring measures show that considering data and algorithm characteristics significantly improves the performance.


Feature Selection Information Gain Feature Subset Term Frequency Word Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Filo D., Y.J.: Yahoo! Inc. (1997),
  2. 2.
    McCallum, A.: N.K.: A comparison of event models for naive bayes text classifiers. In: Proceedings of the AAAI 1998 Workshop on Learning for Text Categorization (1998)Google Scholar
  3. 3.
    Mladenić, D.: G.M.: Word sequences as features in text-learning. In: Proceedings of the Seventh Electrotechnical and Computer Science Conference ERK 1998, Slovenia: IEEE section (1998)Google Scholar
  4. 4.
    Agrawal, R., Mannila, H.: S.R.T.H.V.A.: Fast discovery of association rules. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining (1996)Google Scholar
  5. 5.
    Quinlan, J.: Constructing Decision Tree. Morgan Kaufman Publishers, San Francisco (1993)Google Scholar
  6. 6.
    Koller, D., Hierarchically, S.M.: classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning ICML 1997 (1997)Google Scholar
  7. 7.
    Yang, Y.: P.J.: A comparative study on feature selection in text categorization. In: Proceedings of the 14th International Conference on Machine Learning ICML 1997, pp. 412–420 (1997)Google Scholar
  8. 8.
    van Rijsbergen, C.J., Harper, D.J.: P.M.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)CrossRefGoogle Scholar
  9. 9.
    Mladenić, D.: G.M.: Feature selection on hierarchy of web documents. Journal of Decission support systems 35, 45–87 (2003)CrossRefGoogle Scholar
  10. 10.
    Mladenić, D.: Machine learning on non-homogeneous, distributed text data. In: PhD thesis, University of Ljubljana, Slovenia (1998),
  11. 11.
    Mladenić, D.: Turning yahoo into an automatic web-page classifier. In: Proceedings of the 13th European Conference on Aritficial Intelligence ECAI 1998 (1998)Google Scholar
  12. 12.
    Grobelnik, M.: M.D.: Efficient text categorization. In: Proceedings of the ECML 1998 Workshop on Text Mining (1998)Google Scholar
  13. 13.
    Lewis, D.: Evaluating and optimizating autonomous text classification systems. In: Proceedings of the 18th Annual International ACM-SIGIR Conference on Recsearch and Development in Information Retrieval (1995)Google Scholar
  14. 14.
    Mladenić, D.: Feature subset selection in text-learning. In: Proceedings of the 10th European Conference on Machine Learning ECML 1998 (1998)Google Scholar
  15. 15.
    McCallum, A., Rosenfeld, R.: M.T.N.A.: Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the 15th International Conference on Machine Learning (ICML 1998), Morgan Kaufmann, San Francisco (1998)Google Scholar
  16. 16.
    Brank, J., Grobelnik, M.: M.F.N.M.D.: Interaction of feature selection methods and linear classification models. In: Proceedings of the ICML 2002 Workshop on Text Learning, The University of New South Wales, Sydney (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Dunja Mladenić
    • 1
    • 2
  • Marko Grobelnik
    • 1
  1. 1.J.Stefan InstituteLjubljanaSlovenia
  2. 2.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations