Crawling the Web with OntoDir

Picariello, Antonio; Rinaldi, Antonio M.

doi:10.1007/978-3-540-74469-6_71

Antonio Picariello¹ &
Antonio M. Rinaldi¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4653))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1200 Accesses

Abstract

Managing large amount of information on the internet needs more efficient and effective methods and techniques for mining and representing information. The use of ontologies for knowledge representation has had a fast increase in the last years: in fact the use of a common and formal representation of knowledge allows a more accurate analysis of a number of documents content, in several contexts. One of these challenging applications is the Web: the World Wide Web, in fact, has nowadays those kinds of requirements which are hard to satisfy, especially when one considers a complex scenario as the Semantic Web. In this paper we present a methodology for automatic topic annotation of Web pages. We describe an algorithm for words disambiguation using an apposite metric for measuring the semantic relatedness and we show a technique which allows to detect the topic of the analyzed document by means of ontologies extracted from a knowledge base. The strategy is implemented in a system where these information are taken into account to build a topic hierarchy automatically created and not a priori defined. Experimental results are presented and discussed in order to measure the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Albanese, M., Picariello, A., Rinaldi, A.M.: A semantic search engine for web information retrieval: an approach based on dynamic semantic networks. In: ACM SIGIR Semantic Web and Information Retrieval Workshop (SWIR 2004), pp. 25–29. ACM Press, New York (2004)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web: A new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American 284(5), 28–37 (2001)
Article Google Scholar
Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., Gonçalves, M.A.: Combining link-based and content-based methods for web document classification. In: CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pp. 394–401. ACM Press, New York (2003)
Chapter Google Scholar
Chen, H., Dumais, S.: Bringing order to the web: automatically categorizing search results. In: CHI 2000: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 145–152. ACM Press, New York (2000)
Chapter Google Scholar
Dumais, S., Chen, H.: Hierarchical classification of web content. In: SIGIR 2000: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 256–263. ACM Press, New York (2000)
Chapter Google Scholar
Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis. 5(2), 199–220 (1993)
Article Google Scholar
Huang, C.-C., Chuang, S.-L., Chien, L.-F.: Liveclassifier: creating hierarchical text classifiers through web corpora. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 184–192. ACM Press, New York (2004)
Chapter Google Scholar
Huang, Y., Wang, Q., Yang, J., Ding, Q.: The design and implementation of a subject-oriented web information classification system. In: Proceedings of the 9th International Conference on Computer Supported Cooperative Work in Design, vol. 2, pp. 836–840 (2005)
Google Scholar
Jackson, M., Burden, P.: WWLib-TNG - new directions in search engine technology. IEE Informatics Colloquium Lost in the Web - navigation on the Internet, 10/1–10/8 (1999)
Google Scholar
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)
Google Scholar
Li, Y., Bandar, Z., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15(4), 871–882 (2003)
Article Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Qi, D., Sun, B.: A genetic k-means approaches for automated web page classification. In: IRI, pp. 241–246 (2004)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, London (1989)
Google Scholar
Song, M.-H., Lim, S.-Y., Kang, D.-J., Lee, S.-J.: Automatic classification of web pages based on the concept of domain ontology. In: Proceeding of the 12th Asia-Pacific Software Engineering Conference (APSEC 2005), Taipei, Taiwan, pages CD–ROM (2005)
Google Scholar
Xiaogang, P., Choi, B.: Automatic web page classification in a dynamic and hierarchical way. In: Proceeding of the IEEE International Conference on Data Mining (ICDM 2002), Maebashi City, Japan, pp. 386–393 (2002)
Google Scholar
Zhang, M.-Y., Lu, Z.-D.: A fuzzy classification based on feature selection for web pages. In: WI 2004: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004), pp. 469–472. IEEE Computer Society Press, Washington (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Universitá di Napoli Federico II - Dipartimento di Informatica e Sistemistica 80125 Via Claudio, 21 - Napoli, Italy
Antonio Picariello & Antonio M. Rinaldi

Authors

Antonio Picariello
View author publications
You can also search for this author in PubMed Google Scholar
Antonio M. Rinaldi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Roland Wagner Norman Revell Günther Pernul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Picariello, A., Rinaldi, A.M. (2007). Crawling the Web with OntoDir. In: Wagner, R., Revell, N., Pernul, G. (eds) Database and Expert Systems Applications. DEXA 2007. Lecture Notes in Computer Science, vol 4653. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74469-6_71

Download citation

DOI: https://doi.org/10.1007/978-3-540-74469-6_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74467-2
Online ISBN: 978-3-540-74469-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics