Advertisement

Automatic Generation of Sitemaps Based on Navigation Systems

  • Pasqua Fabiana LanotteEmail author
  • Fabio Fumarola
  • Donato Malerba
  • Michelangelo Ceci
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10122)

Abstract

In this paper we present a method to automatically discover sitemaps from websites. Given a website, existing automatic solutions extract only a flat list of urls that do not show the hierarchical structure of its content. Manual approaches, performed by web-masters, extract deeper sitemaps (with respect to automatic methods). However, in many cases, also because of the natural evolution of the websites’ content, generated sitemaps do not reflect the actual content becoming soon helpless and confusing for users. We propose a different approach that is both automatic and effective. Our solution combines an algorithm to extract frequent patterns from navigation systems (e.g. menu, nav-bar, content list, etc.) contained in a website, with a hierarchy extraction algorithm able to discover rich hierarchies that unveil relationships among web pages (e.g. relationships of super/sub topic). Experimental results, show how our approach discovers high quality sitemaps that have a deep hierarchy and are complete in the extracted urls.

Keywords

Sitemaps Web mining Sequential pattern mining Optimization 

Notes

Acknowledgment

This project has received funding from the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).

References

  1. 1.
    Fumarola, F., Lanotte, P.F., Ceci, M., Malerba, D.: CloFAST: closed sequential pattern mining using sparse and vertical id-lists. Know. Inf. Syst 48(2), 429–463 (2016)CrossRefGoogle Scholar
  2. 2.
    Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: A hybrid approach to general list extraction on the web. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 35–36. ACM, New York (2011)Google Scholar
  3. 3.
    Lanotte, P.F., Fumarola, F., Ceci, M., Scarpino, A., Torelli, M.D., Malerba, D.: Automatic extraction of logical web lists. In: Andreasen, T., Christiansen, H., Cubero, J.-C., Raś, Z.W. (eds.) ISMIS 2014. LNCS (LNAI), vol. 8502, pp. 365–374. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-08326-1_37 Google Scholar
  4. 4.
    Lie, H.W., Bos, B., Sheets, C.S.: Designing for the Web, 2nd edn. Addison-Wesley Professional, Reading (1999).Google Scholar
  5. 5.
    Nielsen, J., Loranger, H.: Prioritizing Web Usability. New Riders Publishing, Thousand Oaks (2006)Google Scholar
  6. 6.
    Weninger, T., Bisk, Y., Han, J.: Document-topic hierarchies from document graphs. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 635–644. ACM, New York (2012)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Pasqua Fabiana Lanotte
    • 1
    Email author
  • Fabio Fumarola
    • 1
  • Donato Malerba
    • 1
    • 2
  • Michelangelo Ceci
    • 1
    • 2
  1. 1.Dipartimento di InformaticaUniversità Degli Studi di Bari Aldo ModoBariItaly
  2. 2.CINI: National Interuniversity Consortium for InformaticsBariItaly

Personalised recommendations