Automatic Generation of Sitemaps Based on Navigation Systems
In this paper we present a method to automatically discover sitemaps from websites. Given a website, existing automatic solutions extract only a flat list of urls that do not show the hierarchical structure of its content. Manual approaches, performed by web-masters, extract deeper sitemaps (with respect to automatic methods). However, in many cases, also because of the natural evolution of the websites’ content, generated sitemaps do not reflect the actual content becoming soon helpless and confusing for users. We propose a different approach that is both automatic and effective. Our solution combines an algorithm to extract frequent patterns from navigation systems (e.g. menu, nav-bar, content list, etc.) contained in a website, with a hierarchy extraction algorithm able to discover rich hierarchies that unveil relationships among web pages (e.g. relationships of super/sub topic). Experimental results, show how our approach discovers high quality sitemaps that have a deep hierarchy and are complete in the extracted urls.
KeywordsSitemaps Web mining Sequential pattern mining Optimization
This project has received funding from the European Commission through the project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (Grant number ICT-2013-612944).
- 2.Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: A hybrid approach to general list extraction on the web. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 35–36. ACM, New York (2011)Google Scholar
- 3.Lanotte, P.F., Fumarola, F., Ceci, M., Scarpino, A., Torelli, M.D., Malerba, D.: Automatic extraction of logical web lists. In: Andreasen, T., Christiansen, H., Cubero, J.-C., Raś, Z.W. (eds.) ISMIS 2014. LNCS (LNAI), vol. 8502, pp. 365–374. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-08326-1_37 Google Scholar
- 4.Lie, H.W., Bos, B., Sheets, C.S.: Designing for the Web, 2nd edn. Addison-Wesley Professional, Reading (1999).Google Scholar
- 5.Nielsen, J., Loranger, H.: Prioritizing Web Usability. New Riders Publishing, Thousand Oaks (2006)Google Scholar
- 6.Weninger, T., Bisk, Y., Han, J.: Document-topic hierarchies from document graphs. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 635–644. ACM, New York (2012)Google Scholar