Skip to main content
Log in

Focussed crawling of environmental Web resources based on the combination of multimedia evidence

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Personalised Environmental Service Configuration and Delivery Orchestration (http://www.pescado-project.eu/).

  2. Adaptive Hierarchical Density Histogram.

  3. Both datasets are available at: http://mklab.iti.gr/project/heatmaps.

  4. These URLs are different to the ones used for training the classifiers.

References

  1. Cao R, Tan C (2002) Text/graphics separation in maps. In: Blostein D, Kwon YB (eds) Graphics Recognition: Algorithms and Applications, 4th IAPR International Workshop on Graphics Recognition (GREC 2001), Selected Papers, Lecture Notes in Computer Science, vol 2390, pp 167–177. Springer Berlin Heidelberg

    Google Scholar 

  2. Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: A new approach to topic-specific web resource discovery. In: Proceedings of the 8th International Conference on World Wide Web, (WWW 1999), pp 1623–1640

    Article  Google Scholar 

  3. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  4. Chang SF, Sikora T, Puri A (2001) Overview of the MPEG-7 standard. IEEE Trans Circ Syst Video Technol 11(6):688–695

    Article  Google Scholar 

  5. Chatfield K, Lempitsky VS, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of the British Machine Vision Conference (BMVC 2011), pp 1–12

  6. Cho J, Garcia-Molina H, Page L. (1998) Efficient crawling through URL ordering. Comput Netw 30(1-7):161–172

    Google Scholar 

  7. Davison BD (2000) Topical locality in the web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR 2000), pp 272–279

  8. De Bra P, Post RDJ (1994) Information retrieval in the world-wide web: Making client-based searching feasible. Comput Netw ISDN Syst 27(2):183–192

    Article  Google Scholar 

  9. Epitropou V, Karatzas K, Bassoukos A (2010) A method for the inverse reconstruction of environmental data applicable at the chemical weather portal. In: Proceedings of the GI-Forum Symposium and Exhibit on Applied Geoinformatics, pp 58–68

  10. Henderson TC, Linton T (2009) Raster map image analysis. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR 2009), pp 376–380. IEEE Computer Society

  11. Karatzas K, Moussiopoulos N (2000) Urban air quality management and information systems in Europe: legal framework and information access. J Environ Assess Policy Manag 2(02):263–272

    Google Scholar 

  12. Lin H-T, Lin C-J, Weng RC (2007) A note on Platts probabilistic outputs for support vector machines. Mach Learn 68(3):267–276

    Article  Google Scholar 

  13. Moumtzidou A, Vrochidis S, Chatzilari E, Kompatsiaris I (2013) Discovery of environmental nodes based on heatmap recognition. In: Proceedings of the 20th IEEE International Conference on Image Processing (ICIP 2013)

  14. Moumtzidou A, Vrochidis S, Kompatsiaris I (2013) Discovery, analysis and retrieval of multimodal environmental information. In: Encyclopedia of Information Science and Technology (in press). IGI Global

  15. Moumtzidou A, Vrochidis S, Tonelli S, Kompatsiaris I, Pianta E (2012) Discovery of environmental nodes in the web. In: Multidisciplinary Information Retrieval, Proceedings of the 5th International Retrieval Facility Conference (IRFC 2012), LNCS, vol 7356, pp 58–72

    Google Scholar 

  16. Olston C, Najork M (2010) Web crawling. Found Trends Inf Retr 4(3):175–246

    Article  Google Scholar 

  17. Over P, Awad G, Kraaij W, Smeaton AF (2007) TRECVID 2007–overview. In: TRECVID 2007 workshop participants notebook papers. National Institute of Standards and Technology (NIST)

  18. Oyama S, Kokubo T, Ishida T (2004) Domain-specific web search with keyword spices. IEEE Trans Knowl Data Eng 16(1):17–27

    Article  Google Scholar 

  19. Pant G, Srinivasan P (2005) Learning to crawl: Comparing classification schemes. ACM Trans Inf Syst 23(4):430–462

    Article  Google Scholar 

  20. Pant G, Srinivasan P (2006) Link contexts in classifier-guided topical crawlers. IEEE Trans Knowl Data Eng 18(1):107–122

    Article  Google Scholar 

  21. Pant G, Srinivasan P, Menczer F (2002) Exploration versus exploitation in topic driven crawlers. In: Levene M, Poulovassilis A (eds) Proceedings of the 2nd International Workshop on Web Dynamics, in conjunction with the World Wide Web Conference (WWW 2002)

  22. San José R, Baklanov A, Sokhi R, Karatzas K, Pérez J (2008) Computational air quality modelling. Dev Integr Environ Assess 3:247–267

    Article  Google Scholar 

  23. Sidiropoulos P, Vrochidis S, Kompatsiaris I (2011) Content-based binary image retrieval using the adaptive hierarchical density histogram. Pattern Recog 44(4):739–750

    Article  Google Scholar 

  24. Srinivasan P, Menczer F, Pant G (2005) A general evaluation framework for topical crawlers. Information Retrieval 8(3):417–447. doi:10.1007/s10791-005-6993-5

    Article  Google Scholar 

  25. Tang TT, Hawking D, Craswell N, Griffiths K (2005) Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, (CIKM 2005), pp 147–154

  26. Tang TT, Hawking D, Craswell N, Sankaranarayana RS (2004) Focused crawling in depression portal search: A feasibility study. In: Proceedings of the 9th Australasian Document Computing Symposium (ADCS 2004), pp 1–9

  27. Tsikrika T, Moumtzidou A, Vrochidis S, Kompatsiaris I (2014) Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence. In: Proceedings of the International Workshop on Environmental Multimedia Retrieval (EMR 2014), pp 61–68

  28. Yuan J et al (2007) THU and ICRC at TRECVID 2007. In: Over P, Awad G, Kraaij W, Smeaton AF (eds) TRECVID 2007 workshop participants notebook papers. National Institute of Standards and Technology (NIST)

Download references

Acknowledgments

This work was supported by MULTISENSOR (contract no. FP7-610411) and HOMER (contract no. FP7-312388) projects, partially funded by the European Commission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Theodora Tsikrika.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsikrika, T., Moumtzidou, A., Vrochidis, S. et al. Focussed crawling of environmental Web resources based on the combination of multimedia evidence. Multimed Tools Appl 75, 1563–1587 (2016). https://doi.org/10.1007/s11042-015-2624-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2624-3

Keywords

Navigation