Skip to main content
Log in

GUIDE: an interactive and incremental approach for crawling Web applications

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Internet, having a sea of Web applications, is one of the largest data stores for big data analysis. To explore and retrieve the states (pages) from Web applications, Web crawlers have been extensively used. Most crawlers allow the users to define a few crawling directives so as to increase the coverage of states that the crawler can explore. A directive can, for example, assign an input value to a specified input field so that the application is instructed to perform a specific action and visit some special states. Note that, a crawler is supposedly capable of exploring an unknown application. But, given an unknown application, how could the user possibly prepare the required directives in advance? This paper proposes an interactive crawling approach and a crawler called GUIDE to overcome this issue. Instead of passively receiving directives from the user, GUIDE actively asks the user for directives when Web pages containing input fields are found. In addition, GUIDE offers a hierarchical directive structure, allowing the user to define multiple values for the same input field. A case study with three Web applications indicated that (1) interactive directives were very useful for increasing the code coverage of the application being explored—up to 10.3–50.5% of code coverage improvement can be achieved, and (2) using GUIDE is more efficient than using a traditional crawler—given the same amount of time, up to 11% of code coverage improvement can be achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Ye M, Li G (2017) Internet big data and capital markets: a literature review. Financ Innov 3(1):6

    Article  MathSciNet  Google Scholar 

  2. Brin S, Page L (1998) The anatomy of a large-scale hypertexual web search engine. Comput Netw ISDN Syst 30(1–7):107–117

    Article  Google Scholar 

  3. Burner M (1997) Crawling towards eternity: building an archive of the world wide web. Web Tech. Mag. 2(5):37–40

    Google Scholar 

  4. Ferrucci F, Sarro F, Ronca D, Abrahao S (2011) A crawljax based approach to exploit traditional accessibility evaluation tools for AJAX applications. In: Information Technology and Innovation Trends in Organizations. Springer, pp 255–262

  5. Muñoz FR, Cortes IIS, Villalba LJG (2017) Enlargement of vulnerable web applications for testing. J Supercomput

  6. Park JH, Sung Y, Sharma PK, Jeong Y-S, Yi G (2017) Novel assessment method for accessing private data in social network security services. J Supercomput 73(7):3307–3325

    Article  Google Scholar 

  7. Groeneveld F, Mesbah A, van Deursen A (2010) Automatic invariant detection in dynamic web applications. Technical Report Series TUD-SERG-2010-037

  8. Mesbah A, Prasad MR (2011) Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering. ACM, pp 561–570

  9. Mirshokraie S, Mesbah A (2012) JSART: Javascript assertion-based regression testing. In: Web Engineering. pp 238–252

  10. Tanida H, Prasad MR, Rajan SP, Fujita M (2011) Automated system testing of dynamic web applications. In: ICSOFT (Selected Papers). Springer, pp 181–196

  11. Mesbah A, van Deursen A, Lenselink S (2012) Crawling ajax-based web applications through dynamic analysis of user interface state changes. ACM Trans Web (TWEB) 6(1):3

    Google Scholar 

  12. Silva CE, Campos JC (2013) Combining static and dynamic analysis for the reverse engineering of web applications. In: Proceedings of the 5th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. ACM, pp 107–112

  13. Olston C, Najork M (2010) Web crawling. Found. Trends Inf. Retr. 4(3):175–246

    Article  Google Scholar 

  14. Choudhary S, Dincturk ME, Mirtaheri SM, Moosavi A, von Bochmann G, Jourdan G-V, Onut IV (2012) Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corp, Riverton, pp 146–160

  15. Mirtaheri SM, Dinçtürk ME, Hooshmand S, Bochmann GV, Jourdan G-V, Onut IV (2013) A brief history of web crawlers. In: Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’13. IBM Corp, Riverton, pp 40–54

  16. van Deursen A, Mesbah A, Nederlof A (2015) Crawl-based analysis of web applications. Sci. Comput. Program. 97(P1):173–180

    Article  Google Scholar 

  17. Fard AM, Mesbah A (2013) Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). pp 278–287

  18. Dincturk ME, Choudhary S, von Bochmann G, Jourdan G-V, Onut IV (2012) A statistical approach for efficient crawling of rich internet applications. In: Proceedings of the 12th International Conference on Web Engineering, ICWE’12. Springer, Berlin, pp 362–369

  19. Choudhary S, Dincturk ME, Mirtaheri SM, Jourdan G-V, Bochmann GV, Onut IV (2013) Building rich internet applications models: example of a better strategy. In: Proceedings of the 13th International Conference on Web Engineering, ICWE’13. Springer, Berlin, pp 291–305

  20. Dincturk ME, Jourdan G-V, Bochmann GV, Onut IV (2014) A model-based approach for crawling rich internet applications. ACM Trans. Web 8(3):19:1–19:39

    Article  Google Scholar 

  21. Moosavi A, Hooshmand S, Baghbanzadeh S, Jourdan G-V, Bochmann GV, Onut IV (2014) Indexing rich internet applications using components-based crawling. Springer International Publishing, Cham, pp 200–217

    Google Scholar 

  22. Artzi S, Dolby J, Jensen SH, Møller A, Tip F (2011) A framework for automated testing of javascript web applications. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11. ACM, New York, pp 571–580

  23. Pellegrino G, Tschürtz C, Bodden E, Rossow C (2015) jÄk: using dynamic analysis to crawl and test modern web applications. Springer International Publishing, Cham, pp 295–316

    Google Scholar 

  24. Chen W-K, Liu C-H, Chen K-MA (2017) Web crawler supporting interactive and incremental user directives. In: Proceedings of the 6th International Conference on Frontier Computing Theory, Technologies, and Applications. pp 105–114

  25. Node BB (2017) An open-source bulletin board application. https://github.com/NodeBB/. Accessed 1 Dec 2017

  26. Keystone JS (2017) A node.js CMS and web application framework. https://github.com/keystonejs. Accessed 1 Dec 2017

  27. TimeOff Management (2017) Allow small business to manage employee absences for free. https://github.com/timeoff-management. Accessed 1 Dec 2017

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chien-Hung Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, CH., Chen, WK. & Sun, CC. GUIDE: an interactive and incremental approach for crawling Web applications. J Supercomput 76, 1562–1584 (2020). https://doi.org/10.1007/s11227-018-2335-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2335-4

Keywords

Navigation