GUIDE: an interactive and incremental approach for crawling Web applications

  • Chien-Hung Liu
  • Woei-Kae Chen
  • Chi-Chia Sun


The Internet, having a sea of Web applications, is one of the largest data stores for big data analysis. To explore and retrieve the states (pages) from Web applications, Web crawlers have been extensively used. Most crawlers allow the users to define a few crawling directives so as to increase the coverage of states that the crawler can explore. A directive can, for example, assign an input value to a specified input field so that the application is instructed to perform a specific action and visit some special states. Note that, a crawler is supposedly capable of exploring an unknown application. But, given an unknown application, how could the user possibly prepare the required directives in advance? This paper proposes an interactive crawling approach and a crawler called GUIDE to overcome this issue. Instead of passively receiving directives from the user, GUIDE actively asks the user for directives when Web pages containing input fields are found. In addition, GUIDE offers a hierarchical directive structure, allowing the user to define multiple values for the same input field. A case study with three Web applications indicated that (1) interactive directives were very useful for increasing the code coverage of the application being explored—up to 10.3–50.5% of code coverage improvement can be achieved, and (2) using GUIDE is more efficient than using a traditional crawler—given the same amount of time, up to 11% of code coverage improvement can be achieved.


Big data Web crawler Coverage Interactive crawler Directives 


  1. 1.
    Ye M, Li G (2017) Internet big data and capital markets: a literature review. Financ Innov 3(1):6MathSciNetCrossRefGoogle Scholar
  2. 2.
    Brin S, Page L (1998) The anatomy of a large-scale hypertexual web search engine. Comput Netw ISDN Syst 30(1–7):107–117CrossRefGoogle Scholar
  3. 3.
    Burner M (1997) Crawling towards eternity: building an archive of the world wide web. Web Tech. Mag. 2(5):37–40Google Scholar
  4. 4.
    Ferrucci F, Sarro F, Ronca D, Abrahao S (2011) A crawljax based approach to exploit traditional accessibility evaluation tools for AJAX applications. In: Information Technology and Innovation Trends in Organizations. Springer, pp 255–262Google Scholar
  5. 5.
    Muñoz FR, Cortes IIS, Villalba LJG (2017) Enlargement of vulnerable web applications for testing. J SupercomputGoogle Scholar
  6. 6.
    Park JH, Sung Y, Sharma PK, Jeong Y-S, Yi G (2017) Novel assessment method for accessing private data in social network security services. J Supercomput 73(7):3307–3325CrossRefGoogle Scholar
  7. 7.
    Groeneveld F, Mesbah A, van Deursen A (2010) Automatic invariant detection in dynamic web applications. Technical Report Series TUD-SERG-2010-037Google Scholar
  8. 8.
    Mesbah A, Prasad MR (2011) Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering. ACM, pp 561–570Google Scholar
  9. 9.
    Mirshokraie S, Mesbah A (2012) JSART: Javascript assertion-based regression testing. In: Web Engineering. pp 238–252Google Scholar
  10. 10.
    Tanida H, Prasad MR, Rajan SP, Fujita M (2011) Automated system testing of dynamic web applications. In: ICSOFT (Selected Papers). Springer, pp 181–196Google Scholar
  11. 11.
    Mesbah A, van Deursen A, Lenselink S (2012) Crawling ajax-based web applications through dynamic analysis of user interface state changes. ACM Trans Web (TWEB) 6(1):3Google Scholar
  12. 12.
    Silva CE, Campos JC (2013) Combining static and dynamic analysis for the reverse engineering of web applications. In: Proceedings of the 5th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. ACM, pp 107–112Google Scholar
  13. 13.
    Olston C, Najork M (2010) Web crawling. Found. Trends Inf. Retr. 4(3):175–246CrossRefzbMATHGoogle Scholar
  14. 14.
    Choudhary S, Dincturk ME, Mirtaheri SM, Moosavi A, von Bochmann G, Jourdan G-V, Onut IV (2012) Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corp, Riverton, pp 146–160Google Scholar
  15. 15.
    Mirtaheri SM, Dinçtürk ME, Hooshmand S, Bochmann GV, Jourdan G-V, Onut IV (2013) A brief history of web crawlers. In: Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’13. IBM Corp, Riverton, pp 40–54Google Scholar
  16. 16.
    van Deursen A, Mesbah A, Nederlof A (2015) Crawl-based analysis of web applications. Sci. Comput. Program. 97(P1):173–180CrossRefGoogle Scholar
  17. 17.
    Fard AM, Mesbah A (2013) Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). pp 278–287Google Scholar
  18. 18.
    Dincturk ME, Choudhary S, von Bochmann G, Jourdan G-V, Onut IV (2012) A statistical approach for efficient crawling of rich internet applications. In: Proceedings of the 12th International Conference on Web Engineering, ICWE’12. Springer, Berlin, pp 362–369Google Scholar
  19. 19.
    Choudhary S, Dincturk ME, Mirtaheri SM, Jourdan G-V, Bochmann GV, Onut IV (2013) Building rich internet applications models: example of a better strategy. In: Proceedings of the 13th International Conference on Web Engineering, ICWE’13. Springer, Berlin, pp 291–305Google Scholar
  20. 20.
    Dincturk ME, Jourdan G-V, Bochmann GV, Onut IV (2014) A model-based approach for crawling rich internet applications. ACM Trans. Web 8(3):19:1–19:39CrossRefGoogle Scholar
  21. 21.
    Moosavi A, Hooshmand S, Baghbanzadeh S, Jourdan G-V, Bochmann GV, Onut IV (2014) Indexing rich internet applications using components-based crawling. Springer International Publishing, Cham, pp 200–217Google Scholar
  22. 22.
    Artzi S, Dolby J, Jensen SH, Møller A, Tip F (2011) A framework for automated testing of javascript web applications. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11. ACM, New York, pp 571–580Google Scholar
  23. 23.
    Pellegrino G, Tschürtz C, Bodden E, Rossow C (2015) jÄk: using dynamic analysis to crawl and test modern web applications. Springer International Publishing, Cham, pp 295–316Google Scholar
  24. 24.
    Chen W-K, Liu C-H, Chen K-MA (2017) Web crawler supporting interactive and incremental user directives. In: Proceedings of the 6th International Conference on Frontier Computing Theory, Technologies, and Applications. pp 105–114Google Scholar
  25. 25.
    Node BB (2017) An open-source bulletin board application. Accessed 1 Dec 2017
  26. 26.
    Keystone JS (2017) A node.js CMS and web application framework. Accessed 1 Dec 2017
  27. 27.
    TimeOff Management (2017) Allow small business to manage employee absences for free. Accessed 1 Dec 2017

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.National Taipei University of TechnologyTaipeiTaiwan

Personalised recommendations