Abstract
The Internet, having a sea of Web applications, is one of the largest data stores for big data analysis. To explore and retrieve the states (pages) from Web applications, Web crawlers have been extensively used. Most crawlers allow the users to define a few crawling directives so as to increase the coverage of states that the crawler can explore. A directive can, for example, assign an input value to a specified input field so that the application is instructed to perform a specific action and visit some special states. Note that, a crawler is supposedly capable of exploring an unknown application. But, given an unknown application, how could the user possibly prepare the required directives in advance? This paper proposes an interactive crawling approach and a crawler called GUIDE to overcome this issue. Instead of passively receiving directives from the user, GUIDE actively asks the user for directives when Web pages containing input fields are found. In addition, GUIDE offers a hierarchical directive structure, allowing the user to define multiple values for the same input field. A case study with three Web applications indicated that (1) interactive directives were very useful for increasing the code coverage of the application being explored—up to 10.3–50.5% of code coverage improvement can be achieved, and (2) using GUIDE is more efficient than using a traditional crawler—given the same amount of time, up to 11% of code coverage improvement can be achieved.
Similar content being viewed by others
References
Ye M, Li G (2017) Internet big data and capital markets: a literature review. Financ Innov 3(1):6
Brin S, Page L (1998) The anatomy of a large-scale hypertexual web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Burner M (1997) Crawling towards eternity: building an archive of the world wide web. Web Tech. Mag. 2(5):37–40
Ferrucci F, Sarro F, Ronca D, Abrahao S (2011) A crawljax based approach to exploit traditional accessibility evaluation tools for AJAX applications. In: Information Technology and Innovation Trends in Organizations. Springer, pp 255–262
Muñoz FR, Cortes IIS, Villalba LJG (2017) Enlargement of vulnerable web applications for testing. J Supercomput
Park JH, Sung Y, Sharma PK, Jeong Y-S, Yi G (2017) Novel assessment method for accessing private data in social network security services. J Supercomput 73(7):3307–3325
Groeneveld F, Mesbah A, van Deursen A (2010) Automatic invariant detection in dynamic web applications. Technical Report Series TUD-SERG-2010-037
Mesbah A, Prasad MR (2011) Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering. ACM, pp 561–570
Mirshokraie S, Mesbah A (2012) JSART: Javascript assertion-based regression testing. In: Web Engineering. pp 238–252
Tanida H, Prasad MR, Rajan SP, Fujita M (2011) Automated system testing of dynamic web applications. In: ICSOFT (Selected Papers). Springer, pp 181–196
Mesbah A, van Deursen A, Lenselink S (2012) Crawling ajax-based web applications through dynamic analysis of user interface state changes. ACM Trans Web (TWEB) 6(1):3
Silva CE, Campos JC (2013) Combining static and dynamic analysis for the reverse engineering of web applications. In: Proceedings of the 5th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. ACM, pp 107–112
Olston C, Najork M (2010) Web crawling. Found. Trends Inf. Retr. 4(3):175–246
Choudhary S, Dincturk ME, Mirtaheri SM, Moosavi A, von Bochmann G, Jourdan G-V, Onut IV (2012) Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corp, Riverton, pp 146–160
Mirtaheri SM, Dinçtürk ME, Hooshmand S, Bochmann GV, Jourdan G-V, Onut IV (2013) A brief history of web crawlers. In: Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’13. IBM Corp, Riverton, pp 40–54
van Deursen A, Mesbah A, Nederlof A (2015) Crawl-based analysis of web applications. Sci. Comput. Program. 97(P1):173–180
Fard AM, Mesbah A (2013) Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). pp 278–287
Dincturk ME, Choudhary S, von Bochmann G, Jourdan G-V, Onut IV (2012) A statistical approach for efficient crawling of rich internet applications. In: Proceedings of the 12th International Conference on Web Engineering, ICWE’12. Springer, Berlin, pp 362–369
Choudhary S, Dincturk ME, Mirtaheri SM, Jourdan G-V, Bochmann GV, Onut IV (2013) Building rich internet applications models: example of a better strategy. In: Proceedings of the 13th International Conference on Web Engineering, ICWE’13. Springer, Berlin, pp 291–305
Dincturk ME, Jourdan G-V, Bochmann GV, Onut IV (2014) A model-based approach for crawling rich internet applications. ACM Trans. Web 8(3):19:1–19:39
Moosavi A, Hooshmand S, Baghbanzadeh S, Jourdan G-V, Bochmann GV, Onut IV (2014) Indexing rich internet applications using components-based crawling. Springer International Publishing, Cham, pp 200–217
Artzi S, Dolby J, Jensen SH, Møller A, Tip F (2011) A framework for automated testing of javascript web applications. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11. ACM, New York, pp 571–580
Pellegrino G, Tschürtz C, Bodden E, Rossow C (2015) jÄk: using dynamic analysis to crawl and test modern web applications. Springer International Publishing, Cham, pp 295–316
Chen W-K, Liu C-H, Chen K-MA (2017) Web crawler supporting interactive and incremental user directives. In: Proceedings of the 6th International Conference on Frontier Computing Theory, Technologies, and Applications. pp 105–114
Node BB (2017) An open-source bulletin board application. https://github.com/NodeBB/. Accessed 1 Dec 2017
Keystone JS (2017) A node.js CMS and web application framework. https://github.com/keystonejs. Accessed 1 Dec 2017
TimeOff Management (2017) Allow small business to manage employee absences for free. https://github.com/timeoff-management. Accessed 1 Dec 2017
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, CH., Chen, WK. & Sun, CC. GUIDE: an interactive and incremental approach for crawling Web applications. J Supercomput 76, 1562–1584 (2020). https://doi.org/10.1007/s11227-018-2335-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2335-4