GUIDE: an interactive and incremental approach for crawling Web applications

Liu, Chien-Hung; Chen, Woei-Kae; Sun, Chi-Chia

doi:10.1007/s11227-018-2335-4

GUIDE: an interactive and incremental approach for crawling Web applications

Published: 28 March 2018

Volume 76, pages 1562–1584, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Chien-Hung Liu¹,
Woei-Kae Chen¹ &
Chi-Chia Sun¹

425 Accesses
5 Citations
Explore all metrics

Abstract

The Internet, having a sea of Web applications, is one of the largest data stores for big data analysis. To explore and retrieve the states (pages) from Web applications, Web crawlers have been extensively used. Most crawlers allow the users to define a few crawling directives so as to increase the coverage of states that the crawler can explore. A directive can, for example, assign an input value to a specified input field so that the application is instructed to perform a specific action and visit some special states. Note that, a crawler is supposedly capable of exploring an unknown application. But, given an unknown application, how could the user possibly prepare the required directives in advance? This paper proposes an interactive crawling approach and a crawler called GUIDE to overcome this issue. Instead of passively receiving directives from the user, GUIDE actively asks the user for directives when Web pages containing input fields are found. In addition, GUIDE offers a hierarchical directive structure, allowing the user to define multiple values for the same input field. A case study with three Web applications indicated that (1) interactive directives were very useful for increasing the code coverage of the application being explored—up to 10.3–50.5% of code coverage improvement can be achieved, and (2) using GUIDE is more efficient than using a traditional crawler—given the same amount of time, up to 11% of code coverage improvement can be achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ye M, Li G (2017) Internet big data and capital markets: a literature review. Financ Innov 3(1):6
Article MathSciNet Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertexual web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Article Google Scholar
Burner M (1997) Crawling towards eternity: building an archive of the world wide web. Web Tech. Mag. 2(5):37–40
Google Scholar
Ferrucci F, Sarro F, Ronca D, Abrahao S (2011) A crawljax based approach to exploit traditional accessibility evaluation tools for AJAX applications. In: Information Technology and Innovation Trends in Organizations. Springer, pp 255–262
Muñoz FR, Cortes IIS, Villalba LJG (2017) Enlargement of vulnerable web applications for testing. J Supercomput
Park JH, Sung Y, Sharma PK, Jeong Y-S, Yi G (2017) Novel assessment method for accessing private data in social network security services. J Supercomput 73(7):3307–3325
Article Google Scholar
Groeneveld F, Mesbah A, van Deursen A (2010) Automatic invariant detection in dynamic web applications. Technical Report Series TUD-SERG-2010-037
Mesbah A, Prasad MR (2011) Automated cross-browser compatibility testing. In: Proceedings of the 33rd International Conference on Software Engineering. ACM, pp 561–570
Mirshokraie S, Mesbah A (2012) JSART: Javascript assertion-based regression testing. In: Web Engineering. pp 238–252
Tanida H, Prasad MR, Rajan SP, Fujita M (2011) Automated system testing of dynamic web applications. In: ICSOFT (Selected Papers). Springer, pp 181–196
Mesbah A, van Deursen A, Lenselink S (2012) Crawling ajax-based web applications through dynamic analysis of user interface state changes. ACM Trans Web (TWEB) 6(1):3
Google Scholar
Silva CE, Campos JC (2013) Combining static and dynamic analysis for the reverse engineering of web applications. In: Proceedings of the 5th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. ACM, pp 107–112
Olston C, Najork M (2010) Web crawling. Found. Trends Inf. Retr. 4(3):175–246
Article Google Scholar
Choudhary S, Dincturk ME, Mirtaheri SM, Moosavi A, von Bochmann G, Jourdan G-V, Onut IV (2012) Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corp, Riverton, pp 146–160
Mirtaheri SM, Dinçtürk ME, Hooshmand S, Bochmann GV, Jourdan G-V, Onut IV (2013) A brief history of web crawlers. In: Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’13. IBM Corp, Riverton, pp 40–54
van Deursen A, Mesbah A, Nederlof A (2015) Crawl-based analysis of web applications. Sci. Comput. Program. 97(P1):173–180
Article Google Scholar
Fard AM, Mesbah A (2013) Feedback-directed exploration of web applications to derive test models. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE). pp 278–287
Dincturk ME, Choudhary S, von Bochmann G, Jourdan G-V, Onut IV (2012) A statistical approach for efficient crawling of rich internet applications. In: Proceedings of the 12th International Conference on Web Engineering, ICWE’12. Springer, Berlin, pp 362–369
Choudhary S, Dincturk ME, Mirtaheri SM, Jourdan G-V, Bochmann GV, Onut IV (2013) Building rich internet applications models: example of a better strategy. In: Proceedings of the 13th International Conference on Web Engineering, ICWE’13. Springer, Berlin, pp 291–305
Dincturk ME, Jourdan G-V, Bochmann GV, Onut IV (2014) A model-based approach for crawling rich internet applications. ACM Trans. Web 8(3):19:1–19:39
Article Google Scholar
Moosavi A, Hooshmand S, Baghbanzadeh S, Jourdan G-V, Bochmann GV, Onut IV (2014) Indexing rich internet applications using components-based crawling. Springer International Publishing, Cham, pp 200–217
Google Scholar
Artzi S, Dolby J, Jensen SH, Møller A, Tip F (2011) A framework for automated testing of javascript web applications. In: Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11. ACM, New York, pp 571–580
Pellegrino G, Tschürtz C, Bodden E, Rossow C (2015) jÄk: using dynamic analysis to crawl and test modern web applications. Springer International Publishing, Cham, pp 295–316
Google Scholar
Chen W-K, Liu C-H, Chen K-MA (2017) Web crawler supporting interactive and incremental user directives. In: Proceedings of the 6th International Conference on Frontier Computing Theory, Technologies, and Applications. pp 105–114
Node BB (2017) An open-source bulletin board application. https://github.com/NodeBB/. Accessed 1 Dec 2017
Keystone JS (2017) A node.js CMS and web application framework. https://github.com/keystonejs. Accessed 1 Dec 2017
TimeOff Management (2017) Allow small business to manage employee absences for free. https://github.com/timeoff-management. Accessed 1 Dec 2017

Download references

Author information

Authors and Affiliations

National Taipei University of Technology, Taipei, Taiwan
Chien-Hung Liu, Woei-Kae Chen & Chi-Chia Sun

Authors

Chien-Hung Liu
View author publications
You can also search for this author in PubMed Google Scholar
Woei-Kae Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Chia Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chien-Hung Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, CH., Chen, WK. & Sun, CC. GUIDE: an interactive and incremental approach for crawling Web applications. J Supercomput 76, 1562–1584 (2020). https://doi.org/10.1007/s11227-018-2335-4

Download citation

Published: 28 March 2018
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11227-018-2335-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GUIDE: an interactive and incremental approach for crawling Web applications

Abstract

Access this article

Similar content being viewed by others

A Web Crawler Supporting Interactive and Incremental User Directives

Intelligent and Adaptive Crawling of Web Applications for Web Archiving

Focused crawling for the hidden web

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GUIDE: an interactive and incremental approach for crawling Web applications

Abstract

Access this article

Similar content being viewed by others

A Web Crawler Supporting Interactive and Incremental User Directives

Intelligent and Adaptive Crawling of Web Applications for Web Archiving

Focused crawling for the hidden web

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation