Hidden-Web Induced by Client-Side Scripting: An Empirical Study

Behfarshad, Zahra; Mesbah, Ali

doi:10.1007/978-3-642-39200-9_7

Zahra Behfarshad¹⁹ &
Ali Mesbah¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7977))

Included in the following conference series:

International Conference on Web Engineering

3648 Accesses
3 Citations

Abstract

Client-side JavaScript is increasingly used for enhancing web application functionality, interactivity, and responsiveness. Through the execution of JavaScript code in browsers, the DOM tree representing a webpage at runtime, can be incrementally updated without requiring a URL change. This dynamically updated content is hidden from general search engines. In this paper, we present the first empirical study on measuring and characterizing the hidden-web induced as a result of clientside JavaScript execution. Our study reveals that this type of hidden-web content is prevalent in online web applications today: from the 500 websites we analyzed, 95% contain client-side hidden-web content; On those websites that contain client-side hidden-web content, (1) on average, 62% of the web states are hidden, (2) per hidden state, there is an average of 19 kilobytes of data that is hidden from which 0.6 kilobytes contain textual content, (3) the DIV element is the most common clickable element used (61%) to initiate this type of hidden-web state transition, and (4) on average 25 minutes is required to dynamically crawl 50 DOM states. Further, our study indicates that there is a correlation between DOM tree size and hidden-web content, but no correlation exists between the amount of JavaScript code and client-side hidden-web.

Download to read the full chapter text

Chapter PDF

jÄk: Using Dynamic Analysis to Crawl and Test Modern Web Applications

Deferrability Analysis for JavaScript

Load-and-Act: Increasing Page Coverage of Web Applications

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Alexa top sites, http://www.alexa.com/topsites/
Alvarez, M., Pan, A., Raposo, J., Vina, A.: Client-side deep web data extraction. In: Proc. of the Int. Conf. on E-Commerce Technology for Dynamic E-Business, pp. 158–161. IEEE Computer Society (2004)
Google Scholar
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proc. of the 16th Int. Conf. on World Wide Web (WWW), pp. 441–450. ACM (2007)
Google Scholar
Bergman, M.: White paper: the deep web: surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)
Google Scholar
Chang, K.C.-C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)
Article Google Scholar
Choudhary, S.R., Versee, H., Orso, A.: WebDiff: Automated identification of cross-browser issues in web applications. In: Proc. of the 26th IEEE Int. Conf. on Softw. Maintenance (ICSM 2010), pp. 1–10 (2010)
Google Scholar
Dasgupta, A., Ghosh, A., Kumar, R., Olston, C., Pandey, S., Tomkins, A.: The discoverability of the web. In: Proc. of the Int. Conf. on World Wide Web (WWW), pp. 421–430. ACM (2007)
Google Scholar
de Carvalho, A.F., Silva, F.S.: Smartcrawl: a new strategy for the exploration of the hidden web. In: Procs. of the ACM Int. Workshop on Web information and Data Management, pp. 9–15. ACM (2004)
Google Scholar
Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: Ajax crawl: making Ajax applications searchable. In: Proc. Int. Conf. on Data Engineering (ICDE 2009), pp. 78–89 (2009)
Google Scholar
Gentleman, R., Ihaka, R.: The R project for statistical computing, http://www.r-project.org
He, B., Patel, M., Zhang, Z., Chang, K.: Accessing the deep web. Communications of the ACM 50(5), 94–101 (2007)
Article Google Scholar
Hsieh, W., Madhavan, J., Pike, R.: Data management projects at Google. In: Proc. of the Int. Conf. on Management of Data (SIGMOD), pp. 725–726 (2006)
Google Scholar
Krishnamurthy, B., Wills, C.: Cat and mouse: content delivery tradeoffs in web access. In: Proc. of WWW, pp. 337–346. ACM (2006)
Google Scholar
Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004)
Article Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. Proc. VLDB Endow. 1(2), 1241–1252 (2008)
Google Scholar
Mesbah, A., Mirshokraie, S.: Automated analysis of CSS rules to support style maintenance. In: Proc. of the 34th ACM/IEEE Int. Conf. on Softw. Eng. (ICSE), pp. 408–418. IEEE Computer Society (2012)
Google Scholar
Mesbah, A., van Deursen, A., Lenselink, S.: Crawling Ajax-based web applications through dynamic analysis of user interface state changes. ACM Transactions on the Web (TWEB) 6(1), 3:1–3:30 (2012)
Google Scholar
Mesbah, A., van Deursen, A., Roest, D.: Invariant-based automatic testing of modern web applications. IEEE Trans. on Softw. Eng. (TSE) 38(1), 35–53 (2012)
Article Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Proc. of the Int. Conf. on Very Large Data Bases (VLDB), pp. 129–138 (2001)
Google Scholar
Yue, C., Wang, H.: Characterizing insecure JavaScript practices on the web. In: Proc. of the Int. World Wide Web Conf (WWW), pp. 961–970. ACM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

University of British Columbia, Vancouver, BC, Canada
Zahra Behfarshad & Ali Mesbah

Authors

Zahra Behfarshad
View author publications
You can also search for this author in PubMed Google Scholar
Ali Mesbah
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Trento, Via Sommarive 5, 38123, Povo, TN, Italy
Florian Daniel
Department of Computer Science, Aalborg University, Selma Lagerloefs Vej 300, 9220, Aalborg, Denmark
Peter Dolog
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Ave., Kowloon, Hong Kong, China
Qing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Behfarshad, Z., Mesbah, A. (2013). Hidden-Web Induced by Client-Side Scripting: An Empirical Study. In: Daniel, F., Dolog, P., Li, Q. (eds) Web Engineering. ICWE 2013. Lecture Notes in Computer Science, vol 7977. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39200-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-39200-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39199-6
Online ISBN: 978-3-642-39200-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hidden-Web Induced by Client-Side Scripting: An Empirical Study

Abstract

Chapter PDF

Similar content being viewed by others

jÄk: Using Dynamic Analysis to Crawl and Test Modern Web Applications

Deferrability Analysis for JavaScript

Load-and-Act: Increasing Page Coverage of Web Applications

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Hidden-Web Induced by Client-Side Scripting: An Empirical Study

Abstract

Chapter PDF

Similar content being viewed by others

jÄk: Using Dynamic Analysis to Crawl and Test Modern Web Applications

Deferrability Analysis for JavaScript

Load-and-Act: Increasing Page Coverage of Web Applications

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation