Abstract
A huge portion of today’s Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep Web, is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web or databases’ subject distribution are somewhat disputable. In this paper, we revisit a problem of deep Web characterization: how to estimate the total number of online databases on the Web? We propose the Host-IP clustering sampling method to address the drawbacks of existing approaches for deep Web characterization and report our findings based on the survey of Russian Web. Obtained estimates together with a proposed sampling technique could be useful for further studies to handle data in the deep Web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
April 2004 Web Server Survey (April 2004), http://news.netcraft.com/archives/2004/04/01/april_2004_web_server_survey.html
DNS load balancing report (April 2004), http://www.securityspace.com/s_survey/data/man.200404/dnsmult.html
Baeza-Yates, R., Castillo, C.: Crawling the infinite Web: five levels are enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004)
Baeza-Yates, R., Castillo, C., Efthimiadis, E.N.: Characterization of national Web domains. ACM Trans. Internet Technol. 7(2) (2007)
Baeza-Yates, R., Castillo, C., López, V.: Characteristics of the Web of Spain. Cybermetrics 9(1) (2005)
Bergman, M.: The deep Web: surfacing hidden value. Journal of Electronic Publishing 7(1) (2001)
Bharat, K., Broder, A.: A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst. 30(1-7), 379–388 (1998)
Bharat, K., Broder, A., Dean, J., Henzinger, M.: A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inf. Sci. 51(12), 1114–1122 (2000)
Chang, K., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: observations and implications. SIGMOD Rec. 33(3), 61–70 (2004)
Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proc. of WebDB 2004 (2004)
Gomes, D., Silva, M.J.: Characterizing a national community web. ACM Trans. Internet Technol. 5(3), 508–531 (2005)
O’Neill, E.T., McClain, P.D., Lavoie, B.F.: A methodology for sampling the World Wide Web. Annual Review of OCLC Research 1997 (1997)
Shestakov, D.: Deep Web: databases on the Web. In: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581–588. IGI Global (2009)
Shestakov, D.: On building a search interface discovery system. In: Proceedings of VLDB Workshops 2009, pp. 114–125 (2009)
Shestakov, D.: Measuring the deep Web (2011) (submitted)
Shestakov, D., Salakoski, T.: On estimating the scale of national deep Web. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 780–789. Springer, Heidelberg (2007)
Thompson, S.: Sampling. John Wiley & Sons, New York (1992)
Tolosa, G., Bordignon, F., Baeza-Yates, R., Castillo, C.: Characterization of the Argentinian Web. Cybermetrics 11(1) (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shestakov, D. (2011). Sampling the National Deep Web. In: Hameurlain, A., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2011. Lecture Notes in Computer Science, vol 6860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23088-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-23088-2_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23087-5
Online ISBN: 978-3-642-23088-2
eBook Packages: Computer ScienceComputer Science (R0)