Skip to main content
Log in

Deep Web adaptive crawling based on minimum executable pattern

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In the literature, related work focus only on generic text boxes or entire query forms, causing the problem of “data islands” or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming the problem of “data islands”. We present a set of experiments to validate the effectiveness of the proposed method. Experimental results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average, it achieves good coverage by issuing only a few hundred queries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Alvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., & Carneiro, V. (2007). DeepBot: A focused crawler for accessing hidden web content. In Proceedings of DEECS2007 (pp. 18–25). San Diego, CA.

  • Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In Proceedings of SBBD2004 (pp. 309–321). Brasilia, Brazil.

  • Bergman, M. K. (2001). The Deep Web: Surfacing hidden value. The Journal of Electronic Publishing from the University of Michigan, 7, 3–21.

    Google Scholar 

  • Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1, 163–190.

    Article  Google Scholar 

  • He, B., & Chang, K. C. C. (2006). Automatic complex schema matching across web query interfaces: A correlation mining approach. ACM Transactions on Database Systems, 13, 1–45.

    Google Scholar 

  • Ipeirotis, P., & Gravano, L. (2002). Distributed search over the hidden web: Hierarchical database sampling and selection. In Proceedings of VLDB2002 (pp. 1–12). Hong Kong, China.

  • Jayant, M., David, K., et al. (2008). Google’s deep-web crawl. In Proceedings of VLDB2008 (pp. 1241–1252). Auckland, New Zealand.

  • Mandelbrot, B. B. (1988). Fractal geometry of nature. New York: Freeman.

    Google Scholar 

  • Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In Proceedings of JCDL2005 (pp. 100–109). Denver CO.

  • Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Proceedings of VLDB2001 (pp. 129–138). Rome Italy.

  • Wu, P., Wen, J. R., Liu, H., & Ma, W. Y. (2006). Query selection techniques for efficient crawling of structured web source. In Proceedings of ICDE2006 (pp. 47–56). Atlanta, GA.

  • Zhang, Z., He, B., & Chang, K. C. C. (2004). Understanding web query interfaces: Best effort parsing with hidden syntax. In Proceedings of the ACM SIGMOD2004 (pp. 107–118). Paris, France.

Download references

Acknowledgements

The research was supported in part by the National High-Tech R&D Program of China under Grant No.2008AA01Z131, the National Science Foundation of China under Grant Nos.60825202, 60803079, 60921003, the National Key Technologies R&D Program of China under Grant Nos. 2006BAK11B02, 2006BAJ07B06, the Program for New Century Excellent Talents in University of China under Grant No.NECT-08-0433, the Doctoral Fund of Ministry of Education of China under Grant No. 20090201110060, Cheung Kong Scholar’s Program. The authors are grateful to the anonymous reviewers for their comments which greatly improved the quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, J., Jiang, L., Wu, Z. et al. Deep Web adaptive crawling based on minimum executable pattern. J Intell Inf Syst 36, 197–215 (2011). https://doi.org/10.1007/s10844-010-0124-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-010-0124-5

Keywords

Navigation