A Framework for Incremental Deep Web Crawler Based on URL Classification

Zhang, Zhixiao; Dong, Guoqing; Peng, Zhaohui; Yan, Zhongmin

doi:10.1007/978-3-642-23982-3_37

A Framework for Incremental Deep Web Crawler Based on URL Classification

Zhixiao Zhang²¹,
Guoqing Dong^21,22,
Zhaohui Peng²¹ &
…
Zhongmin Yan²¹

Conference paper

1385 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Abstract

With the Web grows rapidly, more and more data become available in the Deep Web · But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through static URL links, because Deep Web pages are hidden behind the forms. However, the amount of information contained in the Deep web is not only far more than the Surface Web, the information of Deep Web is more valuable than the Surface Web. As Deep Web Pages change rapidly, how to maintain the Deep Web pages which were crawled fresh and to crawl the new Deep Web pages is a challenge. A framework for incremental Deep Web crawler based on URL classification is proposed. According to the list page and leaf page, the URL that is related with the page can be divided into two parts: list URL and leaf URL. The framework not only crawls the latest Deep Web pages according to the change frequency of list page, but also crawl the leaf pages which often change.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the 7th World-Wide Web Conference (1998)
Google Scholar
Cho, J., Garcia-Molina, H.: Estimating frequency of change. Technical report, Stanford University (2000)
Google Scholar
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: Proceedings of the Twenty-Sixth VLDB Conference, Cairo, Egypt, pp. 200–209 (2000)
Google Scholar
Meng, T., Yan, H.F., Wang, J.: A model of efficient incremental spider for the Chinese Web and its implementation. Journal of Tsinghua University (Science and Technology) 45(S1), 1882–1886 (2005) (in Chinese with English abstract)
Google Scholar
Meng, T., Yan, H.F., Wang, J.M.: Web Evolution and Incremental Crawling. Journal of Software 17(5) (May 2006)
Google Scholar
Sharma, A.K., Gupta, J.P., Agarwal, D.P.: A novel approach towards management of Volatile Information. Journal of CSI 33(1), 18–27 (2003)
Google Scholar
Qprober Research Group (October 2005), acessible at http://qprober.CS.columbia.ed
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Proceedings of the 2000 ACM SIGMOD (2000)
Google Scholar
Key Technology R&D Program of Shandong Province under Grant No. 2010GGX10108
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling new approach to topic-specific web resource discovery. In: Proceedings of the 8th World-Wide Web Conference (1999)
Google Scholar
Bhatia, K.K., Sharma, A.K.: A Framework for an Extensible Domain-specific Hidden Web Crawler (DSHWC). Communicated to IEEETKDE Journal (December 2008)
Google Scholar
Bhatia, K.K., Sharma, A.K.: A Framework for Domain-Specific Interface Mapper (DSIM). International Journal of Computer Science and Network Security, IJCSNS 2008 (2008)
Google Scholar
Dixit, A., Sharma, A.K.: Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler. International Journal of Computer Science and Network Security (IJCSNS) 8(12) (December 2008)
Google Scholar
Cho, J., Roy, S.: Impact of Web search engines on page popularity. In: Proc. of the 13th World-Wide Web Conf., pp. 20–29. ACM Press, New York (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University, Jinan, China
Zhixiao Zhang, Guoqing Dong, Zhaohui Peng & Zhongmin Yan
Shandong Dareway Software Co., Ltd., Jinan, China
Guoqing Dong

Authors

Zhixiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guoqing Dong
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Peng
View author publications
You can also search for this author in PubMed Google Scholar
Zhongmin Yan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Inforamtion Science, University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Zhiguo Gong
School of Computer, Shanghai University, 200444, Shanghai, China
Xiangfeng Luo
College of Computer and Software, Taiyuan University of Technology, 030024, Taiyuan, China
Junjie Chen
School of Computer and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jingsheng Lei
Department of Business Administration, Caritas Institute of Higher Education, 18 Chui Ling Road, Tseung Kwan O, Hong Kong, China
Fu Lee Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Z., Dong, G., Peng, Z., Yan, Z. (2011). A Framework for Incremental Deep Web Crawler Based on URL Classification. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_37

Download citation

DOI: https://doi.org/10.1007/978-3-642-23982-3_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics