Using Structured Tokens to Identify Webpages for Data Extraction

Lin, Ling; Zhou, Lizhu; Guo, Qi; Li, Gang

doi:10.1007/978-3-540-72524-4_27

Using Structured Tokens to Identify Webpages for Data Extraction

Ling Lin¹,
Lizhu Zhou¹,
Qi Guo¹ &
…
Gang Li¹

Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4505))

Abstract

As the web grows, more and more data has become available from webpages, such as the product items from the back-end databases. To provide efficient access to the data objects contained in these pages, data extraction plays an important role. However, identifying the suitable webpages to feed the data extraction is a pre-requisite and non-trivial task. As a result, there is an increasing need for methods that can automatically identify the target pages from unknown websites. In this paper, we solve the problem by exploiting the structured-token features of the webpage content, and applying decision tree based classification algorithm to induce the structure information. Furthermore, a preliminary recognition of data-object is acquired to efficiently initiate the subsequential data extraction. We experiment our approach on the real-world data, and achieve promising results.

This work is supported in part by National Natural Science Foundation of China 60520130299.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakrabarti, S., Dom, B., Berg, M.: Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Article Google Scholar
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005), doi:10.1016/j.datak.2004.11.004
Article Google Scholar
Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of SIGMOD ’99, Philadelphia, Pennsylvania, United States, pp. 467–478 (1999), doi:10.1145/304182.304223
Google Scholar
Ester, M., Kriegel, H., Schubert, M.: Accurate and efficient crawling for relevant websites. In: VLDB, pp. 396–407 (2004)
Google Scholar
Vidal, M.L.A., et al.: Structure-driven crawler generation by example. In: Proceedings of SIGIR ’06, Seattle, Washington, USA, pp. 292–299 (2006), doi:10.1145/1148170.1148223
Google Scholar
Jindal, N.: Wrapper Generation for Automatic Data Extraction from Large Web Sites. In: Bhalla, S. (ed.) DNIS 2005. LNCS, vol. 3433, pp. 34–53. Springer, Heidelberg (2005)
Google Scholar
Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004), doi:10.1016/j.datak.2003.10.003
Article Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of KDD ’03, Washington, D.C., pp. 601–606 (2003), doi:10.1145/956750.956826
Google Scholar
Nie, Z., Zhang, Y., Wen, J., Ma, W.: Object-level ranking: bringing order to web objects. In: WWW, pp. 567–574 (2005)
Google Scholar
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of WWW ’04, New York, NY, USA, pp. 502–511 (2004), doi:10.1145/988672.988740
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of WWW ’03, Budapest, Hungary, pp. 187–196 (2003), doi:10.1145/775152.775179
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of WWW ’05, Chiba, Japan, pp. 66–75 (2005), doi:10.1145/1060745.1060760
Google Scholar
Zhu, J., Nie, Z., Wen, J., et al.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing 100084, PRC
Ling Lin, Lizhu Zhou, Qi Guo & Gang Li

Authors

Ling Lin
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Gang Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Guozhu Dong Xuemin Lin Wei Wang Yun Yang Jeffrey Xu Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, L., Zhou, L., Guo, Q., Li, G. (2007). Using Structured Tokens to Identify Webpages for Data Extraction. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-72524-4_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72483-4
Online ISBN: 978-3-540-72524-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics