Skip to main content

Using Structured Tokens to Identify Webpages for Data Extraction

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4505))

Abstract

As the web grows, more and more data has become available from webpages, such as the product items from the back-end databases. To provide efficient access to the data objects contained in these pages, data extraction plays an important role. However, identifying the suitable webpages to feed the data extraction is a pre-requisite and non-trivial task. As a result, there is an increasing need for methods that can automatically identify the target pages from unknown websites. In this paper, we solve the problem by exploiting the structured-token features of the webpage content, and applying decision tree based classification algorithm to induce the structure information. Furthermore, a preliminary recognition of data-object is acquired to efficiently initiate the subsequential data extraction. We experiment our approach on the real-world data, and achieve promising results.

This work is supported in part by National Natural Science Foundation of China 60520130299.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chakrabarti, S., Dom, B., Berg, M.: Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  2. Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005), doi:10.1016/j.datak.2004.11.004

    Article  Google Scholar 

  3. Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of SIGMOD ’99, Philadelphia, Pennsylvania, United States, pp. 467–478 (1999), doi:10.1145/304182.304223

    Google Scholar 

  4. Ester, M., Kriegel, H., Schubert, M.: Accurate and efficient crawling for relevant websites. In: VLDB, pp. 396–407 (2004)

    Google Scholar 

  5. Vidal, M.L.A., et al.: Structure-driven crawler generation by example. In: Proceedings of SIGIR ’06, Seattle, Washington, USA, pp. 292–299 (2006), doi:10.1145/1148170.1148223

    Google Scholar 

  6. Jindal, N.: Wrapper Generation for Automatic Data Extraction from Large Web Sites. In: Bhalla, S. (ed.) DNIS 2005. LNCS, vol. 3433, pp. 34–53. Springer, Heidelberg (2005)

    Google Scholar 

  7. Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004), doi:10.1016/j.datak.2003.10.003

    Article  Google Scholar 

  8. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of KDD ’03, Washington, D.C., pp. 601–606 (2003), doi:10.1145/956750.956826

    Google Scholar 

  9. Nie, Z., Zhang, Y., Wen, J., Ma, W.: Object-level ranking: bringing order to web objects. In: WWW, pp. 567–574 (2005)

    Google Scholar 

  10. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of WWW ’04, New York, NY, USA, pp. 502–511 (2004), doi:10.1145/988672.988740

    Google Scholar 

  11. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of WWW ’03, Budapest, Hungary, pp. 187–196 (2003), doi:10.1145/775152.775179

    Google Scholar 

  12. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)

    Google Scholar 

  13. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of WWW ’05, Chiba, Japan, pp. 66–75 (2005), doi:10.1145/1060745.1060760

    Google Scholar 

  14. Zhu, J., Nie, Z., Wen, J., et al.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Guozhu Dong Xuemin Lin Wei Wang Yun Yang Jeffrey Xu Yu

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Lin, L., Zhou, L., Guo, Q., Li, G. (2007). Using Structured Tokens to Identify Webpages for Data Extraction. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72524-4_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72483-4

  • Online ISBN: 978-3-540-72524-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics