Abstract
As the web grows, more and more data has become available from webpages, such as the product items from the back-end databases. To provide efficient access to the data objects contained in these pages, data extraction plays an important role. However, identifying the suitable webpages to feed the data extraction is a pre-requisite and non-trivial task. As a result, there is an increasing need for methods that can automatically identify the target pages from unknown websites. In this paper, we solve the problem by exploiting the structured-token features of the webpage content, and applying decision tree based classification algorithm to induce the structure information. Furthermore, a preliminary recognition of data-object is acquired to efficiently initiate the subsequential data extraction. We experiment our approach on the real-world data, and achieve promising results.
This work is supported in part by National Natural Science Foundation of China 60520130299.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chakrabarti, S., Dom, B., Berg, M.: Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks 31(11-16), 1623–1640 (1999)
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005), doi:10.1016/j.datak.2004.11.004
Embley, D.W., Jiang, Y., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of SIGMOD ’99, Philadelphia, Pennsylvania, United States, pp. 467–478 (1999), doi:10.1145/304182.304223
Ester, M., Kriegel, H., Schubert, M.: Accurate and efficient crawling for relevant websites. In: VLDB, pp. 396–407 (2004)
Vidal, M.L.A., et al.: Structure-driven crawler generation by example. In: Proceedings of SIGIR ’06, Seattle, Washington, USA, pp. 292–299 (2006), doi:10.1145/1148170.1148223
Jindal, N.: Wrapper Generation for Automatic Data Extraction from Large Web Sites. In: Bhalla, S. (ed.) DNIS 2005. LNCS, vol. 3433, pp. 34–53. Springer, Heidelberg (2005)
Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden web pages for data extraction. Data Knowl. Eng. 49(2), 177–196 (2004), doi:10.1016/j.datak.2003.10.003
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of KDD ’03, Washington, D.C., pp. 601–606 (2003), doi:10.1145/956750.956826
Nie, Z., Zhang, Y., Wen, J., Ma, W.: Object-level ranking: bringing order to web objects. In: WWW, pp. 567–574 (2005)
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: Proceedings of WWW ’04, New York, NY, USA, pp. 502–511 (2004), doi:10.1145/988672.988740
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: Proceedings of WWW ’03, Budapest, Hungary, pp. 187–196 (2003), doi:10.1145/775152.775179
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. In: Proceedings of WWW ’05, Chiba, Japan, pp. 66–75 (2005), doi:10.1145/1060745.1060760
Zhu, J., Nie, Z., Wen, J., et al.: Simultaneous record detection and attribute labeling in web data extraction. In: KDD, pp. 494–503 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Lin, L., Zhou, L., Guo, Q., Li, G. (2007). Using Structured Tokens to Identify Webpages for Data Extraction. In: Dong, G., Lin, X., Wang, W., Yang, Y., Yu, J.X. (eds) Advances in Data and Web Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4505. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72524-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-72524-4_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72483-4
Online ISBN: 978-3-540-72524-4
eBook Packages: Computer ScienceComputer Science (R0)