Skip to main content

An Effective Method Supporting Data Extraction and Schema Recognition on Deep Web

  • Conference paper
Progress in WWW Research and Development (APWeb 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4976))

Included in the following conference series:

Abstract

With the rapid development of Internet, data sources on deep web store a large number of high-quality structured data, which demands the development of structured data extraction method. But the existing methods focus on data rather than structure, and some of them are difficult to maintain. To resolve these problems, a complete and effective method supporting data extraction and schema recognition is proposed in this paper. To extract data, a novel algorithm based on clustering is adopted, which is also effective when faced complex data and excessive noise. And a simple extraction rule model is defined to resolve the problem of maintenance. In addition, it does deep mining on result schema recognition. At last, experiments show satisfactory results.

This research is supported by the National Natural Science Foundation of China under Grant No. 60673139, 60573090.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chang, C.-C.K., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. In: SIGMOD Conference, pp. 61–70 (2004)

    Google Scholar 

  2. http://www.completeplanet.com/

  3. Meng, X., Lu, H., Wang, H., Gu, M.: SG-WRAP: a schema-guided wrapper generator. In: Proceedings of the 18th International Conference on Data Engineering, pp. 331–332 (2002)

    Google Scholar 

  4. Laender, A.H.F., Berthier, A.R., Altigran, S.: DEByE - data extraction by example. Data Knowl. Eng. 121–154 (2002)

    Google Scholar 

  5. Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in Web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, pp. 601–606 (2003)

    Google Scholar 

  6. Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW, pp. 10–14 (2005)

    Google Scholar 

  7. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

    Google Scholar 

  8. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 21th ACM SIGMOD International Conference on Management of Data, Madison, p. 624 (2002)

    Google Scholar 

  9. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD Conference, pp. 337–348 (2003)

    Google Scholar 

  10. Wang, J., Lochovsky, H.F.: Data Extraction and Label Assignment for Web Databases. In: WWW, pp. 20–24 (2003)

    Google Scholar 

  11. Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proc. of the 9th SIGMOD International Workshop on Web and Databases, pp. 20–25. Illinois, Chicago (2006)

    Google Scholar 

  12. Cai, D., Yu, S., Wen, J., Ma, W.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  13. Cai, D., He, X., Wen, J., Ma, W.: Block-level link analysis. In: SIGIR, pp. 440–447 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Yanchun Zhang Ge Yu Elisa Bertino Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, W., Shen, D., Nie, T. (2008). An Effective Method Supporting Data Extraction and Schema Recognition on Deep Web. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds) Progress in WWW Research and Development. APWeb 2008. Lecture Notes in Computer Science, vol 4976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78849-2_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78849-2_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78848-5

  • Online ISBN: 978-3-540-78849-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics