An Effective Method Supporting Data Extraction and Schema Recognition on Deep Web

Liu, Wei; Shen, Derong; Nie, Tiezheng

doi:10.1007/978-3-540-78849-2_42

Wei Liu¹,
Derong Shen¹ &
Tiezheng Nie¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4976))

Included in the following conference series:

Asia-Pacific Web Conference

862 Accesses
2 Citations

Abstract

With the rapid development of Internet, data sources on deep web store a large number of high-quality structured data, which demands the development of structured data extraction method. But the existing methods focus on data rather than structure, and some of them are difficult to maintain. To resolve these problems, a complete and effective method supporting data extraction and schema recognition is proposed in this paper. To extract data, a novel algorithm based on clustering is adopted, which is also effective when faced complex data and excessive noise. And a simple extraction rule model is defined to resolve the problem of maintenance. In addition, it does deep mining on result schema recognition. At last, experiments show satisfactory results.

This research is supported by the National Natural Science Foundation of China under Grant No. 60673139, 60573090.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, C.-C.K., He, B., Li, C., Patel, M., Zhang, Z.: Structured Databases on the Web: Observations and Implications. In: SIGMOD Conference, pp. 61–70 (2004)
Google Scholar
http://www.completeplanet.com/
Meng, X., Lu, H., Wang, H., Gu, M.: SG-WRAP: a schema-guided wrapper generator. In: Proceedings of the 18th International Conference on Data Engineering, pp. 331–332 (2002)
Google Scholar
Laender, A.H.F., Berthier, A.R., Altigran, S.: DEByE - data extraction by example. Data Knowl. Eng. 121–154 (2002)
Google Scholar
Liu, B., Grossman, R.L., Zhai, Y.: Mining data records in Web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, pp. 601–606 (2003)
Google Scholar
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW, pp. 10–14 (2005)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: automatic data extraction from data-intensive web sites. In: Proceedings of the 21th ACM SIGMOD International Conference on Management of Data, Madison, p. 624 (2002)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD Conference, pp. 337–348 (2003)
Google Scholar
Wang, J., Lochovsky, H.F.: Data Extraction and Label Assignment for Web Databases. In: WWW, pp. 20–24 (2003)
Google Scholar
Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proc. of the 9th SIGMOD International Workshop on Web and Databases, pp. 20–25. Illinois, Chicago (2006)
Google Scholar
Cai, D., Yu, S., Wen, J., Ma, W.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar
Cai, D., He, X., Wen, J., Ma, W.: Block-level link analysis. In: SIGIR, pp. 440–447 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department. of Computer, Northeastern University, Shenyang, 110004, China
Wei Liu, Derong Shen & Tiezheng Nie

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Derong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Tiezheng Nie
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Yanchun Zhang Ge Yu Elisa Bertino Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Shen, D., Nie, T. (2008). An Effective Method Supporting Data Extraction and Schema Recognition on Deep Web. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds) Progress in WWW Research and Development. APWeb 2008. Lecture Notes in Computer Science, vol 4976. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78849-2_42

Download citation

DOI: https://doi.org/10.1007/978-3-540-78849-2_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78848-5
Online ISBN: 978-3-540-78849-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics