Abstract
Web information extraction is the problem of extracting target information items from Web pages. There are two general problems: extracting information from natural language text and extracting structured data from Web pages. This chapter focuses on extracting structured data. A program for extracting such data is usually called a wrapper. Extracting information from text is studied mainly in the natural language processing community.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
Bibliography
Arasu, A. and H. Garcia-Molina. Extracting structured data from web pages. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2003), 2003.
Arlotta, L., V. Crescenzi, G. Mecca, and P. Merialdo. Automatic annotation of data extracted from large web sites. In Proceedings of Intl. Workshop on Web and Databases, 2003.
Baumgartner, R., S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.
Buttler, D., L. Liu, and C. Pu. A fully automated object extraction system for the World Wide Web. In Proceedings of International Conference on Distributed Computing Systems (ICDCS-2001), 2002.
Cafarella, M., A. Halevy, D. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. In Proceedings of International Conference on Very Large Data Bases (VLDB-2008), 2008.
Carrillo, H. and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 1988, 48(5): p. 1073–1082.
Chang, C., M. Kayed, M. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1411–1428.
Chang, C. and S. Lui. IEPAD: information extraction based on pattern discovery. In Proceedings of International Conference on World Wide Web (WWW-2001), 2001.
Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 2001, 40(2): p. 135–158.
Cohen, W., M. Hurst, and L. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of International Conference on World Wide Web (WWW-2002), 2002.
Crescenzi, V., G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of International Conference on Very Large Data Bases (VLDB-2001), 2001.
Embley, D., Y. Jiang, and Y. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Record, 1999, 28(2): p. 467–478.
Grumbach, S. and G. Mecca. In search of the lost schema. Database Theory—ICDT’99, 1999: p. 314–331.
Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. 1997: Cambridge Univ Press.
Hogue, A. and D. Karger. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Honavar, V. and G. Slutzki. eds. Grammatical Inference. Fourth Intl Colloquium on Grammatical Inference. 1998, LNCS 1433. Springer-Verlag.
Hsu, C. and M. Dung. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems, 1998, 23(8): p. 521–538.
Irmak, U. and T. Suel. Interactive wrapper generation with minimal user effort. In Proceedings of International Conference on World Wide Web (WWW-2006), 2006.
Kushmerick, N. Wrapper induction for information extraction, PhD Thesis. 1997.
Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 2000, 118(1–2): p. 15–68.
Lafferty, J., A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning (ICML-2001), 2001.
Lerman, K., L. Getoor, S. Minton, and C. Knoblock. Using the structure of Web sites for automatic segmentation of tables. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD-2004), 2004.
Li, Z. and W. Ng. Wiccap: From semi-structured data to structured data. In Proceedings of 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems (ECBS'04), 2004.
Liu, B., R. Grossman, and Y. Zhai. Mining data records in Web pages. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 2003.
Liu, B. and Y. Zhai. NET - A System for Extracting Web Data from Flat and Nested Data Records. In Proceedings of Intl. Conf. on Web Information Systems Engineering (WISE2005), 2005.
Miao, G., J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.
Muslea, I., S. Minton, and C. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 2006, 27(1): p. 203–233.
Muslea, I., S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of Intl. Conf. on Autonomous Agents (AGENTS-1999), 1999.
Raposo, J., A. Pan, M. Álvarez, J. Hidalgo, and A. Vina. The wargo system: Semi-automatic wrapper generation in presence of complex data access modes. In Proceedings of Workshop on Database and Expert Systems Applications, 2002.
Reis, D., P. Golgher, A. Silva, and A. Laender. Automatic web news extraction using tree edit distance. In Proceedings of International Conference on World Wide Web (WWW-2004), 2004.
Simon, K. and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2005), 2005.
Song, X., J. Liu, Y. Cao, C. Lin, and H. Hon. Automatic extraction of web data records containing user-generated content. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2010), 2010.
Tai, K. The tree-to-tree correction problem. Journal of the ACM (JACM), 1979, 26(3): p. 433.
Wang, J. and F. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of International Conference on World Wide Web (WWW-2003), 2003.
Wang, J., B. Shapiro, D. Shasha, K. Zhang, and K. Currey. An algorithm for finding the largest approximately common substructures of two trees. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2002, 20(8): p. 889–895.
Yang, J., R. Cai, Y. Wang, J. Zhu, L. Zhang, and W. Ma. Incorporating sitelevel knowledge to extract structured data from web forums. In Proceedings of International Conference on World Wide Web (WWW-2009), 2009.
Yang, W. Identifying syntactic differences between two programs. Software: Practice and Experience, 1991, 21(7): p. 739–755.
Zhai, Y. and B. Liu. Extracting web data using instance-based learning. World Wide Web, 2007, 10(2): p. 113–132.
Zhai, Y. and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Transactions on Knowledge and Data Engineering, 2006: p. 1614–1628.
Zhai, Y. and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Zhang, K., R. Statman, and D. Shasha. On the editing distance between unordered labeled trees. Information Processing Letters, 1992, 42(3): p. 133–139.
Zhao, H., W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of International Conference on World Wide Web (WWW-2005), 2005.
Zheng, S., R. Song, J. Wen, and C. Giles. Efficient record-level wrapper induction. In Proceedings of ACM International Conference on Information and knowledge management (CIKM-2009), 2009.
Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. 2D conditional random fields for web information extraction. In Proceedings of International Conference on Machine Learning (ICML-2005), 2005.
Zhu, J., Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2006), 2006.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Liu, B. (2011). Structured Data Extraction: Wrapper Generation. In: Web Data Mining. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19460-3_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-19460-3_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19459-7
Online ISBN: 978-3-642-19460-3
eBook Packages: Computer ScienceComputer Science (R0)