An Indent Shape Based Approach for Web Lists Mining

Zhu, Yanxu; Yin, Gang; Wang, Huaimin; Shi, Dianxi; Li, Xiang; Yuan, Lin

doi:10.1007/978-3-642-23982-3_15

An Indent Shape Based Approach for Web Lists Mining

Yanxu Zhu²¹,
Gang Yin²¹,
Huaimin Wang²¹,
Dianxi Shi²¹,
Xiang Li²¹ &
…
Lin Yuan²²

Conference paper

1300 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Abstract

Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Embley, D.W., Jiang, Y., Ng, Y.K.: Record-Boundary Discovery in Web Documents. In: ACM SIGMOD International Conference on Management of Data, pp. 467–478 (1999)
Google Scholar
Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606 (2003)
Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully.: Automatic Wrapper Generation for Search Engines. In: The 14th International World Wide Web Conference, pp. 66–75 (2005)
Google Scholar
Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The SIAM International Conference on Data Mining, pp. 930–941 (2010)
Google Scholar
Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: ACM SIGKDD, pp. 588–593 (2002)
Google Scholar
Zhai, Y., Liu, B.: Web Data Extraction based on Partial Tree Alignment. In: The 14th International World Wide Web Conference, pp. 76–85 (2005)
Google Scholar
Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: The 12th International World Wide Web Conference, pp. 187–196 (2003)
Google Scholar
Liu, B., Zhai, Y.: NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)
Chapter Google Scholar
Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: Conference on Information and Knowledge Management, pp. 381–388 (2005)
Google Scholar
Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards Domain Independent Information Extraction from Web Tables. In: International World Wide Web Conference, pp. 71–80 (2007)
Google Scholar
Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2009)
Article Google Scholar
Diao, Y., Lu, H., Chen, S., Tian, Z.: Toward Learning Based Web Query Processing. In: International Conference on Very Large Databases, pp. 317–328 (2000)
Google Scholar
W3C, HTML 4.01 Specification (1999), http://www.w3.org/TR/html401
Liu, B.: Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, Hunan, China
Yanxu Zhu, Gang Yin, Huaimin Wang, Dianxi Shi & Xiang Li
College of Electronic Technology, Information Engineering University, 450004, Zhengzhou, Henan, China
Lin Yuan

Authors

Yanxu Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Yin
View author publications
You can also search for this author in PubMed Google Scholar
Huaimin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dianxi Shi
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Lin Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Inforamtion Science, University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Zhiguo Gong
School of Computer, Shanghai University, 200444, Shanghai, China
Xiangfeng Luo
College of Computer and Software, Taiyuan University of Technology, 030024, Taiyuan, China
Junjie Chen
School of Computer and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jingsheng Lei
Department of Business Administration, Caritas Institute of Higher Education, 18 Chui Ling Road, Tseung Kwan O, Hong Kong, China
Fu Lee Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Y., Yin, G., Wang, H., Shi, D., Li, X., Yuan, L. (2011). An Indent Shape Based Approach for Web Lists Mining. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-23982-3_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics