Skip to main content

An Indent Shape Based Approach for Web Lists Mining

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Abstract

Mining repeated patterns from HTML documents is a key step for typical applications of Web information extraction, which require efficient techniques of patterns mining to generate wrappers automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with a high precision, but their efficiency is still a challenge. In this paper, we present a novel approach for Web lists mining based on the indent shape of HTML documents. Indent shape is a simplified abstraction of HTML documents in which tandem repeated waves indicate the potential repeated patterns to be detected. By identifying the tandem repeated waves efficiently with a horizontal line scanning along an indent shape, the repeated patterns in the documents can be recognized, from which the lists of the target Web page can be extracted. Extensive experiments show that our approach achieves better performance and efficiency compared with existing approaches.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Embley, D.W., Jiang, Y., Ng, Y.K.: Record-Boundary Discovery in Web Documents. In: ACM SIGMOD International Conference on Management of Data, pp. 467–478 (1999)

    Google Scholar 

  2. Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)

    Google Scholar 

  3. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606 (2003)

    Google Scholar 

  4. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully.: Automatic Wrapper Generation for Search Engines. In: The 14th International World Wide Web Conference, pp. 66–75 (2005)

    Google Scholar 

  5. Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The SIAM International Conference on Data Mining, pp. 930–941 (2010)

    Google Scholar 

  6. Lin, S.-H., Ho, J.-M.: Discovering Informative Content Blocks from Web Documents. In: ACM SIGKDD, pp. 588–593 (2002)

    Google Scholar 

  7. Zhai, Y., Liu, B.: Web Data Extraction based on Partial Tree Alignment. In: The 14th International World Wide Web Conference, pp. 76–85 (2005)

    Google Scholar 

  8. Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: The 12th International World Wide Web Conference, pp. 187–196 (2003)

    Google Scholar 

  9. Liu, B., Zhai, Y.: NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: Conference on Information and Knowledge Management, pp. 381–388 (2005)

    Google Scholar 

  11. Gatterbauer, W., Bohunsky, P., Herzog, M., Krpl, B., Pollak, B.: Towards Domain Independent Information Extraction from Web Tables. In: International World Wide Web Conference, pp. 71–80 (2007)

    Google Scholar 

  12. Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2009)

    Article  Google Scholar 

  13. Diao, Y., Lu, H., Chen, S., Tian, Z.: Toward Learning Based Web Query Processing. In: International Conference on Very Large Databases, pp. 317–328 (2000)

    Google Scholar 

  14. W3C, HTML 4.01 Specification (1999), http://www.w3.org/TR/html401

  15. Liu, B.: Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, Y., Yin, G., Wang, H., Shi, D., Li, X., Yuan, L. (2011). An Indent Shape Based Approach for Web Lists Mining. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23982-3_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23981-6

  • Online ISBN: 978-3-642-23982-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics