Advertisement

SEDE: A Schema Explorer and Data Extractor for HTML Web Pages

  • Xubin Deng
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 236)

Abstract

We present an approach for automatically exploring relation schema and extracting data from HTML pages. By abstracting a DOM-tree constructed from a HTML page into a set of generalized lists, this approach automatically generates a relation schema for storing data extracted from the page. Based on this approach, we have developed a software system named as SEDE (Schema Explorer and Data Extractor for HTML pages), which can reduces the workload of extracting and storing data objects within HTML pages. This paper will mainly introduce SEDE.

Keywords

DOM-tree abstraction HTML page relational database relation schema 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Padmadas, V., Gadge, J.: Web Data Extracion Using Visual Features. In: Proc of Int’l Conf. and Workshop on Emerging Trends in Technology (ICWET 2010), pp. 218–221 (2010)Google Scholar
  2. 2.
    Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)CrossRefGoogle Scholar
  3. 3.
    Cai, D., Yu, S.P., Wen, J.R., Ma, W.Y.: VIPS: A Vision-based Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79Google Scholar
  4. 4.
    Hiremat, P.S., Benchalli, S.S., Algur, S.P., Udapud, R.V.: Mining Data Regions from Web Pages. In: Proc of Int’l Conf. on Management of Data (COMAD 2005) (2005b)Google Scholar
  5. 5.
    Burget, R.: Layout Based Information Extraction from HTML Documents. In: Proc of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007), p. 5 (2007)Google Scholar
  6. 6.
    Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest. In: Proc. of 8th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (SIGKDD 2002), pp. 71–80 (2002)Google Scholar
  7. 7.
    Xiao, Y., et al.: Efficient Data Mining for Maximal Frequent Subtrees. In: Proc. of the 3rd IEEE Int. Conf. on Data Mining (ICDM 2003), pp. 379–386 (2003)Google Scholar
  8. 8.
    Zhai, Y.H., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: Proc. of 14th Int’l. Conf. on World Wide Web (WWW 2005), pp. 76–85 (2005)Google Scholar
  9. 9.
    Deng, X.B.: Automatic Transformation of HTML Pages into Relational Database. Journal of Information and Computational Science 7(2), 349–355 (2010)Google Scholar
  10. 10.
    Bille, P.: A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science 337(1-3), 217–239 (2005)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Xubin Deng
    • 1
  1. 1.School of InformationZhejiang University of Finance & EconomicsHangzhouChina

Personalised recommendations