Skip to main content

HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM

  • Conference paper
  • First Online:
Discovery Science (DS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

Abstract

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.

K. Kawamura—Now at Sony Group Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://mocobeta.github.io/janome/.

References

  1. Aitken, J.S.: Learning information extraction rules: an inductive logic programming approach. In: ECAI, pp. 355–359 (2002)

    Google Scholar 

  2. Chang, C.H., Kayed, M., Girgis, M., Shaalan, K.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  3. Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. IJCAI 2, 1251–1256 (2001)

    Google Scholar 

  4. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: ACL, pp. 168–175 (2002)

    Google Scholar 

  5. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NeurIPS, pp. 3844–3852 (2016)

    Google Scholar 

  6. Eriguchi, A., Hashimoto, K., Tsuruoka, Y.: Tree-to-sequence attentional neural machine translation. In: ACL, vol. 2, pp. 823–833 (2016)

    Google Scholar 

  7. Goller, C., Kuechler, A.: Learning task-dependent distributed representations by backpropagation through structure. Neural Netw. 1, 347–352 (1996)

    Google Scholar 

  8. Grishman, R.: Message understanding conference-6: a brief history. In: COLING, pp. 466–471 (1996)

    Google Scholar 

  9. Hinton, G.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  10. Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: ICML, pp. 291–298 (2002)

    Google Scholar 

  12. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)

    Google Scholar 

  14. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)

    Article  MathSciNet  Google Scholar 

  15. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV, pp. 3844–3852 (2017)

    Google Scholar 

  16. Malouf, R.: Markov models for language-independent named entity recognition. In: CoNLL, pp. 187–190 (2002)

    Google Scholar 

  17. Michael, A.: Maximum entropy Markov models for information extraction and segmentation Andrew. In: ICML, pp. 591–598 (2000)

    Google Scholar 

  18. Muslea, I., Minton, S., Knoblock, C.: Active learning for hierarchical wrapper induction. In: AAAI, p. 975 (1999)

    Google Scholar 

  19. Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)

    Article  Google Scholar 

  20. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  21. Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden Markov model structure. In: AAAI Workshop, pp. 37–42 (1999)

    Google Scholar 

  22. Shaalan, K., Raza, H.: Arabic named entity recognition from diverse text types. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 440–451. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85287-2_42

    Chapter  Google Scholar 

  23. Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1), 233–272 (1999)

    Article  Google Scholar 

  24. Sundheim, B.M.: Overview of the fourth message understanding evaluation and conference. In: 4th Message Understanding Conference, pp. 3–22 (1992)

    Google Scholar 

  25. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: ACL-IJCNLP, vol. 1, pp. 1556–1566 (2015)

    Google Scholar 

  26. Takeuchi, K., Collier, N.: Use of support vector machines in extended named entity recognition. In: COLING, pp. 1–7 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuki Kawamura .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kawamura, K., Yamamoto, A. (2021). HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88942-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88941-8

  • Online ISBN: 978-3-030-88942-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics