HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM

Kawamura, Kazuki; Yamamoto, Akihiro

doi:10.1007/978-3-030-88942-5_3

Kazuki Kawamura¹⁰ &
Akihiro Yamamoto¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12986))

Included in the following conference series:

International Conference on Discovery Science

1563 Accesses
1 Citations
1 Altmetric

Abstract

In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.

K. Kawamura—Now at Sony Group Corporation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://mocobeta.github.io/janome/.

References

Aitken, J.S.: Learning information extraction rules: an inductive logic programming approach. In: ECAI, pp. 355–359 (2002)
Google Scholar
Chang, C.H., Kayed, M., Girgis, M., Shaalan, K.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. IJCAI 2, 1251–1256 (2001)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: a framework and graphical development environment for robust NLP tools and applications. In: ACL, pp. 168–175 (2002)
Google Scholar
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NeurIPS, pp. 3844–3852 (2016)
Google Scholar
Eriguchi, A., Hashimoto, K., Tsuruoka, Y.: Tree-to-sequence attentional neural machine translation. In: ACL, vol. 2, pp. 823–833 (2016)
Google Scholar
Goller, C., Kuechler, A.: Learning task-dependent distributed representations by backpropagation through structure. Neural Netw. 1, 347–352 (1996)
Google Scholar
Grishman, R.: Message understanding conference-6: a brief history. In: COLING, pp. 466–471 (1996)
Google Scholar
Hinton, G.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1929–1958 (2014)
MathSciNet MATH Google Scholar
Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: ICML, pp. 291–298 (2002)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1–2), 15–68 (2000)
Article MathSciNet Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV, pp. 3844–3852 (2017)
Google Scholar
Malouf, R.: Markov models for language-independent named entity recognition. In: CoNLL, pp. 187–190 (2002)
Google Scholar
Michael, A.: Maximum entropy Markov models for information extraction and segmentation Andrew. In: ICML, pp. 591–598 (2000)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Active learning for hierarchical wrapper induction. In: AAAI, p. 975 (1999)
Google Scholar
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42(4), 963–979 (2006)
Article Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden Markov model structure. In: AAAI Workshop, pp. 37–42 (1999)
Google Scholar
Shaalan, K., Raza, H.: Arabic named entity recognition from diverse text types. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 440–451. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85287-2_42
Chapter Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Mach. Learn. 34(1), 233–272 (1999)
Article Google Scholar
Sundheim, B.M.: Overview of the fourth message understanding evaluation and conference. In: 4th Message Understanding Conference, pp. 3–22 (1992)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: ACL-IJCNLP, vol. 1, pp. 1556–1566 (2015)
Google Scholar
Takeuchi, K., Collier, N.: Use of support vector machines in extended named entity recognition. In: COLING, pp. 1–7 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan
Kazuki Kawamura & Akihiro Yamamoto

Authors

Kazuki Kawamura
View author publications
You can also search for this author in PubMed Google Scholar
Akihiro Yamamoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazuki Kawamura .

Editor information

Editors and Affiliations

Universidade do Porto and Fraunhofer Portugal AICOS, Porto, Portugal
Carlos Soares
Dalhousie University, Halifax, NS, Canada
Luis Torgo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kawamura, K., Yamamoto, A. (2021). HTML-LSTM: Information Extraction from HTML Tables in Web Pages Using Tree-Structured LSTM. In: Soares, C., Torgo, L. (eds) Discovery Science. DS 2021. Lecture Notes in Computer Science(), vol 12986. Springer, Cham. https://doi.org/10.1007/978-3-030-88942-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-88942-5_3
Published: 09 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88941-8
Online ISBN: 978-3-030-88942-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics