DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages

Chen, Junjie; Jia, Junyao; Duan, Liguo

doi:10.1007/978-3-642-23982-3_42

DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages

Junjie Chen²¹,
Junyao Jia²¹ &
Liguo Duan²¹

Conference paper

1308 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Abstract

Web pages usually contain much irrelevant information that customers don’t need. Thus, in order to extract relevant information from the complicated information heap, effective methods to extract information are required. Aiming at the semi-structured characteristic of HTML, theme-relevant information in web pages could be extracted by semantic pruning, in the adoption of DOM-presentation, combined with the feature of web structure and the fuzzy classification of keywords.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wang, T., Tang, S., Yang, D., et al.: COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD 2002, p. 620. ACM Press, New York (2002)
Google Scholar
Wang, Q., Tang, S., Yang, D., et al.: DOM-Based Automatic Extraction of Topical Information from Web Pages. Journal of Computer Research and Development 41(10), 1786–1791 (2004)
Google Scholar
Gu, Y.-h., Tian, W.: Extraction of Information from Web Pages Based on Extended DOM Tree. Computer Science 36(11), 235–237 (2009)
Google Scholar
Ou, J., Dong, S., Cai, B.: Topic information extraction from template web pages. Journal of Tsinghua University (Science and Technology) 45(1), 1743–1747 (2005)
Google Scholar
Li, Y., Zhang, M.: A Fuzzy Extraction for Web Pages Based on Parallel Computing. Computer Engineering and Applications 21, 23–27 (2003)
Google Scholar
Zheng, S.f., Liu, T., Qin, B., Li, S.: Overview of Question-Answering. Journal of Chinese Information Processing 16(06), 46–52 (2002)
Google Scholar
Li, J.-j., Yan, H.-f.: Chinese Web Retrieval Test Collections: Construction, Analysis and Application. Journal of Chinese Information Processing (01), 30–36 (2008)
Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM based Content Extraction of HTML Documents. In: 12th International World Wide Web Conference, vol. (5), pp. 207–214 (2003)
Google Scholar
Freitag, D.: Machine learning for information extraction in information domains. Machine Learning 39(2/3), 169–202 (2000)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Technology College, Taiyuan University of Technology, Taiyuan, China
Junjie Chen, Junyao Jia & Liguo Duan

Authors

Junjie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junyao Jia
View author publications
You can also search for this author in PubMed Google Scholar
Liguo Duan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Inforamtion Science, University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Zhiguo Gong
School of Computer, Shanghai University, 200444, Shanghai, China
Xiangfeng Luo
College of Computer and Software, Taiyuan University of Technology, 030024, Taiyuan, China
Junjie Chen
School of Computer and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jingsheng Lei
Department of Business Administration, Caritas Institute of Higher Education, 18 Chui Ling Road, Tseung Kwan O, Hong Kong, China
Fu Lee Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, J., Jia, J., Duan, L. (2011). DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-23982-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics