Abstract
Web pages usually contain much irrelevant information that customers don’t need. Thus, in order to extract relevant information from the complicated information heap, effective methods to extract information are required. Aiming at the semi-structured characteristic of HTML, theme-relevant information in web pages could be extracted by semantic pruning, in the adoption of DOM-presentation, combined with the feature of web structure and the fuzzy classification of keywords.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Wang, T., Tang, S., Yang, D., et al.: COMIIX:Towards effective WEB information extraction, integration and query answering. In: Proc of SIGMOD 2002, p. 620. ACM Press, New York (2002)
Wang, Q., Tang, S., Yang, D., et al.: DOM-Based Automatic Extraction of Topical Information from Web Pages. Journal of Computer Research and Development 41(10), 1786–1791 (2004)
Gu, Y.-h., Tian, W.: Extraction of Information from Web Pages Based on Extended DOM Tree. Computer Science 36(11), 235–237 (2009)
Ou, J., Dong, S., Cai, B.: Topic information extraction from template web pages. Journal of Tsinghua University (Science and Technology) 45(1), 1743–1747 (2005)
Li, Y., Zhang, M.: A Fuzzy Extraction for Web Pages Based on Parallel Computing. Computer Engineering and Applications 21, 23–27 (2003)
Zheng, S.f., Liu, T., Qin, B., Li, S.: Overview of Question-Answering. Journal of Chinese Information Processing 16(06), 46–52 (2002)
Li, J.-j., Yan, H.-f.: Chinese Web Retrieval Test Collections: Construction, Analysis and Application. Journal of Chinese Information Processing (01), 30–36 (2008)
Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM based Content Extraction of HTML Documents. In: 12th International World Wide Web Conference, vol. (5), pp. 207–214 (2003)
Freitag, D.: Machine learning for information extraction in information domains. Machine Learning 39(2/3), 169–202 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, J., Jia, J., Duan, L. (2011). DOM Semantic Expansion-Based Extraction of Topical Information from Web Pages. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-23982-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)