Juicer: Scalable Extraction for Thread Meta-information of Web Forum

Guo, Yan; Wang, Yu; Ding, Guodong; Cao, Donglin; Zhang, Gang; Lv, Yi

doi:10.1007/978-3-642-01393-5_15

Yan Guo²⁰,
Yu Wang²⁰,
Guodong Ding²⁰,
Donglin Cao²⁰,
Gang Zhang²⁰ &
…
Yi Lv²¹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 5477))

Included in the following conference series:

Pacific-Asia Workshop on Intelligence and Security Informatics

Abstract

In Web forum, thread meta-information contained in list-of-thread of board page provide fundamental data for the further forum mining. This paper describes a complete system named Juicer which was developed as a subsystem for an industrial application that involves forum mining. The task of Juicer is to extract thread meta-information from board pages of a great many of large scale online Web forums, which implies that scalable extraction is required with high accuracy and speed, and minimal user effort for maintenance. Among so many existed approaches about information extraction, we can not find any approach to fully satisfy the requirements, so we present simple scalable extraction approach behind Juicer to achieve the goal. Juicer is constituted by four modules: Template generation, Specifying labeling setting, Automatic extraction, Label assignment. Both experiments and practice show that Juicer successfully satisfied the requirements.

This work is partially supported by the National Grand Fundamental Research 973 Program of China under Grant 2004CB318109, and the National High Technology Development 863 Program of China under Grant 2007AA01Z147.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE transactions on knowledge and data engineering 18(10), 1411–1428 (2006)
Article Google Scholar
Liu, B., Zhai, Y.: Mining data records in web pages. In: Proc. Intl. Conf. Knowledge Discovery in Databases and Data Mining (KDD), pp. 601–606 (2003)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. 14th Intl. Conf. World Wide Web (WWW), pp. 76–85 (2005)
Google Scholar
Liu, B., Zhai, Y.: Net: a system for extracting web data from flat and nested data records. In: Proc. Sixth Intl. Conf. Web Information Systems Eng., pp. 487–495 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, China
Yan Guo, Yu Wang, Guodong Ding, Donglin Cao & Gang Zhang
State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China
Yi Lv

Authors

Yan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guodong Ding
View author publications
You can also search for this author in PubMed Google Scholar
Donglin Cao
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Lv
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of Arizona, Tucson, AZ, USA
Hsinchun Chen
Drexel University, Philadelphia, PA, USA
Christopher C. Yang
The University of Hong Kong, Hong Kong, China
Michael Chau
National Taiwan University, Taipei, Taiwan, R.O.C.
Shu-Hsing Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, Y., Wang, Y., Ding, G., Cao, D., Zhang, G., Lv, Y. (2009). Juicer: Scalable Extraction for Thread Meta-information of Web Forum. In: Chen, H., Yang, C.C., Chau, M., Li, SH. (eds) Intelligence and Security Informatics. PAISI 2009. Lecture Notes in Computer Science, vol 5477. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01393-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-01393-5_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01392-8
Online ISBN: 978-3-642-01393-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics