Effective and Efficient Web Reviews Extraction Based on Hadoop

Wan, Jian; Yan, Jiawei; Jiang, Congfeng; Zhou, Li; Ren, Zujie; Ren, Yongjian

doi:10.1007/978-3-642-37804-1_12

Jian Wan²⁴,
Jiawei Yan²⁴,
Congfeng Jiang²⁴,
Li Zhou²⁴,
Zujie Ren²⁴ &
…
Yongjian Ren²⁴

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7759))

Included in the following conference series:

International Conference on Service-Oriented Computing

Abstract

The rapid development of Web 2.0 brings the flourish of web reviews. Traditional web review data extraction methods suffer from poor performance in dealing with massive data. To solve this problem, we propose an effective and efficient approach to extract web reviews based on Hadoop. It overcomes inefficiency when dealing with large-scale data, and enables the accuracy and efficiency in extracting the massive data sets. Our proposed approach consists of two components: a review record extraction algorithm based on node similarity, and a review content extraction algorithm based on the text depth. We design a Hadoop-based web reviews automatic extraction system. At last, we test the extraction system using the massive web reviews page sets. The experimental results show that this extraction system can achieve accuracy of more than 96%, and also can obtain a higher speedup, compared with the traditional web extraction.

Download to read the full chapter text

Chapter PDF

An Enhanced Method for Review Mining Using N-Gram Approaches

Using Reviewer Information to Improve Performance of Low-Quality Review Detection

E-commerce Review Classification Based on SVM

Keywords

References

Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI), pp. 811–816 (1993)
Google Scholar
Kim, J., Moldovan, D.: Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactions on Knowledge and Data Engineering 7(5), 713–724 (1995)
Article Google Scholar
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. AAAI Technical Report WS, pp. 37–42 (1999)
Google Scholar
Apache Hadoop, http://hadoop.apache.org
Document Object Model, http://www.w3.org/DOM/
Liu, W., Meng, X., Meng, W.: Vision-Based Web data records extraction. In: Zhou, D. (ed.) Proc. of the Int’l Workshop on the Web and Databases (WebDB), pp. 20–25 (2006)
Google Scholar
Liu, B., Grossman, R.-L., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 601–606 (2003)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Journal of Machine Learning 34(1-3), 233–272 (1999)
Article MATH Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Article Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the 26th International Conference on Very Large Database Systems (VLDB), Rome, Italy, pp. 109–118 (2001)
Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for Web databases. In: Hencsey, G., White, B. (eds.) Proc. of the Int’l Conf. on World Wide Web (WWW), pp. 187–196. ACM Press, Budapest (2003)
Google Scholar
Chang, C.-H., Lui, S.-C.: IEPAD: Information extraction based on pattern discovery. In: Proceedings of the Tenth International Conference on World Wide Web (WWW), Hong-Kong, pp. 223–231 (2001)
Google Scholar
Kaushik, R.T., Bhandarkar, M., Nahrstedt, K.: Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 274–287 (2010)
Google Scholar
Nicolae, B., Moise, D., Antoniu, G., Bouge, L., Dorier, M.: BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–11 (2010)
Google Scholar
Mao, H., Zhang, Z., Zhao, B., Xiao, L., Li, R.: Towards Deploying Elastic Hadoop in the Cloud. In: 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 476–482 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Jian Wan, Jiawei Yan, Congfeng Jiang, Li Zhou, Zujie Ren & Yongjian Ren

Authors

Jian Wan
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Yan
View author publications
You can also search for this author in PubMed Google Scholar
Congfeng Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zujie Ren
View author publications
You can also search for this author in PubMed Google Scholar
Yongjian Ren
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Software Engg., University of Wollongong, 2522, Wollongong, NSW, Australia
Aditya Ghose
Software Engineering Institute, East China Normal University, 3663 Zhongshan Road (North), 200062, Shanghai, China
Huibiao Zhu
College of Computing and Information Sciences, Rochester Institute of Technology, 1 Lomb Memorial Drive, 14623, Rochester, NY, USA
Qi Yu
Informatics and Telecommunications, Univ. Campus, University of Athens, 15784, Athens, Greece
Alex Delis
School of Computer Science, University of Adelaide, 5005, Adelaide, South Australia, Australia
Quang Z. Sheng
Lorraine University/LORIA, Campus Scientifique, Nancy 2 University,, BP 239, 54506, Vandoeuvre-les-Nancy Cedex, France
Olivier Perrin
School of Software, Tsinghua University, Haidian District, 100084, Beijing, China
Jianmin Wang
Shandong University, School of Computer Science and Technology, Jinan, China
Yan Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wan, J., Yan, J., Jiang, C., Zhou, L., Ren, Z., Ren, Y. (2013). Effective and Efficient Web Reviews Extraction Based on Hadoop. In: Ghose, A., et al. Service-Oriented Computing - ICSOC 2012 Workshops. ICSOC 2012. Lecture Notes in Computer Science, vol 7759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37804-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-37804-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37803-4
Online ISBN: 978-3-642-37804-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Effective and Efficient Web Reviews Extraction Based on Hadoop

Abstract

Chapter PDF

Similar content being viewed by others

An Enhanced Method for Review Mining Using N-Gram Approaches

Using Reviewer Information to Improve Performance of Low-Quality Review Detection

E-commerce Review Classification Based on SVM

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Effective and Efficient Web Reviews Extraction Based on Hadoop

Abstract

Chapter PDF

Similar content being viewed by others

An Enhanced Method for Review Mining Using N-Gram Approaches

Using Reviewer Information to Improve Performance of Low-Quality Review Detection

E-commerce Review Classification Based on SVM

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation