Effective and Efficient Web Reviews Extraction Based on Hadoop

  • Jian Wan
  • Jiawei Yan
  • Congfeng Jiang
  • Li Zhou
  • Zujie Ren
  • Yongjian Ren
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7759)


The rapid development of Web 2.0 brings the flourish of web reviews. Traditional web review data extraction methods suffer from poor performance in dealing with massive data. To solve this problem, we propose an effective and efficient approach to extract web reviews based on Hadoop. It overcomes inefficiency when dealing with large-scale data, and enables the accuracy and efficiency in extracting the massive data sets. Our proposed approach consists of two components: a review record extraction algorithm based on node similarity, and a review content extraction algorithm based on the text depth. We design a Hadoop-based web reviews automatic extraction system. At last, we test the extraction system using the massive web reviews page sets. The experimental results show that this extraction system can achieve accuracy of more than 96%, and also can obtain a higher speedup, compared with the traditional web extraction.


web reviews information extraction massive data cloud computing Hadoop 


  1. 1.
    Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI), pp. 811–816 (1993)Google Scholar
  2. 2.
    Kim, J., Moldovan, D.: Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactions on Knowledge and Data Engineering 7(5), 713–724 (1995)CrossRefGoogle Scholar
  3. 3.
    Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. AAAI Technical Report WS, pp. 37–42 (1999)Google Scholar
  4. 4.
    Apache Hadoop,
  5. 5.
    Document Object Model,
  6. 6.
    Liu, W., Meng, X., Meng, W.: Vision-Based Web data records extraction. In: Zhou, D. (ed.) Proc. of the Int’l Workshop on the Web and Databases (WebDB), pp. 20–25 (2006)Google Scholar
  7. 7.
    Liu, B., Grossman, R.-L., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 601–606 (2003)Google Scholar
  8. 8.
    Soderland, S.: Learning information extraction rules for semi-structured and free text. Journal of Machine Learning 34(1-3), 233–272 (1999)zbMATHCrossRefGoogle Scholar
  9. 9.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowledge and Data Engineering 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  10. 10.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the 26th International Conference on Very Large Database Systems (VLDB), Rome, Italy, pp. 109–118 (2001)Google Scholar
  11. 11.
    Wang, J., Lochovsky, F.H.: Data extraction and label assignment for Web databases. In: Hencsey, G., White, B. (eds.) Proc. of the Int’l Conf. on World Wide Web (WWW), pp. 187–196. ACM Press, Budapest (2003)Google Scholar
  12. 12.
    Chang, C.-H., Lui, S.-C.: IEPAD: Information extraction based on pattern discovery. In: Proceedings of the Tenth International Conference on World Wide Web (WWW), Hong-Kong, pp. 223–231 (2001)Google Scholar
  13. 13.
    Kaushik, R.T., Bhandarkar, M., Nahrstedt, K.: Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 274–287 (2010)Google Scholar
  14. 14.
    Nicolae, B., Moise, D., Antoniu, G., Bouge, L., Dorier, M.: BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–11 (2010)Google Scholar
  15. 15.
    Mao, H., Zhang, Z., Zhao, B., Xiao, L., Li, R.: Towards Deploying Elastic Hadoop in the Cloud. In: 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 476–482 (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Jian Wan
    • 1
  • Jiawei Yan
    • 1
  • Congfeng Jiang
    • 1
  • Li Zhou
    • 1
  • Zujie Ren
    • 1
  • Yongjian Ren
    • 1
  1. 1.School of Computer Science and TechnologyHangzhou Dianzi UniversityHangzhouChina

Personalised recommendations