Abstract
The rapid development of Web 2.0 brings the flourish of web reviews. Traditional web review data extraction methods suffer from poor performance in dealing with massive data. To solve this problem, we propose an effective and efficient approach to extract web reviews based on Hadoop. It overcomes inefficiency when dealing with large-scale data, and enables the accuracy and efficiency in extracting the massive data sets. Our proposed approach consists of two components: a review record extraction algorithm based on node similarity, and a review content extraction algorithm based on the text depth. We design a Hadoop-based web reviews automatic extraction system. At last, we test the extraction system using the massive web reviews page sets. The experimental results show that this extraction system can achieve accuracy of more than 96%, and also can obtain a higher speedup, compared with the traditional web extraction.
Keywords
References
Riloff, E.: Automatically constructing a dictionary for information extraction tasks. In: Proceedings of the Eleventh National Conference on Artificial Intelligence (AAAI), pp. 811–816 (1993)
Kim, J., Moldovan, D.: Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactions on Knowledge and Data Engineering 7(5), 713–724 (1995)
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model Structure for Information Extraction. AAAI Technical Report WS, pp. 37–42 (1999)
Apache Hadoop, http://hadoop.apache.org
Document Object Model, http://www.w3.org/DOM/
Liu, W., Meng, X., Meng, W.: Vision-Based Web data records extraction. In: Zhou, D. (ed.) Proc. of the Int’l Workshop on the Web and Databases (WebDB), pp. 20–25 (2006)
Liu, B., Grossman, R.-L., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. of the ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD), pp. 601–606 (2003)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Journal of Machine Learning 34(1-3), 233–272 (1999)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large Web sites. In: Proceedings of the 26th International Conference on Very Large Database Systems (VLDB), Rome, Italy, pp. 109–118 (2001)
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for Web databases. In: Hencsey, G., White, B. (eds.) Proc. of the Int’l Conf. on World Wide Web (WWW), pp. 187–196. ACM Press, Budapest (2003)
Chang, C.-H., Lui, S.-C.: IEPAD: Information extraction based on pattern discovery. In: Proceedings of the Tenth International Conference on World Wide Web (WWW), Hong-Kong, pp. 223–231 (2001)
Kaushik, R.T., Bhandarkar, M., Nahrstedt, K.: Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 274–287 (2010)
Nicolae, B., Moise, D., Antoniu, G., Bouge, L., Dorier, M.: BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–11 (2010)
Mao, H., Zhang, Z., Zhao, B., Xiao, L., Li, R.: Towards Deploying Elastic Hadoop in the Cloud. In: 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 476–482 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wan, J., Yan, J., Jiang, C., Zhou, L., Ren, Z., Ren, Y. (2013). Effective and Efficient Web Reviews Extraction Based on Hadoop. In: Ghose, A., et al. Service-Oriented Computing - ICSOC 2012 Workshops. ICSOC 2012. Lecture Notes in Computer Science, vol 7759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37804-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-37804-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37803-4
Online ISBN: 978-3-642-37804-1
eBook Packages: Computer ScienceComputer Science (R0)