Abstract
Many real-world data are not only large in volume but also heterogeneous and fast generated. This type of data, known as big data, typically cannot be analyzed by using traditional software tools and techniques. Although an open-source software project, Apache Hadoop, has been successfully developed and used for handling big data, its setup and configuration complexity including its requirement to learn other additional related tools have hindered non-technical researchers and educators from actually entering the area of big data analytics. To support big-data community, this paper describes procedures and experiences gained from building a big data analytics framework, and demonstrates its usage on a popular case study, Twitter sentiment analysis. The framework comprises a cluster of four commodity computers run by Cloudera CDH 6.0.1 and RapidMiner Studio 9.3 with Text Processing, Hive Connector, and Radoop extensions. According to the study results, setting up a big data analytics framework on a cluster of computers does not require advanced computer knowledge but needs meticulous system configurations to satisfy system installation and software integration requirements. Once all setup and configurations are correctly done, data analysis can be readily performed using visual workflow designers provided by RapidMiner. Finally, the framework is further evaluated on a large data set of 185 million records, “TalkingData AdTracking Fraud Detection” data set. The outcome is very satisfied and proves that the framework is easy to use and can practically be deployed for big data analytics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altalhi, A.H., Luna, J.M., Vallejo, M.A., Ventura, S.: Evaluation and comparison of open source software suites for data mining and knowledge discovery: open source software suites for data mining and knowledge discovery. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 7(3), e1204 (2017)
Bhathal, G.S., Singh, A.: Big data: Hadoop framework vulnerabilities, security issues and attacks. Array 1, 100002 (2019)
Chennamsetty, H., Chalasani, S., Riley, D.: Predictive analytics on electronic health records (EHRs) using Hadoop and Hive. In: 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–5. IEEE, Coimbatore (March 2015)
Dubey, A.K., Jain, V., Mittal, A.P.: Stock market prediction using Hadoop MapReduce ecosystem, p. 6 (2015)
Feltrin, L.: KNIME an open source solution for predictive analytics in the geosciences [software and data sets]. IEEE Geosci. Remote Sens. Mag. 3(4), 28–38 (2015)
Han, P., Wang, D.B., Zhao, Q.G.: The research on Chinese document clustering based on WEKA. In: 2011 International Conference on Machine Learning and Cybernetics, pp. 1953–1957. IEEE, Guilin (July 2011)
Ivanov, T., Niemann, R., Izberovic, S., Rosselli, M., Tolle, K., Zicari, R.V.: Performance evaluation of enterprise big data platforms with HiBench. In: 2015 IEEE Truscom/BigDataSE/ISPA, pp. 120–127. IEEE, Helsinki (August 2015)
Jovic, A., Brkic, K., Bogunovic, N.: An overview of free software tools for general data mining. In: 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1112–1117. IEEE, Opatija (May 2014)
Liu, F.C., Shen, F., Chau, D.H., Bright, N., Belgin, M.: Building a research data science platform from industrial machines. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 2270–2275. IEEE, Washington DC (December 2016)
Nereu, J., Almeida, A., Bernardino, J.: Big data analytics: a preliminary study of open source platforms. In: Proceedings of the 12th International Conference on Software Technologies, pp. 435–440. SCITEPRESS - Science and Technology Publications, Madrid (2017)
Prekopcsák, Z., Makrai, G., Henk, T., Gáspár-Papanek, C.: Radoop: analyzing big data with RapidMiner and Hadoop. In: RCOMM 2011: RapidMiner Community Meeting and Conference, p. 13. Rapid-I (June 2011)
Sangeeta: Twitter data analysis using FLUME & HIVE on hadoop framework. Spec. Issue Int. J. Recent Adv. Eng. Technol. (IJRAET) 4(2), 119–123 (2016)
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Tripathi, P., Vishwakarma, S.K., Lala, A.: Sentiment analysis of english tweets using rapid miner. In: 2015 International Conference on Computational Intelligence and Communication Networks (CICN), pp. 668–672. IEEE, Jabalpur (December 2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kunnakorntammanop, S., Thepwuttisathaphon, N., Thaicharoen, S. (2019). An Experience Report on Building a Big Data Analytics Framework Using Cloudera CDH and RapidMiner Radoop with a Cluster of Commodity Computers. In: Berry, M., Yap, B., Mohamed, A., Köppen, M. (eds) Soft Computing in Data Science. SCDS 2019. Communications in Computer and Information Science, vol 1100. Springer, Singapore. https://doi.org/10.1007/978-981-15-0399-3_17
Download citation
DOI: https://doi.org/10.1007/978-981-15-0399-3_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0398-6
Online ISBN: 978-981-15-0399-3
eBook Packages: Computer ScienceComputer Science (R0)