Abstract
The impact of social media in our daily life cannot be overlooked. Harnessing this rich and varied data for information is a challenging job for the data analysts. As each type of data from social media is unstructured, these data have to be processed, represented and then analysed in different ways suitable to our requirements. Though retail industry and political people are using social media to a great extent to gather feedback and market their new ideas, its significance in other fields related to public like health care and security is not dealt with effectively. Though the information coming from social media may be informal, it contains genuine opinions and experiences which are very much necessary to improve the healthcare service. This work explores analysing the Twitter data related to the most dreaded disease ‘cancer’. We have collected over one million tweets related to various types of cancer and summarized the same to a bunch of representative tweets which may give key inputs to healthcare professionals regarding symptoms, diagnosis, treatment and recovery related to cancer. This, when correlated with clinical research and inputs, may provide rich information to provide a holistic treatment to the patients. We have proposed additional pre-processing to the raw data. We have also explored a combination of feature selection methods, two feature extraction methods and a soft clustering algorithm to study the feasibility of the same for our data. The results have proved our intuition right about underlying information and also show that there is a tremendous scope for further research in the area.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lavanya, P.G., Mallappa, S.: Automatic summarization and visualisation of healthcare tweets. In: Proceedings of the International Conference on 2017Advances in Computing, Communications and Informatics (ICACCI), pp. 1557–1563 (2017). https://doi.org/10.1109/icacci.2017.8126063
Crockett, K., Mclean, D., Latham, A., Alnajran, N.: Cluster Analysis of twitter data: a review of algorithms. In: Proceedings of the 9th International Conference on Agents and Artificial Intelligence, pp. 239–249 (2017). https://doi.org/10.5220/0006202802390249
Cunha, J., Silva, C., Antunes, M.: Health twitter big d ata management with hadoop framework. Procedia Comput. Sci. 64, 425–431 (2015). https://doi.org/10.1016/j.procs.2015.08.536
Carchiolo, V., Longheu, A., Malgeri, M.: Using twitter data and sentiment analysis to study diseases dynamics. In: Proceedings of the International Conference on Information Technology in Bio-and Medical Informatics, pp. 16–24 (2015). https://doi.org/10.1007/978-3-319-22741-2_2
Tripathy, R.M., Sharma, S., Joshi, S., Mehta, S., Bagchi, A.: Theme based clustering of tweets. In: Proceedings of the 1st IKDD Conference on Data Sciences, pp. 1–5 (2014). https://doi.org/10.1145/2567688.2567694
Sechelea, A., Do Huu, T., Zimos, E., Deligiannis, N.: Twitter data clustering and visualization. ICT, pp. 1–5 (2016). https://doi.org/10.1109/ict.2016.7500379
Dutta, S., Ghatak, S., Roy, M., Ghosh, S., Das, A.K.: A graph based clustering technique for tweet summarization. In: Proceedings of the 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions), pp. 1–6 (2015). https://doi.org/10.1109/icrito.2015.7359276
Jiwanggi, M.A., Adriani, M.: Topic summarization of microblog document in Bahasa Indonesia using the phrase reinforcement algorithm. Procedia Comput. Sci. 81, 229–236 (2016). https://doi.org/10.1016/j.procs.2016.04.054
Zhuang, H., Rahman, R., Hu, X., Guo, T., Hui, P., Aberer, K.: Data summarization with social contexts. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 397–406 (2016). https://doi.org/10.1145/2983323.2983736
Sindhuja, P., Suneetha, J.: An Advanced approach for summarization and timeline generation of evolutionary tweet streams
Ventola, C.L.: Social media and health care professionals: benefits, risks, and best practices. Pharm. Ther. 39, 491 (2014)
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011
Panda, S., Sanat, S., Jena, P., Chattopadhyay, S.: Comparing fuzzy-C means and K-means clustering techniques: a comprehensive study. In: Advances in Computer Science, Engineering & Applications, pp. 451–460. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30157-5_45
Dash, M., Liu, H.: Feature selection for clustering. In: Pacific-Asia Conference on knowledge discovery and data mining, pp. 110–121 (2000)
Vasan, K.K., Surendiran, B.: Dimensionality reduction using principal component analysis for network intrusion detection. Perspect. Sci. 8, 510–512 (2016). https://doi.org/10.1016/j.pisc.2016.05.010
Wang, Y., Zhu, L.: Research and implementation of SVD in machine learning. In: Proceedings of the 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), pp. 471–475 (2017). https://doi.org/10.1109/icis.2017.7960038
Steinberger, J., Jevzek, K.: Evaluation measures for text summarization. Comput. Inform. 28, 251–275 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lavanya, P.G., Kouser, K., Suresha, M. (2020). Efficient Pre-processing and Feature Selection for Clustering of Cancer Tweets. In: Thampi, S., et al. Intelligent Systems, Technologies and Applications. Advances in Intelligent Systems and Computing, vol 910. Springer, Singapore. https://doi.org/10.1007/978-981-13-6095-4_2
Download citation
DOI: https://doi.org/10.1007/978-981-13-6095-4_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6094-7
Online ISBN: 978-981-13-6095-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)