Efficient Pre-processing and Feature Selection for Clustering of Cancer Tweets

Lavanya, P. G.; Kouser, K.; Suresha, Mallappa

doi:10.1007/978-981-13-6095-4_2

P. G. Lavanya²²,
K. Kouser²³ &
Mallappa Suresha²²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 910))

349 Accesses
3 Citations

Abstract

The impact of social media in our daily life cannot be overlooked. Harnessing this rich and varied data for information is a challenging job for the data analysts. As each type of data from social media is unstructured, these data have to be processed, represented and then analysed in different ways suitable to our requirements. Though retail industry and political people are using social media to a great extent to gather feedback and market their new ideas, its significance in other fields related to public like health care and security is not dealt with effectively. Though the information coming from social media may be informal, it contains genuine opinions and experiences which are very much necessary to improve the healthcare service. This work explores analysing the Twitter data related to the most dreaded disease ‘cancer’. We have collected over one million tweets related to various types of cancer and summarized the same to a bunch of representative tweets which may give key inputs to healthcare professionals regarding symptoms, diagnosis, treatment and recovery related to cancer. This, when correlated with clinical research and inputs, may provide rich information to provide a holistic treatment to the patients. We have proposed additional pre-processing to the raw data. We have also explored a combination of feature selection methods, two feature extraction methods and a soft clustering algorithm to study the feasibility of the same for our data. The results have proved our intuition right about underlying information and also show that there is a tremendous scope for further research in the area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

www.internetlivestats.com/twitter-statistics
www.similarweb.com/website/twitter.com#overview
Lavanya, P.G., Mallappa, S.: Automatic summarization and visualisation of healthcare tweets. In: Proceedings of the International Conference on 2017Advances in Computing, Communications and Informatics (ICACCI), pp. 1557–1563 (2017). https://doi.org/10.1109/icacci.2017.8126063
Crockett, K., Mclean, D., Latham, A., Alnajran, N.: Cluster Analysis of twitter data: a review of algorithms. In: Proceedings of the 9th International Conference on Agents and Artificial Intelligence, pp. 239–249 (2017). https://doi.org/10.5220/0006202802390249
Cunha, J., Silva, C., Antunes, M.: Health twitter big d ata management with hadoop framework. Procedia Comput. Sci. 64, 425–431 (2015). https://doi.org/10.1016/j.procs.2015.08.536
Article Google Scholar
Carchiolo, V., Longheu, A., Malgeri, M.: Using twitter data and sentiment analysis to study diseases dynamics. In: Proceedings of the International Conference on Information Technology in Bio-and Medical Informatics, pp. 16–24 (2015). https://doi.org/10.1007/978-3-319-22741-2_2
Chapter Google Scholar
Tripathy, R.M., Sharma, S., Joshi, S., Mehta, S., Bagchi, A.: Theme based clustering of tweets. In: Proceedings of the 1st IKDD Conference on Data Sciences, pp. 1–5 (2014). https://doi.org/10.1145/2567688.2567694
Sechelea, A., Do Huu, T., Zimos, E., Deligiannis, N.: Twitter data clustering and visualization. ICT, pp. 1–5 (2016). https://doi.org/10.1109/ict.2016.7500379
Dutta, S., Ghatak, S., Roy, M., Ghosh, S., Das, A.K.: A graph based clustering technique for tweet summarization. In: Proceedings of the 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions), pp. 1–6 (2015). https://doi.org/10.1109/icrito.2015.7359276
Jiwanggi, M.A., Adriani, M.: Topic summarization of microblog document in Bahasa Indonesia using the phrase reinforcement algorithm. Procedia Comput. Sci. 81, 229–236 (2016). https://doi.org/10.1016/j.procs.2016.04.054
Article Google Scholar
Zhuang, H., Rahman, R., Hu, X., Guo, T., Hui, P., Aberer, K.: Data summarization with social contexts. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 397–406 (2016). https://doi.org/10.1145/2983323.2983736
Sindhuja, P., Suneetha, J.: An Advanced approach for summarization and timeline generation of evolutionary tweet streams
Google Scholar
Ventola, C.L.: Social media and health care professionals: benefits, risks, and best practices. Pharm. Ther. 39, 491 (2014)
Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666 (2010). https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
Panda, S., Sanat, S., Jena, P., Chattopadhyay, S.: Comparing fuzzy-C means and K-means clustering techniques: a comprehensive study. In: Advances in Computer Science, Engineering & Applications, pp. 451–460. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30157-5_45
Google Scholar
Dash, M., Liu, H.: Feature selection for clustering. In: Pacific-Asia Conference on knowledge discovery and data mining, pp. 110–121 (2000)
Chapter Google Scholar
Vasan, K.K., Surendiran, B.: Dimensionality reduction using principal component analysis for network intrusion detection. Perspect. Sci. 8, 510–512 (2016). https://doi.org/10.1016/j.pisc.2016.05.010
Article Google Scholar
Wang, Y., Zhu, L.: Research and implementation of SVD in machine learning. In: Proceedings of the 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), pp. 471–475 (2017). https://doi.org/10.1109/icis.2017.7960038
Steinberger, J., Jevzek, K.: Evaluation measures for text summarization. Comput. Inform. 28, 251–275 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

DoS in Computer Science, University of Mysore, Mysuru, India
P. G. Lavanya & Mallappa Suresha
Government First Grade College, Gundlupet, India
K. Kouser

Authors

P. G. Lavanya
View author publications
You can also search for this author in PubMed Google Scholar
K. Kouser
View author publications
You can also search for this author in PubMed Google Scholar
Mallappa Suresha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. G. Lavanya .

Editor information

Editors and Affiliations

School of Computer Science and Information Technology, Indian Institute of Information Technology and Management—Kerala (IIITM-K), Trivandrum, Kerala, India
Sabu M. Thampi
School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada
Ljiljana Trajkovic
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, West Bengal, India
Sushmita Mitra
Indian Institute of Information Technology, Allahabad (IIIT-A), Allahabad, Uttar Pradesh, India
P. Nagabhushan
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, West Bengal, India
Jayanta Mukhopadhyay
Departamento de Informática y Automática, Universidad de Salamanca, Salamanca, Spain
Juan M. Corchado
Dipartimento di Ingegneria dell’Informazione (DINFO), Università degli Studi di Firenze, Florence, Italy
Stefano Berretti
Indian Institute of Space Science and Technology, Trivandrum, Kerala, India
Deepak Mishra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lavanya, P.G., Kouser, K., Suresha, M. (2020). Efficient Pre-processing and Feature Selection for Clustering of Cancer Tweets. In: Thampi, S., et al. Intelligent Systems, Technologies and Applications. Advances in Intelligent Systems and Computing, vol 910. Springer, Singapore. https://doi.org/10.1007/978-981-13-6095-4_2

Download citation

DOI: https://doi.org/10.1007/978-981-13-6095-4_2
Published: 24 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6094-7
Online ISBN: 978-981-13-6095-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics