Skip to main content

Adaptive System for Handling Variety in Big Text

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 19))

Abstract

Today in every corporate, banking, judicial, or medical ecosystem varieties of text are generated like customer reviews, product manuals, white papers, system logs, and usage data. They vary in language, size, context, and formats. Handling such text using a single system is still a challenge. Traditionally, systems exist to handle each specific part of generated text, separately. So this work proposes a concrete step toward integrated solution to the challenge. The proposed system handles text with different formats, sizes, languages, and context seamlessly, encompassing text generated across the ecosystem. Implementation over heterogeneous dataset of text shows promising results. This integrated approach empowers analytics with an extra edge to learn hidden relational and contextual patterns over complete system.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. S. Kaisler, F. Armour, J. A. Espinosa, and W. Money, “Big data: Issues and challenges moving forward,” in System Sciences (HICSS), 2013 46th Hawaii International Conference on. IEEE, 2013, pp. 995–1004.

    Google Scholar 

  2. B. Shneiderman and C. Plaisant, “Sharpening analytic focus to cope with big data volume and variety,” Computer Graphics and Applications, IEEE, vol. 35, no. 3, pp. 10–14, 2015.

    Article  Google Scholar 

  3. A. Sarker and G. Gonzalez, “Portable automatic text classification for adverse drug reaction detection via multi-corpus training,” Journal of biomedical informatics, vol. 53, pp. 196–207, 2015.

    Article  Google Scholar 

  4. Y. Zheng, W. Han, and C. Zhu, “A novel feature selection method based on category distribution and phrase attributes,” in Trustworthy Computing and Services. Springer, 2014, pp. 25–32.

    Google Scholar 

  5. C.-P. Wei, C.-S. Yang, C.-H. Lee, H. Shi, and C. C. Yang, “Exploiting poly-lingual documents for improving text categorization effectiveness,” Decision Support Systems, vol. 57, pp. 64–76, 2014.

    Article  Google Scholar 

  6. W. Fan and A. Bifet, “Mining big data: current status, and forecast to the future,” ACM sIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 1–5, 2013.

    Article  Google Scholar 

  7. F. Noorbehbahani, S. R. Mousavi, and A. Mirzaei, “An incremental mixed data clustering method using a new distance measure,” Soft Computing, vol. 19, no. 3, pp. 731–743, 2015.

    Article  Google Scholar 

  8. Z. Tufekci, “Big questions for social media big data: Representativeness, validity and other methodological pitfalls,” arXiv preprint arXiv:1403.7400, 2014.

  9. T. Nguyen, D. Phung, B. Adams, and S. Venkatesh, “Mood sensing from social media texts and its applications,” Knowledge and information systems, vol. 39, no. 3, pp. 667–702, 2014.

    Article  Google Scholar 

  10. R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection and big heterogeneous data: A survey,” Journal of Big Data, vol. 2, no. 1, pp. 1–41, 2015.

    Article  Google Scholar 

  11. Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, “Twitter spammer detection using data stream clustering,” Information Sciences, vol. 260, pp. 64–73, 2014.

    Article  Google Scholar 

  12. J. Staš, J. Juhár, and D. Hládek, “Classification of heterogeneous text data for robust domain-specific language modeling,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2014, no. 1, pp. 1–12, 2014.

    Article  Google Scholar 

  13. A. Barua, S. W. Thomas, and A. E. Hassan, “What are developers talking about? an analysis of topics and trends in stack overflow,” Empirical Software Engineering, vol. 19, no. 3, pp. 619–654, 2014.

    Article  Google Scholar 

  14. J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding through large-scale heterogeneous text networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1165–1174.

    Google Scholar 

  15. A. N. Harutyunyan, A. V. Poghosyan, N. M. Grigoryan, and M. A. Marvasti, “Abnormality analysis of streamed log data,” in Network Operations and Management Symposium (NOMS), 2014 IEEE. IEEE, 2014, pp. 1–7.

    Google Scholar 

  16. S. Baccianella, A. Esuli, and F. Sebastiani, “Using micro-documents for feature selection: The case of ordinal text classification,” Expert Systems with Applications, vol. 40, no. 11, pp. 4687–4696, 2013.

    Article  Google Scholar 

  17. Q. Wang, Y. Qian, R. Song, Z. Dou, F. Zhang, T. Sakai, and Q. Zheng, “Mining subtopics from text fragments for a web query,” Information retrieval, vol. 16, no. 4, pp. 484–503, 2013.

    Article  Google Scholar 

  18. A. Tagarelli and G. Karypis, “A segment-based approach to clustering multi-topic documents,” Knowledge and information systems, vol. 34, no. 3, pp. 563–595, 2013.

    Article  Google Scholar 

  19. A. Awajan, “Semantic similarity based approach for reducing arabic texts dimensionality,” International Journal of Speech Technology, pp. 1–11, 2015.

    Google Scholar 

  20. J. Tang, X. Wang, H. Gao, X. Hu, and H. Liu, “Enriching short text representation in microblog for clustering,” Frontiers of Computer Science, vol. 6, no. 1, pp. 88–101, 2012.

    MATH  MathSciNet  Google Scholar 

  21. Y. Man, “Feature extension for short text categorization using frequent term sets,” Procedia Computer Science, vol. 31, pp. 663–670, 2014.

    Article  Google Scholar 

  22. X. Ni, X. Quan, Z. Lu, L. Wenyin, and B. Hua, “Short text clustering by finding core terms,” Knowledge and information systems, vol. 27, no. 3, pp. 345–365, 2011.

    Article  Google Scholar 

  23. B.-k. Wang, Y.-f. Huang, W.-x. Yang, and X. Li, “Short text classification based on strong feature thesaurus,” Journal of Zhejiang University SCIENCE C, vol. 13, no. 9, pp. 649–659, 2012.

    Article  Google Scholar 

  24. D. D. R. R. S Pathak, “Message manager (mm): A novel sms classification system,” International Journal of Advanced Computer Communications and Control, vol. 02, no. 02, p. 2, april 2014.

    Google Scholar 

  25. K. P. Chand and G. Narsimha, “An integrated approach to improve the text categorization using semantic measures,” in Computational Intelligence in Data Mining-Volume 2. Springer, 2015, pp. 39–47.

    Google Scholar 

  26. F. Ren and M. G. Sohrab, “Class-indexing-based term weighting for automatic text classification,” Information Sciences, vol. 236, pp. 109–125, 2013.

    Article  Google Scholar 

  27. D. Badawi and H. Altınçay, “A novel framework for termset selection and weighting in binary text classification,” Engineering Applications of Artificial Intelligence, vol. 35, pp. 38–53, 2014.

    Article  Google Scholar 

  28. X. Huang and Q. Wu, “Micro-blog commercial word extraction based on improved tf-idf algorithm,” in TENCON 2013-2013 IEEE Region 10 Conference (31194). IEEE, 2013, pp. 1–5.

    Google Scholar 

  29. N. Chirawichitchai, “Developing term weighting scheme based on term occurrence ratio for sentiment analysis,” in Information Science and Applications. Springer, 2015, pp. 737–744.

    Google Scholar 

  30. J. Zhang, L. Chen, and G. Guo, “Projected-prototype based classifier for text categorization,” Knowledge-Based Systems, vol. 49, pp. 179–189, 2013.

    Article  Google Scholar 

  31. D. D. R. R. S Pathak, “Extensive study on text representation models in text mining,” IJAER, vol. 10, no. 13, pp. 32 967–32 973, Oct 2015.

    Google Scholar 

  32. G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.

    Article  Google Scholar 

  33. X. Zhou, Y. Hu, and L. Guo, “Text categorization based on clustering feature selection,” Procedia Computer Science, vol. 31, pp. 398–405, 2014.

    Article  Google Scholar 

  34. S. Jun, S.-S. Park, and D.-S. Jang, “Document clustering method using dimension reduction and support vector clustering to overcome sparseness,” Expert Systems with Applications, vol. 41, no. 7, pp. 3204–3212, 2014.

    Article  Google Scholar 

  35. T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “Contributions to the study of sms spam filtering: New collection and results,” in Proceedings of the 11th ACM Symposium on Document Engineering, ser. DocEng ’11. New York, NY, USA: ACM, 2011, pp. 259–262. [Online]. Available: doi:10.1145/2034691.2034742

  36. I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos, “An evaluation of naive bayesian anti-spam filtering,” arXiv preprint arXiv:cs/0006013, 2000.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shantanu Pathak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte. Ltd.

About this paper

Cite this paper

Pathak, S., Rajeshwar Rao, D. (2018). Adaptive System for Handling Variety in Big Text. In: Hu, YC., Tiwari, S., Mishra, K., Trivedi, M. (eds) Intelligent Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 19. Springer, Singapore. https://doi.org/10.1007/978-981-10-5523-2_28

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-5523-2_28

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-5522-5

  • Online ISBN: 978-981-10-5523-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics