Abstract
Today in every corporate, banking, judicial, or medical ecosystem varieties of text are generated like customer reviews, product manuals, white papers, system logs, and usage data. They vary in language, size, context, and formats. Handling such text using a single system is still a challenge. Traditionally, systems exist to handle each specific part of generated text, separately. So this work proposes a concrete step toward integrated solution to the challenge. The proposed system handles text with different formats, sizes, languages, and context seamlessly, encompassing text generated across the ecosystem. Implementation over heterogeneous dataset of text shows promising results. This integrated approach empowers analytics with an extra edge to learn hidden relational and contextual patterns over complete system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
S. Kaisler, F. Armour, J. A. Espinosa, and W. Money, “Big data: Issues and challenges moving forward,” in System Sciences (HICSS), 2013 46th Hawaii International Conference on. IEEE, 2013, pp. 995–1004.
B. Shneiderman and C. Plaisant, “Sharpening analytic focus to cope with big data volume and variety,” Computer Graphics and Applications, IEEE, vol. 35, no. 3, pp. 10–14, 2015.
A. Sarker and G. Gonzalez, “Portable automatic text classification for adverse drug reaction detection via multi-corpus training,” Journal of biomedical informatics, vol. 53, pp. 196–207, 2015.
Y. Zheng, W. Han, and C. Zhu, “A novel feature selection method based on category distribution and phrase attributes,” in Trustworthy Computing and Services. Springer, 2014, pp. 25–32.
C.-P. Wei, C.-S. Yang, C.-H. Lee, H. Shi, and C. C. Yang, “Exploiting poly-lingual documents for improving text categorization effectiveness,” Decision Support Systems, vol. 57, pp. 64–76, 2014.
W. Fan and A. Bifet, “Mining big data: current status, and forecast to the future,” ACM sIGKDD Explorations Newsletter, vol. 14, no. 2, pp. 1–5, 2013.
F. Noorbehbahani, S. R. Mousavi, and A. Mirzaei, “An incremental mixed data clustering method using a new distance measure,” Soft Computing, vol. 19, no. 3, pp. 731–743, 2015.
Z. Tufekci, “Big questions for social media big data: Representativeness, validity and other methodological pitfalls,” arXiv preprint arXiv:1403.7400, 2014.
T. Nguyen, D. Phung, B. Adams, and S. Venkatesh, “Mood sensing from social media texts and its applications,” Knowledge and information systems, vol. 39, no. 3, pp. 667–702, 2014.
R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection and big heterogeneous data: A survey,” Journal of Big Data, vol. 2, no. 1, pp. 1–41, 2015.
Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. H. Wang, “Twitter spammer detection using data stream clustering,” Information Sciences, vol. 260, pp. 64–73, 2014.
J. Staš, J. Juhár, and D. Hládek, “Classification of heterogeneous text data for robust domain-specific language modeling,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2014, no. 1, pp. 1–12, 2014.
A. Barua, S. W. Thomas, and A. E. Hassan, “What are developers talking about? an analysis of topics and trends in stack overflow,” Empirical Software Engineering, vol. 19, no. 3, pp. 619–654, 2014.
J. Tang, M. Qu, and Q. Mei, “Pte: Predictive text embedding through large-scale heterogeneous text networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1165–1174.
A. N. Harutyunyan, A. V. Poghosyan, N. M. Grigoryan, and M. A. Marvasti, “Abnormality analysis of streamed log data,” in Network Operations and Management Symposium (NOMS), 2014 IEEE. IEEE, 2014, pp. 1–7.
S. Baccianella, A. Esuli, and F. Sebastiani, “Using micro-documents for feature selection: The case of ordinal text classification,” Expert Systems with Applications, vol. 40, no. 11, pp. 4687–4696, 2013.
Q. Wang, Y. Qian, R. Song, Z. Dou, F. Zhang, T. Sakai, and Q. Zheng, “Mining subtopics from text fragments for a web query,” Information retrieval, vol. 16, no. 4, pp. 484–503, 2013.
A. Tagarelli and G. Karypis, “A segment-based approach to clustering multi-topic documents,” Knowledge and information systems, vol. 34, no. 3, pp. 563–595, 2013.
A. Awajan, “Semantic similarity based approach for reducing arabic texts dimensionality,” International Journal of Speech Technology, pp. 1–11, 2015.
J. Tang, X. Wang, H. Gao, X. Hu, and H. Liu, “Enriching short text representation in microblog for clustering,” Frontiers of Computer Science, vol. 6, no. 1, pp. 88–101, 2012.
Y. Man, “Feature extension for short text categorization using frequent term sets,” Procedia Computer Science, vol. 31, pp. 663–670, 2014.
X. Ni, X. Quan, Z. Lu, L. Wenyin, and B. Hua, “Short text clustering by finding core terms,” Knowledge and information systems, vol. 27, no. 3, pp. 345–365, 2011.
B.-k. Wang, Y.-f. Huang, W.-x. Yang, and X. Li, “Short text classification based on strong feature thesaurus,” Journal of Zhejiang University SCIENCE C, vol. 13, no. 9, pp. 649–659, 2012.
D. D. R. R. S Pathak, “Message manager (mm): A novel sms classification system,” International Journal of Advanced Computer Communications and Control, vol. 02, no. 02, p. 2, april 2014.
K. P. Chand and G. Narsimha, “An integrated approach to improve the text categorization using semantic measures,” in Computational Intelligence in Data Mining-Volume 2. Springer, 2015, pp. 39–47.
F. Ren and M. G. Sohrab, “Class-indexing-based term weighting for automatic text classification,” Information Sciences, vol. 236, pp. 109–125, 2013.
D. Badawi and H. Altınçay, “A novel framework for termset selection and weighting in binary text classification,” Engineering Applications of Artificial Intelligence, vol. 35, pp. 38–53, 2014.
X. Huang and Q. Wu, “Micro-blog commercial word extraction based on improved tf-idf algorithm,” in TENCON 2013-2013 IEEE Region 10 Conference (31194). IEEE, 2013, pp. 1–5.
N. Chirawichitchai, “Developing term weighting scheme based on term occurrence ratio for sentiment analysis,” in Information Science and Applications. Springer, 2015, pp. 737–744.
J. Zhang, L. Chen, and G. Guo, “Projected-prototype based classifier for text categorization,” Knowledge-Based Systems, vol. 49, pp. 179–189, 2013.
D. D. R. R. S Pathak, “Extensive study on text representation models in text mining,” IJAER, vol. 10, no. 13, pp. 32 967–32 973, Oct 2015.
G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.
X. Zhou, Y. Hu, and L. Guo, “Text categorization based on clustering feature selection,” Procedia Computer Science, vol. 31, pp. 398–405, 2014.
S. Jun, S.-S. Park, and D.-S. Jang, “Document clustering method using dimension reduction and support vector clustering to overcome sparseness,” Expert Systems with Applications, vol. 41, no. 7, pp. 3204–3212, 2014.
T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “Contributions to the study of sms spam filtering: New collection and results,” in Proceedings of the 11th ACM Symposium on Document Engineering, ser. DocEng ’11. New York, NY, USA: ACM, 2011, pp. 259–262. [Online]. Available: doi:10.1145/2034691.2034742
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos, “An evaluation of naive bayesian anti-spam filtering,” arXiv preprint arXiv:cs/0006013, 2000.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte. Ltd.
About this paper
Cite this paper
Pathak, S., Rajeshwar Rao, D. (2018). Adaptive System for Handling Variety in Big Text. In: Hu, YC., Tiwari, S., Mishra, K., Trivedi, M. (eds) Intelligent Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 19. Springer, Singapore. https://doi.org/10.1007/978-981-10-5523-2_28
Download citation
DOI: https://doi.org/10.1007/978-981-10-5523-2_28
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-5522-5
Online ISBN: 978-981-10-5523-2
eBook Packages: EngineeringEngineering (R0)