Abstract
The article describes a comparative study of text preprocessing techniques for natural language call routing. Seven different unsupervised and supervised term weighting methods were considered. Four different dimensionality reduction methods were applied: stop-words filtering with stemming, feature selection based on term weights, feature transformation based on term clustering, and a novel feature transformation method based on terms belonging to classes. As classification algorithms we used k-NN and the SVM-based algorithm Fast Large Margin. The numerical experiments showed that the most effective term weighting method is Term Relevance Ratio (TRR). Feature transformation based on term clustering is able to significantly decrease dimensionality without significantly changing the classification effectiveness, unlike other dimensionality reduction methods. The novel feature transformation method reduces the dimensionality radically: number of features is equal to number of classes.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Suhm, B., Bers, J., McCarthy, D., Freeman, B., Getty, D., Godfrey, K., Peterson, P.: A comparative study of speech in the call center: natural language call routing vs. touch-tone menus. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 283–290. ACM (2002)
Lee, C., Jung, S., Kim, S., Lee, G.G.: Example-based dialog modeling for practical multi-domain dialog system. Speech Commun. 51(5), 466–484 (2009)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Fox, C.: A stop list for general text. In: ACM SIGIR Forum, vol. 24, pp. 19–21. ACM (1989)
Porter, M.F.: Snowball: a language for stemming algorithms (2001)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Proc. Manag. 24(5), 513–523 (1988)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and its Applications, pp. 81–97. Springer (2004)
Soucy, P., Mineau, G.W.: Beyond TFIDF weighting for text categorization in the vector space model. IJCAI 5, 1130–1135 (2005)
Xu, H., Li, C.: A novel term weighting scheme for automated text categorization. In: Seventh International Conference on Intelligent Systems Design and Applications, ISDA 2007, pp. 759–764. IEEE (2007)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030. ACM (2012)
Gasanova, T., Sergienko, R., Akhmedova, S., Semenkin, E., Minker, W.: Opinion mining and topic categorization with novel term weighting. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 84–89. ACL (2014)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)
Sergienko, R., Gasanova, T., Semenkin, E., Minker, W.: Text categorization methods application for natural language call routing. In: 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO), vol. 2, pp. 827–831. IEEE (2014)
Momtazi, S., Klakow, D.: A word clustering approach for language model-based sentence retrieval in question answering systems. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1911–1914. ACM (2009)
Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Han, E.H.S., Karypis, G., Kumar, V.: Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Springer (2001)
Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Tech. 1(1), 4–20 (2010)
Joachims, T.: Learning to Classify Text Using Support Vector Machines: Methods. Kluwer Academic Publishers, Theory and Algorithms (2002)
Morariu, D.I., Vintan, L.N., Tresp, V.: Meta-classification using SVM classifiers for text documents. Int. J. Appl. Math. Comput. Sci. 1(1) (2005)
Shafait, F., Reif, M., Kofler, C., Breuel, T.M.: Pattern recognition engineering. In: RapidMiner Community Meeting and Conference, vol. 9. Citeseer (2010)
Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: Advances in Information Retrieval, pp. 345–359. Springer (2005)
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make svms competitive with c4. 5. In: Proceedings of the Twenty-First International Conference on Machine learning, p. 41. ACM (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Singapore
About this chapter
Cite this chapter
Sergienko, R., Shan, M., Schmitt, A. (2017). A Comparative Study of Text Preprocessing Techniques for Natural Language Call Routing. In: Jokinen, K., Wilcock, G. (eds) Dialogues with Social Robots. Lecture Notes in Electrical Engineering, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-10-2585-3_2
Download citation
DOI: https://doi.org/10.1007/978-981-10-2585-3_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2584-6
Online ISBN: 978-981-10-2585-3
eBook Packages: EngineeringEngineering (R0)