Abstract
Short texts like advertisements are characterised by a number of slogans, phrases, words, symbols etc. To improve the quality of textual data, it is necessary to filter out noise textual data from important data. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in English and Slovak advertisement corpora. For this purpose, an experiment was conducted focusing on data pre-processing in these two comparable corpora. We try to find out to what extent removing the stop words has an influence on a quantity and quality of extracted rules. Stop words removal has no impact on the quantity and quality of extracted rules in English as well as in Slovak advertisement corpora. Only language has a significant impact on the quantity and quality of extracted rules.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Feldman, R., Sanger, J.: The text mining handbook. Cambridge University Press (2007)
Choy, M.: Effective Listings of Function Stop words for Twitter. International Jurnal of Advanced Computer Science and Application 3(6), 8–11 (2012)
Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge Information Systems 1(1), 1–27 (1999)
Tayi, G.K., Ballou, D.P.: Examining Data Quality. Communications of the ACM 41(2), 54–57 (1998)
Jung, W.: An Investigation of the Impact of Data Quality on Decision Performance. In: Proceedings of the 2004 International Symposium on Information and Communication Technology (ISICT 2004), pp. 166–171 (2004)
Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. John Wiley and Sons, Ltd. (2010)
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In: Proceedings of the 23rd International Conference on Very Large Databases, pp. 446–455 (1997)
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalabe Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. The VLDB Journal 7, 163–178 (1998)
Silva, C., Ribeiro, B.: The Importance of Stop Word Removal on Recall Values in Text Categorization. In: Proceedings of the International Joint Conference on Neural Networks, vol. 3, pp. 1661–1666. IEEE (2003)
Nisbet, R., Elder, J., Miner, G.: Handbook of statistical analysis and data mining applications. Academic Press, Elsevier (2009)
Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC Stop-Words List Generation. International Journal of Computer Applications 46(8), 8–13 (2012)
Munk, M., Kapusta, J., Švec, P.: Data Preprocessing Evaluation for Web Log Mining: Reconstruction of Activities of a Web Visitor. In: International Conference on Computational Science, ICCS 2010, Procedia Computer Science, vol. 1, pp. 2273–2280 (2010)
Munk, M., Drlík, M.: Impact of Different Pre-Processing Tasks on Effective Identification of Users’ Behavioral Patterns in Web-based Educational System. In: International Conference on Computational Science, ICCS 2011, Procedia Computer Science, vol. 4, pp. 1640–1649 (2011)
Munková, et al.: Analysis of Social and Expressive Factors of Requests by Methods of Text Mining. In: Pacific Asia Conference on Language, Information and Computation, PACLIC 26, pp. 515–524 (2012)
Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. In: International Conference on Computational Science, ICCS 2013, Procedia Computer Science, vol. 18, pp. 1198–1207 (2013)
Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)
Myerson, R.B.: Fundamentals of social choice theory. Discussion Paper No. 1162 (1996)
Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic Construction of Chinese Stop Word List. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)
Khosrow, M.: Encyclopedia of Information Science and Technology. Information Sci. 2 edn. (2009)
Sinka, M.P., Come, D.W.: Evolving Better Stoplists for Document Clustering and Web Intelligence. In: Proceedings of the 3rd Hybrid Intelligent Systems Conference. IOS Press, Australia (2003)
El-Khair, I.A.: Effect of Stop Words Elimination for Arabic Information Retrieval: A comparative Study. International Journal of Computing & Information Sciences 4(3), 119–133 (2006)
Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text Preprocessing. In: Fourth International Conference on Intelligent Computation Technology and Automation (2011)
Fox, C.: Lexical analysis and stoplists. Information Retrieval - Data Structures & Algorithms 7, 102–130 (1992)
Khler, R.: Quantitative Syntax Analysis. De Gruyter, Berlin (2012)
Snowball, http://snowball.tartarus.org/algorithms/english/stop.txt
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (1994)
Han, J., Lakshmanan, L.V.S., Pei, J.: Scalable frequent-pattern mining methods: an overview. In: Tutorial notes of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, New York (2000)
Gadušová, Z., Gromová, E.: Discourse Analysis in Translation. In: 1st Nitra Conference on Discourse Studies. Trends and Perspectives, pp. 59–64 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Munková, D., Munk, M., Vozár, M. (2014). Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora. In: Trajkovik, V., Anastas, M. (eds) ICT Innovations 2013. ICT Innovations 2013. Advances in Intelligent Systems and Computing, vol 231. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-01466-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-01466-1_6
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-01465-4
Online ISBN: 978-3-319-01466-1
eBook Packages: EngineeringEngineering (R0)