Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora

Munková, Daša; Munk, Michal; Vozár, Martin

doi:10.1007/978-3-319-01466-1_6

Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora

Daša Munková³,
Michal Munk³ &
Martin Vozár³

Conference paper

2272 Accesses
11 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 231))

Abstract

Short texts like advertisements are characterised by a number of slogans, phrases, words, symbols etc. To improve the quality of textual data, it is necessary to filter out noise textual data from important data. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in English and Slovak advertisement corpora. For this purpose, an experiment was conducted focusing on data pre-processing in these two comparable corpora. We try to find out to what extent removing the stop words has an influence on a quantity and quality of extracted rules. Stop words removal has no impact on the quantity and quality of extracted rules in English as well as in Slovak advertisement corpora. Only language has a significant impact on the quantity and quality of extracted rules.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Feldman, R., Sanger, J.: The text mining handbook. Cambridge University Press (2007)
Google Scholar
Choy, M.: Effective Listings of Function Stop words for Twitter. International Jurnal of Advanced Computer Science and Application 3(6), 8–11 (2012)
Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge Information Systems 1(1), 1–27 (1999)
Google Scholar
Tayi, G.K., Ballou, D.P.: Examining Data Quality. Communications of the ACM 41(2), 54–57 (1998)
Article Google Scholar
Jung, W.: An Investigation of the Impact of Data Quality on Decision Performance. In: Proceedings of the 2004 International Symposium on Information and Communication Technology (ISICT 2004), pp. 166–171 (2004)
Google Scholar
Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)
Google Scholar
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. John Wiley and Sons, Ltd. (2010)
Google Scholar
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In: Proceedings of the 23rd International Conference on Very Large Databases, pp. 446–455 (1997)
Google Scholar
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalabe Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. The VLDB Journal 7, 163–178 (1998)
Article Google Scholar
Silva, C., Ribeiro, B.: The Importance of Stop Word Removal on Recall Values in Text Categorization. In: Proceedings of the International Joint Conference on Neural Networks, vol. 3, pp. 1661–1666. IEEE (2003)
Google Scholar
Nisbet, R., Elder, J., Miner, G.: Handbook of statistical analysis and data mining applications. Academic Press, Elsevier (2009)
Google Scholar
Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC Stop-Words List Generation. International Journal of Computer Applications 46(8), 8–13 (2012)
Google Scholar
Munk, M., Kapusta, J., Švec, P.: Data Preprocessing Evaluation for Web Log Mining: Reconstruction of Activities of a Web Visitor. In: International Conference on Computational Science, ICCS 2010, Procedia Computer Science, vol. 1, pp. 2273–2280 (2010)
Google Scholar
Munk, M., Drlík, M.: Impact of Different Pre-Processing Tasks on Effective Identification of Users’ Behavioral Patterns in Web-based Educational System. In: International Conference on Computational Science, ICCS 2011, Procedia Computer Science, vol. 4, pp. 1640–1649 (2011)
Google Scholar
Munková, et al.: Analysis of Social and Expressive Factors of Requests by Methods of Text Mining. In: Pacific Asia Conference on Language, Information and Computation, PACLIC 26, pp. 515–524 (2012)
Google Scholar
Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. In: International Conference on Computational Science, ICCS 2013, Procedia Computer Science, vol. 18, pp. 1198–1207 (2013)
Google Scholar
Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)
Google Scholar
Myerson, R.B.: Fundamentals of social choice theory. Discussion Paper No. 1162 (1996)
Google Scholar
Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic Construction of Chinese Stop Word List. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)
Google Scholar
Khosrow, M.: Encyclopedia of Information Science and Technology. Information Sci. 2 edn. (2009)
Google Scholar
Sinka, M.P., Come, D.W.: Evolving Better Stoplists for Document Clustering and Web Intelligence. In: Proceedings of the 3rd Hybrid Intelligent Systems Conference. IOS Press, Australia (2003)
Google Scholar
El-Khair, I.A.: Effect of Stop Words Elimination for Arabic Information Retrieval: A comparative Study. International Journal of Computing & Information Sciences 4(3), 119–133 (2006)
Google Scholar
Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text Preprocessing. In: Fourth International Conference on Intelligent Computation Technology and Automation (2011)
Google Scholar
Fox, C.: Lexical analysis and stoplists. Information Retrieval - Data Structures & Algorithms 7, 102–130 (1992)
Google Scholar
Khler, R.: Quantitative Syntax Analysis. De Gruyter, Berlin (2012)
Book Google Scholar
Snowball, http://snowball.tartarus.org/algorithms/english/stop.txt
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)
Google Scholar
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (1994)
Google Scholar
Han, J., Lakshmanan, L.V.S., Pei, J.: Scalable frequent-pattern mining methods: an overview. In: Tutorial notes of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, New York (2000)
Google Scholar
Gadušová, Z., Gromová, E.: Discourse Analysis in Translation. In: 1st Nitra Conference on Discourse Studies. Trends and Perspectives, pp. 59–64 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Constantine the Philosopher University in Nitra, Tr. A. Hlinku 1, 949 74, Nitra, Slovakia
Daša Munková, Michal Munk & Martin Vozár

Authors

Daša Munková
View author publications
You can also search for this author in PubMed Google Scholar
Michal Munk
View author publications
You can also search for this author in PubMed Google Scholar
Martin Vozár
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daša Munková .

Editor information

Editors and Affiliations

, Faculty of Computer Science, Ss Cyril and Methodius University, Rudgjer Boshkovikj 16, Skopje, 1000, Macedonia
Vladimir Trajkovik
, Faculty of Computer Science, Ss Cyril and Methodius University, Rudgjer Boshkovikj 16, Skopje, 1000, Macedonia
Misev Anastas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Munková, D., Munk, M., Vozár, M. (2014). Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora. In: Trajkovik, V., Anastas, M. (eds) ICT Innovations 2013. ICT Innovations 2013. Advances in Intelligent Systems and Computing, vol 231. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-01466-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-01466-1_6
Publisher Name: Springer, Heidelberg
Print ISBN: 978-3-319-01465-4
Online ISBN: 978-3-319-01466-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics