Advertisement

Text Preprocessing

  • Murugan Anandarajan
  • Chelsey Hill
  • Thomas Nolan
Chapter
Part of the Advances in Analytics and Data Science book series (AADS, volume 2)

Abstract

This chapter starts the process of preparing text data for analysis. This chapter introduces the choices that can be made to cleanse text data, including tokenizing, standardizing and cleaning, removing stop words, and stemming. The chapter also covers advanced topics in text preprocessing, such as n-grams, part-of-speech tagging, and custom dictionaries. The text preprocessing decisions influence the text document representation created for analysis.

Keywords

Text preprocessing Text parsing n-grams POS tagging Stemming Lemmatization Natural language processing Tokens Stop words 

References

  1. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. Beijing: O’Reilly Media, Inc.Google Scholar
  2. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998, November). Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management (pp. 148–155). ACM.Google Scholar
  3. Indurkhya, N., & Damerau, F. J. (Eds.). (2010). Handbook of natural language processing (Vol. 2). Boca Raton: CRC Press.Google Scholar
  4. Inmon, B. (2017). Turning text into gold: Taxonomies and textual analytics. Bradley Beach: Technics Publications.Google Scholar
  5. Johansson, S., Leech, G. N., & Goodluck, H. (1978). The Lancaster-Oslo/Bergen Corpus of British English. Oslo: Department of English: Oslo University Press.Google Scholar
  6. Kučera, H., & Francis, W. N. N. (1967). Computational analysis of present-day American English. Providence: Brown University Press.Google Scholar
  7. Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2), 22–31.Google Scholar
  8. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.Google Scholar
  9. Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.  https://doi.org/10.1017/CBO9780511809071.CrossRefGoogle Scholar
  10. Paice, C. D. (1990). Another stemmer. ACM SIGIR Forum, 24(3), 56–61.CrossRefGoogle Scholar
  11. Paice, C. D. (1994, August). An evaluation method for stemming algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 42–50). Springer-Verlag New York, Inc.Google Scholar
  12. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.CrossRefGoogle Scholar
  13. Salton, G. (1971). The SMART retrieval system: Experiments in automatic document processing. Englewood Cliffs: Prentice-Hall.Google Scholar
  14. Salton, G. (1989). Automatic text processing: The transformation, analysis, and retrieval of. Reading: Addison-Wesley.Google Scholar
  15. Struhl, S. (2015). Practical text analytics: Interpreting text and unstructured data for business intelligence. London: Kogan Page Publishers.Google Scholar
  16. Taylor, A., Marcus, M., & Santorini, B. (2003). The penn treebank: An overview. In Treebanks (pp. 5–22). Dordrecht: Springer.CrossRefGoogle Scholar
  17. Weiss, S. M., Indurkhya, N., Zhang, T., & Damerau, F. (2010). Text mining: predictive methods for analyzing unstructured information. Springer Science & Business Media.Google Scholar
  18. Wilbur, W. J., & Sirotkin, K. (1992). The automatic identification of stop words. Journal of Information Science, 18(1), 45–55.CrossRefGoogle Scholar
  19. Yatsko, V. A. (2011). Methods and algorithms for automatic text analysis. Automatic Documentation and Mathematical Linguistics, 45(5), 224–231.CrossRefGoogle Scholar

Further Reading

  1. For a more comprehensive treatment of natural language processing, see Indurkhya and Damerau (2010), Jurafsky and Martin (2014), or Manning and Schütze (1999).Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Murugan Anandarajan
    • 1
  • Chelsey Hill
    • 2
  • Thomas Nolan
    • 3
  1. 1.LeBow College of BusinessDrexel UniversityPhiladelphiaUSA
  2. 2.Feliciano School of BusinessMontclair State UniversityMontclairUSA
  3. 3.Mercury Data ScienceHoustonUSA

Personalised recommendations