A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App)

Banks, George C.; Woznyj, Haley M.; Wesslen, Ryan S.; Ross, Roxanne L.

doi:10.1007/s10869-017-9528-3

A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App)

Original Paper
Published: 11 January 2018

Volume 33, pages 445–459, (2018)
Cite this article

Journal of Business and Psychology Aims and scope Submit manuscript

George C. Banks¹,
Haley M. Woznyj²,
Ryan S. Wesslen³ &
…
Roxanne L. Ross⁴

10k Accesses
99 Citations
11 Altmetric
Explore all metrics

Abstract

In recent decades, the amount of text available for organizational science research has grown tremendously. Despite the availability of text and advances in text analysis methods, many of these techniques remain largely segmented by discipline. Moreover, there is an increasing number of open-source tools (R, Python) for text analysis, yet these tools are not easily taken advantage of by social science researchers who likely have limited programming knowledge and exposure to computational methods. In this article, we compare quantitative and qualitative text analysis methods used across social sciences. We describe basic terminology and the overlooked, but critically important, steps in pre-processing raw text (e.g., selection of stop words; stemming). Next, we provide an exploratory analysis of open-ended responses from a prototypical survey dataset using topic modeling with R. We provide a list of best practice recommendations for text analysis focused on (1) hypothesis and question formation, (2) design and data collection, (3) data pre-processing, and (4) topic modeling. We also discuss the creation of scale scores for more traditional correlation and regression analyses. All the data are available in an online repository for the interested reader to practice with, along with a reference list for additional reading, an R markdown file, and an open source interactive topic model tool (topicApp; see https://github.com/wesslen/topicApp, https://github.com/wesslen/text-analysis-org-science, https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/R4W7ZS).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What is Qualitative in Qualitative Research

Article Open access 27 February 2019

Patrik Aspers & Ugo Corte

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Article 21 May 2019

Gerry L. Koons, Katja Schenke-Layland & Antonios G. Mikos

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Article Open access 30 January 2023

Gordon W. Cheung, Helena D. Cooper-Thomas, … Linda C. Wang

Notes

Changes from pre-registered protocol: The final sample size (n = 585) was lower than expected (n = 1000), but was dictated by our prespecified budgetary limit. Also, we originally planned to ask participants about their time working with the leader, but dropped the question due to space concerns. We had planned to examine how occupation related to LMX. However, there were not enough respondents for the majority of the occupations (n < 20); given the small n there is not adequate power to detect even a small magnitude effect (e.g., d = .30). When we aggregated the occupations, the information became redundant with our industry question. Hence, our question about how LMX varied by occupation was dropped.
Start words also exist where a researcher specifies that only certain words be included in an analysis.

References

Antonakis, J. (2017). On doing better science: From thrill of discovery to policy implications. The Leadership Quarterly, 28, 5–21.
Article Google Scholar
Banks, G. C., Gooty, J., Ross, R., Williams, C., & Harrison, N. (2017). Construct redundancy in leader behaviors: A review and agenda for the future. The Leadership Quarterly. https://doi.org/10.1016/j.leaqua.2017.12.005.
Banks, G. C., McCauley, K. D., Gardner, W. L., & Guler, C. E. (2016). A meta-analytic review of authentic and transformational leadership: A test for redundancy. The Leadership Quarterly, 27, 634–652.
Article Google Scholar
Baumer, E. P., Mimno, D., Guha, S., Quan, E., & Gay, G. K. (2017). Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence? Journal of the Association for Information Science and Technology, 68, 1397–1410.
Article Google Scholar
Bernerth, J. B., Armenakis, A. A., Feild, H. S., Giles, W. F., & Walker, H. J. (2007). Leader–member social exchange (LMSX): Development and validation of a scale. Journal of Organizational Behavior, 28, 979–1003.
Article Google Scholar
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84.
Article Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Google Scholar
Bliese, P. D., Maltarich, M. A., & Hendricks, J. L. (2017). Back to basics with mixed-effects models: Nine take-away points. Journal of Business and Psychology, 1–23.
Buntine, W., & Jakulin, A. (2004). Applying discrete PCA in data analysis. Paper presented at the Proceedings of the 20th conference on Uncertainty in artificial intelligence.
Cammann, C., Fichman, M., Jenkins, G. D., & Klesh, J. R. (1983). Assessing the attitudes and perceptions of organizational members. In S. E. Seashore, E. E. Lawler, P. H. Mirvis, & C. Cammann (Eds.), Assessing organizational change: A guide to methods, measures, and practices (pp. 71–138). New York: Wiley.
Google Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Paper presented at the Advances in neural information processing systems.
Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology, 37, 51–89.
Article Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
Google Scholar
Connelly, B. L., Certo, S. T., Ireland, R. D., & Reutzel, C. R. (2011). Signaling theory: A review and assessment. Journal of Management, 37, 39–67.
Article Google Scholar
Cowan, R. L., & Fox, S. (2015). Being pushed and pulled: A model of US HR professionals’ roles in bullying situations. Personnel Review, 44, 119–139.
Article Google Scholar
Crain, S. P., Zhou, K., Yang, S.-H., & Zha, H. (2012). Dimensionality reduction and topic modeling: From latent semantic indexing to latent Dirichlet allocation and beyond Mining text data (pp. 129-161): Springer.
Denny, M. J., & Spirling, A. (2017). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Available at SSRN: https://ssrn.com/abstract=2849145.
Dou, W., & Liu, S. (2016). Topic-and time-oriented visual text analysis. IEEE Computer Graphics and Applications, 36, 8–13.
Article PubMed Google Scholar
Dulebohn, J. H., Bommer, W. H., Liden, R. C., Brouer, R. L., Gerald, R., & Ferris, G. R. (2012). A meta-analysis of antecedents and consequences of leader-member exchange: Integrating the past with an eye toward the future. Journal of Management, 38(6), 1715–1759.
Article Google Scholar
Eisenberger, R., Hungtinton, R., Hutchsion, S., & Sowa, D. (1986). Perceived organizational support. Journal of Applied Psychology, 71, 500–507.
Article Google Scholar
Fong, C., & Grimmer, J. (2016). Discovery of treatments from text corpora. In In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
Gioia, D. A., Corley, K. G., & Hamilton, A. L. (2013). Seeking qualitative rigor in inductive research: Notes on the Gioia methodology. Organizational Research Methods, 16, 15–31.
Article Google Scholar
Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory: Strategies for qualitative research. New York: Aldine.
Google Scholar
Grimmer, J. (2015). We are all social scientists now: How big data, machine learning, and causal inference work together. PS: Political Science & Politics, 48, 80–83.
Google Scholar
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis: mps028.
Janasik, N., Honkela, T., & Bruun, H. (2009). Text mining in qualitative research application of an unsupervised learning method. Organizational Research Methods, 12, 436–460.
Article Google Scholar
Joshi, A. K. (1991). Natural language processing. Science, 253, 1242.
Article PubMed Google Scholar
Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismihok, G., & Den Hartog, D. N. (2017). Text classification for organizational researchers: A tutorial. Organizational Research Methods. https://doi.org/10.1177/1094428117719322.
Kouloumpis, E., Wilson, T., & Moore, J. D. (2011). Twitter sentiment analysis: The good the bad and the omg! Icwsm, 11, 164.
Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444.
Article PubMed Google Scholar
Lee, M., & Mimno, D. (2014). Low-dimensional embeddings for interpretable anchor-based topic inference. Paper presented at the Proceedings of Empirical Methods in Natural Language Processing.
Lehmann-Willenbrock, N., & Allen, J. A. (2017). Modeling temporal interaction dynamics in organizational settings. Journal of Business and Psychology, 1–20.
Manning, C. D., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Book Google Scholar
McKenny, A. F., Aguinis, H., Short, J. C., & Anglin, A. H. (2016). What doesn’t get measured does exist improving the accuracy of computer-aided text analysis. Journal of Management: 0149206316657594.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Paper presented at the Proceedings of the conference on empirical methods in natural language processing.
Mitchel, J. O. (1981). The effect of intentions, tenure, personal, and organizational variables on managerial turnover. Academy of Management Journal, 24, 742–751.
Google Scholar
Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58(302), 275–309.
Newman, M. E. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46, 323–351.
Article Google Scholar
Pearce, C. L., & Sims, H. P. (2002). Vertical versus shared leadership as predictors of the effectiveness of change management teams: An examination of aversive, directive, transactional, transformational, and empowering leader behaviors. Group Dynamics: Theory, Research, and Practice, 6, 172–197.
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, 14, 1532–1543.
Google Scholar
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54, 209–228.
Article Google Scholar
Reinard, J. C. (2008). Introduction to communication research (4th ed.). Boston: McGraw-Hill.
Google Scholar
Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in the social sciences. Journal of the American Statistical Association, 111, 988–1003.
Article Google Scholar
Roberts, M. E., Stewart, B. M., & Tingley, D. (2014a). Navigating the local modes of big data: The case of topic models. New York: Cambridge University Press.
Google Scholar
Roberts, M. E., Stewart, B. M., & Tingley, D. (2014b). stm: R package for structural topic models. R package version 0.6, 1.
Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., et al. (2014). Structural topic models for open-ended survey responses. American Journal of Political Science, 58, 1064–1082.
Article Google Scholar
Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives on Psychological Science, 5, 233–242.
Article PubMed Google Scholar
Schofield, A., Magnusson, M. and Mimno, D. (2017). Pulling Out the stops: Rethinking stopword removal for topic models. EACL, 432.
Schofield, A., & Mimno, D. (2016). Comparing apples to apple: The effects of stemmers on topic models. Transactions of the Association for Computational Linguistics, 4, 287–300.
Google Scholar
Schriesheim, C. A., Castro, S. L., & Cogliser, C. C. (1999). Leader-member exchange (LMX) research: A comprehensive review of theory, measurement, and data-analytic practices. The Leadership Quarterly, 10, 63–113.
Article Google Scholar
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34, 1–47.
Article Google Scholar
Shaffer, J. A., DeGeest, D., & Li, A. (2016). Tackling the problem of construct proliferation: A guide to assessing the discriminant validity of conceptually related constructs. Organizational Research Methods, 19, 80–110.
Article Google Scholar
Shanock, L. R., Baran, B. E., Gentry, W. A., Pattison, S. C., & Heggestad, E. D. (2010). Polynomial regression with response surface analysis: A powerful approach for examining moderation and overcoming limitations of difference scores. Journal of Business and Psychology, 25, 543–554.
Article Google Scholar
Short, J. C., Broberg, J. C., Cogliser, C. C., & Brigham, K. H. (2010). Construct validation using computer-aided text analysis (CATA) an illustration using entrepreneurial orientation. Organizational Research Methods, 13, 320–347.
Article Google Scholar
Spreitzer, G. M. (1995). Psychological empowerment in the workplace: Dimensions, measurement, and validation. Academy of Management Journal, 38, 1442–1465.
Google Scholar
Strauss, A., & Corbin, J. (1990). Basics of qualitative research. Newbury Park, CA: Sage.
Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Thousand Oaks: Sage.
Google Scholar
Suddaby, R. (2006). From the editors: What grounded theory is not. Academy of Management Journal, 49, 633–642.
Article Google Scholar
Taddy, M. (2012). On estimation and selection for topic models. Paper presented at the International Conference on Artificial Intelligence and Statistics.
Tang, J., Meng, Z., Nguyen, X., Mei, Q., & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. Paper presented at the ICML.
Tonidandel, S., & LeBreton, J. M. (2015). RWA web: A free, comprehensive, web-based, and user-friendly tool for relative weight analyses. Journal of Business and Psychology, 30, 207–216.
Article Google Scholar
Waddell, K. (2016). The algorithms that tell bosses how employees are feeling. The Atlantic.
Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. Paper presented at the Proceedings of the 26th annual international conference on machine learning.
Williams, L. J., & McGonagle, A. K. (2016). Four research designs and a comprehensive analysis strategy for investigating common method variance with self-report measures using latent variables. Journal of Business and Psychology, 31, 339–359.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Management, Belk College of Business, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA
George C. Banks
Department of Management, Longwood University, Farmville, VA, USA
Haley M. Woznyj
Department of Computer Science, University of North Carolina at Charlotte, Charlotte, NC, USA
Ryan S. Wesslen
Department of Organizational Science, University of North Carolina at Charlotte, Charlotte, NC, USA
Roxanne L. Ross

Authors

George C. Banks
View author publications
You can also search for this author in PubMed Google Scholar
Haley M. Woznyj
View author publications
You can also search for this author in PubMed Google Scholar
Ryan S. Wesslen
View author publications
You can also search for this author in PubMed Google Scholar
Roxanne L. Ross
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George C. Banks.

Additional information

We dedicate this article to Jared Borns for his insight, patience, and guidance in the data collection process. We thank the three reviewers at Journal of Business and Psychology as well as John Batchelor, Wenwen Dou, Katherine Frear, Tiffany Gallicano, Andy Loignon, Aaron McKenny, Bob Muenchen, Ernest O’Boyle, Jeremy Short, Anne Smith, Allison Toth, and Christopher Whelpley for their feedback on previous versions of the manuscript and our analysis. The article was pre-registered via the Open Science Framework (https://osf.io/g9wjy/?view_only=045606c4e42843f7b3d131de6d0908d0).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banks, G.C., Woznyj, H.M., Wesslen, R.S. et al. A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App). J Bus Psychol 33, 445–459 (2018). https://doi.org/10.1007/s10869-017-9528-3

Download citation

Published: 11 January 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10869-017-9528-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App)

Abstract

Access this article

Similar content being viewed by others

What is Qualitative in Qualitative Research

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Review of Best Practice Recommendations for Text Analysis in R (and a User-Friendly App)

Abstract

Access this article

Similar content being viewed by others

What is Qualitative in Qualitative Research

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation