As an emerging research paradigm, big data analytics has been gaining currency in various fields. However, in existing hospitality and tourism literature there is scarcity of discussions on the quality of data which may impact the validity and generalizability of research findings. This study examines the reliability of online hotel reviews in TripAdvisor by developing a text classifier to predict travel purpose (i.e., business vs. leisure) based upon review textual contents. The classifier is tested over a range of cities and data sizes to examine its sensitivity to data samples. The findings show that, while the classifier’s performance is consistent across different cities, there are variations in response to data sizes and sampling methods. More importantly, a considerable amount of noise is found in the data, which leads to misclassification. Furthermore, a novel approach is developed to address the misclassification problem resulting from data noise. This study reveals important data quality issues and contributes to the theoretical development of social media analytics in hospitality and tourism.
Big data Data quality Online hotel reviews Social media analytics Text classification Methodology
This is a preview of subscription content, log in to check access.
This study was sponsored by the National Natural Science Foundation of China (71373023) and Beijing Municipal Commission of Education (SM201611417001).
Abrahams AS, Fan W, Wang GA, Zhang ZJ, Jiao J (2015) An integrated text analytic framework for product defect discovery. Prod Oper Manag 24(6):975–990CrossRefGoogle Scholar
Banerjee S, Chua AY (2016) In search of patterns among travellers’ hotel ratings in TripAdvisor. Tour Manag 53:125–131CrossRefGoogle Scholar
Bird S, Klein E, Loper E (2009) Natural language processing with python. O’Reilly Media Inc, SebastopolGoogle Scholar
Gretzel U, Fesenmaier DR (2002) Building narrative logic into tourism information systems. IEEE Intell Syst 17(6):59–61Google Scholar
Lazer D, Pentland AS, Adamic L, Aral S, Barabasi AL, Brewer D, Jebara T (2009) Life in the network: the coming age of computational social science. Science (New York, NY) 323(5915):721CrossRefGoogle Scholar
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on learning for text categorization, vol 752, pp 41–48Google Scholar
Mccleary KW, Weaver PA, Hutchinson JC (1993) Hotel selection factors as they relate to business travel situations. J Travel Res 32(2):42–48CrossRefGoogle Scholar
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefGoogle Scholar
Park S, Nicolau JL (2015) Asymmetric effects of online consumer reviews. Ann Tour Res 50:67–83CrossRefGoogle Scholar
Ruths D, Pfeffer J (2014) Social media for large studies of behavior. Science 346(6213):1063–1064CrossRefGoogle Scholar
Schuckert M, Liu X, Law R (2015) Hospitality and tourism online reviews: recent trends and future directions. J Travel Tour Mark 32(5):608–621CrossRefGoogle Scholar
Schuckert M, Liu X, Law R (2016) Insights into suspicious online ratings: direct evidence from TripAdvisor. Asia Pac J Tour Res 21(3):259–272CrossRefGoogle Scholar
Tufekci Z (2014) Big questions for social media big data: representativeness, validity and other methodological pitfalls. Preprint arXiv:1403.7400
Xiang Z, Pan B (2011) Travel queries on cities in the United States: implications for search engine marketing for tourist destinations. Tour Manag 32(1):88–97CrossRefGoogle Scholar
Xiang Z, Schwartz Z, Gerdes J, Uysal M (2015) What can big data and text analytics tell us about hotel guest experience and satisfaction? Int J Hosp Manag 44(1):120–130CrossRefGoogle Scholar
Xiang Z, Du Q, Ma Y, Fan W (2017) A comparative analysis of major online review platforms: implications for social media analytics in hospitality and tourism. Tour Manag 58:51–65CrossRefGoogle Scholar