App store mining is not enough for app improvement

Nayebi, Maleknaz; Cho, Henry; Ruhe, Guenther

doi:10.1007/s10664-018-9601-1

App store mining is not enough for app improvement

Published: 22 February 2018

Volume 23, pages 2764–2794, (2018)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Maleknaz Nayebi¹,
Henry Cho² &
Guenther Ruhe¹

2515 Accesses
53 Citations
2 Altmetric
Explore all metrics

Abstract

The rise in popularity of mobile devices has led to a parallel growth in the size of the app store market, intriguing several research studies and commercial platforms on mining app stores. App store reviews are used to analyze different aspects of app development and evolution. However, app users’ feedback does not only exist on the app store. In fact, despite the large quantity of posts that are made daily on social media, the importance and value that these discussions provide remain mostly unused in the context of mobile app development. In this paper, we study how Twitter can provide complementary information to support mobile app development. By analyzing a total of 30,793 apps over a period of six weeks, we found strong correlations between the number of reviews and tweets for most apps. Moreover, through applying machine learning classifiers, topic modeling and subsequent crowd-sourcing, we successfully mined 22.4% additional feature requests and 12.89% additional bug reports from Twitter. We also found that 52.1% of all feature requests and bug reports were discussed on both tweets and reviews. In addition to finding common and unique information from Twitter and the app store, sentiment and content analysis were also performed for 70 randomly selected apps. From this, we found that tweets provided more critical and objective views on apps than reviews from the app store. These results show that app store review mining is indeed not enough; other information sources ultimately provide added value and information for app developers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Studying the dialogue between users and developers of free apps in the Google Play Store

Article 08 September 2017

Fresh apps: an empirical study of frequently-updated mobile apps in the Google play store

Article 07 July 2015

An empirical study on release notes patterns of popular apps in the Google Play Store

Article 04 March 2022

Notes

References

Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R (2011) Sentiment analysis of twitter data. In: Proceedings of the workshop on languages in social media. Association for Computational Linguistics, pp 30–38
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol 6, pp 12
Blackman NJ-M, Koval J J (2000) Interval estimation for cohen’s kappa as a measure of agreement. Statist Med 19(5):723–741
Article Google Scholar
Blei D M, Ng A Y, Jordan M I (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bougie G, Starke J, Storey M-A, German D M (2011) Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering. ACM, pp 31–36
Chang J, Gerrish S, Wang C, Boyd-Graber J L, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Advances in neural information processing systems, pp 288–296
Chen N, Lin J, Hoi S C, Xiao X, Zhang B (2014) Ar-miner: mining informative reviews for developers from mobile app marketplace. In: Proceedings of the 36th international conference on software engineering. ACM, pp 767–778
Ciurumelea A, Schaufelbühl A, Panichella S, Gall HC (2017) Analyzing reviews and code of mobile apps for better release planning. In: Software analysis, evolution and reengineering (SANER). IEEE, pp 91– 102
Di Sorbo A, Panichella S, Alexandru C V, Shimagaki J, Visaggio C A, Canfora G, Gall H C (2016) What would users change in my app? Summarizing app reviews for recommending software changes. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 499–510
Di Sorbo A, Panichella S, Alexandru C V, Visaggio C A, Canfora G (2017) Surf: summarizer of user reviews feedback. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 55–58
Gibbons J D, Chakraborti S (2011) Nonparametric statistical inference. Springer
Gomez M, Martinez M, Monperrus M, Rouvoy R (2015) When app stores listen to the crowd to fight bugs in the wild. In: Proceedings of the 37th international conference on software engineering (ICSE), vol 2. IEEE Press, pp 567–570
Gu X, Kim S (2015) What parts of your apps are loved by users? (t). In: Automated software engineering (ASE). IEEE, pp 760–770
Guzman E, Alkadhi R, Seyff N (2017) An exploratory study of twitter messages about software applications. Requir Eng 22(3):387–412
Article Google Scholar
Guzman E, Ibrahim M, Glinz M (2017) Mining twitter messages for software evolution. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 283– 284
Harman M, Jia Y, Zhang Y (2012) App store mining and analysis: Msr for app stores. In: Proceedings of the 9th IEEE working conference on mining software repositories. IEEE Press, pp 108–111
Hong L, Davison B D (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88
Hutto CJ, Gilbert E Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international AAAI conference on weblogs and social media, pp 50–60
Iacob C, Harrison R (2013) Retrieving and analyzing mobile apps feature requests from online reviews. In: 10th IEEE Working conference on mining software repositories (MSR). IEEE, pp 41–44
Jivani A G et al. (2011) A comparative study of stemming algorithms. Int J Comp Tech Appl 2(6):1930– 1938
Google Scholar
Jongeling R, Datta S, Serebrenik A (2015) Choosing your weapons: on sentiment analysis tools for software engineering research. In: Software maintenance and evolution (ICSME). IEEE, pp 531– 535
Kittur A, Chi E H, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 453–456
Kouloumpis E, Wilson T, Moore J D (2011) Twitter sentiment The good the bad and the omg!. Icwsm 11:538–541
Google Scholar
Liu B (2010) Sentiment analysis and subjectivity. Handbook Natural Lang Process 2:627–666
Google Scholar
Loper E, Bird S (2002) Nltk: the natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1. Association for Computational Linguistics, pp 63–70
Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app reviews. In: IEEE 23rd international requirements engineering conference (RE). IEEE, pp 116– 125
Maalej W, Nayebi M, Johann T, Ruhe G (2016) Toward data-driven requirements engineering. IEEE Softw 33(1):48–54
Article Google Scholar
Manning C D, Surdeanu M, Bauer J, Finkel J R, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp 55–60
Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: IEEE/ACM 12th Working conference on mining software repositories. IEEE, pp 123–133
Martin W, Sarro F, Jia Y, Zhang Y, Harman M (2016) A survey of app store analysis for software engineering. RN 16:02
Google Scholar
Mohammad S M, Kiritchenko S, Zhu X (2013) Nrc-canada: building the state-of-the-art in sentiment analysis of tweets. arXiv:http://arXiv.org/abs/1308.6242
Naaman M, Boase J, Lai C -H (2010) Is it really about me?: message content in social awareness streams. In: Proceedings of the 2010 ACM conference on computer supported cooperative work. ACM, pp 189– 192
Nayebi M, Ruhe G (2015) Analytical product release planning. In: The Art and science of analyzing software data. Morgan Kaufmann, pp 550–580
Nayebi M, Adams B, Ruhe G (2016) Release practices for mobile apps–what do users and developers think? In: Software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 552– 562
Nayebi M, Farrahi H, Ruhe G, Cho H (2017) App store mining is not enough. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 152–154
Nayebi M, Quapp R, Ruhe G, Marbouti M, Maurer F (2017) Crowdsourced exploration of mobile app features: a case study of the fort mcmurray wildfire. In: Proceedings of the 39th international conference on software engineering: software engineering in society track. IEEE Press, pp 57–66
Nielsen F Å (2011) A new anew: Evaluation of a word list for sentiment analysis in microblogs. arXiv:http://arXiv.org/abs/1103.2903
O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for twitter. In: ICWSM
Palomba F, Linares-Vásquez M, Bavota G, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2015) User reviews matter! tracking crowdsourced reviews to support evolution of successful apps. In: Software maintenance and evolution (ICSME). IEEE, pp 291–300
Paolacci G, Chandler J, Ipeirotis P G (2010) Running experiments on amazon mechanical turk. Judgment Decis Making 5(5):411–419
Google Scholar
Porter M F (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Prasetyo PK, Lo D, Achananuparp P, Tian Y, Lim E-P (2012) Automatic classification of software related microblogs. In: Software maintenance (ICSM). IEEE, pp 596–599
Ramage D, Dumais S T, Liebling D J (2010) Characterizing microblogs with topic models. ICWSM 10:1–1
Google Scholar
Rosenthal S, Nakov P, Kiritchenko S, Mohammad S M, Ritter A, Stoyanov V (2015) Semeval-2015 task 10: sentiment analysis in twitter. In: Proceedings of SemEval-2015
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Article Google Scholar
Smedt T D, Daelemans W (2012) Pattern for python. J Mach Learn Res 13:2063–2067
MATH Google Scholar
Thelwall M, Buckley K, Paltoglou G (2011) Sentiment in twitter events. J Am Soc Inf Sci Technol 62(2):406–418
Article Google Scholar
Tian Y, Lo D (2014) An exploratory study on software microblogger behaviors. In: 2014 IEEE 4th Workshop on mining unstructured data (MUD). IEEE, pp 1–5
Villarroel L, Bavota G, Russo B, Oliveto R, Di Penta M (2016) Release planning of mobile apps based on user reviews. In: Proceedings of the 38th international conference on software engineering. ACM, pp 14–24
Wang X, Kuzmickaja I, Abrahamsson P (2014) Microblogging in open source software development, 8–12
Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 345–355
Williams G, Mahmoud A (2017) Mining twitter data for a more responsive software engineering process. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 280–282
Wohlin C, Runeson P, Höst M, Ohlsson M C, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media

Download references

Acknowledgments

We would like to thank Homayoon Farrahi and Ada Lee for their help on this study. We thank all the anonymous reviewers and the Associate editor for their valuable comments and suggestions. This research was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 250343-12.

Author information

Authors and Affiliations

SEDS Lab, University of Calgary, Calgary, Alberta, Canada
Maleknaz Nayebi & Guenther Ruhe
Department of Engineering Science, University of Toronto, Toronto, Ontario, Canada
Henry Cho

Authors

Maleknaz Nayebi
View author publications
You can also search for this author in PubMed Google Scholar
Henry Cho
View author publications
You can also search for this author in PubMed Google Scholar
Guenther Ruhe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maleknaz Nayebi.

Additional information

Communicated by: Yasutaka Kamei

Appendix: Crowdsourced evaluation of RQ2 and RQ3

In Section 5 we discussed why and how we used crowdsourcing:

First, :: we used crowdsourcing to confirm the results of similarity analysis (cosine similarity) between tweet topics and review topics to confirm (i) if assigning a tweet topic to a review topic is correct and (ii) if we missed to assign a tweet topic to a review topic (RQ2).
Second, :: we used crowdsourcing to compare the degree of specification and understandability (RQ3).

“The crowd” is composed of workers that are unknown in person to the authors. To elaborate on the validity of crowdsourcing results, we hired three developers being known to the authors from similar former work. We asked them to perform the same tasks done by the crowd. Overall across RQ2 and RQ3, the average Fleiss Kappa among three developers and across different tasks was 0.84, which indicates an almost perfect agreement. We then compared the results that were achieved by the crowd with the results achieved by the three developers.

Evaluating crowdsorced results in RQ2

We randomly selected 500 tweets and 500 review topics. Among them, 250 topics were marked as similar, and 250 topics were marked as different by the crowd. We asked three app developers we have formerly worked with to manually label these 500 topics. They answered the same questions as the crowd (Fig. 5). We compared the results and found that our developers classified 99.4% of the topics in the same way as we did base on the crowd’s evaluation.

We then compared the results that achieved by the crowd with the results achieved by the three developers as presented in Table 3.

Table 3 Comparison of crowdsourcing with known developers results for RQ2

Full size table

Evaluating crowdsorced results in RQ3

We randomly selected 250 tweets and 250 reviews and asked three developers to judge the degree of specification both for degree of specification and degree of undrestandability. We compared the results of this task as it was done by crowd versus the three developers in Tables 4 and 5.

17.2% of reviews and 16.4% of tweets were classified in a different specification category by developers in comparison with the crowd. With the same set up, we asked developers to evaluate the degree of understandability and compared the results with the ones received from the crowd:

11.2% of the reviews and 17.6% of the tweets were classified in a different undrestandability category by developers in comparison with the crowd.

Table 4 Comparison of crowdsourced results with the results from known developers in RQ3 for degree of specification

Full size table

Table 5 Comparison of crowdsourced results with the results from known developers in RQ3 for undrestandability

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nayebi, M., Cho, H. & Ruhe, G. App store mining is not enough for app improvement. Empir Software Eng 23, 2764–2794 (2018). https://doi.org/10.1007/s10664-018-9601-1

Download citation

Published: 22 February 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s10664-018-9601-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

App store mining is not enough for app improvement

Abstract

Access this article

Similar content being viewed by others

Studying the dialogue between users and developers of free apps in the Google Play Store

Fresh apps: an empirical study of frequently-updated mobile apps in the Google play store

An empirical study on release notes patterns of popular apps in the Google Play Store

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Crowdsourced evaluation of RQ2 and RQ3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

App store mining is not enough for app improvement

Abstract

Access this article

Similar content being viewed by others

Studying the dialogue between users and developers of free apps in the Google Play Store

Fresh apps: an empirical study of frequently-updated mobile apps in the Google play store

An empirical study on release notes patterns of popular apps in the Google Play Store

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Crowdsourced evaluation of RQ2 and RQ3

Appendix: Crowdsourced evaluation of RQ2 and RQ3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation